Next Article in Journal
Multi-Strategy-Improvement-Based Slime Mould Algorithm
Previous Article in Journal
Structural Behaviour of Concrete Deep Beams Reinforced with Aluminium Alloy Bars
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PPDD: Egocentric Crack Segmentation in the Port Pavement with Deep Learning-Based Methods

1
Department of Management Information Systems, Dong-A University, Busan 49236, Republic of Korea
2
Department of Urban Planning and Engineering, Dong-A University, Busan 49315, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(10), 5446; https://doi.org/10.3390/app15105446
Submission received: 2 April 2025 / Revised: 8 May 2025 / Accepted: 9 May 2025 / Published: 13 May 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Road infrastructure is a critical component of modern society, with its maintenance directly influencing traffic safety and logistical efficiency. In this context, automated crack detection technology plays a vital role in reducing maintenance costs and enhancing operational efficiency. However, previous studies are limited by the fact that they provide only bounding box or segmentation mask annotations for a restricted number of crack classes and use a relatively small size of datasets. To address these limitations and advance deep learning-based crack segmentation, this study introduces a novel crack segmentation dataset that reflects real-world road conditions. The proposed dataset includes various types of cracks and defects—such as slippage, rutting, and construction-related cracks—and provides polygon-based segmentation masks captured from an egocentric, vehicle-mounted perspective. Using this dataset, we evaluated the performance of semantic and instance segmentation models. Notably, SegFormer achieved the highest Pixel Accuracy (PA) and mean Intersection over Union (mIoU) for semantic segmentation, while YOLOv7 exhibited outstanding detection performance for alligator crack class, recording an A P 50 of 87.2% and A P of 57.5%. In contrast, all models struggled with the reflection crack type, indicating the inherent segmentation challenges. Overall, this study provides a practical and robust foundation for future research in automated road crack segmentation. Additional resources including the dataset and annotation details can be found at our GitHub repository.

1. Introduction

Roads are among the most essential infrastructure for maintaining the economic and physical connectivity of society, and their quality and stability have both direct and indirect impacts on economic and social issues [1]. Pavement surfaces gradually deteriorate due to factors such as climate conditions, traffic volume, and vehicle loads, which can negatively affect traffic safety and the efficient operation of road networks [2]. Since pavement maintenance and repair strategies are typically tailored to the causes of cracking, accurate identification of crack types is a critical task. Fatigue cracking in pavements arises from diverse failure mechanisms and exhibits distinctive visual characteristics that reflect the underlying causes. Research on such pavement surface defects enables appropriate maintenance actions and supports regular monitoring and interventions to preserve road quality and stability. These efforts, in turn, bring about positive societal outcomes such as improved traffic safety, reduced vehicle maintenance costs, and enhanced pedestrian convenience [3].
Traditional pavement inspection methods have mainly relied on visual assessments by human inspectors or semi-automated techniques, where road surface measurements are collected using automated equipment and later reviewed manually. These methods are inherently limited by their reliance on subjective judgment [4] and often involve considerable time and financial costs. In addition, inspectors must occupy and control traffic lanes during inspections, increasing the risk of accidents and traffic disruptions. To overcome these limitations, deep learning-based detection models have gained traction in recent years, significantly advancing automated crack detection technologies [1]. However, training such models typically requires densely annotated crack labels, which poses a major barrier to large-scale adoption. Consequently, existing studies have mostly relied on bounding box annotations for entire images [5] or segmentation masks applied at the patch level (e.g., cropped images of cracks) [1,6].
In this study, we propose a novel benchmark dataset for deep learning-based crack detection. The dataset includes previously under-represented crack types such as slippage, rutting, and construction but also provides polygon-based segmentation masks for enhanced annotation precision. Unlike existing crack segmentation datasets, which often rely on orthographic imaging—capturing cracks from a perpendicular viewpoint on surfaces such as bridges or building exteriors—our dataset adopts an egocentric acquisition method, capturing road cracks from the viewpoint of a moving vehicle. This acquisition setup allows the dataset to include oblique views and real-world perspective distortions, thereby more accurately reflecting the visual conditions encountered in practical, on-road applications. In particular, we constructed a new benchmark segmentation dataset for road cracks and expanded the crack taxonomy to encompass a wider variety of crack types. Furthermore, by conducting a comprehensive evaluation of various segmentation architectures using the proposed dataset, we demonstrated both its practical utility and its potential as a standardized benchmark for future research.
To establish baselines and analyze the characteristics of the proposed road crack segmentation dataset, we evaluate the performance of deep learning-based detection models on both semantic and instance segmentation tasks. For semantic segmentation, we adopt three representative models: Mask2Former [7], SegFormer [8], and Deeplabv3+ [9]. For instance segmentation, we utilize YOLOv7 [10] and Mask2Former [7]. Our experimental results provide both quantitative and qualitative evaluations of the proposed dataset, demonstrating its potential for training and benchmarking various baseline models.
Overall, the proposed dataset offers a diverse set of crack categories along with detailed segmentation annotations. Our benchmark evaluations on both semantic and instance segmentation models not only establish baseline performance but also highlight future research directions. Section 2 reviews existing road defect datasets and related deep learning models. Section 3 details the acquisition and preprocessing of our dataset. In Section 4, we present our experimental results, and in Section 4 and Section 5, we discuss future directions and conclude this paper.

2. Related Work

2.1. Public Dataset for Crack Identification and Detection

Public datasets for analyzing the visual characteristics of cracks using deep learning-based visual perception models can largely be categorized into two research directions. The first approach focuses on segmentation for detailed representation of cracks. This involves identifying fine-grained visual features of cracks that appear in diverse contexts—such as roads, bridges, and concrete building facades—under varying causes of deterioration. To achieve this, crack segmentation datasets typically involve orthographic imaging, in which the surface affected by cracking is captured from a perpendicular viewpoint. From these images, crack regions are cropped into patches, and segmentation masks are annotated to delineate the fine fissures present in the crack areas [11,12]. While this approach enables detailed analysis of the visual characteristics and precise delineation of various types of cracks, it has limitations when applied to the development of automated road crack detection systems. The second research direction focuses on crack detection specifically tailored for road surfaces. This approach utilizes annotations of road cracks captured from an ego-vehicle view, providing data that more closely resemble real-world driving scenarios. Recently, the SHREC 2022 Challenge [1] addressed this topic by organizing a task focused on detecting potholes and comprehensive cracks in road pavement imagery. The participants of the competition analyzed the challenges of road surface defect detection and proposed solutions, thereby contributing to the advancement research on deep learning-based detection systems using road-specific datasets. However, existing public datasets for road crack detection are limited in terms of both the diversity of crack categories and the availability of pixel-level dense annotations. Most of these datasets only provide bounding box annotations indicating the location of cracks within ego-vehicle images [13,14]. EdmCrack600 [15,16] attempted to address the limitations of front-mounted ego-vehicle views—such as obstructions caused by the vehicle hood—by mounting a GoPro camera on the rear of the vehicle. While the dataset includes segmentation annotations and real-world road surface data, its scale is limited to only 600 images. In Table 1, we compare our dataset with existing road damage datasets in terms of data types, number of classes, types of deep learning tasks applied, and dataset size. This comparison reveals that, unlike previous datasets, which often focus on specific damage types or limited tasks, the dataset constructed in this study encompasses a wider range of data types and a greater number of classes. It also demonstrates scalability and flexibility for application to various deep learning tasks, including classification and detection. To enable more practical and challenging road crack analysis, we propose a new benchmark dataset consisting of ego-vehicle images annotated with segmentation masks for six distinct crack categories.

2.2. Deep Learning-Based Segmentation

Dense prediction in deep learning-based object recognition systems can be broadly categorized into object detection tasks, which identify the spatial location of target objects [20], and object segmentation tasks [21], which further predict pixel-level masks for the detected objects. In particular, image segmentation tasks are divided into semantic segmentation [21], which performs pixel-level classification by predicting the presence of object categories in both foreground and background and instance segmentation [22], which predicts separate segmentation masks for each individual object instance. A typical semantic segmentation process consists of an encoder that extracts pixel-level visual representations and a decoder that reconstructs the prediction masks to match the original image resolution. SegFormer [8], in particular, introduces a hierarchical transformer-based encoder designed to capture both coarse and fine features, along with a lightweight All-MLP decoder that fuses multi-level features for accurate mask prediction. DeepLabv3+ [9] utilizes Atrous Spatial Pyramid Pooling (ASPP) to effectively capture objects at multiple scales and introduces a decoder module that enhances the reconstruction of fine object boundaries. Instance segmentation tasks are typically implemented by attaching a mask-specific decoder to the instance representations predicted by a detection agent. YOLOv7 [10] is an advanced version of the original one-stage detection pipeline, YOLO [20], with multiple architectural improvements. By incorporating a dedicated segmentation mask decoder, it enables segmentation prediction while preserving the fast inference speed characteristic of YOLO-based models. Mask2Former [7] proposes a unified architecture for any image segmentation task—including semantic, instance, and panoptic segmentation—by introducing masked attention for cross-attention between predicted mask regions and a transformer decoder that efficiently leverages high-resolution features from a pixel decoder.
Recent advances in transformer-based segmentation models and domain-specific detectors have led to the development of increasingly powerful dense prediction techniques. For instance, OneFormer [23] introduces a multi-task unified transformer architecture that utilizes task-specific conditional queries, achieving superior performance across semantic, instance, and panoptic segmentation tasks compared to individually trained models such as Mask2Former. In the field of pavement crack analysis, Guo et al. [24] proposed the “Crack Transformer”, which leverages a Swin Transformer encoder to significantly improve pixel-level crack segmentation under challenging conditions such as shadows and complex crack structures, outperforming traditional CNN-based methods like DeepLabv3+. These recent studies highlight the continuous progress in dense prediction—from general-purpose segmentation transformers to task-specific crack detection networks—and underscore the importance of scalable and representative datasets to support further advancement.

3. Materials and Methods

3.1. Port Pavement Distress Dataset (PPDD)

The data were collected from roadways near the port area of Busan, South Korea. This region is characterized by high traffic volumes, including frequent passage of heavy-freight vehicles, which increases the likelihood of various types of cracking and road surface damage [25]. In this study, we systematically observed and collected a wide range of pavement defects occurring in such asphalt road environments.
The dataset used in this study was constructed from video data captured during driving using a GoPro10 camera (CMOS sensor, 23.0MP, 5.3K resolution at 60 fps; GoPro, Inc., San Mateo, CA, USA) mounted on the upper front section of a vehicle. Images were recorded at a resolution of 3840 × 2160 pixels, and only frames containing at least one visible road defect were selected for analysis. The recordings were conducted during three time periods (10:00–12:00, 13:00–15:00, and 15:00–17:00) across different days and under two weather conditions: clear and overcast. As illustrated in Figure 1, crack and defect data were extracted from the lower part of each frame—specifically, a region extending 1000 pixels upward from the bottom edge of the image. This area corresponds to the rear view of the vehicle and includes the left, center, and right lanes of the roadway. The red polygon highlighted in the figure indicates the region used for annotation (labeling); however, this area was not cropped for use during model training or evaluation. To ensure privacy protection, personally identifiable elements such as vehicle license plates and commercial signage were anonymized in accordance with data protection standards.
The dataset constructed in this study was designed to comprehensively cover various types of cracks and defects commonly found on port roads. A total of six major crack categories were defined and labeled: reflective cracking (RC), longitudinal and edge cracking (LEC), corrugation, shoving, and bleeding cracking (CSSC), rutting and depression cracking (RDC), construction joint cracking (CJC), and alligator cracking (AC). In total, 204,839 images were collected and used for analysis. The distribution of labeled instances across the dataset is as follows: 124,780 instances of RC, 183,094 of LEC, 20,210 of CSSC, 68,434 of RDC, 109,625 of CJC, and 88,571 of AC, resulting in a total of 594,714 labeled crack instances. Representative visual examples of each crack category are provided in Figure 2.
In this study, we aim to detect and analyze surface defects and cracks on roadways using semantic and instance segmentation techniques. To this end, the dataset was annotated using polygon-based labeling and crack region masks, following the annotation format of the MS COCO dataset [26]. For semantic segmentation, the polygon annotations for each crack instance were converted into binary masks distinguishing background and object regions. Every pixel within the annotated regions was assigned a class label corresponding to one of the six crack categories. The labeling process was assisted by the use of the Computer Vision Annotation Tool (CVAT) [27], an automated annotation platform. The automatically generated labels were manually verified on a frame-by-frame basis to ensure accuracy. While inter-annotator agreement was not formally measured during dataset construction, we plan to evaluate annotation consistency using a subset of samples in a follow-up study. Figure 3 presents sample images illustrating the mask annotations for semantic segmentation and polygon annotations for instance segmentation. The constructed dataset reflects the distribution of crack types found on port-area roads and provides a reliable and generalizable foundation for real-world crack detection and analysis tasks.

3.2. Baselines

Using the Port Pavement Distress Dataset (PPDD), we adopted Mask2Former, SegFormer, and DeepLabv3+ for semantic segmentation, as these models have demonstrated strong performance in fine-grained object segmentation tasks. For instance segmentation, we employed YOLOv7 and Mask2Former, chosen for their flexible architectures capable of performing both detection and segmentation simultaneously with high accuracy. All experiments were conducted at a fixed input resolution of 640 × 640. This resolution was selected based on two considerations: (1) it matches the input size used by widely adopted real-time detection models such as YOLOv5 and YOLOv7 [10], and (2) it offers a balanced trade-off between computational efficiency and segmentation performance. We therefore chose this resolution to ensure comparability with standard practices while maintaining feasibility for future deployment in resource-constrained systems. Model training and visualization were conducted using the MMDetection [28] and MMSegmentation [29] frameworks within the OpenMMLab ecosystem. Model performance was evaluated using standard metrics: Pixel Accuracy (PA) and mean Intersection over Union (mIoU) for semantic segmentation and mean Average Precision (AP) for instance segmentation. In addition, qualitative comparisons were conducted to assess the precision of crack boundary segmentation and to examine model-specific differences.

4. Experiments

4.1. Experimental Setup

In this study, the road crack dataset was split into training, validation, and test sets in a ratio of 8:1:1 to enable effective model training and evaluation. The test set was only used exclusively to assess the generalization performance of the segmentation models. Considering GPU memory constraints and training stability, the batch size was set to 64 and the learning rate to 0.0001. Training each epoch took approximately 60 min, although this may vary depending on GPU performance, dataset size, and model complexity. Model training was performed using two NVIDIA Ada 6000 GPUs in parallel, on a system running Ubuntu 20.04 with an Intel Xeon Silver CPU and 256 GB of RAM. The code and experimental settings used for training and evaluation are publicly available on GitHub as follows: https://github.com/LeaYoon/PPDD (accessed on 1 April 2025).

4.2. Experimental Results

4.2.1. Quantitative Evaluation

To compare semantic segmentation performance across different types of pavement cracks, we conducted an evaluation using three models with distinct feature extraction mechanisms: Mask2Former, SegFormer, and DeepLabv3+. To compare semantic segmentation performance across different types of pavement cracks, we conducted an evaluation using three models with distinct feature extraction mechanisms: Mask2Former, SegFormer, and DeepLabv3+. For Mask2Former, we experimented with two backbone networks: ResNet-101 and Swin Transformer. SegFormer uses MiT-B2, and DeepLabv3+ is implemented with ResNet-50. Table 2 presents the evaluation results on the test set, reporting Pixel Accuracy (PA) and Intersection over Union (IoU) scores across six crack categories for all four model configurations. In this study, we employed two standard metrics—Pixel Accuracy (PA) and mean Intersection over Union (mIoU)—to evaluate the performance of semantic segmentation models. Pixel Accuracy measures the ratio of correctly classified pixels to the total number of pixels in the image, providing an overall sense of classification accuracy. In contrast, mIoU quantifies the average overlap between the predicted and ground truth segmentation masks. It is calculated by computing the Intersection over Union (IoU) for each class, defined as the intersection area divided by the union area of predicted and true regions, and then averaging across all classes. mIoU is particularly regarded as a more robust metric in scenarios with class imbalance, as it accounts for the quality of segmentation on a per-class basis. Among the models, SegFormer demonstrated the highest semantic segmentation performance in terms of both PA and mIoU, achieving particularly high accuracy (97.52%) on the background class. In per-class performance, the CSSC category showed the lowest prediction scores in three out of the four models, while the AC type exhibited relatively higher accuracy. Overall, the transformer-based models—SegFormer and Mask2Former with a Swin Transformer backbone—achieved superior performance in semantic segmentation. These results highlight the effectiveness of transformer-based architectures in producing fine-grained masks for pavement crack and defect detection.
To evaluate instance segmentation performance across different types of pavement cracks, we compared the Mask2Former and YOLOv7 models. For Mask2Former, experiments were conducted using two different backbone networks: ResNet-101 [30] and Swin Transformer [31]. Table 3 presents the A P 50 and overall A P scores for the three model configurations across the six crack categories in the test dataset. A P 50 (Average Precision at an IoU threshold 50%) measures the accuracy of predictions with an Intersection over Union greater than 0.5. In contrast, A P refers to the mean Average Precision calculated across IoU thresholds from 0.5 to 0.95 in increments of 0.05, following the COCO evaluation metric [32], providing a more comprehensive assessment of overall model performance. Among the models, YOLOv7 achieved the best overall performance, demonstrating particularly strong results on the AC category, with an A P 50 of 87.2% and an A P of 57.5%. On the other hand, the RC category yielded the lowest performance across all three models, suggesting that this type of crack is comparatively more challenging to segment and detect than others. Although the CSSC class accounts for only 3.3% of the dataset, its segmentation performance is not the lowest among the six crack types. For instance, the YOLOv7 model achieves an AP of 38.3% for CSSC, which is notably higher than that of the RC class (21.3%), despite the RC class occupying a larger portion of the dataset. This suggests that the challenge of segmentation may depend not only on class frequency but also on visual distinctiveness and contextual cues.

4.2.2. Qualitative Evaluation

Figure 4 presents a visual comparison of how accurately the semantic segmentation models distinguish different crack types in the PPDD dataset. Panel (a) shows the ground truth labels, depicting four types of pavement damage—RC, LEC, CJC, and AC—in their annotated forms. For RC, a horizontally oriented crack, SegFormer was the most effective in capturing the number and type of cracks, although it struggled with fine boundary delineation, as shown in (c). Mask2Former (d) also detected some portions of RC, but there were instances of omission and misclassification, particularly where RC was confused with AC. In the case of CJC, DeepLabv3+ (b) failed to deliver precise predictions. However, SegFormer (c), Mask2Former-r101 (d), and Mask2Former-swin-l (e) showed relatively detailed boundary representation and yielded outputs closely resembling the ground truth.
Figure 5 provides a visual comparison of instance-level segmentation performance, showing the predictions of YOLOv7 (b), Mask2Former with a ResNet-101 backbone (c), and Mask2Former with a Swin-L backbone (d), alongside the ground truth annotations (a). YOLOv7 demonstrated stable detection of AC, which spanned a wide area, and yielded relatively high confidence scores for the predicted instances (AC: 87.0, 89.8). However, it tended to miss finer details in CJC, indicating a limitation in capturing subtle damage patterns. In the case of Mask2Former, regardless of the backbone architecture, the model tended to overpredict the number of crack instances compared to the ground truth annotations. This suggests the presence of false-positive errors in the Mask2Former predictions.
Figure 6 visualizes the results of both semantic segmentation and instance segmentation on the same road image. It can be observed that cracks belonging to the RDC class are frequently misclassified as AC when they exhibit similar visual patterns. In both semantic and instance segmentation results ((a)–(c) and (d)–(f)), regions annotated as RDC are often predicted as AC, suggesting that the models tend to confuse the two classes when structural similarities are present. The predictions from Mask2Former (middle columns (b) and (e)) show a tendency to over-detect AC, RC, and LEC classes across both tasks. Numerous additional crack segments—colored as AC, RC, or LEC—appear even in areas without corresponding ground truth annotations. This over-detection may be attributed to the model’s high sensitivity and the class imbalance in the training dataset. In contrast, YOLOv7 (Subfigure (f)) produces fewer predictions in regions without actual cracks, whereas Mask2Former frequently generates false positives in such areas. Mask2Former also demonstrates stable IoU scores and accurate class predictions for long, horizontally, or vertically stretched cracks, indicating its robustness in capturing linear structures and correctly classifying their types. YOLOv7, on the other hand, does not excessively overestimate the area of such elongated cracks but sometimes misinterprets their orientation. Since YOLOv7 outputs axis-aligned bounding boxes, vertical cracks may be enclosed within horizontally shaped boxes, and vice versa.

5. Discussion

5.1. Road Crack Detection and Segmentation Dataset

In selecting representative models for evaluation, we referred to established practices in prior studies on pavement crack analysis to ensure methodological consistency. Accordingly, we incorporated both CNN-based segmentation architectures (e.g., YOLO, DeepLab) and ViT-based segmentation architectures (e.g., Mask2Former, SegFormer), which are widely recognized and frequently adopted as core architectures in the recent literature. This selection enables meaningful benchmarking while maintaining relevance to previous research. Recent advancements in deep learning-based crack detection and segmentation technologies have been actively reported through competitive challenges [1], newly released datasets [6], and academic publications [33]. The fine-grained visual appearance of cracks—whether on natural surfaces or man-made structures—can vary significantly depending on the surrounding environment and the underlying causes of formation. OmniCrack30K [6] was developed to facilitate a comprehensive understanding of cracks by aggregating 30,000 diverse crack samples extracted from over 20 different datasets. The authors evaluated crack segmentation performance using both domain-specific models trained to represent crack-specific features and a group of general-purpose segmentation models, including DeepLabV3 [34] and Mask2Former [7]. Their findings revealed that the general-purpose models trained on the proposed dataset significantly outperformed the crack-specific models. This outcome highlights the critical importance of having access to a sufficiently large and well-curated dataset in advancing the performance of deep learning-based models. In the context of paved roads, cracks are not only defined by fissures but also by the overall visual degradation of the road surface, which constitutes an important visual characteristic for detection. This aspect is highly domain-specific and serves the independent purpose of supporting road maintenance through dedicated crack detection. To address this, the road damage dataset (RDD) [14] was introduced, providing annotated bounding boxes for classes such as longitudinal cracks, transverse cracks, alligator cracks, and potholes, collected from paved roads across multiple countries. This dataset has since supported the advancement of crack detection research through a series of ongoing competitions [5,18,35,36,37]. In the recent ORDDC2024 challenge [37], which utilized the RDD dataset, all of the top five teams adopted models from the YOLO series [20] as their backbone architectures, aiming to maximize detection performance relative to inference time. The highest-performing model achieved an overall F1-score of 79.27 across all participating countries. The SHREC2022 challenge [1] focused on evaluating semantic segmentation performance for various crack and pothole classes. The competition used Weighted Pixel Accuracy (WPA) and mean Intersection over Union (mIoU) as evaluation metrics. To facilitate performance comparison with submitted models, the organizers employed DeepLabv3+ [9] as a baseline. DeepLabv3+ achieved a WPA of 0.598 and an mIoU of 0.676, establishing a representative baseline and reference level of performance for benchmark studies in crack segmentation. In this study, we adopted widely used baseline models for crack segmentation—such as YOLOv7 [10], Mask2Former [7], and Deeplabv3+ [9]—which have been frequently employed in prior research for detection and semantic segmentation tasks. Based on our experimental results, we conclude that our work also establishes a valid and consistent baseline aligned with existing standards in the field.

5.2. Limitations and Opportunities

PPDD comes with several considerations for analysis. First, all data were collected by a single, consistent organization. In contrast, the RDD dataset [14] includes road crack images collected from multiple countries. This diversity introduces challenges in evaluating model robustness to non-visual cues or assessing few-shot learning performance, as it becomes difficult to isolate the effects of domain gaps related to country- or city-specific factors. Although this consistency ensures high-quality annotations, the lack of geographic variety in the PPDD dataset may limit its ability to represent road damage patterns observed in other regions or climates. In other words, PPDD alone may not be sufficient for evaluating a model’s generalization across geographic domains. However, it may serve as a valuable evaluation set for studies focusing on crack-centric visual characteristics or for investigating domain gap effects in a more controlled setting. Another important consideration is class imbalance. Since the categories and visual characteristics of road cracks are closely tied to their underlying causes, various environmental factors—such as the primary function of the road or the frequency of certain types of wear—can influence the distribution of classes within a dataset. The proposed dataset was collected from port roads in Busan, South Korea—a city with a high volume of container transport—making it possible to include crack categories associated with heavy loads, such as corrugation, shoving, and slippage as well as rutting and depression. However, the occurrence frequencies of these classes remain relatively low, comprising only 3.3% and 11.5% of the dataset, respectively. Given the active research within the deep learning community on mitigating learning bias caused by imbalanced data [38], this dataset can also serve as a benchmark for evaluating model performance in low-frequency class scenarios.

5.3. Future Directions

Our new benchmark dataset introduces several promising directions for future research in road crack detection and analysis. We organize the future directions into two broad categories: (1) domain-specific challenges within the expanded taxonomy of crack types, and (2) integration with emerging techniques in deep learning and computer vision. The high-quality segmentation annotations provided in PPDD offer detailed information on the extent of damage, enabling the development of more practical and real-world-applicable crack segmentation models. The dataset includes a diverse set of crack categories, contributing to the analysis of visual characteristics associated with different causes of pavement damage. First, it would be worthwhile to explore methods for enhancing model sensitivity to visually subtle cracks with specific directional characteristics, such as reflection cracks. These cracks frequently appear perpendicular to the driving lane and often exhibit low visual contrast with nearby road markings, making them prone to misclassification. To address this challenge, architectural improvements that explicitly encode directional information and effectively learn spatial context may serve as effective solutions. Proposed strategies include adapting transformer-based architectures to better capture global contextual features [39,40], as well as applying connected component-based post-processing techniques for prediction refinement [41]. These approaches aim to improve model robustness for complex crack types by leveraging both structural alignment and contextual dependencies inherent in road defect imagery. While prior work [42] has considered an mIoU of 0.5 to represent reasonably sufficient performance, our model achieves an mIoU of 48.31 for the rarest class in the dataset. Despite the extremely limited training opportunities for this class, the model is still able to capture meaningful patterns, indicating that it has learned to recognize the structural characteristics of even the most infrequent crack types. Future work applying a broader range of model architectures beyond the proposed baselines may further improve segmentation performance. Additionally, the use of ensemble techniques or hybrid architectures could enhance robustness across diverse crack types. Incorporating advanced data augmentation strategies and class balancing techniques [43] may also help mitigate issues arising from class imbalance and limited training samples, particularly for rare crack categories. Together, these approaches could lead to more accurate and generalizable models for real-world crack segmentation tasks. Moreover, although the model maintained reasonable performance under overcast conditions, the number of such samples was limited, and thus, a detailed condition-specific performance analysis was not included in this study. In future work, expanding annotations related to environmental factors such as weather and lighting, and introducing condition-aware evaluation protocols will be crucial to further improve model robustness under varying real-world scenarios. To support real-world applications, we also plan to expand the dataset with more comprehensive weather variation annotations. Secondly, recent advances in deep learning-based perception have increasingly focused on training predictive models under challenging constraints. This includes scenarios where only a limited number of accurate annotations are available, where annotations are weak [44,45] or incomplete [46,47] or where domain gaps [48] arise between training and deployment environments. The proposed crack segmentation dataset is well positioned to support emerging research directions driven by the expansion of deep learning technologies. One particularly promising avenue is the integration of vision–language models (VLMs), such as CLIP [49], which have been trained on large-scale image–text datasets. VLMs represent a significant advancement in the scope and applicability of AI perception, enabling models to leverage text-based alignment and commonsense reasoning in visual tasks. As studies increasingly explore the application of VLMs to perception tasks, their ability to transfer learned representations opens up new possibilities for identifying a broader range of crack categories. The diverse crack categories included in our dataset can thus serve as a critical benchmark for evaluating how well existing recognition models, including those based on VLMs, generalize to previously unseen or fine-grained categories. Although this study does not primarily aim for direct application in real-time or embedded systems, the experimental environment—conducted at a resolution of 640×640 (VGA level)—reflects a configuration similar to that commonly used in lightweight inference scenarios for embedded platforms. Therefore, while embedded deployment was not the explicit objective, the resolution-constrained evaluation offers practical implications relevant to such environments. Our focus was on assessing segmentation accuracy and crack-type differentiation performance across state-of-the-art models. In future work, we plan to explore the possibility of adapting the proposed methods to embedded and real-time applications through model compression, quantization, or other efficiency-oriented strategies.
Therefore, we expect that the dataset proposed in this study will serve as a valuable objective resource for informing road maintenance and repair strategies. Furthermore, it is anticipated to contribute to the evaluation of crack detection model performance and the assessment of their generalization capabilities in future research.

6. Conclusions

In this study, we constructed a high-precision crack segmentation dataset that reflects real-world road conditions from an ego-vehicle perspective. Based on this dataset, we evaluated the performance of both semantic and instance segmentation models. SegFormer and YOLOv7 demonstrated strong performance in semantic segmentation and in detecting specific crack categories, respectively. However, certain crack types—such as reflective cracking—remained difficult to detect accurately, highlighting ongoing challenges in fine-grained crack recognition. The proposed dataset contributes not only by providing detailed polygon-based annotations for a diverse range of crack categories, but also by offering data captured in realistic driving environments, which are under-represented in existing datasets. These features enhance the practical relevance of the benchmark and allow for more robust evaluation across models. We believe this work establishes a valuable foundation for future research in road surface defect analysis. In particular, future studies may build on this dataset to improve model generalization to unseen road textures, evaluate cross-domain transferability, and explore efficient annotation strategies to support large-scale training and benchmarking.

Author Contributions

Conceptualization, H.Y. and S.K.; Methodology, H.Y.; Software, H.Y.; Validation, H.Y., S.K. and H.-K.K.; Resources, H.-K.K.; Data Curation, H.-K.K.; Writing—Original Draft Preparation, H.Y.; Writing—Review and Editing, S.K. and H.-K.K.; Visualization, H.Y.; Supervision, S.K.; Funding Acquisition, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Dong-A University in Republic of Korea, grant number 10.13039/501100002468.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available at https://github.com/LeaYoon/PPDD.

Conflicts of Interest

The authors declare no competing interests.

References

  1. Thompson, E.M.; Ranieri, A.; Biasotti, S.; Chicchon, M.; Sipiran, I.; Pham, M.-K.; Nguyen-Ho, T.-L.; Nguyen, H.-D.; Tran, M.-T. SHREC 2022: Pothole and crack detection in the road pavement using images and RGB-D data. Comput. Graph. 2022, 107, 161–171. [Google Scholar] [CrossRef]
  2. Kanwal, S.; Rasheed, M.I.; Pitafi, A.H.; Pitafi, A.; Ren, M. Road and transport infrastructure development and community support for tourism: The role of perceived benefits, and community satisfaction. Tour. Manag. 2020, 77, 104014. [Google Scholar] [CrossRef]
  3. Kanwal, S.; Pitafi, A.H.; Rasheed, M.I.; Pitafi, A.; Iqbal, J. Assessment of residents’ perceptions and support toward development projects: A study of the China–Pakistan Economic Corridor. Soc. Sci. J. 2022, 59, 102–118. [Google Scholar] [CrossRef]
  4. Huyan, J.; Li, W.; Tighe, S.; Xu, Z.; Zhai, J. CrackU-net: A novel deep convolutional neural network for pixelwise pavement crack detection. Struct. Control Health Monit. 2020, 27, e2551. [Google Scholar] [CrossRef]
  5. Arya, D.; Maeda, H.; Ghosh, S.K.; Toshniwal, D.; Mraz, A.; Kashiyama, T.; Sekimoto, Y. Transfer learning-based road damage detection for multiple countries. arXiv 2020, arXiv:2008.13101. [Google Scholar]
  6. Benz, C.; Rodehorst, V. Omnicrack30k: A benchmark for crack segmentation and the reasonable effectiveness of transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 3876–3886. [Google Scholar]
  7. Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
  8. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  9. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
  10. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
  11. Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
  12. Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
  13. Hedeya, M.A.; Samir, E.; El-Sayed, E.; El-Sharkawy, A.A.; Abdel-Kader, M.F.; Moussa, A.; Abdel-Kader, R.F. A low-cost multi-sensor deep learning system for pavement distress detection and severity classification. In Proceedings of the International Conference on Advanced Machine Learning Technologies and Applications, Cairo, Egypt, 5–7 May 2022; pp. 21–33. [Google Scholar]
  14. Arya, D.; Maeda, H.; Ghosh, S.K.; Toshniwal, D.; Sekimoto, Y. RDD2020: An annotated image dataset for automatic road damage detection using deep learning. Data Brief 2021, 36, 107133. [Google Scholar] [CrossRef]
  15. Mei, Q.; Gül, M. A cost effective solution for pavement crack inspection using cameras and deep neural networks. Constr. Build. Mater. 2020, 256, 119397. [Google Scholar] [CrossRef]
  16. Mei, Q.; Gül, M.; Shirzad-Ghaleroudkhani, N. Towards smart cities: Crowdsensing-based monitoring of transportation infrastructure using in-traffic vehicles. J. Civ. Struct. Health Monit. 2020, 10, 653–665. [Google Scholar] [CrossRef]
  17. Chen, T.; Cai, Z.; Zhao, X.; Chen, C.; Liang, X.; Zou, T.; Wang, P. Pavement crack detection and recognition using the architecture of segNet. J. Ind. Inf. Integr. 2020, 18, 100144. [Google Scholar] [CrossRef]
  18. Arya, D.; Maeda, H.; Ghosh, S.K.; Toshniwal, D.; Omata, H.; Kashiyama, T.; Sekimoto, Y. Global road damage detection: State-of-the-art solutions. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 5533–5539. [Google Scholar]
  19. Maeda, H.; Sekimoto, Y.; Seto, T.; Kashiyama, T.; Omata, H. Road damage detection using deep neural networks with images captured through a smartphone. arXiv 2018, arXiv:1801.09454. [Google Scholar]
  20. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  21. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  22. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  23. Jain, J.; Li, J.; Chiu, M.T.; Hassani, A.; Orlov, N.; Shi, H. Oneformer: One transformer to rule universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2989–2998. [Google Scholar]
  24. Guo, F.; Qian, Y.; Liu, J.; Yu, H. Pavement crack detection based on transformer network. Autom. Constr. 2023, 145, 104646. [Google Scholar] [CrossRef]
  25. Faruk, A.N.; Liu, W.; Lee, S.I.; Naik, B.; Chen, D.H.; Walubita, L.F. Traffic volume and load data measurement using a portable weigh in motion system: A case study. Int. J. Pavement Res. Technol. 2016, 9, 202–213. [Google Scholar] [CrossRef]
  26. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  27. Computer Vision Annotation Tool (CVAT), Version 2.25.0. Available online: https://github.com/cvat-ai/cvat (accessed on 29 June 2018).
  28. OpenMMLab Detection Toolbox and Benchmark, Version 3.0.0. Available online: https://github.com/open-mmlab/mmdetection (accessed on 22 August 2018).
  29. OpenMMLab Semantic Segmentation Toolbox and Benchmark, Version 1.0.0. Available online: https://github.com/open-mmlab/mmsegmentation (accessed on 14 June 2020).
  30. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  31. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  32. Saltık, A.O.; Allmendinger, A.; Stein, A. Comparative analysis of yolov9, yolov10 and rt-detr for real-time weed detection. arXiv 2024, arXiv:2412.13490. [Google Scholar]
  33. Liu, H.; Miao, X.; Mertz, C.; Xu, C.; Kong, H. Crackformer: Transformer network for fine-grained crack detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3783–3792. [Google Scholar]
  34. Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
  35. Arya, D.; Maeda, H.; Ghosh, S.K.; Toshniwal, D.; Sekimoto, Y. RDD2022: A multi-national image dataset for automatic road damage detection. Geosci. Data J. 2024, 11, 846–862. [Google Scholar] [CrossRef]
  36. Arya, D.; Maeda, H.; Ghosh, S.K.; Toshniwal, D.; Omata, H.; Kashiyama, T.; Sekimoto, Y. Crowdsensing-based road damage detection challenge (crddc’2022). In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 6378–6386. [Google Scholar]
  37. Arya, D.; Omata, H.; Maeda, H.; Sekimoto, Y. Orddc’2024: State of the art solutions for optimized road damage detection. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 8430–8438. [Google Scholar]
  38. Tang, K.; Niu, Y.; Huang, J.; Shi, J.; Zhang, H. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3716–3725. [Google Scholar]
  39. Shin, S.-P.; Kim, K.; Le, T.H.M. Feasibility of advanced reflective cracking prediction and detection for pavement management systems using machine learning and image detection. Buildings 2024, 14, 1808. [Google Scholar] [CrossRef]
  40. Wang, C.; Liu, H.; An, X.; Gong, Z.; Deng, F. SwinCrack: Pavement crack detection using convolutional swin-transformer network. Digit. Signal Process. 2024, 145, 104297. [Google Scholar] [CrossRef]
  41. He, Z.; Su, C.; Deng, Y. A novel hybrid approach for concrete crack segmentation based on deformable oriented-YOLOv4 and image processing techniques. Appl. Sci. 2024, 14, 1892. [Google Scholar] [CrossRef]
  42. Ha, J.; Kim, D.; Kim, M. Assessing severity of road cracks using deep learning-based segmentation and detection. J. Supercomput. 2022, 78, 17721–17735. [Google Scholar] [CrossRef]
  43. Ochoa-Ruiz, G.; Angulo-Murillo, A.A.; Ochoa-Zezzatti, A.; Aguilar-Lobo, L.M.; Vega-Fernández, J.A.; Natraj, S. An asphalt damage dataset and detection system based on retinanet for road conditions assessment. Appl. Sci. 2020, 10, 3974. [Google Scholar] [CrossRef]
  44. Lin, C.-S.; Wang, C.-Y.; Wang, Y.-C.F.; Chen, M.-H. SemPLeS: Semantic prompt learning for weakly-supervised semantic segmentation. arXiv 2024, arXiv:2401.11791. [Google Scholar]
  45. Jo, S.; Pan, F.; Yu, I.-J.; Kim, K. DHR: Dual Features-Driven Hierarchical Rebalancing in Inter-and Intra-Class Regions for Weakly-Supervised Semantic Segmentation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 231–248. [Google Scholar]
  46. Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
  47. Wang, H.; Zhang, Q.; Li, Y.; Li, X. Allspark: Reborn labeled features from unlabeled in transformer for semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 3627–3636. [Google Scholar]
  48. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 2096–2030. [Google Scholar]
  49. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Figure 1. Example image illustrating the inclusion criteria for the annotated region on paved road surfaces. The red polygon highlighted in the figure indicates the area used for annotation. In detail, subfigure (a) corresponds to the left-side region of the vehicle, (b) to the right-side region, and (c) to the front-facing region. This region was not cropped or directly utilized during model training or evaluation.
Figure 1. Example image illustrating the inclusion criteria for the annotated region on paved road surfaces. The red polygon highlighted in the figure indicates the area used for annotation. In detail, subfigure (a) corresponds to the left-side region of the vehicle, (b) to the right-side region, and (c) to the front-facing region. This region was not cropped or directly utilized during model training or evaluation.
Applsci 15 05446 g001
Figure 2. Representative visual examples of pavement crack types included in our dataset. Each image illustrates the typical appearance of a specific crack category: (a) reflection cracking characterized by cracks aligned with underlying pavement joints, (b) longitudinal and edge cracking that runs parallel to the direction of traffic, (c) slippage-related cracking patterns such as corrugation and shoving, (d) rutting and surface depression resulting from repeated traffic loads, (e) construction joint cracking typically formed along paving lane boundaries, and (f) alligator cracking defined by interconnected, fatigue-induced cracks.
Figure 2. Representative visual examples of pavement crack types included in our dataset. Each image illustrates the typical appearance of a specific crack category: (a) reflection cracking characterized by cracks aligned with underlying pavement joints, (b) longitudinal and edge cracking that runs parallel to the direction of traffic, (c) slippage-related cracking patterns such as corrugation and shoving, (d) rutting and surface depression resulting from repeated traffic loads, (e) construction joint cracking typically formed along paving lane boundaries, and (f) alligator cracking defined by interconnected, fatigue-induced cracks.
Applsci 15 05446 g002
Figure 3. Example of crack annotations for semantic and instance segmentation. (a) shows a raw road image. (b) illustrates the semantic segmentation mask, where each pixel is labeled by crack type. (c) shows the instance segmentation mask, with each crack instance annotated separately using polygons. Colors indicate different crack categories as shown in the legend.
Figure 3. Example of crack annotations for semantic and instance segmentation. (a) shows a raw road image. (b) illustrates the semantic segmentation mask, where each pixel is labeled by crack type. (c) shows the instance segmentation mask, with each crack instance annotated separately using polygons. Colors indicate different crack categories as shown in the legend.
Applsci 15 05446 g003
Figure 4. An example of ground truth and model prediction on semantic segmentation. (a) Ground truth—prediction from DeepLabv3+ (b), SegFormer (c), Mask2Former with ResNet-101 (d), and largest setting of Swin Transformer (e).
Figure 4. An example of ground truth and model prediction on semantic segmentation. (a) Ground truth—prediction from DeepLabv3+ (b), SegFormer (c), Mask2Former with ResNet-101 (d), and largest setting of Swin Transformer (e).
Applsci 15 05446 g004
Figure 5. An example of ground truth and model prediction on instance segmentation evaluation. (a) Ground truth—prediction from YOLOv7 (b), Mask2Former with ResNet-101 (c), and largest setting of Swin Transformer (d).
Figure 5. An example of ground truth and model prediction on instance segmentation evaluation. (a) Ground truth—prediction from YOLOv7 (b), Mask2Former with ResNet-101 (c), and largest setting of Swin Transformer (d).
Applsci 15 05446 g005
Figure 6. Qualitative comparison of road crack segmentation results. Top row (ac): semantic segmentation results—(a) ground truth, (b) Mask2Former (Swin-Large) prediction, (c) SegFormer prediction. Bottom row (df): instance segmentation results—(d) ground truth, (e) Mask2Former (Swin-Large) prediction, (f) YOLOv7 prediction.
Figure 6. Qualitative comparison of road crack segmentation results. Top row (ac): semantic segmentation results—(a) ground truth, (b) Mask2Former (Swin-Large) prediction, (c) SegFormer prediction. Bottom row (df): instance segmentation results—(d) ground truth, (e) Mask2Former (Swin-Large) prediction, (f) YOLOv7 prediction.
Applsci 15 05446 g006
Table 1. Comparison of road crack datasets.
Table 1. Comparison of road crack datasets.
DataImage TypeTaskNumber of ClassDataset Size
Chen, T., et al. [17]PatchSEG110,000
(5000/2500/2500)
EdmCrack600 [16]Ego-vehicleSEG1600
(420/60/120)
CPRIDEgo-vehicleSEG12235
(2000/200/35)
RCDC [18]Ego-vehicleOD423,705
(18,930/2111/2664)
RDD [19]Ego-vehicleOD89053
(-)
Omicrack30k [6]PatchSEG130,017
(22,158/3277/4582)
PPDD (Ours)Ego-vehicleSEG6204,839
(163,871/20,484/20,484)
Image type: perspective of the captured images (e.g., patch-based, ego-vehicle view); task: type of annotation task (SEG: semantic segmentation, OD: object detection); dataset size: number of images (training/validation/test splits).
Table 2. Quantitative evaluation results for each class and overall performance in semantic segmentation.
Table 2. Quantitative evaluation results for each class and overall performance in semantic segmentation.
(%)RCLECCSSCRDCJCACBackgroundTotal
Mask2Former-r101
PA48.7153.942.569.3360.3364.3997.4662.37
mIoU33.838.729.1336.7940.3154.6295.4246.96
Mask2Former-swin-l
PA48.4556.0340.6764.7562.965.8497.9862.37
mIoU34.3641.8234.9239.8941.7656.1995.9049.26
SegFormer
PA40.8458.6453.9273.1058.1781.9699.1066.53
mIoU35.1350.7448.3162.2749.1972.5597.5259.39
DeepLabv3+
PA39.7648.6736.2946.5446.4568.3898.7054.97
mIoU31.9538.1427.0537.8636.9357.1896.1946.47
Performance metrics are expressed as percentages and rounded to two decimal places, RC: reflection cracking, LEC: longitudinal–edge cracking, CSSC: corrugation, shoving, and slippage cracking, RD: rutting and depression, CJC: construction joint cracking, AC: alligator cracking.
Table 3. Quantitative evaluation results for each class and overall performance in instance segmentation.
Table 3. Quantitative evaluation results for each class and overall performance in instance segmentation.
(%)RCLECCSSCRDCJCACTotal
Mask2Former-r101
A P 50 45.950.749.162.664.969.757.15
A P 15.417.118.736.629.643.326.78
Mask2Former-swin-s
A P 50 51.956.557.666.770.1076.363.18
A P 18.52022.741.434.649.431.10
YOLOv7
A P 50 57.465.180.28484.387.276.37
A P 21.325.238.355.945.257.540.57
Performance metrics are expressed as percentages and rounded to two decimal places, RC: reflection cracking, LEC: longitudinal–edge cracking, CSSC: corrugation, shoving, and slippage cracking, RD: rutting and depression, CJC: construction joint cracking, AC: alligator cracking.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yoon, H.; Kim, H.-K.; Kim, S. PPDD: Egocentric Crack Segmentation in the Port Pavement with Deep Learning-Based Methods. Appl. Sci. 2025, 15, 5446. https://doi.org/10.3390/app15105446

AMA Style

Yoon H, Kim H-K, Kim S. PPDD: Egocentric Crack Segmentation in the Port Pavement with Deep Learning-Based Methods. Applied Sciences. 2025; 15(10):5446. https://doi.org/10.3390/app15105446

Chicago/Turabian Style

Yoon, Hyemin, Hoe-Kyoung Kim, and Sangjin Kim. 2025. "PPDD: Egocentric Crack Segmentation in the Port Pavement with Deep Learning-Based Methods" Applied Sciences 15, no. 10: 5446. https://doi.org/10.3390/app15105446

APA Style

Yoon, H., Kim, H.-K., & Kim, S. (2025). PPDD: Egocentric Crack Segmentation in the Port Pavement with Deep Learning-Based Methods. Applied Sciences, 15(10), 5446. https://doi.org/10.3390/app15105446

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop