Improving Segmentation Accuracy for Asphalt Pavement Cracks via Integrated Probability Maps

Trach, Roman; Tyvoniuk, Volodymyr; Trach, Yuliia

doi:10.3390/app15189865

Open AccessArticle

Improving Segmentation Accuracy for Asphalt Pavement Cracks via Integrated Probability Maps

by

Roman Trach

^1,2

,

Volodymyr Tyvoniuk

^1,2,*

and

Yuliia Trach

^1,2

¹

Institute of Civil Engineering, Warsaw University of Life Sciences, 02-776 Warsaw, Poland

²

Institute of Civil Engineering and Architecture, National University of Water and Environmental Engineering, 33028 Rivne, Ukraine

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 9865; https://doi.org/10.3390/app15189865

Submission received: 19 August 2025 / Revised: 7 September 2025 / Accepted: 8 September 2025 / Published: 9 September 2025

(This article belongs to the Special Issue Technology and Organization Applied to Civil Engineering)

Download

Browse Figures

Versions Notes

Abstract

Asphalt crack segmentation is essential for preventive maintenance but is sensitive to noise, viewpoint, and illumination. This study evaluates a minimally invasive strategy that augments standard RGB input with an auxiliary fourth channel—a crack-probability map generated by a multi-scale ensemble of classifiers—and injects it into segmentation backbones. Field imagery from unmanned aerial vehicles and action cameras was used to train and compare U-Net, ENet, HRNet, and DeepLabV3+ under unified settings; the probability map was produced by an ensemble of lightweight convolutional neural networks (CNNs). Across models, the four-channel configuration improved performance over three-channel baselines; for DeepLabV3+, the Intersection over Union (IoU) increased by 6.41%. Transformer-based classifiers, despite strong accuracy, proved less effective and slower than lightweight CNNs for probability-map generation; the final ensemble processed images in approximately 0.63 s each. Integrating ensemble-derived probability maps yielded consistent gains, with the best four-channel CNNs surpassing YOLO11x-seg and Transformer baselines while remaining practical. This study presents a systematic evaluation showing that probability maps from classifier ensembles can serve as an auxiliary channel to improve segmentation of asphalt pavement cracks, providing a novel modular complement or alternative to attention mechanisms. The findings demonstrate a practical and effective strategy for enhancing automated pavement monitoring.

Keywords:

asphalt pavement; crack segmentation; probability map; ensemble learning

1. Introduction

Road infrastructure is fundamental to economic growth, social mobility, and overall quality of life. Global road length is estimated at approximately 39.7 million km [1]. In most regions, the share of paved roads ranges between 60 and 80% [2], although sizable gaps remain in parts of Asia and Africa. OpenStreetMap (OSM) data indicate that about 47% of roads in Germany are asphalted [3]. Given OSM incompleteness, this figure should be regarded as a lower bound—actual coverage is likely higher. Public investment in land transport infrastructure (rail + road + waterways) typically amounts to ~0.1–3% of Gross Domestic Product (GDP); in most countries, roads account for the majority of expenditures. For example, Poland has averaged ~0.7% of GDP since 2014, with 75–80% directed to roads [4,5].

Among defects of asphalt pavements, cracks are both ubiquitous and critical [6]. They serve as initiation points for more severe failures (potholes, raveling, premature layer degradation), driving life-cycle costs upward and elevating safety risks. Reliable, timely crack detection is therefore essential for preventive maintenance planning and sustainable asset management.

Against this backdrop, computer vision and deep learning have become central to automated inspection pipelines, increasingly coupled with unmanned aerial vehicles (UAVs) and vehicle-mounted acquisition systems [7,8,9,10,11]. Contemporary studies span both detection and pixel-level segmentation. Reported detection systems include members of the You Only Look Once (YOLO) and Faster R-CNN families, often augmented with attention mechanisms; high mean average precision (mAP) values (e.g., ~91.48%) and even crack-width estimation with ~10% error have been documented [12,13]. Pixel-level semantic segmentation provides finer localization on heterogeneous and noisy backgrounds. Improvements to fully convolutional network (FCN) and U-Net-like architectures via spatial/channel attention or enhanced decoders have yielded higher Intersection over Union (IoU) and F1-score values [14,15]. To address data scarcity, synthetic datasets have been explored [16], while multimodal fusion (e.g., UAV imagery with Light Detection and Ranging (LiDAR)) has demonstrated advantages on complex surfaces [17]. Survey papers note the growing prevalence of convolutional neural network (CNN) and Transformer-based pixel segmentation while emphasizing persistent challenges: dependence on high-quality annotations, limited cross-surface generalization, and sensitivity to capture conditions (illumination changes, perspective, shadows) [18,19,20].

In recent years, most works on asphalt crack segmentation have relied on CNN-based architectures enhanced with targeted decoder or attention improvements. For instance, modifications of DeepLabV3+ with multi-level attention modules have been shown to improve generalization over U-Net on complex asphalt backgrounds, though such solutions require substantial architectural changes and careful hyperparameter tuning [21]. Specialized networks such as MixCrackNet employ deformable convolutions, weighted loss functions, and other techniques to increase robustness against noise, but these improvements come at the cost of greater architectural complexity and computational overhead [22]. Research on lightweight variants is also emerging; however, such models are often trained and tested on controlled datasets using small image fragments captured at close range, which limits transferability to field conditions, with greater variation in perspective, illumination, and surface textures [23].

In addition to general-purpose CNN and Transformer backbones, several crack-specific architectures have been proposed. DeepCrack employs a hierarchical CNN with multi-scale deep supervision to capture fine crack structures, achieving high accuracy on benchmark datasets but being limited to relatively low-resolution imagery and small training sets [24]. CrackFormer and its extensions introduce Transformer-based modules tailored for thin-structure extraction and deliver state-of-the-art results on public benchmarks, but their high computational demands and reliance on lower-resolution inputs constrain their applicability in large-scale, high-resolution field monitoring [25].

In parallel, Transformer-based approaches and modern benchmark architectures have gained attention. Enhanced versions of Swin-UNet for road crack segmentation have demonstrated improved contour continuity, while studies on SegFormer report advantages over classical CNNs on specific datasets. Nevertheless, these models generally require larger datasets and computational resources, and are frequently evaluated on images obtained very close to the surface (or on small tiles), which does not always reflect operational monitoring conditions. Recent initiatives involving UAV-based datasets (e.g., DronePavSeg) and encoder–decoder designs such as ConvNeXt-UPerNet underscore a trend toward heavier architectures that are better suited to large-scale datasets and structured acquisition protocols [26,27,28,29].

Another important direction is multimodality and the use of external priors. Fusion of UAV imagery with low-cost LiDAR (intensity/height) or the use of hyperspectral data has been shown to significantly improve segmentation performance on asphalt cracks but requires additional sensors, synchronization, and more complex data acquisition pipelines. This indirectly demonstrates that supplementary information beyond raw RGB can help models focus on relevant regions; however, conventional approaches for obtaining such information increase system cost and complicate deployment [17,30].

Across this body of work, several recurring limitations can be identified. Many methods rely on substantial architectural modifications, such as complex attention modules, deformable convolutions, or customized decoder designs, which increase model complexity and computational costs. Others depend on close-range imagery or small input sizes, limiting their transferability to large-scale monitoring scenarios. Performance is often sensitive to dataset scale and annotation quality, while computational demands remain high, making deployment in real-world infrastructure monitoring challenging. These limitations motivate the use of auxiliary prior information in the form of a crack-probability map. In the present study, the probability map is produced by an ensemble of classification models operating at multiple spatial scales (including lightweight CNNs and Transformer architectures) and then injected as a fourth input channel to the segmentation network. In contrast to work conducted under standardized laboratory conditions, the evaluation relies on field imagery (UAV and action cameras) without controlled distance or illumination, aligning the methodology with real-world monitoring scenarios. The primary objective is a quantitative assessment of the fourth-channel integration for improving crack segmentation, together with an evaluation of the role of Transformer-based classifiers within the ensemble relative to lightweight CNN alternatives. The novelty of this work lies in showing that ensemble-derived probability maps can systematically enhance segmentation accuracy of asphalt pavement cracks under real-world conditions while keeping computational costs nearly unchanged. By treating the probability map as an external, modular prior, the approach provides an interpretable, modular prior that can act as a complement or alternative to built-in attention mechanisms while remaining practical for large-scale infrastructure monitoring. Performance is benchmarked against three-channel baselines and a reference model (YOLO11x) and two recent Transformer-based segmentation architectures (SegFormer B2 and SwinUNet Tiny), enabling conclusions on accuracy, robustness, and computational practicality for deployment.

The remainder of the paper is organized as follows: Section 2 details the methodology (dataset, models, and ensemble construction), Section 3 presents the results with quantitative and qualitative comparisons of three- and four-channel configurations, and Section 4 provides the discussion and conclusions.

2. Methods

2.1. Dataset Description

The dataset was constructed using asphalt pavement images obtained from two open-access sources: UAV flights [31] and the EdmCrack600 dataset collected with an action camera mounted on a passenger vehicle [32]. The UAV imagery (117 images in total) was acquired with a DJI Mini 2 (SZ DJI Technology Co., Ltd., Shenzhen, China) drone at varying altitudes under sunny conditions, capturing top-down views with high spatial resolution. The action camera imagery (818 images in total) was recorded using a GoPro Hero 7 Black (GoPro Inc., San Mateo, CA, USA) at an oblique 45° angle during real traffic, including challenging conditions such as the presence of other vehicles, roadside vegetation, snow and ice, shadows, and wet pavement with sun glare. All original images were captured at a resolution of 1920 × 1080 pixels.

The dataset was then processed to create uniform training samples. Images were cropped into square tiles of 1080 × 1080 pixels and subsequently resized to 1280 × 1280 to avoid distortions while preserving crack details. In cases where cracks spanned the entire image width, frames were split into two tiles to enlarge the dataset. The final dataset comprised 935 images in total, with 702 allocated to the training set and 233 to the validation set. To prevent data leakage, tiles derived from the same source frame were always kept within the same subset.

For segmentation tasks, ground truth annotations were prepared in the form of binary masks (PNG format), with each pixel labeled either “crack” or “background.” After annotation, the segmentation masks were exported in PNG format. For YOLO11x-seg, these masks were further converted using the OpenCV (Open Computer Vision Library) [22], which identified all contour points of the masks. The coordinates of these points were normalized by dividing them by the width and height of the image and then stored in a list for subsequent use. Examples of training and validation images together with their corresponding masks are illustrated in Figure 1.

For classification tasks, additional image patches of 32 × 32, 64 × 64, and 128 × 128 pixels were generated. Patches containing cracks were selected such that at least 10% of pixels belonged to the crack class, and an equal number of crack-free patches were added to maintain class balance. In total, 51,494 training patches and 17,244 validation patches were produced, and were evenly distributed between the two classes.

2.2. Segmentation Models

Several representative segmentation architectures were employed to address the crack segmentation task:

U-Net [33]—a symmetric encoder–decoder network with skip connections that transfer information from early layers to the corresponding decoding blocks. This design helps reconstruct fine crack boundaries and improves accuracy on small datasets.

Efficient Neural Network (ENet) [34]—a compact real-time segmentation architecture with a large encoder and minimalistic decoder. The use of PReLU activations improves the model’s capacity to represent negative values and reduces information loss.

High-Resolution Network (HRNet) [35]—a high-resolution network that maintains multi-scale feature representations throughout the processing pipeline. This enables the accurate capture of thin cracks against complex asphalt textures.

DeepLabV3+ [36]—a model that integrates an Atrous Spatial Pyramid Pooling (ASPP) module for multi-scale feature extraction with a decoder for boundary refinement. Dilated convolutions make the network robust to textural variations.

All four models are widely available in open-source toolchains, can be modified to accept a fourth input channel by adjusting only the first convolutional layer, and were trained under a unified optimization setup (Section 2.3). Taken together, they cover a broad spectrum of accuracy–efficiency trade-offs (from real-time to high-capacity settings), allowing an informative analysis of whether the probability map consistently improves segmentation quality across architectures and how such gains relate to computational constraints relevant for field deployment.

2.3. Training of Segmentation Models

All segmentation models were trained under consistent hyperparameters to ensure comparability across architectures. The Adam optimizer was employed with an initial learning rate of 0.0001, and BCEWithLogitsLoss was selected as the loss function, being well suited for binary pixel-level classification tasks. Batch size varied from 4 to 16 depending on GPU memory constraints. To mitigate overfitting, an early stopping mechanism was applied, halting training if validation loss or IoU did not improve for 50 consecutive epochs.

The hyperparameters used for segmentation model training are summarized in Table 1.

The training was conducted on a personal workstation with the following specifications:

CPU: AMD Ryzen 5 5600
RAM: 32 GB
GPU: Nvidia RTX 4060 Ti
OS: Ubuntu 22.04

In Table 2, the training results of segmentation models with three-channel input are presented, allowing for a direct comparison of baseline architectures under identical settings.

In summary, DeepLabV3+ attains the highest IoU (up to 0.5697 when saved by IoU). U-Net shows very high precision (≈0.93) but low recall (≈0.23–0.37), which depresses IoU. ENet reaches moderate IoU (up to 0.3063) with a lightweight design, while HRNet underperforms on this dataset (IoU ≈ 0.07–0.08). These observations motivate selecting DeepLabV3+ as the baseline backbone for evaluating probability-map variants in Section 2.7.

2.4. Evaluation Metrics

All models were evaluated using consistent metrics, with additional comparisons made against the original YOLO11x-seg, SegFormer B2, and SwimUNet Tiny models. The machine learning models in this study were assessed based on the following metrics: precision, recall, F1-score, IoU, and loss [37].

In semantic segmentation, precision quantifies the ratio of correctly predicted pixels of the target class to the total number of pixels predicted as that class:

P r e c i s i o n = \frac{T P}{T P + F P}

(1)

where TP (True Positive) denotes the number of pixels correctly classified as belonging to the target class, and FP (False Positive) represents the number of pixels incorrectly assigned to that class.

Recall reflects the model’s ability to identify all relevant pixels of the target class. It is defined as the fraction of correctly detected positive pixels out of all actual positive pixels:

R e c a l l = \frac{T P}{T P + F N}

(2)

where FN (False Negative) indicates the number of pixels from the target class that were not detected.

To balance precision and recall, the F1-score is used. It is the harmonic mean of these two metrics, providing a single measure that accounts for both false positives and false negatives:

F 1 = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l}

(3)

Another widely adopted metric is the Intersection over Union (IoU), also known as the Jaccard Index, which evaluates the spatial agreement between the predicted segmentation mask and the ground truth mask:

I o U = \frac{I n t e r s e c t i o n}{U n i o n}

(4)

where Intersection corresponds to the number of pixels shared by both masks, and Union is the total number of pixels present in at least one of them (either predicted or ground truth).

These metrics collectively assessed both localization accuracy and completeness of crack detection.

2.5. Classification Models

To generate probability maps, a broad spectrum of models was investigated, covering both classical CNNs and Transformer-based architectures: AlexNet, MobileNetV2, MobileNetV3, Swin Transformer (Swin-Tiny), and Vision Transformer (ViT Tiny).

AlexNet [38]—one of the pioneering deep CNNs, comprising five convolutional layers (some followed by max pooling) and three fully connected layers, with ~60 million parameters and 650,000 neurons. ReLU activations improved gradient propagation and nonlinearity, while Dropout in fully connected layers reduced overfitting by randomly deactivating neurons during training.

MobileNetV2/V3 [39,40]—lightweight networks optimized for mobile and low-power devices. MobileNetV2 introduces inverted residual blocks, linear bottlenecks (with no ReLU activation at compressed stages), and depthwise separable convolutions with an expansion factor, allowing efficient feature retention. MobileNetV3 builds upon this with automated architecture search (NAS), the hard swish activation function, and Squeeze and Excitation (SE) blocks for dynamic channel recalibration, achieving an improved trade-off between accuracy and efficiency.

ShuffleNet V2 [41]—an efficient CNN optimized for mobile devices, evolving from ShuffleNet V1 [42]. While V1 relied on group convolutions and channel shuffle operations to reduce complexity, V2 removed group convolutions due to their memory overhead and introduced the channel split operation. At the start of each block, channels are divided into two branches: One remains unchanged, while the other undergoes convolutional transformations before recombination and shuffling. This structure enhances information exchange with reduced computational cost. Multiple variants (0.5×, 1.0×, 1.5×, 2.0×) were proposed to balance accuracy and speed.

ViT Tiny [43]—the first model to apply a Transformer encoder directly to image processing. Images are divided into fixed size patches (e.g., 16 × 16), linearly projected into vectors, and processed by a standard Transformer encoder similar to NLP tasks. This enables the learning of global dependencies without convolutions. The Tiny version uses fewer layers and parameters, making it suitable for small datasets and resource-limited tasks.

Swin-Tiny [44]—a hierarchical Transformer with shifted windows. Instead of processing entire images globally, it divides them into fixed-size local windows and gradually shifts them to improve cross-window communication. This design scales efficiently to larger images while capturing both fine local details (e.g., hairline cracks) and global structural patterns.

2.6. Training of Classification Models

The classification models were trained on balanced sets of patches (with and without cracks). The AdamW optimizer was used with a learning rate of 0.0001, while BCELoss served as the loss function. A batch size of 64 ensured stable and efficient convergence. An early stopping strategy monitored loss and F1 score to avoid overfitting. Both randomly initialized and pre-trained Transformer-based models were tested, with the pre-trained variants showing substantially better performance on the relatively small dataset.

The performance of the classification models was evaluated on balanced datasets of patches containing cracks and non-crack regions. The results of training are summarized in Table 3, which reports the values of loss, accuracy, precision, recall, and F1-score for all architectures considered.

A review of the results shows that among convolutional architectures, MobileNetV2 achieved the most stable and competitive performance across all metrics (accuracy = 0.906, F1-score = 0.903). Both MobileNetV3-Large and ShuffleNetV2 also demonstrated strong results, slightly trailing behind MobileNetV2. In contrast, AlexNet, while historically important, performed considerably worse, highlighting its limitations compared to more recent lightweight CNNs.

When comparing Transformer-based approaches, it is evident that pretrained models significantly outperformed randomly initialized counterparts. For instance, Swin-Tiny Pretrained and ViT-Tiny Pretrained attained accuracy levels of 0.932 and 0.920, respectively, with F1-scores above 0.92, surpassing all CNN-based models. This finding underscores the effectiveness of transfer learning in scenarios with limited training data.

Overall, the best-performing models were Swin-Tiny Pretrained and ViT-Tiny Pretrained, confirming the suitability of Transformer-based architectures, particularly when pretrained weights are utilized, for crack versus non-crack classification tasks.

2.7. Ensemble Construction

The primary aim of this study is to rigorously evaluate the effectiveness of using an auxiliary (fourth) input channel—composed of crack-probability maps generated by an ensemble of binary classification models—for improving segmentation accuracy. In addition, the study assesses the advisability of incorporating Transformer-based classifiers into the ensemble in comparison with lightweight CNN alternatives.

At the first stage, 54 datasets were generated: six for each classification model. These six datasets resulted from three patch scales (32 × 32, 64 × 64, 128 × 128) combined with two model checkpoints (saved by loss and F1 criteria). Each dataset was used to train the DeepLabV3+ segmentation model, allowing the evaluation of probability maps produced by classification architectures. The IoU metric served as the primary comparison criterion. Table 4 reports results for all classification models, while Table 5 and Table 6 present the best results along with representative cases from Transformer-based models. To avoid redundancy, only the best result (between loss and F1 checkpoints) was reported. Results were ranked in descending order of IoU. The analysis revealed that MobileNetV2 consistently achieved the highest scores across all patch sizes (32 × 32, 64 × 64, 128 × 128), while Transformer architectures (ViT Tiny, Swin Tiny) generally underperformed in generating accurate probability maps.

The comparative analysis highlights that, although Transformer-based models initially achieved superior results during the classification training phase, their effectiveness in generating probability maps for crack segmentation was notably lower compared to CNN-based models. Furthermore, CNN architectures consistently demonstrated faster inference, with substantially reduced processing time per image, which makes them more practical for large-scale applications.

At the second stage, based on the results in Table 4, MobileNetV2 was chosen as the baseline for ensemble construction due to its superior IoU performance. Additional tests combined it with the next three best models at 32 × 32 and 64 × 64 scales. The aim was to determine whether architectural diversity could reduce false predictions and improve stability. The best performance at this stage was obtained with the combination of 32 × 32 MobileNetV2 and 64 × 64 MobileNetV2, which consistently achieved the highest IoU and recall among the tested ensembles. The corresponding results are reported in Table 7.

The third stage examined whether incorporating models trained on larger patches (128 × 128) could further enhance performance. Building on the best two-model ensemble identified in Table 6, the top three models at the 128 × 128 scale were added sequentially. As summarized in Table 8, the optimal configuration was achieved with the ensemble of 32 × 32 MobileNetV2 + 64 × 64 MobileNetV2 + 128 × 128 MobileNetV2, where the 128 × 128 model was selected based on the F1-score checkpoint. This three-model combination provided the highest overall IoU performance at this stage.

At the final stage, the best-performing three-model ensemble identified in Table 8 was extended by incorporating an additional 32 × 32 model. For this purpose, the second- to fourth-highest-ranked models from Table 4 at the 32 × 32 scale were sequentially tested as candidates. This approach was intended to evaluate whether increasing model diversity at the finest resolution could further improve the accuracy of the generated probability maps. The detailed results of these experiments are presented in Table 9.

According to Table 9, the final ensemble comprised:

32 × 32 MobileNetV2 (loss)
32 × 32 ShuffleNet V2 × 1.0 (loss)
64 × 64 MobileNetV2 (loss)
128 × 128 MobileNetV2 (F1)

Figure 2 shows the probability maps produced by each model in the ensemble, along with the combined probability map indicating the presence of a crack.

As illustrated, employing a pair of classifiers on 32 × 32 patches together with another pair on larger segments markedly reduced false-positive findings while maintaining fine detail in the resulting probability maps.

Based on the probability maps generated by the ensemble, all considered segmentation models were additionally retrained with a four-channel input (RGB + probability map). For this purpose, only the first convolutional layer of each network was modified to accept four channels, with its weights initialized as the mean of the RGB filters; no other architectural changes were introduced. Detailed numerical results and comparisons are presented in Section 3, providing a comprehensive evaluation of ensemble effectiveness.

3. Results

Based on the multi-stage evaluation detailed in Section 2.7, the final ensemble consisted of four models: 32 × 32 MobileNetV2, 64 × 64 MobileNetV2, 128 × 128 MobileNetV2 (F1 checkpoint), and 32 × 32 ShuffleNetV2. Using this ensemble, all segmentation architectures were retrained with RGB + probability input and compared against their RGB (3-channel) baselines on the same validation split. For benchmarking, we also included Transformer-based segmentation models (SegFormer B2, SwinUNet Tiny) and YOLO11x-seg. This setup enabled direct assessment of three- vs. four-channel configurations in comparison with both Transformer-based architectures and a widely used object-detection-driven baseline.

Table 10, Table 11, Table 12, Table 13 and Table 14 summarize the validation of precision, recall, F1-score, IoU, and loss for each model in two settings—best checkpoint selected by loss and by IoU—and report the relative change (Δ%) from three-channel to four-channel inputs. Overall, the four-channel configuration improves performance across all architectures. For example, DeepLabV3+ improves IoU from 0.5697 → 0.6062 (Δ +6.41%) when selected by IoU, with consistent gains in precision, recall, and F1-score; ENet shows larger relative gains (e.g., IoU +59.48% when selected by IoU); and U-Net exhibits modest improvements (IoU +1.44–23.72% depending on the checkpointing criterion). Very large relative increases observed for HRNet stem from low three-channel baselines. Among Transformer-based models, SegFormer B2 and SwinUNet Tiny achieve IoU values of 0.2719 and 0.5649, respectively, which remain below the best-performing CNN-based four-channel configurations. As a reference, YOLO11x-seg attains an IoU of 0.4987, also lower than DeepLabV3+.

The remainder of this section presents consolidated, metric-wise comparisons for all architectures (3- vs. 4-channel inputs) and a concise qualitative illustration of typical segmentation improvements achieved with the additional probability map.

Table 10 reports the validation of precision for each segmentation architecture under two configurations: baseline three-channel (RGB) input and four-channel (RGB + probability map) input. Results are reported for the best checkpoint selected by loss and by IoU, with YOLO11x-seg, SegFormer B2, and SwinUNet Tiny included as external benchmarks.

Adding the probability map increases precision for all reported models. DeepLabV3+ improves by +4.05% (saved by loss) and +2.09% (saved by IoU), ENet by +6.63%/+4.21%, and U-Net by +2.18%/+3.03%, which also attains the highest absolute precision among all configurations (0.9512 when saved by loss). HRNet exhibits very large relative gains (+977%/+649%) due to extremely low three-channel baselines; its four-channel precision (≈0.79–0.80) falls within the range of the other CNNs. Although the Transformer-based models show strong precision overall—with SegFormer B2 achieving 0.9247 (loss) and 0.8940 (IoU), and SwinUNet Tiny reaching 0.8418 (loss) and 0.8165 (IoU)—they do not surpass the best CNN results. In particular, both the baseline and the four-channel U-Net configurations outperform them, underscoring the effectiveness of the proposed probability map approach. The reference YOLO11x-seg precision (0.6428) is lower than all four-channel CNN-based models as well as the Transformer benchmarks.

Table 11 lists validation recall for each architecture with three-channel (RGB) and four-channel (RGB + probability map) inputs under two checkpointing criteria (saved by loss/by IoU). Δ% denotes the relative change from the three-channel baseline.

Incorporating the probability map leads to clear recall improvements for CNN-based models. The gains are moderate for DeepLabV3+ (+5.46% when saved by loss, +6.29% when saved by IoU) and marginal for U-Net under the IoU criterion (+0.11%), while ENet and HRNet exhibit the largest relative increases (+110.73%/+72.43% and +142.22%/+268.03%, respectively) due to low three-channel baselines. The Transformer-based benchmarks also achieve competitive results: SegFormer B2 reports a recall of 0.2816 (loss) and 0.4395 (IoU), while SwinUNet Tiny reaches 0.6100 (loss) and 0.6485 (IoU). Although strong, these values do not surpass the top-performing four-channel CNNs such as HRNet and DeepLabV3+. The YOLO11x reference (≈0.708) is comparable to the three-channel DeepLabV3+ but remains below its four-channel configuration.

Table 12 reports the F1-score values for the three- and four-channel models. In contrast to precision and recall, here the improvements appear more consistent across nearly all architectures. The relative gains range from modest (+1.65% for U-Net, IoU) to very substantial (over +67% for ENet, loss; and +518% for HRNet, IoU).

Adding the probability map improves the harmonic mean of precision and recall across most CNN-based models. DeepLabV3+ shows consistent but moderate gains (+4.87%/+4.32%), ENet achieves the largest relative increases (+67.17%/+40.62%), while U-Net records smaller improvements (+19.98%/+1.65%). HRNet again exhibits very large relative growth due to weak three-channel baselines, reaching competitive values in its four-channel form. Transformer-based models (SegFormer B2, SwinUNet Tiny) perform strongly in baseline settings, with F1 ≈ 0.57–0.72, but do not surpass the top CNN results with probability-map integration. The YOLO11x-seg baseline (0.6461) remains below the stronger configurations.

Table 13 summarizes the IoU metrics. Here, the integration of the probability map yields particularly strong gains, with HRNet and ENet showing the most dramatic relative improvements (over +700% and +95%, respectively). Even in stronger baselines such as DeepLabV3+, the additional channel still contributes to a measurable increase (+7.40% by loss, +6.41% by IoU).

Probability maps yield systematic improvements in overlap accuracy for all CNN-based models. DeepLabV3+ increases by +7.40% (loss) and +6.41% (IoU), ENet by +95.42%/+59.48%, and U-Net by +23.72%/+1.44%. HRNet again shows very large relative gains, surpassing 0.64 in its four-channel configuration. SegFormer B2 and SwinUNet Tiny demonstrate solid IoU values in baseline form (0.27–0.56) but remain below the strongest CNN configurations with probability-map integration. YOLO11x-seg attains an IoU of 0.4987, also lower than four-channel DeepLabV3+.

Table 14 presents the validation losses. In most cases, the loss values slightly increase after adding the probability map. However, it is important to note that the highest recorded value (0.0252) still remains very low in absolute terms.

Validation losses show mixed trends. For DeepLabV3+ and HRNet, the additional channel increases the loss slightly (+12.31%/+37.14%), but absolute values remain very low (<0.023). ENet records minor reductions in loss when selected by loss (−9.28%), while U-Net stays within a narrow range (−3.03%/+12.70%). SegFormer B2 achieves the lowest absolute loss (≈0.0126–0.0158), while SwinUNet Tiny reports higher values (>0.23), reflecting dataset scale sensitivity. These results confirm that even when the loss metric rises locally, the overlap-based metrics (IoU/F1) consistently improve with the probability map.

Figure 3 presents qualitative examples of crack segmentation, comparing baseline three-channel models with their four-channel counterparts, where the additional channel corresponds to the probability map generated by the classification ensemble. The examples include both drone-captured and action camera images, enabling an assessment of model robustness across different acquisition conditions.

As illustrated, only DeepLabV3+ consistently produces accurate crack delineations across both types of images, showing high alignment with the ground truth. Other architectures, particularly when processing action camera inputs, exhibit weaker baseline performance; however, the integration of the probability map systematically improves the coverage and continuity of predicted masks. HRNet demonstrates the most notable relative gains, with its weak three-channel baseline transformed into competitive four-channel outputs. U-Net and ENet also benefit, albeit to different extents, confirming that the proposed strategy is adaptable across diverse CNN families.

Figure 3 thus reinforces the quantitative findings from Table 10, Table 11, Table 12, Table 13 and Table 14: The probability map enhances both detection sensitivity and boundary consistency while limiting fragmented predictions. Although Transformer-based models such as SegFormer B2 and SwinUNet Tiny achieve strong precision and recall without modification, they do not surpass the best four-channel CNNs, such as DeepLabV3+. Importantly, all four-channel CNN configurations outperform the YOLO11x-seg reference under identical validation conditions.

Overall, the results highlight that incorporating ensemble-derived probability maps as an auxiliary channel provides a robust and computationally efficient means of improving crack segmentation, offering a practical complement or alternative to heavier state-of-the-art architectures.

4. Discussion

The proposed approach integrates a crack probability map—generated by a classification-model ensemble—as a fourth input channel for segmentation networks. The final ensemble used to produce probability maps comprised 32 × 32 MobileNetV2 + 64 × 64 MobileNetV2 + 128 × 128 MobileNetV2 (F1 checkpoint) + 32 × 32 ShuffleNetV2 (x1_0). Subsequently, all segmentation models were retrained with four-channel input (RGB + probability map) and compared against their three-channel baselines on the same validation split. For benchmarking, YOLO11x-seg was included as a reference, while SegFormer B2 and SwinUNet Tiny were added as Transformer-based baselines for comparison.

4.1. Quantitative Summary

Across Table 10, Table 11, Table 12, Table 13 and Table 14, the four-channel configuration yields systematic improvements. For IoU, gains range from moderate for DeepLabV3+ (from 0.5697 to 0.6062; Δ +6.41%) to large relative increases for ENet (e.g., +59.48% when the checkpoint is selected by IoU) and HRNet (very high relative gains due to extremely low three-channel baselines). F1-score improves for most architectures, with stable gains for DeepLabV3+ and pronounced improvements for ENet; HRNet exhibits the largest relative rise because of its weak baseline. Precision consistently increases across models, while recall also improves, with effect sizes varying by architecture. Although the validation loss often rises slightly after adding the fourth channel, absolute values remain very low (maximum around 0.0252), and these local optimization effects do not offset the overlap-based gains (IoU/F1). For reference, YOLO11x-seg reaches IoU 0.4987, while Transformer-based baselines (SegFormer B2, SwinUNet Tiny) exceed several CNN results but remain below the best four-channel DeepLabV3+ and HRNet configurations.

4.2. Comparison with Integrated Attention Mechanisms

The proposed probability map can be regarded as a complementary strategy to conventional built-in attention modules rather than a direct replacement. Unlike mechanisms such as Squeeze-and-Excitation Networks (SE-Nets) [45] or the Convolutional Block Attention Module (CBAM) [46], which adaptively reweight features during training in an implicit manner, the probability map provides an explicit and interpretable prior that highlights the regions most likely to contain cracks. This design offers several advantages:

Modularity, as classification and segmentation models can be trained, updated, or replaced independently.
Computational flexibility, since classification and segmentation can be executed on different devices, enabling scalable deployment.
Minimal architectural modifications, as only the first convolutional layer is adapted to four channels, leaving the segmentation backbone unchanged.
The probability map can itself act as a lightweight screening tool, allowing segmentation to be omitted when no cracks are detected.

At the same time, certain trade-offs exist. Segmentation accuracy becomes sensitive to the quality of the classification ensemble, false activations may propagate to the segmentation stage, and the two-stage pipeline introduces additional inference overhead compared to end-to-end attention. Nevertheless, the explicit nature of the probability prior makes the approach transparent and adaptable, providing a degree of interpretability and modularity that built-in attention mechanisms cannot directly offer.

4.3. Qualitative Observations

The examples in Figure 3 corroborate the quantitative findings. Four-channel models produce cleaner masks with fewer fragmented regions and better agreement with the ground truth. DeepLabV3+ remains robust across both drone and dashboard camera imagery. HRNet benefits markedly from the additional channel relative to its weak three-channel baseline. For U-Net, the effect is limited and depends on the checkpoint selection criterion.

In addition to the successful examples, several challenging cases are illustrated in Figure 4 using DeepLabV3+ as the segmentation backbone.

The first case demonstrates cracks located in a hard shadow cast by a vehicle, which were not detected by either the ensemble or the segmentation model. In the second example, motion blur caused by UAV movement led to missed detection of partially blurred cracks, although sharp cracks were correctly identified; notably, cracks partially covered with sand were successfully detected, and vegetation did not produce false positives. In the third case, the ensemble generated a noisy probability map that highlighted all cracks, but the segmentation network failed to fully utilize this information, resulting in discontinuities in the predicted mask. The fourth case involves low-contrast cracks filled with dirt, where the ensemble identified only a short fragment with low confidence, while the segmentation model captured approximately half of the crack pattern.

To complement the qualitative examples in Figure 4, we also conducted a quantitative failure case analysis. Figure 5 presents the cumulative distribution of per-image IoU scores on the validation set for three-channel and four-channel models. The four-channel configuration consistently achieves a higher proportion of high-IoU predictions, confirming the stability of the proposed approach. Using IoU < 0.4 as a threshold for failure, 21 images (≈9% of the validation set) were identified as low-quality cases. Their distribution across categories is summarized in Table 15: 11 involved low-contrast cracks, 5 were caused by false activations in the probability maps, 4 by shadows, and 2 by noisy probability maps.

This quantitative assessment confirms that failure cases are not isolated but occur systematically under challenging acquisition conditions. Moreover, they highlight the dual role of ensemble activations: Sufficiently confident false activations may propagate into segmentation and cause spurious detections, while strong true activations do not always guarantee successful segmentation. These insights reinforce the importance of ensemble calibration, adaptive fusion strategies, and more diverse training data to further improve robustness.

4.4. Computational Aspects

The ensemble that delivers the best probability maps enables the top IoU at a practical processing time of approximately 0.6–0.7 s/image (the best four-model configuration ≈ 0.63 s/image, with alternatives ≈ 0.59 and 0.72 s/image). On the classification side, CNN-based models infer substantially faster than pretrained transformer counterparts, which supports the choice of CNNs as the ensemble backbone for large-scale or field deployments.

An additional observation concerns the relative performance of Transformer-based classifiers compared with lightweight CNNs. Despite their strong results on large-scale vision benchmarks, pre-trained Transformer backbones underperformed in our setting. Three factors likely explain this outcome. First, Transformers are known to require very large training datasets to exploit their self-attention capacity effectively, whereas our patch-level datasets (32 × 32, 64 × 64, 128 × 128) were comparatively limited. Second, the architectural complexity of Transformers may constitute an over-design for the relatively simple binary classification task of distinguishing crack versus non-crack patches, resulting in higher computational overhead without proportional accuracy gains. Third, the self-attention mechanism of Transformers aggregates information across patch windows (e.g., 7 × 7 or 16 × 16), which favors capturing global context but may overlook very fine local structures. By contrast, convolutional kernels in lightweight CNNs (e.g., 3 × 3 filters in MobileNet or ShuffleNet) are inherently suited to emphasize local textures and thin crack patterns. Together, these factors explain why CNNs produced more reliable probability maps in this study, making them preferable for small-data regimes and real-time applications.

4.5. Practical Implications for Deployment

From a practical standpoint, the proposed approach is attractive for deployment. UAV and vehicle-mounted imagery can be processed with minimal hardware modification: Classification can run on one device, segmentation on another. This modularity supports integration into pavement management systems, where the method can simplify adoption since it does not substantially increase computational complexity. In practice, deployment could rely on existing servers, accepting a modest increase in processing time, or alternatively on minimal digital infrastructure upgrades that distribute classification and segmentation tasks across separate devices. The ensemble itself may also serve as a lightweight, standalone screening tool on low-power hardware such as entry-level laptops or GPUs, enabling preliminary assessments of pavement condition before full segmentation is performed. This flexibility enables scalable inspection pipelines without requiring expensive multimodal sensors or large investments in new hardware.

4.6. Limitations and Potential Improvements

Despite the overall improvement in segmentation quality, this study has some limitations:

Segmentation quality is sensitive to the reliability of the probability maps; occasional false activations from the ensemble can propagate to the four-channel input and locally reduce model confidence.
Generalization was assessed under specific acquisition conditions; broader validation on diverse pavement types and capture setups is warranted.
Not all architectures exploit the additional channel equally well; some (e.g., U-Net) may require tailored integration (e.g., attention mechanisms, adaptive channel fusion, or deeper insertion points for the probability map).
Final performance can depend on the checkpoint selection criterion (loss vs. IoU), suggesting that systematic analysis of selection strategies is beneficial.
Direct comparison with crack-specific architectures (e.g., DeepCrack, CrackFormer) was not feasible due to low-resolution datasets and high computational requirements, though they remain important references for future work.

A more detailed examination of these aspects highlights their implications. First, the dataset consisted of 935 images (702 training, 233 validation), captured in relatively narrow acquisition settings (sunny UAV flights and action camera recordings during winter). While sufficient to demonstrate feasibility, this limited coverage restricts the generalization of trained models to other pavement types (e.g., concrete, composite, or different asphalt mixes) and environmental conditions such as nighttime, rain, or strong shadows. Second, the effectiveness of segmentation is tied to the reliability of the probability maps: localized false activations (e.g., shadows, vegetation, or debris) can reduce confidence in the four-channel input. Third, the improvements obtained with probability maps are not uniform across segmentation backbones: Stronger architectures tend to show moderate yet stable gains, while weaker baselines benefit more substantially, in some cases surpassing initially stronger models. Finally, model selection based on different checkpointing criteria (loss vs. IoU) led to noticeable variations in outcomes, underscoring sensitivity to training dynamics.

Addressing these issues through ensemble refinement, dataset expansion (including synthetic augmentation and domain adaptation), broader validation campaigns, and architecture-level adaptation would strengthen both accuracy and robustness. In addition, exploring synthetic dataset generation and semi-supervised annotation and generative approaches for dataset enrichment may extend applicability and help unlock the potential of Transformer-based classifiers, complementing the strengths of lightweight CNNs.

5. Conclusions

Incorporating an ensemble-derived probability map as a fourth channel is an effective and minimally invasive means to improve crack segmentation on asphalt imagery. The proposed method delivers consistent gains in IoU, F1-score, precision, and recall across multiple architectures and surpasses the YOLO11x-seg reference under identical validation conditions. While Transformer-based segmentation models have shown strong potential in recent literature, in this study, the best-performing four-channel CNN configurations (DeepLabV3+, HRNet) achieved superior results across most metrics. It should be noted, however, that the extent of improvement varies by backbone: stronger models generally gain moderately, while weaker baselines benefit more substantially, in some cases exceeding initially stronger configurations. This highlights that modern CNN backbones, when combined with ensemble-derived probability maps, can offer more robust generalization on relatively small, heterogeneous datasets, making them particularly attractive for practical infrastructure monitoring. Given its moderate computational cost and compatibility with common backbones, the method is well suited to practical inspection pipelines (both ground and aerial).

Future research directions include:

Refining and calibrating the classification ensemble to reduce false activations.
Exploring alternative integration strategies for the probability map (e.g., attention-based fusion, adapter layers, or multi-layer injection rather than input-only).
Extending evaluation to additional surface types and acquisition regimes.
Analyzing the effect of checkpoint selection strategies on final performance.
Assessing probability-map integration for Transformer-based segmentation models.
Using generative AI for dataset expansion and developing pre-trained models tailored to road and infrastructure inspection.

Overall, this study demonstrates that probability maps derived from lightweight classifier ensembles can serve as a scalable solution for automated pavement crack segmentation, bridging the gap between academic research and deployable systems for road management.

Author Contributions

Conceptualization, R.T., V.T. and Y.T.; methodology, V.T., R.T. and Y.T.; software, R.T. and V.T.; validation, Y.T. and R.T.; formal analysis, Y.T., R.T. and V.T.; investigation, V.T., Y.T. and R.T.; resources, Y.T. and R.T.; data curation, V.T. and Y.T.; writing—original draft preparation, V.T. and R.T.; writing—review and editing, R.T., Y.T. and V.T.; visualization, V.T. and R.T.; supervision, R.T.; funding acquisition R.T. and Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable for studies not involving humans or animals.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Barrington-Leigh, C.; Millard-Ball, A. The World’s User-Generated Road Map Is More than 80% Complete. PLoS ONE 2017, 12, e0180698. [Google Scholar] [CrossRef] [PubMed]
Randhawa, S.; Aygün, E.; Randhawa, G.; Herfort, B.; Lautenbach, S.; Zipf, A. Paved or Unpaved? A Deep Learning Derived Road Surface Global Dataset from Mapillary Street-View Imagery. ISPRS J. Photogramm. Remote Sens. 2025, 223, 362–374. [Google Scholar] [CrossRef]
Kapp, A.; Hoffmann, E.; Weigmann, E.; Mihaljević, H. StreetSurfaceVis: A Dataset of Crowdsourced Street-Level Imagery Annotated by Road Surface Type and Quality. Sci. Data 2025, 12, 92. [Google Scholar] [CrossRef] [PubMed]
International Transport Forum. Global Transport Infrastructure Investment Data Update; ITF Statistics Briefs; OECD Publishing/International Transport Forum: Paris, France, 2025. [Google Scholar]
Trach, R.; Lendo-Siwicka, M.; Pawluk, K.; Połoński, M. Analysis of direct rework costs in Ukrainian construction. Arch. Civ. Eng. 2021, 67, 397–411. [Google Scholar] [CrossRef]
Loprencipe, G.; Pantuso, A. A Specified Procedure for Distress Identification and Assessment for Urban Road Surfaces Based on PCI. Coatings 2017, 7, 65. [Google Scholar] [CrossRef]
Arafin, P.; Billah, A.M.; Issa, A. Deep Learning-Based Concrete Defects Classification and Detection Using Semantic Segmentation. Struct. Health Monit. 2024, 23, 383–409. [Google Scholar] [CrossRef]
Trach, R.; Moshynskyi, V.; Chernyshev, D.; Borysyuk, O.; Trach, Y.; Striletskyi, P.; Tyvoniuk, V. Modeling the Quantitative Assessment of the Condition of Bridge Components Made of Reinforced Concrete Using ANN. Sustainability 2022, 14, 15779. [Google Scholar] [CrossRef]
Yu, C.; Du, J.; Li, M.; Li, Y.; Li, W. An Improved U-Net Model for Concrete Crack Detection. Mach. Learn. Appl. 2022, 10, 100436. [Google Scholar] [CrossRef]
Zulfiqar, A.; Cabieses, M.; Mikhail, A.; Khan, N. Design of a Bridge Inspection System (BIS) to Reduce Time and Cost; George Mason University: Fairfax, VA, USA, 2014. [Google Scholar]
Zhang, Y. YOLO Series Target Detection Technology and Application. Highlights Sci. Eng. Technol. 2023, 39, 841–847. [Google Scholar] [CrossRef]
Chen, X.; Liu, C.; Chen, L.; Zhu, X.; Zhang, Y.; Wang, C. A Pavement Crack Detection and Evaluation Framework for a UAV Inspection System Based on Deep Learning. Appl. Sci. 2024, 14, 1157. [Google Scholar] [CrossRef]
Zhang, J.; Xia, H.; Li, P.; Zhang, K.; Hong, W.; Guo, R. A Pavement Crack Detection Method via Deep Learning and a Binocular-Vision-Based Unmanned Aerial Vehicle. Appl. Sci. 2024, 14, 1778. [Google Scholar] [CrossRef]
Zhang, H.; Liu, J.; Hu, G. FCN Attention Enhancing Asphalt Pavement Crack Detection through Attention Mechanisms and Fully Convolutional Networks. Sci. Rep. 2025, 15, 25257. [Google Scholar] [CrossRef]
Zhang, Q.; Chen, S.; Wu, Y.; Ji, Z.; Yan, F.; Huang, S.; Liu, Y. Improved U-Net Network Asphalt Pavement Crack Detection Method. PLoS ONE 2024, 19, e0300679. [Google Scholar] [CrossRef] [PubMed]
Sedighian-Fard, M.; Golroo, A.; Javanmardi, M.; Alahi, A.; Rasti, M. Data Generation for Asphalt Pavement Evaluation: Deep Learning-Based Insights from Generative Models. Case Stud. Constr. Mater. 2025, 23, e05116. [Google Scholar] [CrossRef]
Elamin, A.; El-Rabbany, A. UAV-Based Image and LiDAR Fusion for Pavement Crack Segmentation. Sensors 2023, 23, 9315. [Google Scholar] [CrossRef] [PubMed]
Hamishebahar, Y.; Guan, H.; So, S.; Jo, J. A Comprehensive Review of Deep Learning-Based Crack Detection Approaches. Appl. Sci. 2022, 12, 1374. [Google Scholar] [CrossRef]
Huang, S.; Chen, H.; Yan, L.; Zou, X.; Li, B.; Bi, Y. A Review of the Progress in Machine Vision-Based Crack Detection and Identification Technology for Asphalt Pavements. Digit. Transp. Saf. 2025, 4, 65–79. [Google Scholar] [CrossRef]
Ma, N.; Song, Z.; Hu, Q.; Liu, C.-W.; Han, Y.; Zhang, Y.; Fan, R.; Xie, L. Vehicular Road Crack Detection with Deep Learning: A New Online Benchmark for Comprehensive Evaluation of Existing Algorithms. arXiv 2025, arXiv:2503.18082. [Google Scholar] [CrossRef]
Li, F.; Mou, Y.; Zhang, Z.; Liu, Q.; Jeschke, S. A Novel Model for the Pavement Distress Segmentation Based on Multi-Level Attention DeepLabV3+. Eng. Appl. Artif. Intell. 2024, 137, 109175. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, C. Network for Robust and High-Accuracy Pavement Crack Segmentation. Autom. Constr. 2024, 162, 105375. [Google Scholar] [CrossRef]
Yu, Y.; Xia, W.; Zhao, Z.; He, B. A Lightweight and High-Accuracy Model for Pavement Crack Segmentation. Appl. Sci. 2024, 14, 11632. [Google Scholar] [CrossRef]
Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A Deep Hierarchical Feature Learning Architecture for Crack Segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Liu, H.; Yang, J.; Miao, X.; Mertz, C.; Kong, H. CrackFormer Network for Pavement Crack Segmentation. IEEE Trans. Intell. Transp. Syst. 2023, 24, 9240–9252. [Google Scholar] [CrossRef]
Chen, S.; Feng, Z.; Xiao, G.; Chen, X.; Gao, C.; Zhao, M.; Yu, H. Pavement Crack Detection Based on the Improved Swin-Unet Model. Buildings 2024, 14, 1442. [Google Scholar] [CrossRef]
Li, H.; Zhang, H.; Zhu, H.; Gao, K.; Liang, H.; Yang, J. Automatic Crack Detection on Concrete and Asphalt Surfaces Using Semantic Segmentation Network with Hierarchical Transformer. Eng. Struct. 2024, 307, 117903. [Google Scholar] [CrossRef]
Wang, J.; Zhang, C.; Lin, T. ConvNeXt-UperNet-Based Deep Learning Model for Road Extraction from High-Resolution Remote Sensing Images. Comput. Mater. Contin. 2024, 80, 1907–1925. [Google Scholar] [CrossRef]
Taha, H.; El-Habrouk, H.; Bekheet, W.; El-Naghi, S.; Torki, M. Pixel-Level Pavement Crack Segmentation Using UAV Remote Sensing Images Based on the ConvNeXt-UPerNet. Alex. Eng. J. 2025, 124, 147–169. [Google Scholar] [CrossRef]
Chen, X.; Zhang, X.; Ren, M.; Zhou, B.; Sun, M.; Feng, Z.; Chen, B.; Zhi, X. A Multiscale Enhanced Pavement Crack Segmentation Network Coupling Spectral and Spatial Information of UAV Hyperspectral Imagery. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103772. [Google Scholar] [CrossRef]
Nooralishahi, P. Dataset for Drone-Based Inspection of Road Pavement Structures for Cracks. Mendeley 2022. [Google Scholar] [CrossRef]
Mei, Q.; Gül, M. A Cost Effective Solution for Pavement Crack Inspection Using Cameras and Deep Neural Networks. Constr. Build. Mater. 2020, 256, 119397. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. ISBN 978-3-319-24573-7. [Google Scholar]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Tyvoniuk, V.; Trach, R.; Trach, Y. Integration of Probability Maps into Machine Learning Models for Enhanced Crack Segmentation in Concrete Bridges. Appl. Sci. 2025, 15, 3201. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv 2018, arXiv:1801.04381. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Salt Lake City, UT, USA, 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]

Figure 1. Examples of crack images (a,b) and masks (c,d): (a,c) training set, (b,d) validation set.

Figure 2. Example of probability maps from each model in the ensemble and the combined probability map.

Figure 3. Examples of segmentation using three-channel images and four-channel images, with a probability map as an additional channel.

Figure 4. Failure cases of the proposed approach (DeepLabV3+). Columns: (a) original image, (b) ground-truth mask, (c) probability map, and (d) predicted segmentation. Rows: (1) shadowed cracks, (2) motion blur with cracks under sand, (3) noisy probability map with discontinuities, and (4) low-contrast cracks filled with dirt.

Figure 5. Distribution of per-image IoU scores on the validation set for three-channel and four-channel models. The cumulative percentage of images above each IoU threshold is shown. Four-channel models consistently achieve higher proportions of high-IoU predictions.

Table 1. Hyperparameters for training segmentation models.

Hyperparameter	DeepLabV3+	ENet	HRNet	U-Net	SegFormer B2	SwimUnet Tiny	YOLO11x
Learning rate	0.0001
Loss function	BCEWithLogitsLoss						Custom
Optimizer	Adam	Adam	Adam	Adam	AdamW	AdamW	AdamW
Image size	640 × 640
Batch sizes	8	16	4	4	4	8	8

Table 2. Training results of segmentation models (3-channel input).

Models	Best Model Selected by	Loss	IoU	Precision	Recall	F1-Score	Epoch
U-Net	Loss	0.0142	0.2293	0.9309	0.2325	0.3593	47
U-Net	IoU	0.0151	0.3607	0.9161	0.3735	0.5185	53
HRNet	Loss	0.0142	0.0685	0.0731	0.3148	0.1104	74
HRNet	IoU	0.0168	0.0792	0.1071	0.2091	0.1267	38
ENet	Loss	0.0244	0.2325	0.6861	0.2589	0.3732	162
ENet	IoU	0.0252	0.3063	0.6588	0.3649	0.4661	208
DeepLabV3+	Loss	0.0136	0.5568	0.7395	0.6917	0.7119	49
DeepLabV3+	IoU	0.0145	0.5697	0.7468	0.7043	0.7218	70
SegFormer B2	Loss	0.0126	0.2719	0.9247	0.2816	0.4058	26
SegFormer B2	IoU	0.0158	0.4136	0.8940	0.4395	0.5749	72
SwimUNet Tiny	Loss	0.2382	0.5461	0.8418	0.6100	0.7035	43
SwimUNet Tiny	IoU	0.2395	0.5649	0.8165	0.6485	0.7197	34
YOLO11x-seg			0.4987	0.6428	0.7079	0.6461

Table 3. Training results of classification models.

Models	Best Model Selected by	Loss	IoU	Precision	Recall	F1-Score
AlexNet	Loss	0.390	0.827	0.821	0.828	0.824
AlexNet	F1-score	0.695	0.845	0.824	0.868	0.846
MobileNetV2	Loss	0.256	0.906	0.910	0.896	0.903
MobileNetV2	F1-score	0.256	0.906	0.910	0.896	0.903
MobileNetV3-Small	Loss	0.276	0.890	0.889	0.885	0.887
MobileNetV3-Small	F1-score	0.314	0.891	0.889	0.888	0.889
MobileNetV3-Large	Loss	0.293	0.893	0.907	0.870	0.888
MobileNetV3-Large	F1-score	0.672	0.902	0.894	0.909	0.901
ShuffleNet_v2_x1_0	Loss	0.264	0.896	0.908	0.876	0.892
ShuffleNet_v2_x1_0	F1-score	0.309	0.898	0.904	0.886	0.895
Swin-Tiny	Loss	0.363	0.848	0.858	0.840	0.849
Swin-Tiny	F1-score	0.631	0.846	0.826	0.884	0.854
Swin-Tiny Pretrained	Loss	0.183	0.932	0.909	0.962	0.935
Swin-Tiny Pretrained	F1-score	0.302	0.936	0.924	0.953	0.938
ViT-Tiny	Loss	0.404	0.825	0.820	0.841	0.831
ViT-Tiny	F1-score	0.433	0.834	0.814	0.874	0.843
ViT-Tiny Pretrained	Loss	0.207	0.918	0.904	0.940	0.922
ViT-Tiny Pretrained	F1-score	0.236	0.920	0.904	0.945	0.924

Table 4. Results of training DeepLabV3+ on 4-channel datasets with 32 × 32 patches for different classification models.

Models	Best Model Selected by	IoU	Precision	Recall	Sec/Img
MobileNetV2	Loss	0.5865	0.7812	0.7035	0.4
MobileNetV3-Small	Loss	0.5860	0.7746	0.7078	0.19
MobileNetV3-Large	Loss	0.5820	0.7781	0.6996	0.32
ShuffleNet_v2_x1_0	Loss	0.5805	0.7815	0.6940	0.23
AlexNet	Loss	0.5771	0.7774	0.6926	0.16
ViT-Tiny	Loss	0.5551	0.7925	0.6511	0.25
Swin-Tiny	Loss	0.5506	0.7960	0.6433	1.02
Swin-Tiny Pretrained	Loss	0.5304	0.8006	0.6140	8.65
ViT-Tiny Pretrained	Loss	0.5281	0.8024	0.6098	1.42

Table 5. Results of training DeepLabV3+ on 4-channel datasets with 64 × 64 patches for the best-performing classification models.

Models	Best Model Selected by	IoU	Precision	Recall	Sec/Img
MobileNetV2	Loss	0.5931	0.7428	0.7589	0.18
MobileNetV3-Small	Loss	0.5910	0.7410	0.7579	0.13
MobileNetV3-Large	Loss	0.5908	0.7409	0.7566	0.16
Swin-Tiny	Loss	0.5583	0.7142	0.7821	0.33
ViT-Tiny	Loss	0.5568	0.7127	0.7837	0.15

Table 6. Results of training DeepLabV3+ on 4-channel datasets with 128 × 128 patches, including top CNN models and selected Transformer-based models.

Models	Best Model Selected by	IoU	Precision	Recall	Sec/Img
MobileNetV2	F1	0.58798	0.7544	0.7271	0.18
ShuffleNet_v2_x1_0	F1	0.58539	0.7495	0.7285	0.13
MobileNetV3-Large	Loss	0.58522	0.7558	0.7224	0.16
Swin-Tiny Pretrained	F1	0.56480	0.7729	0.6787	0.33
ViT-Tiny Pretrained	F1	0.55013	0.7788	0.6546	0.15

Table 7. Results of the two-model ensemble.

Models	IoU	Precision	Recall	Sec/Img
32 MobileNetV2 + 64 MobileNetV2	0.5956	0.7711	0.7250	0.47
32 MobileNetV2 + 64 MobileNetV3-Large	0.5945	0.7709	0.7239	0.45
32 MobileNetV2 + 64 MobileNetV3-Small	0.5941	0.7705	0.7234	0.42
32 MobileNetV2 + 32 MobileNetV3-Small	0.5885	0.7786	0.7083	0.48
32 MobileNetV2 + 32 MobileNetV3-Large	0.5865	0.7804	0.7041	0.61

Table 8. Results of the three-model ensemble.

Models	IoU	Precision	Recall	Sec/Img
32 MobileNetV2 + 64 MobileNetV2 + 128 MobileNetV2	0.5987	0.7669	0.7329	0.5
32 MobileNetV2 + 64 MobileNetV2 + 128 MobileNetV3-Large	0.5982	0.7672	0.7321	0.49
32 MobileNetV2 + 64 MobileNetV2 + 128 ShuffleNet_v2_x1_0	0.5978	0.7648	0.7337	0.49

Table 9. Results of the four-model ensemble.

Models	IoU	Precision	Recall	Sec/Img
32 MobileNetV2 + 64 MobileNetV2 + 128 MobileNetV2 + 32 ShuffleNet_v2_x1_0	0.6062	0.7624	0.7486	0.63
32 MobileNetV2 + 64 MobileNetV2 + 128 MobileNetV2 + 32 MobileNetV3-Small	0.5984	0.7687	0.7309	0.59
32 MobileNetV2 + 64 MobileNetV2 + 128 MobileNetV2 + 32 MobileNetV3-Large	0.5980	0.7704	0.7286	0.72

Table 10. Validation metrics for precision.

Model	Best Model Selected by Loss			Best Model Selected by IoU
Model	Precision (3 Channels)	Precision (4 Channels)	Δ%	Precision (3 Channels)	Precision (4 Channels)	Δ%
U-Net	0.9309	0.9512	2.18	0.9161	0.9439	3.03
HRNet	0.0731	0.7876	977.30	0.1071	0.8027	649.59
ENet	0.6861	0.7316	6.63	0.6588	0.6865	4.21
DeepLabV3+	0.7395	0.7695	4.05	0.7468	0.7624	2.09
SegFormer B2	0.9247			0.8940
SwimUNet Tiny	0.8418			0.8165
YOLO11x	0.6428

Table 11. Validation metrics for recall.

Model	Best Model Selected by Loss			Best Model Selected by IoU
Model	Recall (3 Channels)	Recall (4 Channels)	Δ%	Recall (3 Channels)	Recall (4 Channels)	Δ%
U-Net	0.2325	0.2879	23.80	0.3735	0.3739	0.11
HRNet	0.3148	0.7624	142.22	0.2091	0.7695	268.03
ENet	0.2589	0.5455	110.73	0.3649	0.6293	72.43
DeepLabV3+	0.6917	0.7295	5.46	0.7043	0.7486	6.29
SegFormer B2	0.2816			0.4395
SwimUNet Tiny	0.6100			0.6485
YOLO11x	0.7079

Table 12. Validation metrics for F1-score.

Model	Best Model Selected by Loss			Best Model Selected by IoU
Model	F1-Score (3 Channels)	F1-Score (4 Channels)	Δ%	F1-Score (3 Channels)	F1-Score (4 Channels)	Δ%
U-Net	0.3593	0.4312	19.98	0.5185	0.5271	1.65
HRNet	0.1104	0.7719	599.30	0.1267	0.7837	518.69
ENet	0.3732	0.6238	67.17	0.4661	0.6554	40.62
DeepLabV3+	0.7119	0.7466	4.87	0.7218	0.7530	4.32
SegFormer B2	0.4058			0.5749
SwimUNet Tiny	0.7035			0.7197
YOLO11x	0.6461

Table 13. Validation metrics for IoU.

Model	Best Model Selected by Loss			Best Model Selected by IoU
Model	IoU (3 Channels)	IoU (4 Channels)	Δ%	IoU (3 Channels)	IoU (4 Channels)	Δ%
U-Net	0.2293	0.2837	23.72	0.3607	0.3659	1.44
HRNet	0.0685	0.6315	821.26	0.0792	0.6473	717.19
ENet	0.2325	0.4544	95.42	0.3063	0.4885	59.48
DeepLabV3+	0.5568	0.5981	7.40	0.5697	0.6062	6.41
SegFormer B2	0.2719			0.4136
SwimUNet Tiny	0.5461			0.5649
YOLO11x	0.4987

Table 14. Validation losses.

Model	Best Model Selected by Loss			Best Model Selected by IoU
Model	Loss (3 Channels)	Loss (4 Channels)	Δ%	Loss (3 Channels)	Loss (4 Channels)	Δ%
U-Net	0.0142	0.0137	−3.03	0.0151	0.0170	12.70
HRNet	0.0142	0.0202	42.12	0.0168	0.0230	37.14
ENet	0.0244	0.0221	−9.28	0.0252	0.0225	−10.56
DeepLabV3+	0.0136	0.0158	16.15	0.0145	0.0163	12.31
SegFormer B2	0.0126			0.0158
SwimUNet Tiny	0.2382			0.2395

Table 15. Distribution of failure cases (IoU < 0.4) in the validation set.

Failure Case Category	Count	Share of Failure Cases
Low-contrast cracks	11	50
Noisy probability maps	5	23
Shadows	4	18
Blur	2	9
Total	22	100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Trach, R.; Tyvoniuk, V.; Trach, Y. Improving Segmentation Accuracy for Asphalt Pavement Cracks via Integrated Probability Maps. Appl. Sci. 2025, 15, 9865. https://doi.org/10.3390/app15189865

AMA Style

Trach R, Tyvoniuk V, Trach Y. Improving Segmentation Accuracy for Asphalt Pavement Cracks via Integrated Probability Maps. Applied Sciences. 2025; 15(18):9865. https://doi.org/10.3390/app15189865

Chicago/Turabian Style

Trach, Roman, Volodymyr Tyvoniuk, and Yuliia Trach. 2025. "Improving Segmentation Accuracy for Asphalt Pavement Cracks via Integrated Probability Maps" Applied Sciences 15, no. 18: 9865. https://doi.org/10.3390/app15189865

APA Style

Trach, R., Tyvoniuk, V., & Trach, Y. (2025). Improving Segmentation Accuracy for Asphalt Pavement Cracks via Integrated Probability Maps. Applied Sciences, 15(18), 9865. https://doi.org/10.3390/app15189865

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Segmentation Accuracy for Asphalt Pavement Cracks via Integrated Probability Maps

Abstract

1. Introduction

2. Methods

2.1. Dataset Description

2.2. Segmentation Models

2.3. Training of Segmentation Models

2.4. Evaluation Metrics

2.5. Classification Models

2.6. Training of Classification Models

2.7. Ensemble Construction

3. Results

4. Discussion

4.1. Quantitative Summary

4.2. Comparison with Integrated Attention Mechanisms

4.3. Qualitative Observations

4.4. Computational Aspects

4.5. Practical Implications for Deployment

4.6. Limitations and Potential Improvements

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI