The Crack Diffusion Model: An Innovative Diffusion-Based Method for Pavement Crack Detection

Zhang, Haoyuan; Chen, Ning; Li, Mei; Mao, Shanjun

doi:10.3390/rs16060986

Open AccessArticle

The Crack Diffusion Model: An Innovative Diffusion-Based Method for Pavement Crack Detection

Institude of Remote Sensing and Geographic Information System, Peking University, Beijing 100871, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(6), 986; https://doi.org/10.3390/rs16060986

Submission received: 12 January 2024 / Revised: 29 February 2024 / Accepted: 8 March 2024 / Published: 11 March 2024

Download

Browse Figures

Versions Notes

Abstract

Pavement crack detection is of significant importance in ensuring road safety and smooth traffic flow. However, pavement cracks come in various shapes and forms which exhibit spatial continuity, and algorithms need to adapt to different types of cracks while preserving their continuity. To address these challenges, an innovative crack detection framework, CrackDiff, based on the generative diffusion model, is proposed. It leverages the learning capabilities of the generative diffusion model for the data distribution and latent spatial relationships of cracks across different sample timesteps and generates more accurate and continuous crack segmentation results. CrackDiff uses crack images as guidance for the diffusion model and employs a multi-task UNet architecture to predict mask and noise simultaneously at each sampling step, enhancing the robustness of generations. Compared to other models, CrackDiff generates more accurate and stable results. Through experiments on the Crack500 and DeepCrack pavement datasets, CrackDiff achieves the best performance (F1 = 0.818 and mIoU = 0.841 on Crack500, and F1 = 0.841 and mIoU = 0.862 on DeepCrack).

Keywords:

pavement crack detection; diffusion model; multi-task; spatial continuity

1. Introduction

Pavement crack detection plays a crucial role in the maintenance and management of road infrastructure worldwide. Cracks on road surfaces are common indicators of pavement distress, often resulting from various factors such as traffic load, environmental conditions, and material deterioration. These cracks, if left unattended, can lead to a multitude of issues, including compromised road safety, increased risk of accidents, and hindered traffic flow. Traditional methods of crack detection primarily rely on manual visual inspection, where engineers or inspectors visually examine the surfaces of structures or materials to identify cracks and defects. This approach relies on human expertise, demanding significant time and labor efforts.

In recent years, numerous research efforts have been dedicated to developing automated pavement crack detection algorithms and systems, utilizing various imaging modalities such as visible light photography [1], infrared thermography [2], and LiDAR scanning [3]. These technologies offer the potential for faster, more accurate, and cost-effective crack detection compared to manual inspection methods. Methods utilizing visible light images for pavement crack detection offer the advantages of cost-effectiveness, enhanced detection speed, and automation. Image-based crack detection methods mainly include threshold segmentation [4], edge detection [5,6,7], traditional machine learning [8,9,10], and deep learning techniques [11,12,13,14,15,16,17,18]. Deep learning models, particularly convolutional neural networks (CNNs) [19], have exhibited significant promise in the field of crack detection because they can automatically learn image features, leading to highly accurate crack detection. Influenced by materials and environment, pavement cracks exhibit various shapes and features, but they typically present spatial continuity. However, limited by the receptive field of the kernel of CNNs, deep learning methods based on CNNs often produce fragmented and discontinuous detection results. There are mainly two reasons for producing such results: 1. Cracks exhibit continuity in space, while CNN-based methods are unable to learn this underlying spatial relationship between pixels. 2. Cracks exhibit significant scale differences between length and width dimensions. Moreover, annotating ground truth for cracks can be challenging as it is difficult for humans to accurately delineate crack contours, which results in a decline in CNN performance.

In recent years, generative models have garnered substantial popularity in the domain of deep learning. Deep generative models are essentially designed to seek and express the probability distribution of (multivariate) data in some way. These models, by capturing the data generation process, possess a certain level of robustness and probabilistic inference capability. They can handle missing and unlabeled data, learn high-level feature representations, and, as a result, find wide application in tasks such as visual object recognition, information retrieval, classification, and regression. Particularly, diffusion models [20] have emerged as a cutting-edge generative model, surpassing traditional models like Generative Adversarial Networks (GANs) [21] and Variational Autoencoders (VAEs) [22] in image generation tasks. Diffusion models learn through the processes of adding noise and sampling, and they possess two major advantages in feature representation. Firstly, diffusion models can predict pixel-level noise at each diffusion step, thereby characterizing the joint probability distribution of all pixels. This implies that they can learn broader spatial relationships, capturing more details and structural information during the image generation process. Secondly, the sampling process of diffusion models progresses from noise to image and from coarse to fine, showcasing their strong learning ability for both shallow and deep image features. This sampling process helps the model gradually understand the structure and content of images, leading to the generation of higher-quality images. Therefore, this paper aims to leverage the capability of the diffusion model for data distribution to learn the latent spatial relationships of pavement cracks and address the issues present in the CNN models mentioned above and, finally, generate accurate and continuous crack segmentation results.

However, currently, diffusion models are primarily used in the field of image generation. Unlike discriminative models, which can easily compute the correlation between predicted results and ground truth, there are only a few generative tasks that have well-defined ground truth, such as image super-resolution [23]. Currently, there is no research on incorporating the learning capability of diffusion models into crack detection. Addressing this issue, this paper proposes the Crack Diffusion Model (CrackDiff), a pavement crack detection model based on the diffusion model. It is designed to learn the spatial distribution characteristics of pavement cracks, generating accurate and continuous crack segmentation results.

CrackDiff uses the framework of the diffusion denoising probability model (DDPM) [20], where the crack image is embedded and input into a multi-task UNet [24] network alongside the noise image. In this network, one branch is responsible for predicting the image’s segmentation result, while the other branch predicts the noise at the current step, as illustrated in Figure 1. With this design, CrackDiff is capable of accurately learning the distribution and spatial relationships of pavement cracks. During the initial phases of the sampling process, CrackDiff can learn the approximate outline of the cracks. In the later stages of sampling, CrackDiff is capable of generating detailed contours of the cracks. Finally, it can generate highly confident segmentation results.

The primary contributions of this paper include:

This study introduces a novel framework for pavement crack detection based on the diffusion model, CrackDiff, which is capable of learning both surface and deep features related to the distribution and spatial relationships of cracks, leading to accurate and continuous crack segmentation results.
This study proposes a diffusion model structure based on multi-task UNet, which enhances the guidance effect on the original image by predicting image segmentation results, resulting in robust crack segmentation.
Through experiments conducted on the Crack500 and DeepCrack datasets, CrackDiff achieves state-of-the-art results in pavement crack detection.

The subsequent structure of this paper is outlined as follows: In Section 2, a concise overview of the existing research on crack detection and diffusion models is presented. Section 3 elaborates on the principles of the diffusion model, CrackDiff, and the specific architecture of the network. Section 4 presents the experimental settings and qualitative and quantitative analyses of the performance of CrackDiff and demonstrates its advantages through ablation experiments. Section 5 discusses the limitations of CrackDiff, emphasizes existing challenges, and proposes potential directions for future research. Finally, Section 6 provides a summary of the paper’s conclusions.

2. Related Works

2.1. Pavement Crack Detection

Image-based pavement crack detection methods mainly include threshold segmentation, edge detection, traditional machine learning, and deep learning techniques. Threshold segmentation methods classify pixels in a given image into object and background classes based on pixel thresholds [4]. These methods are computationally efficient and fast, but they are highly sensitive to noise. The edge is the main feature of the cracks in the image. Edge detection methods use edge detection operators to identify these edges, such as Sobel, Canny, Prewitt, etc. [5,6,7]. However, the choice of parameters in these algorithms can significantly impact the detection of cracks, and finding the optimal parameters for different images, especially those with complex backgrounds, can be challenging. Traditional machine learning methods involve manually extracting crack features from images and then using techniques like support vector machines [8,9] or random forests [10] for feature classification. These methods offer better accuracy compared to traditional image processing techniques but rely on manually extracted crack features.

Deep learning is currently the most popular method. Semantic segmentation models are widely used in pavement crack detection because they can classify images pixel by pixel, allowing for quantitative measurement of road crack severity and density while simultaneously expressing the size and location of the targets. Various image segmentation models, including FCN [25], UNet [24], DeepLabv3+ [26], and SegNet [27], have been applied extensively in pavement crack detection research [11,12,13,14,15]. To capture the multi-scale spatial features of cracks, techniques such as spatial pyramids [16,28,29], dilated convolutions [30], and residual connections [31] have been incorporated into network designs. Additionally, attention mechanisms have been widely applied due to their advantages in representing long-range spatial dependencies and inter-channel correlations [17,32,33,34], especially transformer-based approaches [18,35,36,37,38].

Despite the integration of dilated convolutions, attention mechanisms, and various other techniques into CNN-based models, the performance of crack detection based on CNNs is still limited by the local receptive field. This limitation hinders the ability to identify long cracks and capture the entire image background, leading to discontinuous detection results and susceptibility to noise interference. Additionally, due to the difficulty in accurately labeling ground truth and data imbalance, the network may easily converge to treating all pixels as background.

Some studies have already applied the feature learning capability of generative models to crack detection. CrackGAN [39] proposed a crack-patch-only (CPO) supervised GAN network for generating crack-GT images. Deep convolutional GAN [40] is employed to generate a crack image dataset. Conditional GAN [41] is used to first extract the road and then detect cracks. However, due to the simultaneous training of both the generator and discriminator, GANs are challenging to balance, leading to training instability.

2.2. Diffusion Model

The diffusion model is a type of probabilistic generative model that can be divided into two main stages: the forward noise-adding stage and the backward denoising stage. In the forward noise stage, original images are progressively corrupted by adding Gaussian noise to the original data. In the backward denoising stage, the generative model’s task is to learn the noise-adding in the forward process and recover the original input data from the noisy data. Currently, diffusion models can be categorized into three main types, namely Denoising Diffusion Probabilistic Models (DDPMs) [20], Score-based Generative Models [42], and Generative Models based on Stochastic Differential Equations [43]. CrackDiff is founded on the architecture of the DDPM.

The inspiration for the DDPM comes from non-equilibrium thermodynamics, and the training process involves two stages: the forward diffusion with noise process and the reverse denoising process, which will be explained in the next section. To accelerate the sampling speed, for traditional forward Markov processes, Denoising Diffusion Implicit Models (DDIMs) [44] has demonstrated that using non-Markovian processes can still yield good generative results and improve the sampling speed. The IDDPM [45] reduced the number of sampling steps by adding cosine noise during the forward noise injection process and introducing learnable variance during the reverse denoising process. In terms of model structure, D2C [46] uses the idea of contrastive representation learning for the diffusion decoder model and improves the quality of representations through contrastive self-supervised learning. Peebles [47] replaced the commonly used UNet network for reverse image generation with transformer networks, achieving better training results. In the realm of conditional diffusion probabilistic models, the IDDPM [45] incorporates class information as a conditional embedding in the timestamp embedding, generating images of a given class. SRDiff [48] combines diffusion generative models to make predictions for super-high-resolution images (SR) during the reverse process. It uses the low-resolution information encoded by an encoder as conditional noise to progressively denoise high-resolution images and generate super-resolution images. SegDiff [49] adds original image embeddings to the network and generates segmentation results for the original image.

Apart from the applications mentioned above, diffusion models have been successfully applied to various complex visual problems, such as image-to-image translation [50] and image blending [51], demonstrating the excellent capability of diffusion models in capturing latent patterns and relationships between samples.

3. Methodology

In this section, the foundational diffusion model will be introduced initially, followed by an exploration of the enhanced diffusion model utilized for pavement crack detection. The details of the network and other pertinent information employed in this context will also be presented.

3.1. Diffusion Model

The diffusion model includes two processes: the forward noise-adding process and the backward denoising process. Both the forward and backward processes are parameterized Markov chains, where the backward process can be employed for data generation.

The forward process is also referred for the diffusion process. It gradually adds Gaussian noise to data until the data become random noise. For the original ground truth image (segmentation mask)

{GT}_{0}

∼

q ({GT}_{0})

, the diffusion process consists of a total of T steps. In each step, Gaussian noise is added to the data obtained in the previous step, as shown in Equation (1):

q ({GT}_{t} ∣ {GT}_{t - 1}) = N ({GT}_{t}; \sqrt{1 - σ_{T}^{2}} {GT}_{t - 1}, σ_{T}^{2} I)

(1)

{\{σ_{T}^{2}\}}_{t = 1}^{T}

represents the variance of noise used at each step. It is manually specified and falls within the range of 0 to 1. As each step adds random noise which is unrelated to the previous step, generating

{GT}_{t}

at an arbitrary timestamp t from

{GT}_{0}

is a Markov process:

q ({GT}_{t} ∣ {GT}_{0}) = \prod_{i = 1}^{t} q ({GT}_{i} ∣ {GT}_{i - 1})

(2)

With the reparameterization technique (similar to VAE), the formula to calculate

{GT}_{t}

without iteration is:

q ({GT}_{t} ∣ {GT}_{0}) = N ({GT}_{t}; \sqrt{{\bar{α}}_{t}} {GT}_{0}, (1 - {\bar{α}}_{t}) I)

(3)

where

α_{t} = 1 - σ_{T}^{2}

and

\bar{α_{t}} = \prod_{i = 1}^{t} α_{i}

.

In the denoising process, given a random noise

{GT}_{T}

∼

N (0, I)

, a noise-free predictive segmentation image can be systematically generated if the distribution of each step

q ({GT}_{t - 1} ∣ {GT}_{t})

is known. That is the purpose our network needs to learn. Based on Bayes’ theorem and Equation (3), the following result is derived:

\begin{matrix} q ({GT}_{t - 1} ∣ {GT}_{t}, {GT}_{0}) & = q ({GT}_{t} ∣ {GT}_{t - 1}, {GT}_{0}) \frac{q ({GT}_{t - 1} ∣ {GT}_{0})}{q ({GT}_{t} ∣ {GT}_{0})} \\ = N ({GT}_{t - 1}; {\tilde{μ}}_{t} ({GT}_{t}, {GT}_{0}), {\tilde{σ_{t}}}^{2} I) \end{matrix}

(4)

where

{\tilde{μ}}_{t}

and

\tilde{σ_{t}}

are calculated using the reparameterization technique, as shown below.

{\tilde{μ}}_{t} ({GT}_{t}, t) = \frac{1}{\sqrt{α_{t}}} {GT}_{t} - \frac{σ_{t}^{2}}{\sqrt{α_{t} (1 - {\bar{α}}_{t})}} ϵ_{θ} ({GT}_{t}, t)

(5)

\tilde{σ_{t}} = \sqrt{\frac{1 - {\bar{α}}_{t - 1}}{1 - {\bar{α}}_{t}}} \cdot σ_{t}

(6)

ϵ_{θ} ({GT}_{t}, t)

is the noise added at timestep t and learned by the neural network.

\tilde{σ_{t}}

can be computed through the initial variance without the need for learning.

Then,

{GT}_{t - 1}

can calculate using Equations (4) and (5).

{GT}_{t - 1} = {\tilde{μ}}_{t} ({GT}_{t}, t) + \tilde{σ_{t}} η

(7)

η

is a random noise sampled from a standard normal distribution.

{GT}_{0}

can be generated step by step using Equation (7).

3.2. Crack Diffusion Model

This section will introduce the Crack Diffusion Model (CrackDiff). The main differences between CrackDiff and the diffusion model are model input and output. The input of noise estimation network

ϵ_{θ}

at the training stage is the combination of embedding derived from both the ground truth at the current step and the crack image. The input at the sample (test) stage is the combination of embedding derived from the current estimation and the crack image, as shown in Figure 1.

The embedding of the ground truth or the estimation is denoted as

E m b_{GT}

and the embedding of the crack image is denoted as

E m b_{I}

. These features are summed and passed to

ϵ_{θ}

’s encoder network

E

.

ϵ_{θ}

is structured as a multi-task UNet, comprising an encoding branch and two decoding branches. The decoding branches are responsible for predicting the segmentation (mask) and

ϵ_{θ} ({GT}_{t}, t)

(noise), respectively. The encoder part

ϵ_{θ}

is denoted as

ϵ_{θ_{e n c o d e}}

, the two decoder part of

ϵ_{θ}

is denoted as

ϵ_{θ_{d e c o d e}}^{m a s k}

and

ϵ_{θ_{d e c o d e}}^{n o i s e}

, and the network

ϵ_{θ}

can be formulated as follows:

\begin{matrix} ϵ_{θ_{e n c o d e}} ({GT}_{t}, I, t) = E (E m b_{GT} ({GT}_{t}) + E m b_{I} (I), E m b_{t} (t)) \end{matrix}

(8)

\begin{matrix} ϵ_{θ_{d e c o d e}}^{m a s k} ({GT}_{t}, I, t) = D_{m a s k} (E, E m b_{t} (t)) \end{matrix}

(9)

\begin{matrix} ϵ_{θ_{d e c o d e}}^{n o i s e} ({GT}_{t}, I, t) = D_{n o i s e} (E, E m b_{t} (t), D_{m a s k}) \end{matrix}

(10)

The mask decoder network

D_{m a s k}

is the same architecture as UNet. The noise decoder network

D_{n o i s e}

takes both the encoder feature map and mask decoder feature map as inputs. The current step index t is encoded and added to both the encoder and decoder parts. The details of the network architecture will be explained in the following section.

The variance parameter

σ_{t}^{2}

in the diffusion process is a linearly increasing constant, the same as the DDPM [20], as shown below.

σ_{t}^{2} = \frac{10^{- 4} (T - t) + 2 * 10^{- 2} (t - 1)}{T - 1}

(11)

Algorithm 1 depicts the training procedure. The input training dataset comprises images of cracks alongside their corresponding ground truth segmentation images. The hyperparameter total diffusion steps (T) are predefined. In the training procedure, a random number t is sampled from a uniform distribution as the current timestep, and a random noise

ϵ

is sampled from a standard normal distribution.

σ_{t}, α_{t}, \bar{α_{t}}

, and

{GT}_{t}

are sequentially calculated according to Equations (3) and (11). The loss function will be introduced in next section.

Algorithm 1 Training Algorithm

Input: images and masks

D = {(I_{i}, {GT}_{i})}_{i}^{N}

, total diffusion steps T
repeat
Sample

t \sim {1, \dots, T}, ϵ \sim N (0, I), (I_{i}, {GT}_{i}) \sim D

Calculate

σ_{t}, α_{t}, \bar{α_{t}}, {GT}_{t}

according to Equations (3) and (11)
Calculate loss

L_{t}

and gradient

\nabla_{θ}

according to Equation (12)
Backward
until iteration stop

The inference procedure is depicted in Algorithm 2. It takes a random noise

{GT}_{T}

sampled from a standard normal distribution as input and a crack image as guidance. The total diffusion steps (T) are the same as in training procedure. From T to 1, the network predicts the mean and variance of noise in the current timestep to denoise

{GT}_{t}

and calculate the estimation

{GT}_{t - 1}

according to Equation (7).

Algorithm 2 Inference Algorithm

Input: image I, total diffusion steps T
Sample

{GT}_{T} \sim N (0, I)

for

t = T, T - 1, \dots, 1

do
Sample

η \sim N (0, I)

Calculate

σ_{t}, α_{t}, \bar{α_{t}}, {\tilde{σ}}_{t}

Calculate noise prediction

ϵ_{θ_{d e c o d e}}^{n o i s e} ({GT}_{t}, I, t)

Calculate

{GT}_{t - 1}

according to Equation (7)
end for
return

{GT}_{0}

3.3. Structure of the Denoising Network

The network structure used for predicting ground truth and noise at each step is illustrated in Figure 2a. The network takes the crack image, the ground truth image at timestep t, and the time step t as input. The output consists of the segmentation image and the noise image at timestep t.

The crack image encoder is constructed using Residual in Residual Dense Blocks (RRDBs) [52], as shown in Figure 2b. It consists of multiple 3 × 3 convolutional layers with residual connections to the higher-level modules. The output dimensions of the convolutional layers are indicated in parentheses. The ground truth image at time step

{GT}_{t}

is encoded using a single layer of 3 × 3 convolution. The structure of the encoder layers of timestep t and the residual block is depicted in Figure 3. First, a lookup table for t is defined for map t into a feature vector, which is then encoded through a series of fully connected layers. The encoded crack image and

{GT}_{t}

are used as the initial inputs to the network, and the encoded timestep t is added to each residual module. The attention block in the residual block is a self-attention block proposed in a non-local neural network [53].

The encoder part of the network consists of four sets of modules. Each set contains two residual modules and a downsampling module, with the feature dimensions after each set marked in the figure. The residual module, as depicted in Figure 3, includes group normalization, Silu activation, and 2D convolution layers. The residual blocks receive time embeddings through a linear layer, and there is an attention layer after each residual block, including multiple attention heads. The downsampling module is a 3 × 3 convolution layer with a stride of 2.

The decoder part of the network comprises two branches, each consisting of four sets of modules. Similar to the encoder, each module contains upsampling modules and three residual modules. The residual modules are the same as in the encoder, and the upsampling module includes a nearest-neighbor interpolation function and a 2D convolution layer. Each layer in the encoder has a skip connection to the left branch and the right branch of the decoder, and at the same time, each layer in the decoder’s left branch has a skip connection to the right branch.

The loss function of the multi-task network includes two parts: noise-predicted loss and mask-predicted loss, as shown in Equation (12).

\begin{matrix} L_{t} = & E_{{GT}_{0}, ϵ, I, t} [L o s s_{I} ({GT}_{0}, ϵ_{θ_{d e c o d e}}^{m a s k} (\sqrt{{\bar{α}}_{t}} {GT}_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, I_{i}, t)) \\ + ∥ ϵ - ϵ_{θ_{d e c o d e}}^{n o i s e} (\sqrt{{\bar{α}}_{t}} {GT}_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, I_{i}, t) ∥^{2}] \end{matrix}

(12)

The image loss function

L o s s_{I}

used in the article is Focal Tversky Loss, with

α = 0.7, β = 0.3, γ = 0.75, δ = 1

, as shown in Equation (13).

L o s s_{I} = {(1 - \frac{T P + δ}{T P + α * F P + β * F N + δ})}^{γ}

(13)

4. Experiments

4.1. Experimental Settings

In order to test the performance of CrackDiff, this study conducted experiments on two public pavement crack datasets, Crack500 [1] and DeepCrack [12]. These two datasets offer diverse representations of road crack images, encompassing various pavement materials, lighting conditions, and weather scenarios, thereby enabling a more comprehensive training and evaluation of models, enhancing their generalization capabilities. Additionally, Crack500 and DeepCrack are widely used datasets in the field of pavement crack detection and have been extensively applied in academic research. Therefore, utilizing these two datasets enables better comparison and validation with existing studies, thereby enhancing the credibility and reproducibility of the research.

The Crack500 dataset comprises 500 pavement images, each approximately 2000 × 1500 pixels in size, captured by cell phones across the main campus of Temple University. The predominant material of the pavement is asphalt. These images were subdivided into 16 non-overlapping regions, with only those regions containing over 1000 pixels of cracks retained. After removing images with significant annotation errors and resizing all images to 448 × 448 pixels, 1896 training images, 348 validation images, and 400 test images are randomly selected for our experiment.

The DeepCrack dataset comprises 537 pavement images meticulously annotated for segmentation, with each image measuring 544 × 384 pixels. The pavement predominantly consists of asphalt and concrete materials. The dataset is partitioned into 300 training images, 137 validation images, and 100 test images for comprehensive evaluation and analysis.

Our method is compared with widely used models UNet [24], FCN [25], DeepLabV3+ [26], improved models PSPNet [28], HRNet [31], transformer-based model SegFormer-B5 [35], customized models for crack detection DeepCrack [12], MFPANet [17], CrackFormer [36], CT-crackseg [38], and diffusion-based model SegDiff [49]. The UNet architecture is the same as the backbone of the CrackDiff network with only the crack predict decoder. Precision, recall, F1 score, and mIoU (Mean Intersection over Union) are commonly used evaluation metrics in crack detection and image segmentation tasks. The calculation of precision and recall is influenced by threshold settings. The F1 score is the harmonic mean of precision and recall, providing a comprehensive evaluation of the model’s performance. It does not fluctuate significantly with changes in probability thresholds, thus offering a more objective assessment. The mIoU calculates the accuracy of pixel positioning, and its value also remains relatively stable when probability thresholds change. It provides a comprehensive assessment of the model’s segmentation performance for cracks and backgrounds, rather than simply evaluating overall pixel-level accuracy.

F 1 = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(14)

m I o U = \frac{1}{2} (\frac{T P}{T P + F P + F N} + \frac{T N}{T N + F N + F P})

(15)

During the inference process, as the initial values are sampled from a standard normal distribution, each inference result exhibits a certain level of variability. Therefore, this study conducts multiple inferences and averages the results. The variability and impact of multiple inferences will be discussed in the analysis section. In this study, inference is performed 10 times.

The AdamW optimizer is employed for updating the parameters of the network, where the initial learning rate is 1 × 10⁻⁴. The diffusion step T is 500 and the training epoch is 500. The size of the input images is 256 × 256. No data augmentation algorithms were employed on the input data. The algorithms in this study, as well as the comparative algorithms, were implemented on two RTX 4090 GPUs.

4.2. Experiment Results

The comparison of the crack segmentation results of CrackDiff and other methods on Crack500 is shown in Table 1 and Figure 4. CrackDiff achieved slightly lower precision and recall metrics compared to other models, but it obtained the best scores in terms of the F1 and mIoU metrics. The comparison of crack segmentation results with DeepCrack is shown in Table 2 and Figure 5. CrackDiff obtains the best scores for precision, F1, and mIoU.

The highest F1 and mIoU scores demonstrate that CrackDiff excels in comprehensive performance and strikes a good balance between precision and recall. In the crack segmentation task, which involves an imbalance between positive (cracks) and negative (background) samples, it performs well in handling both the positive and negative classes. The mIoU score indicates that CrackDiff performs well in terms of both localization and classification accuracy.

In Figure 4 and Figure 5, the characteristics of the model that extend beyond the scope of most metrics can be observed. Given the complexity of ground truth annotations in the task of crack segmentation, many cracks cannot be accurately labeled, and their contours are not well represented in the ground truth. Comparing the prediction of CrackDiff with the original image, CrackDiff demonstrates high precision, as evidenced by the sharper boundaries in its segmentation results and its ability to predict less prominent cracks even better than ground truth. This, to some extent, explains why CrackDiff may not exhibit the best precision or recall but achieves the highest F1 and mIoU scores.

Figure 6 shows the confusion matrices for the detection results. From the results, we can observe that pavement crack detection is an imbalanced task, with over 90% of pixels belonging to the background, while crack pixels only account for around 5%. CrackDiff misclassified approximately 2% of the pixels. On DeepCrack, there are very few background pixels classified as cracks, which explains why CrackDiff has the highest precision on DeepCrack.

Furthermore, the formation of cracks often exhibits a certain level of continuity, meaning there is a spatial intrinsic connection between them. In CrackDiff’s prediction results, the cracks demonstrate better continuity and there are few isolated predictions, which indicates that CrackDiff has learned both the shallow and deep features of the cracks.

4.3. Model Analysis

4.3.1. Visualization of Diffusion Steps

Figure 7 depicts CrackDiff’s prediction results at different timesteps T during the diffusion process, along with the predictions from the crack prediction branch. In the figure, blue represents 0, and red represents 1. The results show that during the diffusion process, the pixel values in the crack region gradually tend toward 1, while those in the background region gradually tend toward 0. In the initial stages of the diffusion process, it primarily outlines the approximate contour of the crack. In the later 100 steps, the diffusion results exhibit significant changes, yielding fine-grained crack segmentation results.

The trend of the crack prediction branch aligns with the diffusion results. In the early stages, the model can predict the approximate contour of the crack. As the diffusion steps progress, the crack prediction results become more refined, and the boundaries become sharper.

4.3.2. Ablation Study

To validate the effectiveness of the diffusion model and multi-task architecture in the CrackDiff model, this study compared it with models that only use the backbone for segmentation and models that perform segmentation without a multi-task structure. The results are presented in Table 3.

From Table 3, it can be observed that compared to using the UNet architecture backbone directly for crack segmentation, the diffusion model enhances precision, F1, and mIoU, with a slight decrease in recall. When compared to the single-task diffusion model that only predicts noise, the multi-task diffusion model shows improvements in all metrics, indicating that the task of predicting the ground truth effectively enhances the guidance effect on the original input images.

4.3.3. Impact of the Number of Generations

In the experimental settings, it was noted that the sampling of the diffusion model from a standard distribution introduces variability in the generated segmentation results. To mitigate the impact of this variability, this study employed the method of multiple sampling (10 times) and taking the mean. This section discusses the influence of averaging the results of multiple samplings.

The variation in metrics with the number of result accumulations is shown in Figure 8. From the figure, it can be observed that as the number of samplings increases, precision gradually decreases, while recall, F1, and mIoU all show improvement. The results from two samplings are significantly better than those from one sampling, and as the number of samplings increases, the improvement becomes less pronounced. When 10 samplings are performed, F1 and mIoU no longer show improvement. Therefore, by increasing the number of samplings appropriately, the accuracy of the results can be improved.

4.3.4. Robustness of CrackDiff

To validate that multi-tasking effectively enhances the stability of the diffusion model, this study analyzed the metrics for each sampling result (without accumulation) for both multi-task and single-task structures. The results are presented in Figure 9. From the figure, it can be observed that the sampling results for the single-task structure exhibit greater variability in all metrics compared to the multi-task structure. This indicates that the sampling results generated by the multi-task structure have less variability.

For a quantitative depiction of the model’s stability, the standard deviation of each metric was computed, as shown in Table 4. The table results align with the results in the figure, where the standard deviation of each metric for the multi-task structure is significantly smaller than that of the single-task structure. The standard deviation of accuracy for multiple samplings is 0.112%, and the standard deviations for recall, F1, and mIoU are all less than 0.1%. This indicates that CrackDiff produces reliable results.

4.3.5. Model Generalization

Model generalization refers to the ability of a model to perform effectively on unseen data beyond the training dataset. It is a critical aspect of model evaluation, ensuring that the model’s performance extends to real-world scenarios. In this section, we use two different pavement datasets that were not used in the previous experiment to evaluate our model generalization.

The CrackForest-Dataset (CFD) [10] comprises 118 pavement images captured using a mobile phone, with each image having a resolution of 480 × 320 pixels. These images contain a notable presence of noisy pixels, including oil spots and water stains. CrackTree [54] consists of 260 pavement images captured using an area-array camera, boasting a higher resolution of 800 × 600 pixels. However, a significant portion of the images in CrackTree exhibit shadows, which poses challenges for accurate crack detection.

The test results for CrackDiff (trained on Crack500) on representative images from CFD are shown in Figure 10. The test results for CrackDiff (trained on Crack500) on representative images from CrackTree are shown in Figure 11. In Crack500, there are no data affected by road markings and shadows, so these test images challenge the model’s generalization. From the results, it is evident that CrackDiff performs better overall on the CFD dataset, showing similar results to Crack500. Road markings have a minor impact on CrackDiff. However, its performance is poorer on the CrackTree dataset, indicating that shadows have a significant effect on CrackDiff. Cracks in shadows are challenging to detect, and some shadows may be falsely detected.

5. Discussion

In this section, we will discuss the limitations of the model. The first is the model’s inference time. Due to the sampling process of the diffusion model starting with random noise and gradually generating images, this step cannot be parallelized, resulting in slower inference speeds. Table 5 illustrates the comparison of inference time between the proposed method and the comparison methods. CrackDiff does indeed require significantly more time compared to the other methods. Currently, accelerating the sampling in diffusion models, such as the DDIM and IDDPM, is a key research focus. In future research, we will investigate the performance of these faster diffusion models in pavement crack detection tasks.

Another limitation of CrackDiff is inherent to 2D visible light images themselves. A significant limitation of visible light images is their sensitivity to lighting conditions. For instance, in model generalization analysis, it becomes challenging for models to detect cracks within shadows. Additionally, variations in pavement textures can affect the accuracy of crack detection algorithms. To address these issues, some methods involving the use or fusion of other sensors have been proposed, such as employing infrared thermography [2] and LiDAR scanning [3]. Compared to 2D cameras, 3D laser scanning can capture the depth information for the cracks, providing additional metrics for assessing structural conditions. The detection system based on infrared thermography utilizes the contrasting reflectivity and surface temperature characteristics between cracks and intact pavement, resulting in a pronounced contrast in the infrared images of the two targets. These approaches can capture more useful information even under poor lighting conditions. We will also explore how to leverage them to enhance the effectiveness of pavement crack detection.

6. Conclusions

This paper introduces a diffusion model structure, CrackDiff, which is the first application of the diffusion model for pavement crack detection. In this architecture, crack images are employed as a guide for the sampling process, and a multi-task UNet framework is utilized to predict crack truth and noise simultaneously. Robust crack segmentation results are obtained by sampling random noise. Compared to the other models, CrackDiff excels in capturing both shallow and deep image features. Through comparisons with advanced algorithms on the Crack500 and DeepCrack datasets, CrackDiff achieves superior performance.

The paper also proposes insights regarding conditional diffusion models. In comparison to single-task network structures, multi-task structures exhibit stronger robustness and can produce results that align with conditions. However, the diffusion model faces challenges such as slow inference speed and difficulty in detecting cracks within shadows. Further research is needed to enhance the inference speed of diffusion models and explore broader applications.

The code will be made publicly accessible. Please refer to https://github.com/zhysess/CrackDiff for the address (accessed on April 2024).

Author Contributions

Conceptualization, H.Z. and N.C.; methodology, H.Z. and N.C.; software, H.Z. and N.C.; validation, H.Z. and N.C.; formal analysis, H.Z.; investigation, H.Z.; resources, H.Z.; data curation, H.Z. and N.C.; writing—original draft preparation, H.Z. and N.C.; writing—review and editing, H.Z., N.C. and M.L.; visualization, H.Z. and N.C.; supervision, M.L.; project administration, M.L.; funding acquisition, S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (2022YFC3004701).

Data Availability Statement

Data will be made publicly accessible at https://github.com/zhysess/CrackDiff (accessed on April 2024).

Acknowledgments

This work was supported by the National Key R&D Program of China (2022YFC3004701). The authors express their appreciation to the editor and anonymous reviewers for their insightful recommendations, which significantly contributed to enhancing the initial manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1525–1535. [Google Scholar] [CrossRef]
Yang, J.; Wang, W.; Lin, G.; Li, Q.; Sun, Y.; Sun, Y. Infrared Thermal Imaging-Based Crack Detection Using Deep Learning. IEEE Access 2019, 7, 182060–182077. [Google Scholar] [CrossRef]
Li, Q.; Zhang, D.; Zou, Q.; Lin, H. 3D laser imaging and sparse points grouping for pavement crack detection. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; pp. 2036–2040. [Google Scholar] [CrossRef]
Kheradmandi, N.; Mehranfar, V. A critical review and comparative study on image segmentation-based techniques for pavement crack detection. Constr. Build. Mater. 2022, 321, 126162. [Google Scholar] [CrossRef]
Zhao, H.; Qin, G.; Wang, X. Improvement of canny algorithm based on pavement edge detection. In Proceedings of the 2010 3rd International Congress on Image and Signal Processing, Yantai, China, 16–18 October 2010; Volume 2, pp. 964–967. [Google Scholar] [CrossRef]
Hoang, N.D.; Nguyen, Q.L. Metaheuristic optimized edge detection for recognition of concrete wall cracks: A comparative study on the performances of roberts, prewitt, canny, and sobel algorithms. Adv. Civ. Eng. 2018, 2018, 7163580. [Google Scholar] [CrossRef]
Zhang, Y.; Lu, Y.; Duan, Y.; Wei, D.; Zhu, X.; Zhang, B.; Pang, B. Robust surface crack detection with structure line guidance. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103527. [Google Scholar] [CrossRef]
Lin, J.; Liu, Y. Potholes detection based on SVM in the pavement distress image. In Proceedings of the International Symposium DCABES, Hong Kong, China, 10–12 August 2010; pp. 544–547. [Google Scholar] [CrossRef]
O’Byrne, M.; Schoefs, F.; Ghosh, B.; Pakrashi, V. Texture analysis based damage detection of ageing infrastructural elements. Comput.-Aided Civ. Infrastruct. Eng. 2013, 28, 162–177. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
David Jenkins, M.; Carr, T.A.; Iglesias, M.I.; Buggy, T.; Morison, G. A Deep Convolutional Neural Network for Semantic Pixel-Wise Segmentation of Road and Pavement Surface Cracks. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), Roma, Italy, 3–7 September 2018; pp. 2120–2124. [Google Scholar] [CrossRef]
Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Alipour, M.; Harris, D.K.; Miller, G.R. Robust pixel-level crack detection using deep fully convolutional neural networks. J. Comput. Civ. Eng. 2019, 33, 04019040. [Google Scholar] [CrossRef]
Chen, T.; Cai, Z.; Zhao, X.; Chen, C.; Liang, X.; Zou, T.; Wang, P. Pavement crack detection and recognition using the architecture of segNet. J. Ind. Inf. Integr. 2020, 18, 100144. [Google Scholar] [CrossRef]
Sun, X.; Xie, Y.; Jiang, L.; Cao, Y.; Liu, B. DMA-Net: DeepLab with multi-scale attention for pavement crack segmentation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18392–18403. [Google Scholar] [CrossRef]
Ji, A.; Xue, X.; Wang, Y.; Luo, X.; Xue, W. An integrated approach to automatic pixel-level crack detection and quantification of asphalt pavement. Autom. Constr. 2020, 114, 103176. [Google Scholar] [CrossRef]
Jiang, X.; Mao, S.; Li, M.; Liu, H.; Zhang, H.; Fang, S.; Yuan, M.; Zhang, C. MFPA-Net: An efficient deep learning network for automatic ground fissures extraction in UAV images of the coal mining area. Int. J. Appl. Earth Obs. Geoinf. 2022, 114, 103039. [Google Scholar] [CrossRef]
Xiao, S.; Shang, K.; Lin, K.; Wu, Q.; Gu, H.; Zhang, Z. Pavement crack detection with hybrid-window attentive vision transformers. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103172. [Google Scholar] [CrossRef]
Ali, R.; Chuah, J.H.; Talip, M.S.A.; Mokhtar, N.; Shoaib, M.A. Structural crack detection using deep convolutional neural networks. Autom. Constr. 2022, 133, 103989. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 6–12 December 2020. [Google Scholar] [CrossRef]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Han, L.; Zhao, Y.; Lv, H.; Zhang, Y.; Liu, H.; Bi, G.; Han, Q. Enhancing Remote Sensing Image Super-Resolution with Efficient Hybrid Conditional Diffusion Model. Remote Sens. 2023, 15, 3452. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Berlin/Heidelberg, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar] [CrossRef]
Ren, M.; Zhang, X.; Chen, X.; Zhou, B.; Feng, Z. YOLOv5s-M: A deep learning network model for road pavement damage detection from urban street-view imagery. Int. J. Appl. Earth Obs. Geoinf. 2023, 120, 103335. [Google Scholar] [CrossRef]
Song, W.; Jia, G.; Zhu, H.; Jia, D.; Gao, L. Automated pavement crack damage detection using deep multiscale convolutional features. J. Adv. Transp. 2020, 2020, 6412562. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Pan, Y.; Zhang, G.; Zhang, L. A spatial-channel hierarchical deep learning network for pixel-level automated crack detection. Autom. Constr. 2020, 119, 103357. [Google Scholar] [CrossRef]
Cui, X.; Wang, Q.; Dai, J.; Xue, Y.; Duan, Y. Intelligent crack detection based on attention mechanism in convolution neural network. Adv. Struct. Eng 2021, 24, 1859–1868. [Google Scholar] [CrossRef]
Zhu, W.; Zhang, H.; Eastwood, J.; Qi, X.; Jia, J.; Cao, Y. Concrete crack detection using lightweight attention feature fusion single shot multibox detector. Knowl.-Based Syst. 2023, 261, 110216. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Proc. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar] [CrossRef]
Liu, H.; Miao, X.; Mertz, C.; Xu, C.; Kong, H. Crackformer: Transformer network for fine-grained crack detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 3763–3772. [Google Scholar] [CrossRef]
Xu, Z.; Guan, H.; Kang, J.; Lei, X.; Ma, L.; Yu, Y.; Chen, Y.; Li, J. Pavement crack detection from CCD images with a locally enhanced transformer network. Int. J. Appl. Earth Obs. Geoinf. 2022, 110, 102825. [Google Scholar] [CrossRef]
Tao, H.; Liu, B.; Cui, J.; Zhang, H. A Convolutional-Transformer Network for Crack Segmentation with Boundary Awareness. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 9–12 October 2023; pp. 86–90. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, Y.; Cheng, H.D. CrackGAN: Pavement Crack Detection Using Partially Accurate Ground Truths Based on Generative Adversarial Learning. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1306–1319. [Google Scholar] [CrossRef]
Liu, Y.; Gao, W.; Zhao, T.; Wang, Z.; Wang, Z. A Rapid Bridge Crack Detection Method Based on Deep Learning. Appl. Sci. 2023, 13, 9878. [Google Scholar] [CrossRef]
Kyslytsyna, A.; Xia, K.; Kislitsyn, A.; Abd El Kader, I.; Wu, Y. Road Surface Crack Detection Method Based on Conditional Generative Adversarial Networks. Sensors 2021, 21, 7405. [Google Scholar] [CrossRef]
Song, Y.; Durkan, C.; Murray, I.; Ermon, S. Maximum likelihood training of score-based diffusion models. Adv. Neural Inf. Process. Syst. 2021, 34, 1415–1428. [Google Scholar] [CrossRef]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 8162–8171. [Google Scholar] [CrossRef]
Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; Vaughan, J.W. D2C: Diffusion-decoding models for few-shot conditional generation. Adv. Neural Inf. Process. Syst. 2021, 34, 12533–12548. [Google Scholar] [CrossRef]
Peebles, W.; Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 4195–4205. [Google Scholar] [CrossRef]
Li, H.; Yang, Y.; Chang, M.; Chen, S.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing 2022, 479, 47–59. [Google Scholar] [CrossRef]
Amit, T.; Shaharbany, T.; Nachmani, E.; Wolf, L. Segdiff: Image segmentation with diffusion probabilistic models. arXiv 2021, arXiv:2112.00390. [Google Scholar]
Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; Norouzi, M. Palette: Image-to-image diffusion models. In Proceedings of the ACM SIGGRAPH 2022 Conference Proceedings, Vancouver, BC, Canada, 7–11 August 2022; pp. 1–10. [Google Scholar] [CrossRef]
Meng, Q.; Shi, W.; Li, S.; Zhang, L. PanDiff: A Novel Pansharpening Method Based on Denoising Diffusion Probabilistic Model. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5611317. [Google Scholar] [CrossRef]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual Dense Network for Image Super-Resolution. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2472–2481. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar] [CrossRef]
Zou, Q.; Cao, Y.; Li, Q.; Mao, Q.; Wang, S. CrackTree: Automatic crack detection from pavement images. Pattern Recognit. Lett. 2012, 33, 227–238. [Google Scholar] [CrossRef]

Figure 1. Overall framework of CrackDiff. The framework consists of a forward diffusion process and a backward denoising process. At any timestep t in the denoising process, a multi-task UNet network is employed for the simultaneous prediction of mask and noise. The network takes the sum of the embedding of the current estimation

{GT}_{t}

and the crack image as inputs and embeds timestep t into every residual block. Based on the predicted noise, the sampled result

{GT}_{t - 1}

is obtained.

Figure 1. Overall framework of CrackDiff. The framework consists of a forward diffusion process and a backward denoising process. At any timestep t in the denoising process, a multi-task UNet network is employed for the simultaneous prediction of mask and noise. The network takes the sum of the embedding of the current estimation

{GT}_{t}

and the crack image as inputs and embeds timestep t into every residual block. Based on the predicted noise, the sampled result

{GT}_{t - 1}

is obtained.

Figure 2. Denoising network structure of CrackDiff. (a) Overall network architecture. Taking the crack image and

{GT}_{t}

as inputs, the network outputs predictions for crack and noise. Skip connections exist between the encoding layers and decoding layers and between the decoding layers. The time step t is embedded into various residual blocks of the network. (b) Image encoding RRDB structure, featuring residual connections between each convolutional layer.

Figure 2. Denoising network structure of CrackDiff. (a) Overall network architecture. Taking the crack image and

{GT}_{t}

as inputs, the network outputs predictions for crack and noise. Skip connections exist between the encoding layers and decoding layers and between the decoding layers. The time step t is embedded into various residual blocks of the network. (b) Image encoding RRDB structure, featuring residual connections between each convolutional layer.

Figure 3. Time embedding and residual block structure. The time step t is encoded into a fixed-length vector through a lookup table and linear layers. It is then added to the intermediate results within the residual block.

Figure 4. Crack extraction results for Crack500 dataset. Images from left to right are original photos, ground truth, results of UNet [24], HRNet [31], SegFormer [35], SegDiff [49], and CrackDiff. The red box indicates results where CrackDiff differs from the comparison algorithms.

Figure 5. Crack extraction results for DeepCrack dataset. Images from left to right are original photos, ground truth, results of UNet [24], HRNet [31], SegFormer [35], SegDiff [49], and CrackDiff. The red box indicates results where CrackDiff differs from the comparison algorithms.

Figure 6. Confusion matrices (percentage) for detection results. (a) Result for Crack500. (b) Result for DeepCrack.

Figure 7. Prediction of CrackDiff at different timesteps T. The results in the top row are the output of the diffusion process, while the results in the bottom row are the predictions from the model’s crack prediction branch.

Figure 8. Variation in metrics with the number of result accumulations for Crack500.

Figure 9. Comparison of multiple sampling results between multi-task structure and single-task structure for Crack500. (a) Comparison of precision (pr) and recall (re). (b) Comparison of F1 and mIoU.

Figure 10. The test results for CrackDiff (trained on Crack500) on representative images from CFD. The images from top to bottom are pavement images, ground truth, and predicted results.

Figure 11. The test results for CrackDiff (trained on Crack500) on representative images from CrackTree. The images from top to bottom are pavement images, ground truth, and predicted results.

Table 1. Comparison between CrackDiff and alternative methods on the Crack500 dataset.

Method	Precision	Recall	F1	mIoU
UNet [24]	76.00	87.53	80.20	83.00
FCN [25]	77.58	84.17	79.67	82.64
DeepLabV3+ [26]	77.99	85.88	80.72	83.39
PSPNet [28]	76.01	73.45	72.93	78.01
HRNet [31]	85.04	74.83	78.49	81.56
MFPANet [17]	75.73	86.88	80.92	82.83
SegFormer [35]	85.89	78.00	80.84	83.32
CrackFormer [36]	84.06	79.19	81.55	82.47
CT-crackseg [38]	68.51	76.37	72.14	74.89
DeepCrack [12]	81.17	78.97	78.21	81.66
SegDiff [49]	80.47	79.63	80.05	80.90
CrackDiff (ours)	81.34	84.10	81.78	84.09

Table 2. Comparison between CrackDiff and alternative methods on the DeepCrack dataset.

Method	Precision	Recall	F1	mIoU
UNet [24]	79.76	88.08	81.82	84.62
FCN [25]	78.68	82.08	77.90	82.02
DeepLabV3+ [26]	68.93	88.56	74.98	80.10
PSPNet [28]	63.93	48.23	51.46	66.84
HRNet [31]	89.58	68.51	78.42	82.13
MFPANet [17]	81.82	85.67	83.70	84.62
SegFormer [35]	91.79	71.09	79.42	82.56
CrackFormer [36]	88.32	78.11	82.90	85.46
CT-crackseg [38]	87.43	79.87	83.48	85.83
DeepCrack [12]	81.56	84.02	80.49	83.85
SegDiff [49]	85.60	68.10	75.85	84.07
CrackDiff (ours)	91.93	79.45	84.06	86.24

Table 3. Results of ablation study.

	Method	Precision	Recall	F1	mIoU
Crack500	Backbone	76.00	87.53	80.20	83.00
	Single task	80.47	79.63	80.05	80.90
	CrackDiff	81.34	84.10	81.78	84.09
DeepCrack	Backbone	79.76	88.08	81.82	84.62
	Single task	85.60	68.10	75.85	84.07
	Crackdiff	91.93	79.45	84.06	86.24

Table 4. Standard deviations for multiple generations. The smaller, the better.

	Precision	Recall	F1	mIoU
Single	0.514	0.63	0.508	0.289
Multi	0.112	0.084	0.095	0.067

Table 5. GPU inference time(s) for CrackDiff and comparison methods.

Method	Crack500	DeepCrack
UNet [24]	15.01	4.15
DeepLabV3+ [26]	16.19	4.55
MFPANet [17]	30.83	8.60
DeepCrack [12]	18.76	5.40
SegFormer [35]	26.48	7.37
PSPNet [28]	7.90	2.66
CrackDiff	892.57	243.88

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, H.; Chen, N.; Li, M.; Mao, S. The Crack Diffusion Model: An Innovative Diffusion-Based Method for Pavement Crack Detection. Remote Sens. 2024, 16, 986. https://doi.org/10.3390/rs16060986

AMA Style

Zhang H, Chen N, Li M, Mao S. The Crack Diffusion Model: An Innovative Diffusion-Based Method for Pavement Crack Detection. Remote Sensing. 2024; 16(6):986. https://doi.org/10.3390/rs16060986

Chicago/Turabian Style

Zhang, Haoyuan, Ning Chen, Mei Li, and Shanjun Mao. 2024. "The Crack Diffusion Model: An Innovative Diffusion-Based Method for Pavement Crack Detection" Remote Sensing 16, no. 6: 986. https://doi.org/10.3390/rs16060986

APA Style

Zhang, H., Chen, N., Li, M., & Mao, S. (2024). The Crack Diffusion Model: An Innovative Diffusion-Based Method for Pavement Crack Detection. Remote Sensing, 16(6), 986. https://doi.org/10.3390/rs16060986

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Crack Diffusion Model: An Innovative Diffusion-Based Method for Pavement Crack Detection

Abstract

1. Introduction

2. Related Works

2.1. Pavement Crack Detection

2.2. Diffusion Model

3. Methodology

3.1. Diffusion Model

3.2. Crack Diffusion Model

3.3. Structure of the Denoising Network

4. Experiments

4.1. Experimental Settings

4.2. Experiment Results

4.3. Model Analysis

4.3.1. Visualization of Diffusion Steps

4.3.2. Ablation Study

4.3.3. Impact of the Number of Generations

4.3.4. Robustness of CrackDiff

4.3.5. Model Generalization

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI