CrackCLIP: Adapting Vision-Language Models for Weakly Supervised Crack Segmentation

Liang, Fengjiao; Li, Qingyong; Yu, Haomin; Wang, Wen

doi:10.3390/e27020127

Open AccessArticle

CrackCLIP: Adapting Vision-Language Models for Weakly Supervised Crack Segmentation

¹

Key Laboratory of Big Data & Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education, Beijing 100044, China

²

Frontiers Science Center for Smart High-Speed Railway System, Beijing Jiaotong University, Beijing 100044, China

³

Department of Computer Sicence, Aalborg University, 9200 Aalborg, Denmark

^*

Author to whom correspondence should be addressed.

Entropy 2025, 27(2), 127; https://doi.org/10.3390/e27020127

Submission received: 16 December 2024 / Revised: 22 January 2025 / Accepted: 24 January 2025 / Published: 25 January 2025

(This article belongs to the Special Issue Application of Information Theory to Computer Vision and Image Processing II)

Download

Browse Figures

Versions Notes

Abstract

Weakly supervised crack segmentation aims to create pixel-level crack masks with minimal human annotation, which often only differentiate between crack and normal no-crack patches. This task is crucial for assessing structural integrity and safety in real-world industrial applications, where manually labeling the location of cracks at the pixel level is both labor-intensive and impractical. Addressing the challenges of labeling uncertainty, this paper presents CrackCLIP, a novel approach that leverages language prompts to augment the semantic context and employs the Contrastive Language–Image Pre-Training (CLIP) model to enhance weakly supervised crack segmentation. Initially, a gradient-based class activation map is used to generate pixel-level coarse pseudo-labels from a trained crack patch classifier. The estimated coarse pseudo-labels are utilized to fine-tune additional linear adapters, which are integrated into the frozen image encoders of CLIP to adapt the CLIP model to the specialized task of crack segmentation. Moreover, specific textual prompts are crafted for crack characteristics, which are input into the frozen text encoder of CLIP to extract features encapsulating the semantic essence of the cracks. The final crack segmentation is determined by comparing the similarity between text prompt features and visual patch token features. Comparative experiments on the Crack500, CFD, and DeepCrack datasets demonstrate that the proposed framework outperforms existing weakly supervised crack segmentation methods, and the pre-trained vision-language model exhibits strong potential for crack feature learning, thereby enhancing the overall performance and generalization capabilities of the proposed framework.

Keywords:

weakly supervised crack segmentation; vision-language model; Contrastive Language–Image Pre-Training

1. Introduction

Crack segmentation is a specialized application within the field of semantic segmentation, aimed at generating binary masks at the pixel level to identify and outline cracks [1,2,3,4]. This technique is essential for various critical applications, including crack detection in pavements [5,6,7], crack extraction from concrete surfaces [8], road pattern recognition from aerial imagery [9], and blood vessel segmentation in medical diagnostics [10].

Visually, cracks manifest as linear topologies with higher pixel intensity relative to the surrounding background pixels. However, they often exhibit poor continuity and low contrast against the background due to noise interference [11,12,13]. Over the past few years, deep learning-based models have emerged as a leading approach for crack detection, significantly enhancing detection capabilities across diverse scenarios [14]. However, these models heavily rely on extensive manual annotated datasets, which necessitate laborious pixel-level labeling, particularly for small and intricate cracks [15].

To mitigate the reliance on extensive manual annotations, weakly supervised learning methods are proposed for crack image segmentation [15,16,17,18,19]. These methods typically employ binary labels to annotate image patches as either containing cracks or not. Currently, two-stage weakly supervised crack segmentation methods, which include the generation of crack pixel pseudo-labels and the training of segmentation models, achieve superior performance. The methods [15,16] focus on enhancing the accuracy of pseudo-labels through various post-processing operations to refine the initial pixel-level pseudo-labels. Moreover, the approaches [17,18,19] emphasize continuously optimizing the reliability of pseudo-labels during the training process of the segmentation network. Nevertheless, the methods for generating pseudo-labels using only such binary annotations are insufficient to address the significant challenges arising from complex topological structures, irregular crack edges, and low-contrast backgrounds.

Beyond the binary annotation of image patches, we propose that incorporating textual language can introduce additional contextual information. By providing generalized descriptions of the inherent topological features of cracks, textual language not only captures the unique morphological characteristics of cracks but also enhances their generalization across different scenarios. Additionally, by providing a detailed description of the differences between cracks and the surrounding normal background, the characteristics of cracks in images can be highlighted more clearly, thereby significantly improving the accuracy and reliability of crack detection. To leverage textual descriptions to provide additional supervisory information for the crack segmentation task, the Contrastive Language–Image Pre-training (CLIP) [20] model is introduced to jointly train crack images and their corresponding text representations through contrastive learning. This approach enhances the model’s ability to understand and identify crack features. However, CLIP primarily focuses on global alignment between images and text rather than fine-grained pixel-level alignment [21].

In this paper, we propose a weakly supervised crack segmentation method called CrackCLIP for adapting vision-language models. This method leverages the generalization ability of CLIP, a vision-language model, to learn expressive representations that capture broad concepts. The proposed model comprises two main phases: pixel-level pseudo-label generation and CLIP-based crack segmentation. For the pixel-level pseudo-label generation phase, a patch-based crack classifier is trained to generate crack pixel pseudo-labels using Gradient-based Class Activation Mapping (Grad-CAM) [22]. For the crack segmentation phase, the CLIP-based vision-language model is employed to align the crack pixels with crack compositional text prompts, achieving precise crack pixel segmentation. The frozen CLIP model provides a robust foundation for feature extraction, while the additional linear layers are fine-tuned using the generated crack pixel-level pseudo-labels to ensure that the model effectively captures the specific characteristics of crack images. Finally, the crack text features are aligned with the features of the crack image patches to achieve accurate crack image segmentation. To summarize, our main contributions are as follows:

(1) We propose a CLIP-based model, CrackCLIP, for weakly supervised crack image segmentation. The model leverages a pre-trained CLIP with frozen parameters to extract features from both text prompts and crack images, thereby enhancing its generalization capability.

(2) Additional linear layers are introduced to adaptively train the model on crack images, achieving alignment between crack pixels and text prompts. Furthermore, textual prompts for cracks are designed based on epistemic and topological features, enabling the text encoder to capture rich semantic information and improve generalization across diverse crack scenarios.

(3) The proposed weakly supervised method achieves competitive results on the Crack500 [23], CFD [24], and DeepCrack [6] datasets. Notably, the method demonstrates excellent generalization capabilities when trained on the Crack500 training set and tested on the Crack500 testing set, as well as on the CFD and DeepCrack datasets.

2. Related Work

2.1. Weakly Supervised Crack Segmentation Methods

In recent years, the field of automatic crack detection has witnessed significant advancements, particularly in the realm of weakly supervised learning for crack image segmentation [25,26,27,28]. Within this domain, image-level labels are predominantly utilized in weakly supervised segmentation approaches.

The prevailing methodologies, which are based on image-level labels, typically adopt a two-stage framework encompassing the derivation of crack pixel-level pseudo-labels and the subsequent training of segmentation models. In the derivation of pseudo-labels, a common approach is to employ image-level labels to train classifiers for crack detection, which then generate class activation maps as preliminary crack pixel-level pseudo-labels [29]. König et al. [15] introduced an innovative thresholding technique to refine these initial pseudo-labels, thereby enhancing the precision of crack segmentation models. While such classifiers provide a coarse yet effective localization of cracks and mitigate background noise, they may fail to capture subtle crack details. Dong et al. [16] advanced this approach by integrating conditional random fields to post-process the initial pseudo-labels, thereby improving the quality of pixel-level pseudo-labels. Wang et al. [30] proposed Crack-CAM, a pixel-level weakly supervised segmentation method that leverages clustering within Convolutional Neural Networks (CNNs) to accentuate crack features and elevate the fidelity of pseudo-labels. These methodologies are primarily aimed at enhancing the quality of crack pixel pseudo-labels.

Nevertheless, existing weakly supervised segmentation techniques have also focused on optimizing the training process of segmentation models. Al-Huda et al. [18] proposed a multi-scale class activation map approach to enhance the completeness of initial pseudo-labels and introduced an Incremental Annotation Refinement module to progressively improve these pseudo-labels. In another contribution, Al-Huda et al. [17] combined class activation maps from CNN classifiers with features extracted by the encoder in the segmentation model, feeding the amalgamated features into the decoder to bolster crack segmentation quality.

Similar to the methods above, our proposed method leverages the advantage of Grad-CAM in generating initial pseudo-labels, effectively providing a coarse representation of crack regions within an image. In contrast, the initial pseudo-labels are enhanced with semantic descriptions of the crack appearance in natural language. By integrating a vision-language model, we capture the intricate relationships between the visual features of cracks and the corresponding semantic features in the text. It leverages the robust representational capabilities of large-scale, pre-trained vision-language frameworks, which have demonstrated exceptional performance in understanding and processing multi-modal data. Our dual-modal strategy extends beyond traditional crack detection by not only enriching the feature space for crack detection but also significantly improving the segmentation accuracy. This is achieved by offering a more holistic understanding of the crack context, which is crucial for precision in image segmentation tasks. Consequently, this enhancement leads to a more accurate segmentation of cracks in images.

2.2. Vision-Language Modeling Methods

Recent advancements in large pre-trained vision-language models have been notably successful, with Contrastive Language–Image Pre-Training (CLIP) [20] standing out for its ability to model the relevance between images and text. This model enhances understanding by aligning visual content with textual descriptions.

This model has been instrumental in image segmentation, particularly in the Reference Image Segmentation (RIS) approach, which uses textual prompts to identify and segment objects. Liu et al. [31] introduced a weakly supervised RIS method leveraging CLIP, employing a bilateral prompt strategy that includes target and background text prompts. This strategy effectively bridges the domain gap between visual and linguistic features, enabling comprehensive target activation for accurate pixel-level pseudo-labeling. Rao et al. [32] proposed a dense prediction framework that transforms the image–text problem in CLIP into a pixel–text problem. By utilizing pixel–text scores, their model guides dense predictions while enhancing vision-language alignment through contextual image information.

In image anomaly segmentation, CLIP leverages natural language to address the paucity of diverse anomaly samples and precise annotations. Jeong et al. [33] presented WinCLIP, a few-shot/zero-shot anomaly detection approach that employs a multi-scale window-based CLIP model to pinpoint anomaly locations based on textual descriptions. Chen et al. [34] proposed a zero-shot anomaly detection model based on CLIP, which incorporates additional linear layers within the image encoder to map image features to a joint embedding space for anomaly detection.

Inspired by the methods above, this paper presents a method to integrate CLIP into the task of weakly supervised crack segmentation. To adapt the CLIP model, which is pre-trained on natural images, to the specific characteristics of industrial crack images, we devised a strategy incorporating customized textual prompts for cracks and a linear adapter module. This linear adapter is integrated into the frozen image encoder of CLIP, allowing us to fine-tune the adapter using pseudo-labels. This refinement enables CLIP to better adapt to the distinct features of crack images, enhancing segmentation performance and accuracy.

3. Methodology

3.1. Approach Overview

We propose CrackCLIP, a novel adapting visual-language model specifically designed for the task of weakly supervised crack segmentation. Our approach, as illustrated in Figure 1, is designed to integrate pixel-level pseudo-labels with textual prompts to enhance the segmentation of cracks. The framework is divided into two main phases: (a) pixel-level pseudo-label generation and (b) CLIP-based crack segmentation.

The first phase, depicted in Figure 1a, is designed to generate the initial pixel-level pseudo-labels. With patch-level labels indicating the presence or absence of cracks, Gradient-based Class Activation Mapping (Grad-CAM) is employed to highlight potential crack regions. Specifically, the process begins with a CNN classifier that is trained to recognize cracks within image patches. By computing the gradients of the classifier’s output with respect to the input image and weighting them by the importance of each feature map, Grad-CAM produces a class activation map. This map highlights areas crucial for crack identification and is up-scaled to the original image size to provide a preliminary segmentation.

The CLIP-based crack segmentation process, shown in Figure 1b, is comprised of two key branches: a text encoder and an image encoder. The text encoder processes textual prompts that describe the state of the pavement, such as “perfect”, “flawless”, “narrow break and opening”, and “dark curve”. These textual descriptions are designed to guide the model in understanding the context of the cracks and overcome the insufficiency and uncertainty of the weak supervision. The image encoder of CLIP is employed to extract visual features from the input image and is frozen without further fine-tuning. The linear adapter modules, as indicated in the diagram, are added to the image encoder to adapt to the specific features of crack images. These linear layers are optimized through a segmentation loss function, which fits the model’s output with the initial pixel pseudo-labels, thereby enhancing the precision of the crack segmentation.

3.2. Crack Pseudo-Label Generation

Following the Grad-CAM [22], this method generates crack activation maps as initial pixel pseudo-labels for the cracks in the image. A dataset

D = {(x, y) | y \in {0, 1}}

of crack image patches, including patches with cracks

(y = 1)

and patches without cracks

(y = 0)

, is used to train a patch-based crack classifier. For a given input image x and corresponding image-level label y, the classification network first embeds x into a high-level feature mapping

Z \in R^{C \times H \times W}

, where C and

H \times W

denote the number of channels and the spatial dimension, respectively. A global average pooling layer and a

1 \times 1

convolution layer with a learnable matrix

A \in R^{C \times N}

, where N denotes the number of classes, are then applied to Z to obtain the prediction result

\hat{y} \in [0, 1]

. The cross-entropy loss is used to train the crack classifier.

Given the trained image patch classification network, the high-level feature mapping Z is weighted by the parameters A to generate the initial crack pixel activation map M, which is denoted as:

M (h, w) = A^{T} Z (h, w),

(1)

where

Z (h, w)

represents the feature vector located at position

(h, w)

.

While Grad-CAM effectively identifies key regions for crack detection, it may overlook less prominent crack pixels. Given the elongated nature of cracks, the up-sampling process used to generate the crack activation map can inadvertently activate surrounding pixels, potentially reducing the precision of the pseudo-labels. To address this, our method adopts a post-processing strategy, as detailed in [15,16], which refines the pseudo-labels and accurately delineates fine crack pixels.

3.3. Crack Segmentation with Vision-Language Alignment

Crack Compositional Text Prompts. This section provides text prompts for crack images, which describe the semantic information of cracks and the background. CrackCLIP defines the two pieces of semantic information of a crack image using a crack compositional text prompting strategy. Specifically, CrackCLIP employs a combination of predefined text templates and state words to describe a given image, rather than freely writing definitions [34]. For example, given an image of a pavement with cracks, the image category is considered to be “pavement”, i.e., the “o” in Figure 2 can be described as “pavement”. The text template is “a photo of a {}”, and the state word for the image is “{} with narrow break and opening”, which together are expressed as: “a photo of a {pavement with narrow break and opening}”. The state words include not only the common states of the target in natural images, such as a normal background being described as “flawless” and cracks being described as “damaged”. They also encompass textual descriptions defined based on the apparent and topological features of the cracks, e.g., cracks are denoted as “narrow break and opening”. Finally, text prompts for cracks provide a list of templates, as shown in Figure 2.

Crack Segmentation. The CLIP-based weakly supervised crack segmentation framework is depicted in Figure 1b. This framework is composed of two main branches: a text encoder and an image encoder, which are designed to work in concert to assess the similarity between textual and visual data. In the text feature learning process, the template is combined with the state of the object to construct a crack text description. The text description is encoded by the text encoder of the CLIP model, and the encoded text features are represented as

F_{t} \in R^{N \times C}

, where N denotes the number of categories of the image, and C denotes the number of channels of the text feature.

The frozen CLIP image encoder is utilized to learn the image feature representation. However, the features are not fully suitable for the crack image scenario and cannot be directly compared with the crack text features using the image features alone. Therefore, this paper follows the approach in [34] by adding additional linear layers to learn specific crack image features and then compares these features with text features. Additionally, the image encoder of CLIP in Figure 1 uses the Vision Transformer (ViT) framework [35], where all feature layers are divided into four stages. A linear layer is applied after each stage to map the output features into the joint embedding space, which is learned based on both the pre-trained data and the crack data. The joint feature

F_{v}^{'}

is represented as:

F_{v}^{'} = k F_{v} + b,

(2)

where

F_{v}

denotes the crack patch token features extracted by the pre-trained CLIP image encoder, and k and b denote the weights and biases of the linear layer, respectively.

Since the CLIP model is primarily designed for classification tasks, it cannot be directly applied to pixel-level classification. Our objective is to achieve text-to-pixel alignment. Specifically, the CLIP model predicts the probability of a pixel belonging to a crack by computing the similarity between the features of patch tokens and text features. The image encoder’s network architecture is divided into Q stages (Q is set to 4 in this work), and the joint feature of the patch token is denoted as

F_{v}^{i'}

in the i-th stage. To obtain the final crack segmentation map, the proposed CrackCLIP method takes the similarity between the joint image feature

F_{v}^{i'}

and the crack text feature

F_{t}

as the probability of belonging to the crack. Following the predictive outcomes of crack detection at various stages, the framework employs a multi-scale strategy to effectively integrate shallow and deep features, thereby generating the final crack segmentation image S. S is represented as:

S = \sum_{i = 1}^{Q} softmax (F_{v}^{i'} {F_{t}}^{T}) .

(3)

To ensure that the segmentation image matches the dimensions of the original input image, the method applies an up-sampling process to the final segmentation map.

3.4. Segmentation Loss

In the proposed weakly supervised crack segmentation framework, the parameters of both the image encoder and the text encoder based on the CLIP model are frozen, meaning that only the pre-trained model parameters are utilized. However, the additional linear layer is a learnable module that is trained to adapt to the crack image scenario. Therefore, we supervise the crack segmentation map prediction using a linear combination of focal loss [36], dice loss [37], and edge loss [38].

The focal loss addresses class imbalance by reshaping the standard cross-entropy loss,

L_{f o c a l}

, which is defined as:

L_{focal} (p_{u}) = - α_{u} {(1 - p_{u})}^{γ} log (p_{u}),

(4)

where

γ

is the focusing parameter,

α

balances the importance of positive and negative examples, and

p_{u}

is defined as:

p_{u} = \{\begin{matrix} p & if y = 1, \\ 1 - p & otherwise, \end{matrix}

(5)

where

p \in [0, 1]

denotes the probability in the final crack segmentation map S and

y \in {0, 1}

specifies the ground truth.

The dice loss function

L_{d i c e}

creates a balance between the crack foreground and background classes by implicitly measuring the overlap between the predicted masks and the ground truth.

L_{d i c e}

is defined as:

L_{d i c e} = 1 - \frac{2 \sum_{i}^{N} p_{i} y_{i}}{\sum_{i}^{N} p_{i}^{2} + \sum_{i}^{N} y_{i}^{2}},

(6)

where p and y denote the crack prediction probability and the ground truth, respectively.

The edge loss

L_{e d g e}

is designed to encourage the segmentation model to produce more accurate predictions in the boundary region.

L_{e d g e}

is defined as:

L_{e d g e} = 1 - \frac{\sum_{i}^{N} p_{i} \cdot E (x_{i})}{\sum_{i}^{N} E (x_{i})},

(7)

where p denotes the crack prediction probability.

E (x_{i})

indicates whether the pixel

x_{i}

belongs to the edge or not, where

E (x_{i}) = 1

means that

x_{i}

is a crack edge pixel and

E (x_{i}) = 0

means that

x_{i}

is not a crack edge pixel.

Finally, the total combined loss

L_{t o t a l}

is denoted as:

L_{t o t a l} = L_{f o c a l} + L_{d i c e} + L_{e d g e} .

(8)

During the testing process, images are input into the CLIP image encoder of CrackCLIP to extract image features and perform similarity calculations with text features, thereby obtaining the results of crack segmentation. By integrating visual and linguistic information, the model can more accurately identify cracks.

4. Results

This section first details the dataset and experimental settings. We then compare our proposed method against other prevalent approaches and conduct ablation studies to evaluate its effectiveness.

4.1. Dataset

To demonstrate the performance of the proposed CrackCLIP method, we utilize three challenging and widely used crack image datasets: Crack500 [23], CFD [24], and DeepCrack [6]. In our experiments, the CrackCLIP model is trained using the training set from the Crack500 dataset. The test set comprises the Crack500 testing set, as well as the CFD and DeepCrack datasets, to evaluate the model’s generalization capabilities across diverse datasets.

The Crack500 dataset [23] was collected using a mobile phone at the main campus of Temple University and serves as a pavement cracking dataset. The dataset consists of 1896 crack images used as a training set and 1124 crack images used as a testing set. The resolution of each original image is either

648 \times 484

pixels or

640 \times 360

pixels. To generate crack pixel pseudo-labels for training the CrackCLIP model, a crack image patch dataset is constructed from the Crack500 training set. The Crack500 training dataset is sliced into image patches of

128 \times 128

pixels. Additionally, this work employs rotation and flipping to augment the crack data. As a result, 556,448 images are used to train the crack patch classification network, comprising 238,820 crack images and 317,628 images without cracks. Finally, the original images in the Crack500 training set, along with the generated pseudo-labels of crack pixels, are used to train the crack segmentation network, CrackCLIP.

The CFD dataset [24] was captured using a smartphone, the iPhone 5, in Beijing, China. The images depict road conditions and are taken with a focal length of 4 mm, an aperture of f/2.4, and an exposure time of 1/134 s. The dataset consists of 118 crack images, each with a resolution of

320 \times 480

pixels. The cracks in this dataset are fine, with widths ranging from 1 to 3 mm, and the background contains various types of noise, such as shadows, oil spots, and water stains.

The DeepCrack dataset [6] consists of 537 Red–Green–Blue (RGB) crack images, each with a resolution of

544 \times 384

pixels or

384 \times 544

pixels. The dataset includes crack images with multiple textures, scenes, and scales. In terms of scene distribution, 22% of the cracks belong to asphalt scenes, while the remaining 78% belong to concrete scenes. Regarding texture distribution, 40% of the crack images are classified as rough, 22.4% as stained, and the remaining 37.6% as smooth. Across the entire dataset, 3.54% of the pixels represent cracks, while the remaining 96.46% correspond to background pixels.

4.2. Evaluation Metrics

To quantitatively evaluate the detection performance of different models on the crack dataset, we follow the existing crack segmentation methods [5] and evaluate the similarity between model predictions and ground truth using the

F 1

with the following evaluation strategies: Optimal Dataset Scale (

O D S

), Optimal Image Scale (

O I S

), and Average Precision (

A P

). The

O D S

denotes that a fixed threshold is selected across the entire dataset to obtain the best

F 1

. The

O I S

denotes the selection of the threshold corresponding to the optimal

F 1

for each image to obtain the final

F 1

. The

O D S

and

O I S

are denoted as:

\begin{matrix} O I S = \frac{1}{N} \sum_{i = 1}^{N} max \{F 1_{τ}^{i} : \forall τ \in {0.01, 0.02, \dots, 0.99}\}, \end{matrix}

(9)

\begin{matrix} O D S = max \{\{\frac{1}{N} \sum_{i = 1}^{N} F 1_{τ}^{i}\} : \forall τ \in {0.01, 0.02, \dots, 0.99}\} . \end{matrix}

(10)

Here

τ \in [0, 1]

represents the selected threshold, and

F 1

denotes the F1-score. The precision (

P R

), recall (

R E

), and

F 1

are denoted as:

\begin{matrix} P R_{τ} = \frac{T P_{τ}}{T P_{τ} + F N_{τ}}, \end{matrix}

(11)

\begin{matrix} R E_{τ} = \frac{T P_{τ}}{T P_{τ} + F P_{τ}}, \end{matrix}

(12)

\begin{matrix} F 1_{τ} = 2 \times \frac{R E_{τ} \times P R_{τ}}{R E_{τ} + P R_{τ}} . \end{matrix}

(13)

Here

T P

,

F P

,

T N

, and

F N

denote true positive, false positive, true negative, and false negative, respectively. In addition, the average precision (

A P

) of the model at different recall rates is measured by calculating the area under the precision–recall curve, and

A P

is denoted as:

A P = \sum_{τ = 0.01}^{τ = 1} \frac{1}{T} (R E_{τ} - R E_{τ - 0.01}) P R_{τ},

(14)

where

τ \in [0, 1]

is the selected threshold, and T is set to 100.

4.3. Comparison Methods

We compare the performance of CrackCLIP to existing weakly supervised crack segmentation methods. These methods are described as follows:

Grad-CAM [22]. This method employs Gradient-based Class Activation Mapping to generate pseudo-labels, which can be directly utilized for training crack segmentation models without any post-processing.
PWSC [16]. This method employs patch-based Grad-CAM combined with conditional random field (CRF) post-processing for the weakly supervised crack segmentation task.
GPLL [15]. This method generates crack pseudo-labels based on Grad-CAM using localization with a classifier and thresholding to implement the weakly supervised crack segmentation task.
CAC [19]. This method utilizes crack pseudo-labels with varying confidence levels to co-train a weakly supervised crack segmentation framework.

We utilized crack pixel-level pseudo-labels to train existing crack segmentation backbones and compared the performance of CrackCLIP with other network backbones in a weakly supervised crack segmentation task.

U-Net [39]. U-Net extends the encoder–decoder architecture by incorporating skip connections, which combine feature maps from the encoder with those from the decoder. This design retains more spatial information and enhances localization accuracy.
DeepCrack¹ [6]. DeepCrack¹ aggregates multi-scale and multi-level features using a fully convolutional neural network to predict crack pixels. A deep supervised network is employed to directly supervise the crack features at each convolutional stage, ensuring robust and accurate feature extraction.
DeepCrack² [5]. DeepCrack² fuses the convolutional features generated in the encoder and decoder networks based on the SegNet [40] network.
OED [41]. Based on a fully convolutional U-Net, OED exploits residual connectivity within the convolutional blocks and adds an attention-based gating mechanism between the encoder and decoder parts of the architecture.

4.4. Implementation Details

4.4.1. Environment

Our experiments were conducted on a deep learning workstation with Ubuntu 16.04 LTS, equipped with an Nvidia Titan XP GPU (Santa Clara, CA, USA). The framework used for the experiments is PyTorch 1.12 [42].

4.4.2. Experimental Setting

This section details the experimental setup for generating crack pixel-level pseudo-labels as well as the experimental setup for the training and testing phases of the CrackCLIP model. In the crack pixel-level pseudo-label generation process, a ResNet50 [43] network is used to train a binary classification model for crack images. The crack classification model is trained for 10 epochs with a batch size of 16. The initial learning rate is set to

1 \times 10^{- 3}

, and the learning rate is reduced by a factor of 10 after each epoch. Stochastic Gradient Descent is used as the optimizer with a momentum value of 0.9. The Grad-CAM method [22] is employed to generate pseudo-labels for crack pixels. Subsequently, two post-processing methods are used in this paper to generate fine-grained pseudo-labels: one is threshold segmentation of pseudo-labels based on the literature [15], and the other is the conditional random field for pseudo-labels based on the literature [16].

In the crack segmentation stage of CrackCLIP, our CLIP image encoder utilizes the ViT-L/14 model with an input image resolution of

518 \times 518

pixels. The image encoder consists of a total of 24 layers, which are divided equally into 4 stages, each containing 6 layers. Four additional linear layers are added to extract the crack features. The CLIP model is trained using the Adam optimizer with a fixed learning rate of

1 \times 10^{- 3}

. The model requires only 3 epochs of training with a batch size of 8. Since an edge-based loss function is employed to preserve the boundaries of the cracks, wider cracks may occur, where background pixels around the cracks are incorrectly classified as cracks. To address this issue, we apply morphological erosion post-processing to the experimental results. Specifically, in our experiments, we use a

3 \times 3

kernel to perform three morphological erosion operations on the crack-segmented images.

4.5. Comparison with State-of-the-Art Methods

Table 1 demonstrates the performance of CrackCLIP compared to other weakly supervised crack segmentation (WSCS) methods on the Crack500, CFD, and DeepCrack datasets. To evaluate the accuracy and generalization of CrackCLIP, its performance is compared with several existing weakly supervised crack segmentation methods, including Grad-CAM [22], PWSC [16], GPLL [15], and CAC [41]. The results show that CrackCLIP performs well on the Crack500 dataset, achieving an OIS of 61.31%, an ODS of 68.58%, and an AP of 59.33%. To further validate the generalization of our proposed model, CrackCLIP, and the compared methods, are each trained on the Crack500 training set and subsequently tested on the CFD and DeepCrack datasets. On the CFD dataset, CrackCLIP achieves an OIS of 40.80%, an ODS of 41.74%, and an AP of 31.81%. As illustrated in Figure 3, cracks in the CFD dataset are typically very thin and exhibit much lower contrast with the background compared to those in the training set (Crack500). This significant difference in crack appearance makes generalization considerably more challenging. CrackCLIP leverages a large visual-language model augmented with crack text descriptions. By improving model generalization through semantic mining of crack categories, this approach demonstrates clear advantages over all comparative methods. Specifically, compared to the previous best method, CAC, CrackCLIP shows substantial improvements: 17.49% in OIS, 10.19% in ODS, and 13.26% in AP. On the DeepCrack dataset, CrackCLIP achieves an OIS of 68.26%, an ODS of 73.29%, and an AP of 68.82%, outperforming most other methods in these tests. The DeepCrack dataset exhibits substantial similarity to Crack500, thereby diminishing the apparent generalization advantage of CrackCLIP. Additionally, this dataset includes a significant amount of noise that closely resembles crack features semantically, which heightens the risk of false detections. As a result, while CrackCLIP generally yields superior performance compared to most benchmark methods, it does not exceed CAC, which has undergone specialized iterative optimization to effectively address noisy data challenges.

Figure 3 shows a visual comparison of the qualitative results of different WSCS methods on the Crack500, CFD, and DeepCrack datasets. In Figure 3, rows (1)–(3) are from the Crack500 testing set, rows (4)–(6) are from the CFD dataset, and rows (7)–(9) are from the DeepCrack dataset. From rows (1)–(3) of Figure 3, it can be observed that CrackCLIP is more robust to background noise on the Crack500 dataset and can accurately detect cracks even when the contrast between the cracks and the background is poor, as in row (1) of Figure 3. In addition, the three rows (4), (5), and (6) in the middle of Figure 3 show the visualization comparison on the CFD dataset. The cracks in the CFD dataset are mostly thinner and the contrast between the cracks and the background is low. In contrast to other models, which suffer from false detection and missed detection, CrackCLIP is more responsive to crack pixels. Rows (7)–(9) in Figure 3 show the prediction results of different methods on the DeepCrack dataset, which has a complex image background with varying crack scales. CrackCLIP demonstrates stable performance compared to other models. This can be attributed to the pre-trained CLIP model in CrackCLIP, which effectively exploits the semantic features of cracks. In summary, CrackCLIP not only achieves optimal accuracy on the Crack500 testing set but also demonstrates good generalization to complex crack scenarios in other datasets.

Experimental results on three publicly available datasets demonstrate that the CrackCLIP model outperforms most existing methods. Specifically, on the Crack500 dataset, the CrackCLIP model, which leverages vision-language alignment, exhibits excellent performance in handling crack image scenes with significant background noise interference. By utilizing the pre-trained CLIP model, only the parameters of four linear layers need to be fine-tuned to achieve effective crack segmentation. On the CFD and DeepCrack datasets, the CrackCLIP model demonstrates robust performance in detecting cracks with varying scales, textures, and thicknesses, particularly excelling in identifying thin cracks. In summary, the weakly supervised crack detection task is significantly improved by CrackCLIP, enhancing both the accuracy of crack prediction and the generalization capabilities in diverse crack scenarios.

4.6. Ablation Studies

To further validate the effectiveness of pseudo-label types, backbone networks, and crack-specific language prompts in CrackCLIP for weakly supervised crack segmentation, we conduct several analyses. We first examined the impact of different pseudo-label types on segmentation performance, with the results presented in Table 2. Next, we perform ablation studies on various backbone networks, with the results shown in Figure 4 and Table 3. Finally, we compare the experimental performance of general defect descriptions against specific crack descriptions, with the results illustrated in Figure 5. Through these analyses, we comprehensively demonstrated the contributions of each component to the model’s performance.

4.6.1. Pseudo-Label Type

The generation of pseudo-labels is a fundamental component of weakly supervised learning, as it directly influences the quality of the final segmentation performance. To verify the effectiveness of different quality pseudo-labels for CrackCLIP, we use two different types of crack pixel-level pseudo-labels, including CAM-CRF [16] and CAM-Location [15]. CAM-CRF [16] is a pixel-level pseudo-label generated by a class activation map with CRF, while CAM-Location [15] is a pixel-level pseudo-label generated by class activation maps, crack patch location, and threshold segmentation combined to generate crack pixel-level pseudo-labels. Table 2 shows the quantitative prediction results of cracks for the CrackCLIP model using different types of pseudo-labels. As shown in Table 2, the fully supervised method (FSV) demonstrates the best performance across all datasets, which provides an upper bound for the performance of our weakly supervised methods, as FSV can utilize complete annotation information for training. However, with merely weak supervision, the performance of our proposed CrackCLIP approach is approaching the upper bound set by fully supervised methods on all three testing datasets. On the DeepCrack dataset, the CrackCLIP model using pseudo-labels CAM-Location outperforms the fully supervised approach, reduces model overfitting, and improves model generalization. CAM-CRF using CRF partially reduces background noise but does not eliminate it. The background noise may cause the pseudo-labels to contain incorrect positive samples, increasing the risk of model overfitting. In contrast, the pixel-level pseudo-labels generated by CAM-Location [15] eliminate the noise in the background as much as possible.

4.6.2. Backbones

In order to evaluate the ability of the CLIP backbone for feature learning and generalization, we compare the weakly supervised crack segmentation results of CrackCLIP with those of several mainstream crack segmentation frameworks, including U-Net, DeepCrack¹, DeepCrack², and OED. Table 3 presents the performance of CrackCLIP with different backbone networks for image feature extraction on the Crack500, CFD, and DeepCrack datasets. The results indicate that CrackCLIP performs well on the Crack500 dataset compared to other network frameworks, achieving an OIS of 63.00%, ODS of 68.07%, and AP of 60.54%. To further validate the effectiveness and generalization of CrackCLIP, we conducted additional experimental analyses using the CFD and DeepCrack datasets. On the CFD dataset, CrackCLIP achieved an OIS of 39.15%, ODS of 39.82%, and AP of 32.01%. On the DeepCrack dataset, the OIS was 58.77%, ODS was 67.05%, and AP was 55.66%. These results demonstrate that CrackCLIP outperforms the other backbone networks on both the CFD and DeepCrack datasets.

Figure 4 presents a visualization of the qualitative results for each crack segmentation framework. CrackCLIP demonstrates superior visual performance compared to other models. In Figure 4, rows (1) and (2) are from the Crack500 test set, rows (3) and (4) are from the CFD dataset, and rows (5) and (6) are from the DeepCrack dataset. From Figure 4, it can be observed that all crack segmentation frameworks achieve better detection performance in scenes with low image background noise, wide cracks, and prominent structures. Additionally, it is evident in rows (2), (5), and(6) of Figure 4 that in cases of strong background noise, the contrast between the crack and the background is reduced, and the performance of other frameworks is more susceptible to interference from the noise, whereas CrackCLIP exhibits better robustness against such disturbances. This is primarily attributed to the fact that the CrackCLIP model employs a visual-language alignment-based approach, leveraging the strong visual-language alignment capabilities of the pre-trained CLIP model. The image encoder in the CLIP model, which is based on the Transformer architecture, effectively captures long-range dependencies in crack images and understands the global contextual information of cracks, thereby reducing the likelihood of background noise being incorrectly detected. However, from the visualization and quantification results of CrackCLIP in the last column of Figure 4, it is noted that the predicted crack width by the CrackCLIP model is broader, leading to an improvement in recall but a decrease in precision, meaning that background pixels surrounding the crack pixels are often incorrectly detected. From the quantitative metric AP, the AP value of CrackCLIP is lower compared to other methods. The possible reason for this is that during the image encoding process, the input image is divided into sequential image chunks, and the features of these extracted chunks are aligned with text features rather than individual pixels. Consequently, the edges of the cracks may not be as sharp as desired.

4.6.3. Crack Text Prompts

We conducted an ablation study of crack language text prompts on three datasets: Crack500, CFD, and DeepCrack, to evaluate the effectiveness of our method. Figure 5 shows the generalization of normal defect text prompts versus specific crack text prompts in the weakly supervised crack segmentation task. In this study, we use generic text to express normal defect text prompts, such as “a photo of a {damaged {pavement}}”, while specific text prompts are designed to express crack text based on the apparent characteristics of cracks, such as “a photo of a {{pavement} with narrow break and opening}”. The proposed CrackCLIP is trained on the Crack500 dataset and tested on the Crack500 testing sets, CFD, and DeepCrack datasets. From Figure 5, it is evident that the CrackCLIP model using crack text prompts does not perform as well as the CrackCLIP model using normal text prompts on the Crack500 dataset. However, on the CFD and DeepCrack datasets, the CrackCLIP model using crack text prompts performs better. The experimental results indicate that normal text prompts help reduce the overfitting of CrackCLIP for the dataset of Crack500, while specific crack text prompts provide richer information as auxiliary supervisory signals for the crack text domain, thereby improving the generalization of the weakly supervised model for the datasets of CFD and DeepCrack.

5. Conclusions and Future Work

We propose CrackCLIP, a weakly supervised crack segmentation framework based on vision-language models. To achieve the alignment of crack images with text, we design specific text prompts that capture the apparent features and topology of cracks, enabling the learning of generalized crack semantic features and enhancing the generalization of crack detection. An additional linear layer is added to the image encoder module of the CLIP model to adaptively train the crack scene. The frozen pre-trained CLIP model provides a powerful feature representation for weakly supervised crack segmentation, allowing our approach to achieve better performance with reduced training costs. Furthermore, we evaluate the effectiveness of CrackCLIP on different crack datasets, demonstrating its robustness and versatility. By leveraging textual prompts to enhance the generalization of crack detection, we introduce a novel crack segmentation paradigm that offers innovative insights into this field. However, we also recognize limitations in our approach. Surface cracks are highly complex, and the limited nature of existing datasets may not adequately capture all variations across different scenarios. Additionally, weakly supervised information inherently introduces uncertainty. To address these challenges, we will collect more diverse training data from a wider range of scenarios and design more refined textual prompts. We will also explore finer-grained coarse segmentation methods to mitigate the uncertainty associated with weak supervision. We believe that these improvements will enable CrackCLIP to better address real-world challenges and advance research in this direction.

Author Contributions

Conceptualization, F.L. and Q.L.; methodology, F.L. and W.W.; software, F.L.; validation, F.L.; formal analysis, F.L. and Q.L.; investigation, F.L.; resources, F.L. and Q.L.; data curation, F.L.; writing—original draft preparation, F.L.; writing—review and editing, Q.L., H.Y. and W.W.; visualization, F.L.; supervision, Q.L.; project administration, F.L.; funding acquisition, Q.L. and W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the Fundamental Research Funds for the Central Universities under Grant 2022JBMC055, 2023JBZY037, in part by the Beijing Natural Science Foundation under Grant L231019, and in part by the Shanghai Industrial Development Project under Grant HCXBCY-2023-033.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. Implementation of the proposed framework is publicly available on GitHub at the following link: https://github.com/liangfengjiao/CrackCLIP (accessed on 15 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guo, J.; Liu, P.; Xiao, B.; Deng, L.; Wang, Q. Surface defect detection of civil structures using images: Review from data perspective. Autom. Constr. 2024, 158, 105186. [Google Scholar] [CrossRef]
Yu, X.; Kuan, T.W.; Tseng, S.P.; Chen, Y.; Chen, S.; Wang, J.F.; Gu, Y.; Chen, T. EnRDeA U-net deep learning of semantic segmentation on intricate noise roads. Entropy 2023, 25, 1085. [Google Scholar] [CrossRef] [PubMed]
Zhao, Y.; Yan, J.; Wang, Y.; Jing, Q.; Liu, T. Porcelain insulator crack location and surface states pattern recognition based on hyperspectral technology. Entropy 2021, 23, 486. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; He, Z.; Zeng, X.; Zeng, J.; Cen, Z.; Qiu, L.; Xu, X.; Zhuo, Q. GGMNet: Pavement-Crack Detection Based on Global Context Awareness and Multi-Scale Fusion. Remote Sens. 2024, 16, 1797. [Google Scholar] [CrossRef]
Zou, Q.; Zhang, Z.; Li, Q.; Qi, X.; Wang, Q.; Wang, S. DeepCrack: Learning Hierarchical Convolutional Features for Crack Detection. IEEE Trans. Image Process. 2019, 28, 1498–1512. [Google Scholar] [CrossRef]
Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Zhang, H.; Chen, N.; Li, M.; Mao, S. The Crack Diffusion Model: An Innovative Diffusion-Based Method for Pavement Crack Detection. Remote Sens. 2024, 16, 986. [Google Scholar] [CrossRef]
Yeum, C.M.; Dyke, S.J. Vision-based automated crack detection for bridge inspection. Comput.-Aided Civ. Infrastruct. Eng. 2015, 30, 759–770. [Google Scholar] [CrossRef]
Bastani, F.; He, S.; Abbar, S.; Alizadeh, M.; Balakrishnan, H.; Chawla, S.; Madden, S.; DeWitt, D. Roadtracer: Automatic extraction of road networks from aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4720–4728. [Google Scholar]
Zhang, J.; Wang, G.; Xie, H.; Zhang, S.; Huang, N.; Zhang, S.; Gu, L. Weakly supervised vessel segmentation in X-ray angiograms by self-paced learning from noisy labels with suggestive annotation. Neurocomputing 2020, 417, 114–127. [Google Scholar] [CrossRef]
Sironi, A.; Türetken, E.; Lepetit, V.; Fua, P. Multiscale centerline detection. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 1327–1341. [Google Scholar] [CrossRef]
Zhang, Z.; Xing, F.; Shi, X.; Yang, L. Semicontour: A semi-supervised learning approach for contour detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 251–259. [Google Scholar]
Yuan, Q.; Shi, Y.; Li, M. A Review of Computer Vision-Based Crack Detection Methods in Civil Infrastructure: Progress and Challenges. Remote Sens. 2024, 16, 2910. [Google Scholar] [CrossRef]
Wang, W.; Su, C. Automatic concrete crack segmentation model based on transformer. Autom. Constr. 2022, 139, 104275. [Google Scholar] [CrossRef]
König, J.; Jenkins, M.D.; Mannion, M.; Barrie, P.; Morison, G. Weakly-Supervised Surface Crack Segmentation by Generating Pseudo-Labels Using Localization With a Classifier and Thresholding. IEEE Trans. Intell. Transp. Syst. 2022, 23, 24083–24094. [Google Scholar] [CrossRef]
Dong, Z.; Wang, J.; Cui, B.; Wang, D.; Wang, X. Patch-based weakly supervised semantic segmentation network for crack detection. Constr. Build. Mater. 2020, 258, 120291. [Google Scholar] [CrossRef]
Al-Huda, Z.; Peng, B.; Algburi, R.N.A.; Al-antari, M.A.; AL-Jarazi, R.; Zhai, D. A hybrid deep learning pavement crack semantic segmentation. Eng. Appl. Artif. Intell. 2023, 122, 106142. [Google Scholar] [CrossRef]
Al-Huda, Z.; Peng, B.; Algburi, R.N.A.; Alfasly, S.; Li, T. Weakly supervised pavement crack semantic segmentation based on multi-scale object localization and incremental annotation refinement. Appl. Intell. 2023, 53, 14527–14546. [Google Scholar] [CrossRef]
Liang, F.; Li, Q.; Li, X.; Liu, Y.; Wang, W. CAC: Confidence-Aware Co-Training for Weakly Supervised Crack Segmentation. Entropy 2024, 26, 328. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning Research (PMLR), Virtual, 18–24 July 2021. [Google Scholar]
Yong, G.; Jeon, K.; Gil, D.; Lee, G. Prompt engineering for zero-shot and few-shot defect detection and classification using a visual-language pretrained model. Comput.-Aided Civ. Infrastruct. Eng. 2023, 38, 1536–1554. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1525–1535. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Mishra, A.; Gangisetti, G.; Eftekhar Azam, Y.; Khazanchi, D. Weakly supervised crack segmentation using crack attention networks on concrete structures. Struct. Health Monit. 2024, 23, 3748–3777. [Google Scholar] [CrossRef]
Tao, H. Weakly-Supervised Pavement Surface Crack Segmentation Based on Dual Separation and Domain Generalization. IEEE Trans. Intell. Transp. Syst. 2024, 25, 19729–19743. [Google Scholar] [CrossRef]
Inoue, Y.; Nagayoshi, H. Weakly-supervised Crack Detection. IEEE Trans. Intell. Transp. Syst. 2023, 24, 12050–12061. [Google Scholar] [CrossRef]
Wang, Z.; Leng, Z.; Zhang, Z. A weakly-supervised transformer-based hybrid network with multi-attention for pavement crack detection. Constr. Build. Mater. 2024, 411, 134134. [Google Scholar] [CrossRef]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Wang, H.; Li, Y.; Dang, L.M.; Lee, S.; Moon, H. Pixel-level tunnel crack segmentation using a weakly supervised annotation approach. Comput. Ind. 2021, 133, 103545. [Google Scholar] [CrossRef]
Liu, F.; Liu, Y.; Kong, Y.; Xu, K.; Zhang, L.; Yin, B.; Hancke, G.; Lau, R. Referring Image Segmentation Using Text Supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023. [Google Scholar]
Rao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang, G.; Zhou, J.; Lu, J. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVRR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Jeong, J.; Zou, Y.; Kim, T.; Zhang, D.; Ravichandran, A.; Dabeer, O. Winclip: Zero-/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Chen, X.; Han, Y.; Zhang, J. A Zero-/Few-Shot Anomaly Classification and Segmentation Method for CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st Place on Zero-shot AD and 4th Place on Few-shot AD. arXiv 2023, arXiv:2305.17382. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016. [Google Scholar]
Ma, X.; Wu, Q.; Zhao, X.; Zhang, X.; Pun, M.O.; Huang, B. SAM-Assisted Remote Sensing Imagery Semantic Segmentation With Object and Boundary Constraints. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5636916. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
König, J.; Jenkins, M.D.; Barrie, P.; Mannion, M.; Morison, G. A convolutional neural network for pavement surface crack segmentation using residual connections and attention gating. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]

Figure 1. Overview of the proposed CrackCLIP framework. (a) Crack pixel-level pseudo-label generation. (b) CLIP-based crack segmentation. Phase (b) includes a crack image encoder and a crack text encoder. The crack text prompt features are aligned with the crack image patch token features to enable pixel-level crack prediction.

Figure 2. List of crack compositional text prompts. “o” denotes the object in the image.

Figure 3. Visualization of crack prediction results for different WSCS methods. Rows (1)–(3) are from the Crack500 testing set, rows (4)–(6) are from the CFD dataset, and rows (7)–(9) are from the DeepCrack dataset.

Figure 4. Visualization of crack prediction results for different crack segmentation backbones. Rows (1) and (2) are from the Crack500 test set, rows (3) and (4) are from the CFD dataset, and rows (5) and (6) are from the DeepCrack dataset.

Figure 5. ODS, OIS, and AP values for CrackCLIP with different text prompts on the Crack500, CFD, and DeepCrack. “Normal” refers to normal defect text prompts, and “Crack” denotes specific crack text prompts.

Table 1. Evaluation of the segmentation results in

O D S

,

O I S

, and

A P

of different WSCS methods on the Crack500 testing set, CFD, DeepCrack (%).

Table 1. Evaluation of the segmentation results in

O D S

,

O I S

, and

A P

of different WSCS methods on the Crack500 testing set, CFD, DeepCrack (%).

Methods	Crack500			CFD			DeepCrack
Metrics	ODS	OIS	AP	ODS	OIS	AP	ODS	OIS	AP
Grad-CAM [22]	53.12	56.86	49.89	23.16	17.52	14.07	44.88	52.42	37.33
PWSC [16]	56.64	63.73	65.13	8.56	14.46	7.72	37.05	43.95	44.31
GPLL [15]	45.04	56.69	45.46	18.74	19.41	14.88	65.97	73.19	72.28
CAC [19]	60.43	64.60	63.65	23.31	31.55	18.55	71.01	77.98	75.51
CrackCLIP	61.31	68.58	59.33	40.80	41.74	31.81	68.26	73.29	68.82

Table 2. The CrackCLIP segmentation results in

O D S

,

O I S

, and

A P

with different pseudo-label types were evaluated on the Crack500 testing set, CFD, and DeepCrack (%).

Table 2. The CrackCLIP segmentation results in

O D S

,

O I S

, and

A P

with different pseudo-label types were evaluated on the Crack500 testing set, CFD, and DeepCrack (%).

Methods	Pseudo-Label Types	Crack500			CFD			DeepCrack
Methods	Pseudo-Label Types	ODS	OIS	AP	ODS	OIS	AP	ODS	OIS	AP
CrackCLIP	FSV	67.49	71.62	62.51	43.38	45.19	35.36	66.20	71.70	63.78
CrackCLIP	CAM-CRF [16]	63.00	68.07	60.54	39.15	39.82	32.01	58.77	67.05	55.66
CrackCLIP	CAM-Location [15]	61.31	68.58	59.33	40.80	41.74	31.81	68.26	73.29	68.82

Table 3. Evaluation of the segmentation results in

O D S

,

O I S

, and

A P

of different crack segmentation backbones on the Crack500 testing set, CFD, DeepCrack (%).

Table 3. Evaluation of the segmentation results in

O D S

,

O I S

, and

A P

of different crack segmentation backbones on the Crack500 testing set, CFD, DeepCrack (%).

Methods	Crack500			CFD			DeepCrack
Metrics	ODS	OIS	AP	ODS	OIS	AP	ODS	OIS	AP
U-Net [39]	55.88	63.20	65.17	9.09	16.74	9.58	44.17	54.32	56.55
DeepCrack¹ [6]	56.64	63.73	65.13	8.56	14.46	7.72	37.05	43.95	44.31
DeepCrack² [5]	57.93	64.44	62.23	20.18	26.25	14.79	44.57	49.61	48.89
OED [41]	52.01	61.14	54.02	10.24	15.89	4.43	26.16	32.71	9.42
CrackCLIP	63.00	68.07	60.54	39.15	39.82	32.01	58.77	67.05	55.66

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liang, F.; Li, Q.; Yu, H.; Wang, W. CrackCLIP: Adapting Vision-Language Models for Weakly Supervised Crack Segmentation. Entropy 2025, 27, 127. https://doi.org/10.3390/e27020127

AMA Style

Liang F, Li Q, Yu H, Wang W. CrackCLIP: Adapting Vision-Language Models for Weakly Supervised Crack Segmentation. Entropy. 2025; 27(2):127. https://doi.org/10.3390/e27020127

Chicago/Turabian Style

Liang, Fengjiao, Qingyong Li, Haomin Yu, and Wen Wang. 2025. "CrackCLIP: Adapting Vision-Language Models for Weakly Supervised Crack Segmentation" Entropy 27, no. 2: 127. https://doi.org/10.3390/e27020127

APA Style

Liang, F., Li, Q., Yu, H., & Wang, W. (2025). CrackCLIP: Adapting Vision-Language Models for Weakly Supervised Crack Segmentation. Entropy, 27(2), 127. https://doi.org/10.3390/e27020127

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CrackCLIP: Adapting Vision-Language Models for Weakly Supervised Crack Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Weakly Supervised Crack Segmentation Methods

2.2. Vision-Language Modeling Methods

3. Methodology

3.1. Approach Overview

3.2. Crack Pseudo-Label Generation

3.3. Crack Segmentation with Vision-Language Alignment

3.4. Segmentation Loss

4. Results

4.1. Dataset

4.2. Evaluation Metrics

4.3. Comparison Methods

4.4. Implementation Details

4.4.1. Environment

4.4.2. Experimental Setting

4.5. Comparison with State-of-the-Art Methods

4.6. Ablation Studies

4.6.1. Pseudo-Label Type

4.6.2. Backbones

4.6.3. Crack Text Prompts

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI