UCrack-DA: A Multi-Scale Unsupervised Domain Adaptation Method for Surface Crack Segmentation

Deng, Fei; Yang, Shaohui; Wang, Bin; Dong, Xiujun; Tian, Siyuan

doi:10.3390/rs17122101

Open AccessArticle

UCrack-DA: A Multi-Scale Unsupervised Domain Adaptation Method for Surface Crack Segmentation

by

Fei Deng

^1,†

,

Shaohui Yang

^1,*,†

,

Bin Wang

¹

,

Xiujun Dong

²

and

Siyuan Tian

¹

College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu 610059, China

²

State Key Laboratory of Geohazard Prevention and Geoenvironment Protection, Chengdu University of Technology, Chengdu 610059, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(12), 2101; https://doi.org/10.3390/rs17122101

Submission received: 19 May 2025 / Revised: 15 June 2025 / Accepted: 17 June 2025 / Published: 19 June 2025

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Surface cracks serve as early warning signals for potential geological hazards, and their precise segmentation is crucial for disaster risk assessment. Due to differences in acquisition conditions and the diversity of crack morphology, scale, and surface texture, there is a significant domain shift between different crack datasets, necessitating transfer training. However, in real work areas, the sparse distribution of cracks results in a limited number of samples, and the difficulty of crack annotation makes it highly inefficient to use a high proportion of annotated samples for transfer training to predict the remaining samples. Domain adaptation methods can achieve transfer training without relying on manual annotation, but traditional domain adaptation methods struggle to effectively address the characteristics of cracks. To address this issue, we propose an unsupervised domain adaptation method for crack segmentation. By employing a hierarchical adversarial mechanism and a prediction entropy minimization constraint, we extract domain-invariant features in a multi-scale feature space and sharpen decision boundaries. Additionally, by integrating a Mix-Transformer encoder, a multi-scale dilated attention module, and a mixed convolutional attention decoder, we effectively solve the challenges of cross-domain data distribution differences and complex scene crack segmentation. Experimental results show that UCrack-DA achieves superior performance compared to existing methods on both the Roboflow-Crack and UAV-Crack datasets, with significant improvements in metrics such as mIoU, mPA, and Accuracy. In UAV images captured in field scenarios, the model demonstrates excellent segmentation Accuracy for multi-scale and multi-morphology cracks, validating its practical application value in geological hazard monitoring.

Keywords:

Unsupervised Domain Adaptation (UDA); crack segmentation; adversarial training; Unmanned Aerial Vehicle (UAV) image; multi-scale feature processing

Graphical Abstract

1. Introduction

In geological disaster prevention and emergency response, ground surface cracks serve as early warning signs of potential hazards [1], revealing hidden safety risks in rock masses, soil bodies, and artificial structures. Accurate monitoring of crack distributions under complex surface structures is of paramount importance for safety prevention and geological disaster early warning. In recent years, the rapid advancement of unmanned aerial vehicle (UAV) technology has facilitated the widespread use of high-resolution UAV imagery in geological monitoring [2]. These images offer notable advantages, including wide spatial coverage, high temporal resolution, and efficient detection of crack distributions in complex surface environments.

Crack detection and segmentation, as fundamental computer vision tasks, have long been of interest in fields such as engineering safety and geological disaster prevention. Early methods primarily relied on traditional image processing techniques, such as Canny edge detection [3] and Hough transform [4], which were effective for regular cracks but tended to fail under complex backgrounds, varying lighting conditions, and noise interference. With the development of deep learning technologies, particularly the widespread application of CNNs [5], crack segmentation entered a data-driven end-to-end learning phase. Yang et al. [6] applied FCN [7] to crack detection tasks, achieving efficient pixel-level crack segmentation through end-to-end training. DeepCrack [8], based on VGG-16 [9], integrated batch normalization layers and side network prediction fusion techniques, significantly improving the Accuracy of FCN [7] in crack detection across various scenarios. CrackUNet [10], leveraging UNet [11]’s unique encoder–decoder symmetric skip connection structure, enhanced the detection of small-scale cracks by increasing the number of encoder layers, but this also demanded a larger number of training samples. DenseCrack [12], built on DenseNet [13], utilized the concept of dense connections to propose an improved deep architecture, demonstrating superior performance compared to other crack segmentation methods. With the introduction of the Transformer [14] architecture, it has found applications in crack detection. Liu H and colleagues proposed the Crackformer [15] based on the Transformer architecture, a model specifically optimized for fine-grained crack detection. Hatem Taha et al. [16] employed ConvNeXt [17] as the encoder and adopted the UPerNet [18] architecture as the decoder to learn both local and global semantic features of road cracks, significantly improving segmentation Accuracy. SCSegamba [19] utilizes the state space model (SSM) [20] to extract crack morphology and background texture information, enhancing the semantic continuity among crack pixels. SAM [21] is a general-purpose foundation model for image segmentation, with its core innovation lying in the prompt-based mechanism that enables zero-shot transfer capabilities, making it applicable to crack semantic segmentation tasks. Although large models demonstrate outstanding performance, their training typically requires massive amounts of data.

Despite the promising performance of existing crack segmentation algorithms on specific datasets, their generalization capability remains challenging when the sample size is small. Due to the characteristics of geological structures and limitations of imaging conditions, the distribution of crack regions in work areas is typically highly imbalanced [22]. In large-scale field scenes, cracks exhibit diverse morphologies and are locally densely distributed, which makes manual annotation challenging, leading to an extremely limited number of obtainable crack sample images. Supervised learning paradigms require a substantial amount of labeled data to establish semantic segmentation models, where a considerable number of labeled data are used to train models to handle a small amount of remaining data, significantly diminishing the marginal benefits of manual annotation.

In this context, unsupervised domain adaptation methods demonstrate unique application value. Currently, research on crack semantic segmentation largely relies on datasets collected from single surface types or ideal environments, such as pavements [16], concrete [23], and tunnels [24], which fail to fully reflect the complexity of various surface structures in real-world scenarios, including rocks, soil, and highways. As illustrated in Figure 1, domain shift exists among different crack datasets, manifested in the diversity of factors such as background noise, surface types, and crack quantity and morphology, as well as crack scales. Additionally, there are significant differences in shooting angles, lighting conditions, and surface details across different datasets during collection. These factors lead to a severe domain shift problem between crack semantic segmentation datasets, resulting in models trained on existing crack datasets performing poorly when directly applied to new datasets, failing to achieve the expected Accuracy.

In recent years, domain adaptation (DA) has emerged as an effective transfer learning technique and has been widely applied in fields such as object detection [26], medical image analysis [27], and remote sensing [28], aiming to alleviate the degradation of model performance caused by domain shift. Zhu et al. [29] used Generative Adversarial Networks (GANs) [30] to transform source domain images into synthetic images with the style of the target domain. Wang et al. [31] employed a category-adaptive threshold mechanism to generate pseudo-labels on the target domain for domain adaptation. When dealing with imbalanced sample categories, self-training methods can easily suffer from the “long-tail effect,” [32] causing the network’s performance on low-frequency categories to decline continuously. Hoffman et al. [33] were the first to propose the application of adversarial training to unsupervised domain adaptation for semantic segmentation. Srivastava et al. [34] achieved unsupervised domain adaptation for crack segmentation through incremental training with adversarial learning, without significantly reducing the Accuracy of the source domain. When applying adversarial training to unsupervised domain adaptation for semantic segmentation, the high-level features generated by the encoder are typically input into the discriminator, allowing the discriminator and encoder to engage in adversarial training to extract domain-invariant features across different datasets.

To extract domain-invariant features at multiple levels and enhance the Accuracy and robustness of crack semantic segmentation, the method designed in this paper combines a hierarchical adversarial mechanism and prediction entropy minimization constraints and utilizes an advanced semantic segmentation network to process multi-scale features. The contributions of this paper are summarized in the following three points:

Design a hierarchical adversarial mechanism to extract domain-invariant features of different scales, combined with a prediction entropy minimization mechanism to sharpen decision boundaries, which improves the model’s domain adaptation capability for crack segmentation tasks in various scenes.
To enhance the model’s ability to capture features of cracks at different scales and morphologies, we design a U-shaped crack semantic segmentation network. It extracts multi-scale receptive field features through stacking dilated convolutions with multiple dilation rates and optimizes the reconstruction of multi-morphological crack structures in the upsampling stage by combining mixed convolutional kernels.
Construct a UAV ground surface crack dataset, containing a variety of complex factors in real-world scenes, which verifies the applicability of the proposed method on UAV images and provides important data support for the research on ground surface crack segmentation.

2. Related Work

2.1. Crack Semantic Segmentation

In recent years, crack segmentation techniques [6,8,10,12,15] based on deep learning have developed rapidly, significantly improving the Accuracy of crack recognition in specific scenarios. The overall workflow is shown in Figure 2. First, a pretrained crack semantic segmentation network is obtained by learning crack features from public datasets. Then, it is necessary to construct training samples for specific work areas, which involves manually annotating cracks in images to create a dataset for network transfer training. The training set is input into the initial network model to generate a preliminary segmentation result. Next, by comparing the model’s output with the true labels, the loss function is calculated, and the model parameters are optimized based on the loss values. This process continues to iterate until the model converges and achieves stable segmentation performance. Finally, the model with the best performance is selected for crack segmentation prediction.

Existing crack semantic segmentation methods rely heavily on supervised learning. When dealing with complex cracks accompanied by various noise interferences, they often require large-scale, high-quality annotated data for training support. However, within a single work area, regions containing cracks are limited, and the labeling process is cumbersome. A single crack image dataset may only contain a few hundred samples [35], making the data volume insufficient to support the transfer training of large models. In this context, the introduction of unsupervised domain adaptation methods is particularly crucial, significantly reducing the human resources needed for crack detection tasks.

2.2. Unsupervised Domain Adaptation

Domain adaptation aims to solve the problem of feature distribution differences between source and target domains, which has significant application value in semantic segmentation tasks. The current mainstream domain adaptation methods can be roughly divided into two categories: self-training-based methods [36,37] and adversarial learning-based methods [34,38,39], as illustrated in Figure 3.

Self-training methods generate pseudo-labels [40] by selecting high-confidence results from the target domain, which are then used as supervisory signals to fine-tune the model, thereby optimizing its performance on the target domain. The core advantage of such methods is their simplicity and efficiency, without the need for additional adversarial training modules. However, self-training methods face the challenge of the long-tail effect [32] in semantic segmentation tasks: the quality of pseudo-labels for small sample categories in the target domain is low, causing the model to fit high-frequency categories during training while neglecting the feature learning of tail categories. Adversarial learning methods reduce the impact of domain shift by introducing a domain discriminator, which encourages the model to learn domain-invariant features. Through adversarial training between the feature extractor and the discriminator, the feature distributions of the source and target domains are induced to converge. However, adversarial learning methods are prone to training instability issues, as the dynamic interaction between the generator and the discriminator can lead to gradient oscillation, a problem that is particularly pronounced when dealing with high-dimensional semantic segmentation tasks.

3. Method

3.1. Overview

To effectively extract domain-invariant features in both source and target domains and enhance the precision of crack semantic segmentation, we propose UCrack-DA, whose overall structure is shown in Figure 4. To strengthen the model’s ability to model the local details and overall shapes of cracks, we designed a U-shaped semantic segmentation network with a Mix-Transformer [41] backbone, specifically optimizing for the diversity in crack scales and morphologies. In the U-shaped network encoder, features produced at each downsampling step are concatenated with features of the same size during upsampling, meaning that the multi-scale features generated by the encoder all impact the model’s output. Traditional adversarial training methods usually only input the high-level features produced by the final stage of the encoder into the discriminator, generating domain-invariant features through adversarial training. To enable the encoder to produce multi-scale domain-invariant features, we designed a discriminator tailored to the Mix-Transformer [41] encoder, inputting multi-scale features into the discriminator for adversarial training to generate domain-invariant features at these scales. Unlike existing methods that only constrain the encoder to extract domain-invariant features, we train the decoder to learn target domain features and sharpen the decision boundary in the target domain by using the prediction entropy of the target domain images as a loss. Simultaneously, to enhance the stability of the domain adaptation method and reduce the network’s forgetfulness of crack features, we conduct supervised incremental learning on the source domain data.

The network training strategy proposed in this paper consists of two stages: the first stage involves the U-shaped semantic segmentation network fully learning from the source domain data; the second stage achieves unsupervised domain adaptation through adversarial learning between the encoder and the domain discriminator, along with entropy minimization constraints.

The semantic segmentation network employs an encoder–decoder architecture to achieve end-to-end crack segmentation. In the encoding phase, the Overlap Patch Embedding layer constructs an image embedding representation using convolution operations with overlapping regions. This is followed by hierarchical feature extraction through four cascaded Mix-Transformer [41] Blocks, progressively establishing feature representations from local details to global semantics, ultimately outputting four feature maps of different scales. The deepest feature map is processed by the Multi-Scale Dilation Attention Module (MSDAM) for multi-receptive field information enhancement, effectively fusing context information across different scales before being input into the decoder. The decoding phase uses a mixed convolutional decoder structure, which includes four Mixed Convolutional Attention Modules (MCAMs) and five Patch Expanding [42] operations for upsampling. Through skip connection mechanisms, the decoder deeply fuses features from corresponding levels of the encoder, effectively compensating for the multi-scale feature information that may be lost during downsampling. MCAM precisely reconstructs the geometric morphology of cracks by fusing features produced by heterogeneous convolution kernels through attention mechanisms, while Patch Expanding [42] gradually restores the resolution of the feature maps through channel expansion and feature rearrangement strategies, ultimately achieving high-precision pixel-level crack semantic segmentation.

3.2. Proposed Unsupervised Domain Adaptation Method

The overall workflow of the domain adaptation method is shown in Figure 5. The encoder and discriminator extract multi-scale domain-invariant features through hierarchical adversarial training. Meanwhile, to enable the decoder to learn target domain features and sharpen decision boundaries, a minimization prediction entropy constraint is imposed on the output results of the target domain. To enhance the stability of the training process, supervised training on the source domain data continues.

3.2.1. Hierarchical Adversarial Training

During adversarial training, we use multi-scale features as input for the discriminator, enabling it to effectively distinguish between the feature distributions of the source and target domains. At the same time, the encoder learns to generate feature representations that can confuse the discriminator through adversarial training. Specifically, this training mechanism adopts an alternating optimization strategy: first, the discriminator is trained with fixed encoder parameters to accurately determine the origin of the features; subsequently, the encoder is optimized with fixed discriminator parameters to generate features that can deceive the discriminator. Through this adversarial training paradigm, the encoder and discriminator continuously optimize in a dynamic game, ultimately achieving effective alignment of inter-domain features.

Source domain images

x_{s} \in X_{s}

and target domain images

x_{t} \in X_{t}

are input into the Mix-Transformer [41] encoder to extract features at four different scales. The source domain features are denoted as

F_{s}

, and the target domain features are denoted as

F_{t}

. The four sets of features are input into the discriminator, where they are first mapped to a shared embedding space [43] using four different MLP layers. Due to the different feature dimensions at each level, bilinear interpolation [44] is used to unify the resolution. After the feature dimensions are unified, channel concatenation and convolution operations are employed to fuse and compress the spatial dimensions. Finally, the multi-dimensional tensors are converted into one-dimensional tensors, and an MLP is used for domain classification, outputting the domain category corresponding to the features.

When training the discriminator, the label for source domain features is set to 0, and the label for target domain features is set to 1. The training objective function for the discriminator is as follows:

min_{θ_{D I S C}} \frac{1}{|X_{s}|} \sum_{x_{s}} L_{D} (F_{s}, 0) + \frac{1}{|X_{t}|} \sum_{x_{t}} L_{D} (F_{t}, 1)

(1)

where

θ_{D I S C}

denotes the network parameters of the discriminator. We use BCE Loss [45] as the loss function

L_{D}

for domain classification, defined as follows:

L_{D} = - d_{i} log ({\hat{d}}_{i}) - (1 - d_{i}) log (1 - {\hat{d}}_{i})

(2)

where

d_{i}

denotes the true domain label of the sample and

{\hat{d}}_{i}

denotes the output of the discriminator.

During the training of the encoder, to make the data distribution of the target domain features approximate that of the source domain features, only the target domain features are input into the discriminator, and the loss is calculated between the output domain label and the true domain label 0 of the source domain. Simultaneously, supervised learning is performed on the source domain data. The training objective function for the encoder is as follows:

min_{θ_{E N C}} \frac{λ_{a d v}}{|X_{t}|} \sum_{x_{t}} L_{D} (F_{t}, 0) + \frac{1}{|X_{s}|} \sum_{x_{s}} L_{s e g} (y_{s}, {\hat{y}}_{s})

(3)

where

θ_{E N C}

denotes the network parameters of the encoder,

λ_{a d v}

is the weight factor,

{\hat{y}}_{s}

is the pixel-level label of the source domain image, and

y_{s}

is the network’s prediction. We use Weighted BCE Loss [46] and Dice Loss [47] as the loss functions for training on source domain data, defined as follows:

\begin{matrix} L_{D i c e} = 1 - \frac{2 \sum_{i = 1}^{N} y_{i} {\hat{y}}_{i} + ε}{\sum_{i = 1}^{N} y_{i} + \sum_{i = 1}^{N} {\hat{y}}_{i} + ε} \\ L_{W B C E} = - \frac{1}{N} \sum_{i = 1}^{N} [w_{1} y_{i} log ({\hat{y}}_{i}) + w_{2} (1 - y_{i}) log (1 - {\hat{y}}_{i})] \\ L_{s e g} = L_{D i c e} + L_{W B C E} \end{matrix}

(4)

where N denotes the total number of pixels,

y_{i}

represents the ground truth label values of the pixels,

{\hat{y}}_{i}

indicates the predicted values’ output by the network,

w_{1}

and

w_{2}

are the weights for background pixels and crack pixels, respectively, and

ε = 1 \times 10^{- 30}

is a numerical stability coefficient. During training, we alternately optimize objectives (1) and (3) to train the network to extract domain-invariant features from both the source and target domains.

3.2.2. Prediction Entropy Minimization

By optimizing the decoder parameters through the entropy loss on target domain data, the model is forced to produce high-confidence predictions for target domain samples, thereby sharpening decision boundaries, reducing class confusion, and improving the network’s performance on the target domain. Meanwhile, to enhance the robustness of training, supervised learning with source domain data is incorporated. The objective function is defined as follows:

min_{θ_{D E C}} \frac{λ_{e n t}}{|X_{t}|} \sum_{x_{t}} L_{e n t} ({\hat{y}}_{t}) + \frac{1}{|X_{s}|} \sum_{x_{s}} L_{s e g} (y_{s}, {\hat{y}}_{s})

(5)

where

θ_{D E C}

denotes the network parameters of the decoder,

λ_{e n t}

denotes the weighting factor, and

{\hat{y}}_{t}

refers to the output of the decoder for target domain features. Considering the pixel-level prediction nature of semantic segmentation tasks, a normalized entropy loss function

L_{e n t}

is employed, which is defined as

L_{e n t} = \frac{1}{H W {log}_{2} C} \sum_{h, w = 1}^{H, W} \sum_{c = 1}^{C} p_{i, c}^{(h, w)} {log}_{2} (p_{i, c}^{(h, w)} + ε)

(6)

where

p_{i, c}^{(h, w)} = softmax {(z_{i}^{(h, w)})}_{c}

denotes the predicted probability that the i-th sample at location

(h, w)

belongs to class c, z represents the output value of the decoder,

ε = 1 \times 10^{- 30}

is a small constant used to prevent numerical overflow, C indicates the total number of classes, and

{log}_{2} C

is the normalization factor that scales the loss value to the

[0, 1]

range and eliminates the influence of the number of classes on the loss.

3.3. Multi-Scale Dilated Attention Module

To address the challenges commonly encountered in complex crack semantic segmentation tasks, such as scale diversity and background noise, the final features extracted by the encoder are input into a Multi-Scale Dilated Attention Module (MSDAM). This module utilizes convolutions with different dilation rates to extract multi-scale features, while the attention mechanism is used to enhance crack features and suppress background noise. Its structure is shown in Figure 6.

The module constructs a progressive feature extraction path using a 4-level cascaded dilated convolution, with dilation rates increasing exponentially (d = 1, 2, 4, 8). The structural design of this module is inspired by Atrous Spatial Pyramid Pooling (ASPP) [48], whose improved variants [49,50,51] have been widely adopted for multi-scale feature extraction. Lei Hu et al. [52] applied the ASPP module to remote sensing image segmentation tasks with the same resolution as used in this study and demonstrated that an exponentially increasing dilation rate enhances the effectiveness of multi-scale feature extraction. Therefore, we adopted this design in our work. Under this structure, shallow convolutional kernels focus on capturing local texture features around crack edges, while deeper convolutional kernels leverage progressively dilated receptive fields to capture long-range continuity features of cracks. In the feature fusion stage, multi-scale feature maps are aggregated through channel concatenation, followed by the introduction of the CBAM [53] dual attention mechanism. The channel attention [54] weights automatically enhance the response strength of crack-related channels across scales, while the spatial attention [55] suppresses background noise and highlights the spatial distribution of crack pixels.

3.4. Mixed Convolutional Attention Module

In crack semantic segmentation tasks, the highly diverse shapes, scales, and orientations of cracks in natural scenes pose significant challenges for feature reconstruction. Cracks typically exhibit extreme geometric properties, with widths often spanning only a few pixels, while their lengths may extend across different regions of the image, forming complex spatial distribution patterns. To address this, we use the Mixed Convolutional Attention Module (MCAB) as the core component of the decoder. Its structure is shown in Figure 7. MCAB is a decoder module that integrates various types of convolutional kernels, with the core idea of extracting features using multi-scale, multi-morphology convolutional kernels and combining them with the CBAM [53] dual attention mechanism to handle the complex characteristics of cracks, such as diverse shapes, slender edges, and random orientations.

Structurally, the module consists of four parallel convolution branches, each using different combinations of convolution kernels: 1 × 1, 3 × 3, 3 × 7/7 × 3, and 5 × 11/11 × 5. Fixed-size convolution kernels struggle to simultaneously capture the diversity of cracks. Multi-morphology convolutional kernels effectively address the directionality and length diversity of cracks. Specifically, the 1 × 1 convolution is used for channel feature fusion, the 3 × 3 convolution extracts basic texture details and short-range crack orientations, while the 3 × 7 and 7 × 3 asymmetric kernels, with their differentiated receptive fields in the horizontal and vertical directions, specialize in capturing the orientation features of elongated cracks. The 5 × 11 and 11 × 5 large-scale kernels target more complex large cracks or branching structures, enhancing context integration by expanding the receptive field. After the outputs of the four branches are concatenated with the original input features, they are processed by the CBAM [53] attention module to enhance the features, dynamically increasing the response weights of crack-related channels and spatial positions while suppressing background noise.

4. Experimental Results and Analysis

4.1. Datasets

To evaluate the performance of the proposed method in cross-domain crack semantic segmentation tasks, we designed and created a new target domain dataset, UAV-Crack, and conducted experiments using an existing source domain dataset (CrackSeg9k) [35] and another target domain dataset (Roboflow-Crack) [56]. All data samples were uniformly resized to a resolution of 512 × 512. Figure 8 shows some representative sample images from the three datasets.

UAV-Crack is a custom-built dataset constructed in this study, collected from real-world field scenes captured by UAVs. It covers various realistic scenarios, including rock fractures, soil shrinkage cracks, and road alligator cracking. The dataset exhibits significant multi-scale object distribution, complex background interference, and diverse imaging conditions. It shows a noticeable domain shift from the source domain data, making it highly challenging and suitable for research and evaluation in cross-domain crack segmentation tasks.

The source-domain dataset, CrackSeg9k [35], integrates and refines crack data from multiple previous studies. The collected images cover a wide range of lighting conditions, viewing angles, and resolutions. The dataset includes various crack patterns—such as linear, branched, and reticular cracks—across more than ten types of substrate surfaces, including concrete and asphalt. Another target domain dataset, Roboflow-Crack [56], is sourced from the Roboflow platform and includes 886 crack images covering various surface materials. Its data distribution is relatively close to the source domain, making it suitable for testing the model’s adaptability under low domain shift conditions. The names of the datasets along with the number of training and validation/test samples are listed in Table 1.

4.2. Implementation Details

The training process consists of two stages. The first stage aims to allow the semantic segmentation model to fully learn from the source domain data. During this process, to enhance the model’s generalization ability, data augmentation techniques such as random vertical flipping, random horizontal flipping, random rotation, and random color space transformation are applied to the training data. When learning from the source domain data, to address the class imbalance between cracks and background, the background pixel weight in the Weighted BCE Loss [46] is set to 1, while the crack pixel weight is set to 1.5. During training, the batch size is set to 8, the number of epochs is set to 100, and the initial learning rate (

l r_{init}

) is set to

5 \times 10^{- 5}

. The training process adopts a two-stage learning rate scheduling strategy: during the first five training epochs, the learning rate is gradually warmed up from the initial value to the initial learning rate in an exponential rise manner. In the subsequent training stages, a cosine annealing scheduling strategy [57] is applied, where the learning rate follows a cosine function with period T, decaying to a predefined minimum value (

l r_{\min}

), thus balancing the model’s convergence speed and Accuracy. The definition is as follows:

l r = \{\begin{matrix} l r_{min} \times {(\frac{l r_{init}}{l r_{min}})}^{\frac{t + 1}{5}} & 0 \leq t < 5 \\ l r_{min} + \frac{l r_{init} - l r_{min}}{2} (1 + cos (π \times \frac{(t - 5) % T}{T})) & t \geq 5 \end{matrix}

(7)

In the second stage, while performing cross-domain feature alignment, incremental training on the source domain data is conducted to reduce model forgetting in the crack segmentation task and enhance the stability of the training process. During the adversarial training phase, the semantic segmentation network and the discriminator are optimized using different optimizers. The initial learning rates are set to

5 \times 10^{- 5}

and

2 \times 10^{- 5}

, respectively, and cosine annealing is used to decay them to a lower bound of

1 \times 10^{- 8}

. The weight factors

λ_{a d v}

and

λ_{e n t}

in the objective functions (3) and (5) are set to 0.001 and 0.0002, respectively.

All experiments in this study were conducted on an NVIDIA GeForce RTX 4090D, using PyTorch 2.1.0 to build, train, and test the model. Adam [58] was used as the optimization algorithm, with first-order and second-order momentum coefficients set to 0.9 and 0.999, respectively, and a weight decay coefficient of

1 \times 10^{- 5}

.

4.3. Evaluation Metrics

Given the characteristics of multi-form, multi-scale crack semantic segmentation tasks in noisy environments, an evaluation system is constructed using three metrics: mean Intersection over Union (mIoU), mean Pixel Accuracy (mPA), and Accuracy. mIoU quantifies the spatial overlap between the predicted region and the true cracks, focusing on the model’s ability to recognize crack shape details. mPA evaluates pixel-level classification Accuracy from a class-balanced perspective, effectively revealing the model’s stability in distinguishing sparse crack pixels. Accuracy, as a basic metric, reflects the global pixel classification correctness. This multi-metric coupling strategy balances segmentation precision, class balance, and global consistency, making it particularly suitable for complex scenarios with strong lighting interference, shadow occlusion, and small crack distributions, providing a basis for evaluating the model’s reliability in real-world environments. Their computational expressions are as follows:

\begin{matrix} m I o U = \frac{1}{N} \sum_{i = 1}^{N} \frac{| Y_{i} \cap {\hat{Y}}_{i} |}{| Y_{i} \cup {\hat{Y}}_{i} |} \\ m P A = \frac{1}{N} \sum_{i = 1}^{N} \frac{| Y_{i} \cap {\hat{Y}}_{i} |}{| Y_{i} |} \\ A c c u r a c y = \frac{\sum_{i = 1}^{N} | Y_{i} \cap {\hat{Y}}_{i} |}{\sum_{i = 1}^{N} | Y_{i} |} \end{matrix}

(8)

where N denotes the number of classes,

Y_{i}

denotes the set of pixels that truly belong to class i, and

{\hat{Y}}_{i}

denotes the set of pixels predicted by the model as class i.

4.4. Ablation Study

We validate the effectiveness of the designed MSAdv (Multi-Scale Adversarial), MinEnt (Minimize Entropy), MSDAB, and MCAB through ablation experiments on UAV-Crack. The results of the ablation experiments are presented in Table 2, and the quantitative metrics demonstrate that the various modules of UCrack-DA can effectively improve segmentation performance.

Removing MSAdv leads to a significant decrease in mIoU by 2.06% (65.33% vs. 63.27%), indicating that the multi-scale feature alignment strategy effectively enhances the feature consistency of the model in cross-domain scenarios. The primary role of MinEnt is to generate clearer decision boundaries in the decoder; its removal results in a 0.53% decrease in mIoU, suggesting it has a certain optimization effect on the segmentation results of the target domain data. The removal of the MSDAB and MCAB modules causes mIoU decreases of 0.37% and 1.17%, respectively, highlighting their importance in capturing multi-scale and multi-morphological cracks. Moreover, when multiple modules are removed simultaneously, the model performance further degrades compared to removing individual modules, which highlights the synergistic effect of the proposed unsupervised domain adaptation strategy and feature processing modules in enhancing segmentation performance. Notably, removing certain modules results in a slight drop in mIoU while yielding higher Accuracy, indicating that the model tends to make conservative predictions by assigning most pixels to the background class, thereby increasing overall Accuracy.

The complete model (UCrack-DA) achieves a high Accuracy of 96.31% while also attaining 71.76% mPA and 65.33% mIoU, indicating that each module collectively optimizes the model’s segmentation performance in domain adaptation scenarios through different mechanisms: MSAdv addresses inter-domain discrepancies, MinEnt sharpens the decision boundaries, further enhancing segmentation performance in the target domain, while MSDAB and MCAB enhance the model’s ability to represent multi-scale and multi-morphological cracks from a feature enhancement perspective and jointly suppress background noise through attention mechanisms.

4.5. Comparison with Other Methods

To validate the model’s ability to extract domain-invariant features across different levels of domain shift, we conducted experiments using CrackSeg9k [35] as the source domain data and Roboflow-Crack [56] and UAV-Crack as the target domain data. Among the compared domain adaptation methods, approaches based on different training paradigms can be categorized into two types: a) self-training methods, represented by DACS [36] and DAFormer [37], which achieve target domain knowledge transfer through iterative generation of pseudo-labels; b) adversarial training methods, such as AdaptSegnet [38], ADVENT [39], and CrackUDA [34], which utilize discriminative networks to align cross-domain feature distributions. All methods are evaluated under a unified experimental setting, including the use of identical data augmentation strategies, input resolution, batch size, optimizer, and learning rate scheduling scheme, with an MiT [41] encoder pretrained on ImageNet [59]. To comprehensively evaluate the domain adaptation performance of the proposed method, we additionally established two important benchmarks for comparison and visualized the segmentation results under these benchmarks: (1) Source Only: a standard model trained using only source domain data, representing the baseline performance of the model without domain adaptation. This result intuitively reflects the degree of distribution discrepancy between the source and target domains; (2) Oracle: an ideal model trained with full supervision on the target domain training set, representing the upper limit of the model’s performance on the target domain. This benchmark is used to measure the gap between the domain adaptation methods and the theoretical optimal solution.

4.5.1. Crackseg9K→Roboflow-Crack

Table 3 presents the metric scores of all methods on Roboflow-Crack [56]. Here, “Source Only” indicates the results of directly applying the model to the target domain without domain adaptation, while “Oracle” represents the results of training the model with supervised learning on the target domain. The results show that self-training models generally exhibit insufficient performance, primarily due to the “long-tail effect,” [32] which often leads to a continuous decline in performance for low-frequency categories in the target domain, especially when there is a severe imbalance between the number of crack and background categories in the samples. Our UCrack-DA outperforms all methods in comprehensive metrics, achieving an Accuracy of 97.92%, mPA of 90.90%, and mIoU of 81.34%. This represents a significant improvement compared to DAFormer [37], which uses the same feature extraction network and is based on a self-training approach. CrackUDA [34] also employs adversarial training to extract domain-invariant features, but it lacks task-specific optimization for crack segmentation and only extracts high-level domain-invariant features, resulting in limited performance.

In Figure 9, we present the segmentation results of the top three methods ranked by mIoU, selected from representative samples in the Roboflow-Crack [56] dataset. The visual comparison results are consistent with the quantitative metrics. By examining the first row, we can observe surface textures in the image that closely resemble cracks in appearance, making them highly misleading. ADVENT [39] misclassifies these textures as cracks due to its inability to suppress background noise. In contrast, our network is barely affected by such noise, as the proposed method leverages attention mechanisms to suppress background interference and enhance crack-related features. In the second row, the image features weak crack characteristics, which greatly tests the network’s feature extraction capability. By obtaining global features through the encoder and aggregating contextual information using MCAB and by minimizing prediction entropy in the target domain to sharpen decision boundaries, our network successfully identifies most of the cracks. The third and fourth rows represent large-scale and small-scale cracks, respectively, posing significant challenges to the network’s multi-scale feature processing ability. Other methods perform poorly on samples containing small-scale cracks due to their failure to extract multi-scale domain-invariant features. Our network combines multi-scale feature fusion and attention mechanisms to capture superior contextual information from the samples and enables the encoder to extract domain-invariant features at multiple scales during the domain adaptation stage, thereby comprehensively extracting small-scale cracks and optimizing the edges of large-scale cracks. In the fifth row, the image contains numerous cracks of varying sizes and shapes, which severely tests the network’s comprehensive capabilities. Through comparison, it is evident that our model maintains high Accuracy. Overall, compared to other domain adaptation schemes, UCrack-DA achieves better connectivity and Accuracy in segmenting cracks from images.

4.5.2. Crackseg9K→UAV-Crack

Table 4 presents the metric scores of all networks on the UAV-Crack dataset. This dataset is quite challenging, yet our approach still achieved the best results, with an Accuracy of 96.31%, mPA of 71.76%, and mIoU of 65.33%. CrackUDA has the highest Accuracy, but its core metric mIoU is very low. The Accuracy metric is more sensitive to the majority class (background), indicating that under this method, areas with weak crack features are often misjudged as background.

In Figure 10, we present the segmentation results of the top three methods ranked by the mIoU metric, selected from representative samples in the UAV-Crack test set. From the figure, it is evident that the difficulty of the samples in this dataset has significantly increased, as they contain a large amount of noise, various types of terrain, and cracks of multiple scales and morphologies. The network’s ability to resist interference and extract information about crack width is particularly crucial for improving performance metrics. The first two rows represent cases of mixed surfaces consisting of roads and soil. UCrack-DA achieves better segmentation results by enhancing crack-related semantic features and suppressing background noise through attention mechanisms. CrackUDA [34], with limited handling of crack features, results in poor continuity of the extracted cracks. ADVENT [39], lacking specific treatment for background noise, misclassifies some background textures as cracks. Rows three to five represent various soil surfaces with different degrees of vegetation coverage and relatively small-scale cracks. Other methods fail to detect some of the finer cracks due to the lack of multi-scale domain-invariant feature extraction. In contrast, UCrack-DA successfully identifies most cracks by extracting multi-scale domain-invariant features and enhancing the decision boundaries. Row six depicts a rocky surface with cracks of varying scales; our method not only detects the cracks but also preserves finer local details more effectively. The results demonstrate that, faced with these situations, UCrack-DA can extract multi-scale domain-invariant features from both the source and target domains to handle cracks of various forms and scales, and it suppresses noise through attention mechanisms.

4.5.3. Comprehensive Analysis

Through the validation of cross-dataset comparative experiments, the UCrack-DA method proposed in this study demonstrates significant advantages in tests on two target domain datasets. It is worth noting that although the mPA and mIoU metrics vary considerably across different models in our experiments, the overall Accuracy remains consistently high. In crack semantic segmentation tasks, cracks typically occupy only a small portion of the image, with background pixels far outnumbering crack pixels. As a result, overall pixel Accuracy tends to be dominated by the majority class (background), often leading to inflated Accuracy scores. In our dataset, crack pixels account for only 3.376% of the total pixels. Due to this extreme class imbalance, the performance on crack segmentation has a limited effect on the overall Accuracy, making it less sensitive to the model’s true capability in detecting cracks. Therefore, we primarily use mIoU as the core evaluation metric, while Accuracy and mPA serve as supplementary indicators to validate model performance. To demonstrate the domain adaptation capability of UCrack-DA, we randomly sample 180 images from each of the three datasets and extract features using the Mix-Transformer [41] encoder. These high-dimensional features are then reduced to two dimensions using t-SNE [60], resulting in the feature distribution shown in Figure 11a. Subsequently, CrackSeg9K [35] is designated as the source domain, while Roboflow-Crack [56] and UAV-Crack are used as target domains. Our method is applied to mitigate domain shift. The same samples are then encoded by the same encoder, and their features are reduced to two dimensions using the same t-SNE [60] method, producing the distribution shown in Figure 11b. It can be observed that the feature points from the three datasets are well mixed, indicating that the domain discrepancy has been significantly reduced. The MMD (Maximum Mean Discrepancy) [61] between the feature distributions of CrackSeg9K [35] and Roboflow-Crack [56] is reduced from 0.0335 to 0.0201, while the MMD [61] with UAV-Crack decreases from 0.0404 to 0.0241.

The experimental data indicates that the performance metrics of the method proposed in this paper surpass those of existing methods, and it maintains stable generalization capabilities across different data distributions. As visualized in Figure 9 and Figure 10, the method effectively suppresses complex background interference while accurately extracting multi-scale crack features and crack morphologies. The output results not only preserve the spatial coherence of the crack topology but also retain a significant amount of crack width information. This is primarily attributed to the hierarchical adversarial mechanism, which enables the encoder to generate multi-scale domain-invariant features. Meanwhile, the decoder sharpens decision boundaries and more effectively completes feature aggregation and upsampling.

4.6. Efficiency Analysis

To compare the model size and efficiency of different methods, we evaluate the number of parameters and computational complexity for each approach under a unified setting, using images with a resolution of 512 × 512 as input. For inference performance, one image is fed at a time, and each method is tested 10 times on the same hardware environment (NVIDIA RTX 3090 GPU), with the average inference time and standard deviation recorded. All results are summarized in Table 5. Overall, our method achieves the best performance in crack semantic segmentation, with only slightly higher parameter count and computational cost.

5. Discussion

5.1. Limitations

The Transformer-based backbone network adopted in this study possesses strong modeling capabilities and effectively mitigates overfitting on the source domain. Its core mechanism is self-attention, with computational complexity. The network contains approximately 97.64 million parameters and incurs a forward inference cost of 73.36 GFLOPs. While it maintains high modeling capacity, it also comes with significant computational overhead, making it suitable for scenarios that demand high Accuracy and have ample computing resources. In remote sensing semantic segmentation tasks, the input is typically high-resolution imagery, which dramatically increases computational load and poses challenges for real-time performance. As a result, this model is not well-suited for deployment on edge devices such as drones.

5.2. Future Works

With targeted processing of multi-scale features and enhanced capability for detecting sparse objects, the innovative architecture of UCrack-DA demonstrates strong potential for extension to other remote sensing tasks, such as road and vehicle recognition. Future research can further investigate the application of unsupervised domain adaptation in remote sensing image analysis, particularly its adaptability across diverse scenarios like cross-region, cross-temporal, and cross-sensor conditions. Furthermore, incorporating more lightweight model architectures may reduce computational overhead, facilitating deployment on resource-limited edge devices and enhancing the model’s practicality, responsiveness, and stability in real-world settings.

6. Conclusions

This paper proposes a multi-scale unsupervised domain adaptation method for crack semantic segmentation, UCrack-DA, aimed at addressing the domain shift issue in surface crack segmentation in images and achieving high-quality segmentation results in the target domain. By designing a hierarchical adversarial mechanism and prediction entropy minimization constraints and integrating key modules such as the Mix-Transformer [41] encoder, MSDAM, and MCAM, UCrack-DA has achieved significant results in cross-domain feature alignment and crack segmentation in complex scenes. Experimental results demonstrate that the method performs exceptionally well on two target domain datasets, particularly in complex background and multi-scale crack segmentation, where its mIoU, mPA, and Accuracy significantly outperform existing methods.

Author Contributions

Conceptualization, F.D.; Data curation, S.Y. and X.D.; Formal analysis, S.Y.; Investigation, S.Y., B.W., X.D., and S.T.; Methodology, S.Y. and B.W.; Supervision, F.D.; Validation, S.Y.; Visualization, S.Y.; Writing—original draft, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Major Science and Technology Project of China State Railway Group Co., Ltd. (Grant No. K2023G032).

Data Availability Statement

The UAV-Crack dataset used in this study is publicly available at https://github.com/ohouyang/UAV-Crack.git (accessed on 18 June 2025), providing a foundation for further research; proper citation is required when using this dataset.

Acknowledgments

The successful completion of this study was made possible through the collective research contributions of all participating authors. We would also like to sincerely thank the editors and reviewers for their valuable suggestions.

Conflicts of Interest

The authors declare that this study received funding from China State Railway Group. The funder had the following involvement with the study: Data curation; Investigation.

Abbreviations

UAV	unmanned aerial vehicle
CNN	Convolutional Neural Network
DA	domain adaptation
UDA	unsupervised domain adaptation
ENC	encoder
DEC	decoder
DISC	discriminator
MSDAM	Multi-Scale Dilated Attention Module
MCAM	Mixed Convolutional Attention Module
MLP	Multi-layer Perceptron
WBCE	Weighted Binary Cross-Entropy
CBAM	Convolutional Block Attention Module
t-SNE	t-distributed Stochastic Neighbor Embedding
LR	learning rate
mIoU	mean Intersection over Union
mPA	mean Pixel Accuracy
MMD	Maximum Mean Discrepancy
M	million
GFLOPs	Giga Floating Point Operations per Second

References

Lian, X.; Li, Z.; Yuan, H.; Liu, J.; Zhang, Y.; Liu, X.; Wu, Y. Rapid identification of landslide, collapse and crack based on low-altitude remote sensing image of UAV. J. Mt. Sci. 2020, 17, 2915–2928. [Google Scholar] [CrossRef]
Colica, E.; D’Amico, S.; Iannucci, R.; Martino, S.; Gauci, A.; Galone, L.; Galea, P.; Paciello, A. Using unmanned aerial vehicle photogrammetry for digital geological surveys: Case study of Selmun promontory, northern of Malta. Environ. Earth Sci. 2021, 80, 551. [Google Scholar] [CrossRef]
Canny, J. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, 6, 679–698. [Google Scholar] [CrossRef]
Ballard, D.H. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognit. 1981, 13, 111–122. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1525–1535. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
Huyan, J.; Li, W.; Tighe, S.; Xu, Z.; Zhai, J. CrackU-net: A novel deep convolutional neural network for pixelwise pavement crack detection. Struct. Control Health Monit. 2020, 27, e2551. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Mei, Q.; Gül, M. Multi-level feature fusion in densely connected deep-learning architecture and depth-first search for crack segmentation on images collected with smartphones. Struct. Health Monit. 2020, 19, 1726–1744. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar] [CrossRef]
Liu, H.; Miao, X.; Mertz, C.; Xu, C.; Kong, H. Crackformer: Transformer network for fine-grained crack detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3783–3792. [Google Scholar] [CrossRef]
Taha, H.; El-Habrouk, H.; Bekheet, W.; El-Naghi, S.; Torki, M. Pixel-level pavement crack segmentation using UAV remote sensing images based on the ConvNeXt-UPerNet. Alex. Eng. J. 2025, 124, 147–169. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar] [CrossRef]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar] [CrossRef]
Liu, H.; Jia, C.; Shi, F.; Cheng, X.; Chen, S. SCSegamba: Lightweight structure-aware vision mamba for crack segmentation in structures. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 29406–29416. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 4015–4026. [Google Scholar] [CrossRef]
Huang, H.; Wu, Z.; Shen, H. A three-stage detection algorithm for automatic crack-width identification of fine concrete cracks. J. Civ. Struct. Health Monit. 2024, 14, 1373–1382. [Google Scholar] [CrossRef]
Joshi, D.; Singh, T.P.; Sharma, G. Automatic surface crack detection using segmentation-based deep-learning approach. Eng. Fract. Mech. 2022, 268, 108467. [Google Scholar] [CrossRef]
Wang, H.; Li, Y.; Dang, L.M.; Lee, S.; Moon, H. Pixel-level tunnel crack segmentation using a weakly supervised annotation approach. Comput. Ind. 2021, 133, 103545. [Google Scholar] [CrossRef]
Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; Vaughan, J.W. A theory of learning from different domains. Mach. Learn. 2010, 79, 151–175. [Google Scholar] [CrossRef]
Yu, F.; Wang, D.; Chen, Y.; Karianakis, N.; Shen, T.; Yu, P.; Lymberopoulos, D.; Lu, S.; Shi, W.; Chen, X. Unsupervised domain adaptation for object detection via cross-domain semi-supervised learning. arXiv 2019, arXiv:1911.07158. [Google Scholar] [CrossRef]
Mahapatra, D.; Korevaar, S.; Bozorgtabar, B.; Tennakoon, R. Unsupervised domain adaptation using feature disentanglement and GCNs for medical image classification. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 735–748. [Google Scholar] [CrossRef]
Wang, B.; Deng, F.; Wang, S.; Luo, W.; Zhang, Z.; Jiang, P. SiamSeg: Self-Training with Contrastive Learning for Unsupervised Domain Adaptation Semantic Segmentation in Remote Sensing. arXiv 2024, arXiv:2410.13471. [Google Scholar] [CrossRef]
Zhu, X.; Zhou, H.; Yang, C.; Shi, J.; Lin, D. Penalizing top performers: Conservative loss for semantic segmentation adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 568–583. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar] [CrossRef]
Wang, Z.; Luo, Y.; Huang, D.; Ge, N.; Lu, J. Category-adaptive domain adaptation for semantic segmentation. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 3773–3777. [Google Scholar] [CrossRef]
Chen, Z.; Xiao, R.; Li, C.; Ye, G.; Sun, H.; Deng, H. Esam: Discriminative domain adaptation with non-displayed items to improve long-tail performance. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 25–30 July 2020; pp. 579–588. [Google Scholar] [CrossRef]
Hoffman, J.; Wang, D.; Yu, F.; Darrell, T. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv 2016, arXiv:1612.02649. [Google Scholar] [CrossRef]
Srivastava, K.; Kancharla, D.D.; Tahereen, R.; Ramancharla, P.K.; Sarvadevabhatla, R.K.; Kandath, H. CrackUDA: Incremental Unsupervised Domain Adaptation for Improved Crack Segmentation in Civil Structures. In Proceedings of the International Conference on Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2025; pp. 74–89. [Google Scholar] [CrossRef]
Kulkarni, S.; Singh, S.; Balakrishnan, D.; Sharma, S.; Devunuri, S.; Korlapati, S.C.R. CrackSeg9k: A collection and benchmark for crack segmentation datasets and frameworks. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 179–195. [Google Scholar] [CrossRef]
Tranheden, W.; Olsson, V.; Pinto, J.; Svensson, L. Dacs: Domain adaptation via cross-domain mixed sampling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual Conference, 5–9 January 2021; pp. 1379–1389. [Google Scholar] [CrossRef]
Hoyer, L.; Dai, D.; Van Gool, L. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9924–9935. [Google Scholar] [CrossRef]
Tsai, Y.H.; Hung, W.C.; Schulter, S.; Sohn, K.; Yang, M.H.; Chandraker, M. Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7472–7481. [Google Scholar] [CrossRef]
Vu, T.H.; Jain, H.; Bucher, M.; Cord, M.; Pérez, P. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2517–2526. [Google Scholar] [CrossRef]
Wang, Y.; Wang, H.; Shen, Y.; Fei, J.; Li, W.; Jin, G.; Wu, L.; Zhao, R.; Le, X. Semi-supervised semantic segmentation using unreliable pseudo-labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4248–4257. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Jähne, B. Digital Image Processing; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations; Rumelhart, D.E., McClelland, J.L., Eds.; MIT Press: Cambridge, MA, USA, 1986; pp. 318–362. ISBN 0-262-68053-X. [Google Scholar]
King, G.; Zeng, L. Logistic regression in rare events data. Political Anal. 2001, 9, 137–163. [Google Scholar] [CrossRef]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
Zhu, Y.; Wan, L.; Xu, W.; Wang, S. ASPP-DF-PVNet: Atrous spatial pyramid pooling and distance-filtered PVNet for occlusion resistant 6D object pose estimation. Signal Process. Image Commun. 2021, 95, 116268. [Google Scholar] [CrossRef]
Hu, L.; Zhou, X.; Ruan, J.; Li, S. ASPP+-LANet: A Multi-Scale Context Extraction Network for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2024, 16, 1036. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar] [CrossRef]
Thesis. Crack Segmentation Dataset. 2024. Available online: https://universe.roboflow.com/thesis-bvx2g/crack-segmentation-mtjxf (accessed on 29 April 2025).
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Long, M.; Cao, Y.; Wang, J.; Jordan, M. Learning transferable features with deep adaptation networks. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; PMLR: New York, NY, USA, 2015; pp. 97–105. [Google Scholar] [CrossRef]

Figure 1. Domain shift [25] between UAV dataset and other datasets. The content within the red dashed boxes represents background noise present in the crack images.

Figure 2. Workflow of deep learning-based crack semantic segmentation.

Figure 3. Unsupervised domain adaptation methods. (a) Adversarial learning-based method. (b) Self-training-based method.

Figure 4. The overall structure of UCrack-DA. UCrack-DA is primarily composed of a domain discriminator, an encoder, and a decoder. The semantic segmentation network consists of a Mix-Transformer encoder, a Multi-Scale Dilation Attention Module (MSDAM), and a decoder, with the Mixed Convolutional Attention Module (MCAM) serving as the core component of the decoder.

Figure 5. Proposed unsupervised domain adaptation method illustration. The red arrow represents the source domain, and the blue arrow represents the target domain.

Figure 6. Multi-Scale Dilated Attention Module illustration.

Figure 7. Mixed Convolutional Attention Module illustration.

Figure 8. Some sample images from the datasets used in the experiments.

Figure 9. Visualization of results on Roboflow-Crack. (a) Image, (b) ground truth, (c) UCrack-DA, (d) CrackUDA, (e) ADVENT, (f) Source Only, (g) Oracle. The red dashed boxes highlight regions where our model achieves notably better results, illustrating its effectiveness.

Figure 10. Visualization of results on UAV-Crack. (a) Image, (b) ground truth, (c) UCrack-DA, (d) CrackUDA, (e) ADVENT, (f) Source Only, (g) Oracle. The red dashed boxes highlight regions where our model achieves notably better results, illustrating its effectiveness.

Figure 11. Feature distribution. (a) Before domain adaptation. (b) After domain adaptation.

Table 1. Dataset split and sample statistics.

Dataset Name	Training Samples	Validation/Test Samples	Total
CrackSeg9K [35]	7332	1827	9159
Roboflow-Crack [56]	748	138	886
UAV-Crack	150	45	195

Table 2. Ablation study of each module combination.

	MSAdv	MinEnt	MSDAM	MCAM	Accuracy (%)	mPA (%)	mIoU (%)
Remove	✗	✓	✓	✓	96.55	68.28	63.27
	✓	✗	✓	✓	96.39	70.31	64.80
	✓	✓	✗	✓	96.33	70.97	64.96
	✓	✓	✓	✗	96.58	69.66	64.16
	✗	✓	✗	✓	96.20	68.48	62.33
	✗	✓	✓	✗	95.68	68.14	60.74
	✓	✗	✗	✓	96.51	67.45	63.51
	✓	✗	✓	✗	96.43	68.75	64.05
UCrack-DA (Ours)					96.31	71.76	65.33

The best results are shown in bold. ✓ indicates the module is included; ✗ indicates it is removed.

Table 3. Experimental results on Roboflow-Crack.

Method	Accuracy (%)	mPA (%)	mIoU (%)
DACS [36]	96.08	70.77	66.93
DAFormer [37]	96.61	75.04	71.19
AdaptSegnet [38]	97.07	77.06	71.69
ADVENT [39]	97.27	79.66	73.84
CrackUDA [34]	97.60	82.17	76.95
UCrack-DA	97.92	90.90	81.34
Source Only	96.81	76.79	69.54
Oracle	98.71	93.26	87.19

The best results are shown in bold.

Table 4. Experimental results on UAV-Crack.

Method	Accuracy (%)	mPA (%)	mIoU (%)
DACS [36]	96.14	62.12	58.28
DAFormer [37]	95.58	68.11	60.49
AdaptSegnet [38]	95.82	66.77	61.13
ADVENT [39]	95.78	67.09	61.23
CrackUDA [34]	96.39	68.45	62.88
UCrack-DA	96.31	71.76	65.33
Source Only	96.21	63.18	59.15
Oracle	96.97	79.36	70.44

The best results are shown in bold.

Table 5. Efficiency analysis of each method.

Method	Params (M)	FLOPs (GFLOPs)	Inference Time (ms)
DACS [36]	74.54	57.71	485.55 ± 9.68
DAFormer [37]	64.55	102.01	835.24 ± 10.8
AdaptSegnet [38]	74.54	57.71	489.88 ± 12.73
ADVENT [39]	74.54	57.71	486.35 ± 11.29
CrackUDA [34]	88.21	70.22	577.97 ± 9.35
UCrack-DA (ours)	97.64	73.36	652.13 ± 13.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Deng, F.; Yang, S.; Wang, B.; Dong, X.; Tian, S. UCrack-DA: A Multi-Scale Unsupervised Domain Adaptation Method for Surface Crack Segmentation. Remote Sens. 2025, 17, 2101. https://doi.org/10.3390/rs17122101

AMA Style

Deng F, Yang S, Wang B, Dong X, Tian S. UCrack-DA: A Multi-Scale Unsupervised Domain Adaptation Method for Surface Crack Segmentation. Remote Sensing. 2025; 17(12):2101. https://doi.org/10.3390/rs17122101

Chicago/Turabian Style

Deng, Fei, Shaohui Yang, Bin Wang, Xiujun Dong, and Siyuan Tian. 2025. "UCrack-DA: A Multi-Scale Unsupervised Domain Adaptation Method for Surface Crack Segmentation" Remote Sensing 17, no. 12: 2101. https://doi.org/10.3390/rs17122101

APA Style

Deng, F., Yang, S., Wang, B., Dong, X., & Tian, S. (2025). UCrack-DA: A Multi-Scale Unsupervised Domain Adaptation Method for Surface Crack Segmentation. Remote Sensing, 17(12), 2101. https://doi.org/10.3390/rs17122101

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

UCrack-DA: A Multi-Scale Unsupervised Domain Adaptation Method for Surface Crack Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Crack Semantic Segmentation

2.2. Unsupervised Domain Adaptation

3. Method

3.1. Overview

3.2. Proposed Unsupervised Domain Adaptation Method

3.2.1. Hierarchical Adversarial Training

3.2.2. Prediction Entropy Minimization

3.3. Multi-Scale Dilated Attention Module

3.4. Mixed Convolutional Attention Module

4. Experimental Results and Analysis

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Ablation Study

4.5. Comparison with Other Methods

4.5.1. Crackseg9K→Roboflow-Crack

4.5.2. Crackseg9K→UAV-Crack

4.5.3. Comprehensive Analysis

4.6. Efficiency Analysis

5. Discussion

5.1. Limitations

5.2. Future Works

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI