2.1. DTA-Unet Model
For crack detection, extracting the morphological characteristics of cracks is essential for conducting subsequent safety assessments of the target structure. Therefore, performing semantic segmentation to identify crack morphology is more valuable for engineering practice than merely identifying image patches containing cracks. Pixel-level classification, enabled by fully convolutional networks, allows for crack segmentation and the identification of crack geometry. Unet is a prominent deep neural network architecture for semantic segmentation. It has been widely used as a backbone model for numerous researchers in various semantic segmentation tasks.
One of the significant challenges in training deep neural networks is the excessive number of parameters that require optimization. An abundance of trainable parameters necessitates a larger and more diverse dataset for effective model training; otherwise, the model risks underfitting or overfitting. Concurrently, a greater number of parameters imposes higher demands on hardware performance during training, leading to increased computational costs. For semantic segmentation tasks such as crack detection, which are essentially pixel-level binary classification problems (determining whether a pixel belongs to a crack or the background), enhancing the model’s expressive capacity is crucial. Before dynamic convolution was introduced, conventional convolutions used the same kernel parameters for all input images, which restricted the network’s capacity for feature representation. Typically, network depth or width is increased to boost expressive power, but this approach significantly escalates computational costs. Therefore, Yang [
26] proposed a conditionally parameterized convolution in which an attention mechanism is assigned to multiple parallel convolutional kernels, breaking the characteristic of traditional convolutions that share parameters across all inputs. This method demonstrated excellent performance in detection, providing a new direction for enhancing model expressiveness. Building upon conditionally parameterized convolutions, Chen [
27] introduced dynamic convolution, which accelerated model training and reduced the number of parameters. Li [
28], analyzing the issues of large parameter counts in dynamic convolution and the difficulty in jointly optimizing dynamic attention with conventional kernels from a matrix decomposition perspective, proposed a Dynamic Convolution Decomposition (DCD) model incorporating a dynamic channel fusion mechanism. Compared to conditionally parameterized convolutions and dynamic convolutions, DCD features fewer parameters and improved performance. Consequently, this study replaces all standard convolutions in the encoder–decoder part of the Unet with DCD, aiming to achieve substantial performance gains with a relatively small increase in parameter count.
In deep neural networks, attention mechanisms assign varying weight parameters to input feature maps, enabling models to focus more on critical information while disregarding irrelevant details such as background. To facilitate information interaction across channel dimensions, the Squeeze-and-Excitation (SE) block was proposed [
29]. However, the excitation phase primarily relies on dimensionality reduction and expansion operations, where reduction may hinder the effective learning of channel interdependencies. Wang [
30] introduced an Efficient Channel Attention (ECA) mechanism, which eliminates dimensionality reduction and employs one-dimensional convolution to enable partial cross-channel interaction, thereby reducing the parameter count and improving performance. Liu [
31] introduced a Normalization-based Attention Module (NAM), which leverages variance from batch normalization to represent the importance of channels and spatial locations, with greater variance indicating richer information and higher importance. Woo [
32] proposed the Convolutional Block Attention Module (CBAM), integrating both spatial and channel attention mechanisms. While CBAM adaptively refines features and can be seamlessly integrated into CNN architectures, its spatial and channel attention components operate independently, lacking beneficial cross-dimensional interaction. Consequently, a nearly parameter-free Triplet Attention (TA) mechanism was introduced. It leverages residual connections and rotational transformations between tensors to achieve information interaction across three dimensions: channels, height, and width, while simultaneously avoiding the adverse effects associated with dimensionality reduction present in modules like CBAM. Compared to the aforementioned attention mechanisms, TA not only implements attention across spatial and channel dimensions but also facilitates cross-dimensional information exchange. Therefore, TA was selected for feature refinement. Furthermore, in crack identification, the background occupies a large area of the image, and the target cracks exhibit significant variations in shape and size. To address this, the present study combines TA with Attention Gates (AG). This combined approach not only enables the identification of cracks of various sizes but also mitigates the adverse effects of irrelevant regions and background information introduced by skip connections.
The proposed crack identification network, DTA-Unet, is illustrated in
Figure 1. It is built upon the Unet architecture, with DCD and TA incorporated into its fundamental network. First, a conventional encoder–decoder structure was employed in the network. The encoder consists of dual DCD blocks, TA, and max-pooling layers, responsible for feature extraction, feature refinement, reduction in image resolution, and increasing channel depth. The decoder comprises bilinear upsampling layers, dual DCD blocks, skip connections, TA, and 1 × 1 convolutions, tasked with restoring the encoded abstract feature maps to the original image size. Conventional CNNs are constrained by computational cost, limiting significant increases in network depth or width, which in turn restricts their expressive power. Replacing standard convolutions with DCD throughout the encoder–decoder process helps the network capture more detailed semantic information while effectively balancing computational cost and representational capacity. The fundamental principle of DCD is as follows:
where
x represents the input feature map,
denotes the dynamic convolution kernel,
signifies the standard convolution kernel, and
indicates the number of convolution kernels, with a value range of
.
refers to the attention weight coefficient, which ranges between 0 and 1. Additionally,
.
By leveraging residuals to represent each standard convolution,
can be defined as:
where
is the average convolution kernel.
where
signifies the right singular matrix,
denotes the diagonal matrix, and
is the left singular matrix corresponding to
.
Thus, the complete dynamic convolution is obtained as:
where
U is the left singular matrix of
, with shape
.
S is a unit diagonal matrix representing the convolutional weight coefficients obtained via the attention mechanism, with shape
. And
is the right singular matrix of
, with shape
.
If the constraint
is relaxed, and the channel dimension is utilized to represent dynamic convolution, the channel count for a single standard convolution is
. Consequently, the total channel count for
standard convolutions becomes
:
Finally, setting the dimensionality of the hidden space to
and letting
, and substituting
,
and
into the equation, the complete expression for DCD is expressed as:
where
x represents the input feature map. Both
and
are matrices of order
, where
denotes the number of input channels, and
indicates the convolution kernel size (
).
denotes the Dynamic Convolution Decomposition.
is a diagonal matrix of order
, generated by an attention mechanism.
denotes the average convolution kernel.
is a matrix of order
, designed to compress the number of kernel elements from
to
.
is a matrix of order
, whose function is to dynamically fuse
elements, significantly reducing the dimensionality of the hidden space.
is a matrix of order
, used to increase the dimensionality back to the desired output channel count. To reduce the total number of parameters, the parameter
is manually set to
.
Furthermore, for further feature refinement, Triplet Attention (TA) is incorporated into both the encoder and decoder processes, enhancing the model’s focus on the segmentation target. Let the feature extracted by DCD be denoted as
:
where
denotes element-wise multiplication. The shape of
is
, where
represents the number of channels,
represents the height, and
represents the width.
The architecture of TA (Triplet Attention) is illustrated in
Figure 2. A three-branch architecture was employed. The first two branches are similar, primarily utilizing rotation, Z-Pool, convolution, and Sigmoid functions to facilitate dimensional interaction between C&H and C&W, respectively. The third branch is designed to compute spatial attention weights. Finally, the outputs of the three branches are averaged. The resulting feature map not only incorporates the advantages of channel attention (CA) and spatial attention (SA) but also encompasses cross-dimensional interaction information, thereby achieving the objective of feature refinement.
The Z-Pool layer is responsible for reducing the C-dimensional tensor to 2 dimensions by concatenating the average-pooled and max-pooled features along that dimension. This allows the layer to retain a rich representation of the original tensor while reducing its depth, thereby further lowering computational intensity. The expression for Z-Pool is as follows:
where
and
denote the max pooling and average pooling operations applied along 0 dimension.
2.3. Optimization Algorithm
This section employs the Adam optimization algorithm [
33] to update the model parameters, which is expressed as:
where
t and
t − 1 represent the current and previous iteration steps, respectively. The symbol
θ is the model parameter to be updated, and Δ
θ refers to the parameter update increment, which is determined by the following equations:
where
ε is the learning rate,
and
are the bias-corrected first and second moment estimates,
δ is a small constant added for numerical stability, and
ρ1 and
ρ2 are the decay rates for the first and second moment estimates, respectively. The first moment estimate
and the second moment estimate
are calculated according to the following formulas:
In these equations, gt represents the mean gradient, m is the number of images in a training batch, yi is the ground truth label of a pixel, L(f(xi; θ), yi) is the loss function, and f(xi; θ) is the predicted output from the neural network.
2.4. Two-Stage Model Training Method
Concrete crack images captured in the field often contain complex background patterns, which can lead to identification errors even in trained deep neural networks. To this end, a two-stage model training method was proposed, which selects noise images without cracks.
After cropping and labeling crack images obtained during field inspections, this study constructed a dataset containing 6785 annotated crack images and 4285 noise images without cracks.
Manually selected noise images resembling cracks may reflect human bias and may not necessarily confuse a neural network. Training with such images does not substantially contribute to improving model robustness, so it is unnecessary to use all non-crack images for training. Moreover, training with larger datasets requires more powerful hardware resources. To enhance training efficiency, the following two-stage training method was proposed.
In the first stage, the deep neural network model is trained using only the 6785 pixel-wise annotated crack images, as shown in the
Figure 3. Subsequently, the initially trained model is applied to process the 4285 non-crack noise images. Noise images that cause significant errors in the initial model are retained, while those that are easily distinguished are discarded.
In the second stage, the retained noise images are combined with the 6785 crack images to form an enhanced crack dataset. All subsequent network models are then trained using this consolidated dataset.