2.1. Overall Architecture
To address these challenges, we propose a crack segmentation model based on feature integration and triple attention. The model uses DeepLabv3+, renowned for its multi-scale context capture capability, as the backbone network and introduces three core modules on this foundation. The overall network architecture is shown in
Figure 1.
The network first employs a ResNet-50 backbone, augmented with dilated convolutions, to extract feature maps rich in semantic information. The residual structure in ResNet contributes to the optimization of the network and can extract hierarchical visual representations more effectively. In addition, dilated convolution is introduced into the ResNet network to expand the receptive field while maintaining the resolution of the feature map. This design has been widely used in semantic segmentation [
14], which can not only alleviate the problem of deep network gradient disappearance but also retain the characteristics of small crack structures in the feature extraction process.
Following the output of the ResNet-50 feature extraction network, a triple-dimensional interactive attention mechanism (TDIA) is introduced. This module models the interdependencies among the channel, height, and width dimensions, enhancing the capacity of the network to spatially localize crucial crack features. Subsequent to the TDIA module, a multi-groups dilation feature fusion module (MDFF) was employed. This module enriches the diversity of multi-scale crack feature representations, thereby improving the model’s ability to capture complex crack characteristics. During the decoding stage, a feature integration branch was added to the original decoding path. This branch first integrates feature maps from different stages (Stage 1, Stage 2, and Stage 3) of ResNet-50 and introduces a dimension aware selective integration module (DASI) to enable the network to adaptively select and integrate features from different layers, thereby enhancing the saliency of small cracks in complex backgrounds.
Before presenting the detailed mathematical formulations of the three modules, their intuitive roles in the proposed framework are summarized in
Table 1. This summary is intended to help readers understand the functional motivation of each module and its relationship to the challenges of low-quality pavement crack segmentation. Specifically, TDIA focuses on directional crack localization, MDFF focuses on anisotropic multi-scale slender crack modeling, and DASI focuses on the adaptive recovery of weak crack details through cross-layer feature integration.
2.2. Triple-Dimensional Interactive Attention (TDIA)
Attention mechanisms are inspired by the human cognitive ability to selectively focus on important information. In deep neural networks, they model interdependencies among feature channels and generate spatial attention weights, allowing the network to enhance target-relevant features while suppressing irrelevant background interference. By integrating attention mechanisms, deep convolutional neural networks can better capture critical features and achieve improved performance in large-scale visual tasks.
Although existing attention mechanisms (such as convolutional block attention module (CBAM) [
15] and coordinate attention (CA) [
16]) have made significant progress in computer vision tasks, they have limitations when used in road cracks. The crack shape usually belongs to a slender linear topology and has a low contrast with the background. The previous attention mechanism focused on the independent weighting of channels and space, relying on isotropic square receptive fields. When dealing with the directional characteristics of cracks, surrounding pavement texture information can easily be introduced, thereby weakening the continuity and directional spatial localization ability of the crack features. To objectively investigate this issue, in
Section 4.6 of this study, TDIA was further quantitatively compared with CBAM [
15] and CA [
16] under the same backbone network and training settings.
Therefore, this study proposes a triple-dimensional interactive attention mechanism subsequent to the feature extraction part of the ResNet-50 backbone network. This module aims to establish the interactive relationships among the three dimensions of the channel, height, and width. We can model the channel and spatial attention more efficiently without reducing dimensionality, thereby improving our ability to locate key targets.
The difference between TDIA and CBAM and coordinate attention is its explicit cross-dimensional interaction strategy. CBAM models channel attention and spatial attention sequentially, while coordinate attention embeds position information into channel attention through two one-dimensional direction encodings. In contrast, TDIA constructs three pairs of interactive branches, namely C × H, C × W, and H × W branches, to jointly model channel semantics, directional spatial extension, and spatial continuity. The C × H and C × W branches are designed to capture the interaction between the channel dimension C and height H or width W, respectively, whereas the last H × W branch focuses on modeling the internal relationship between the two spatial dimensions H and W. Finally, the attention weights of the interaction dimensions calculated on the three branches are multiplied. This three-way interactive design is of crucial importance for road crack segmentation. Since pavement cracks usually exhibit strong directional extensibility, such as the longitudinal and transverse distribution of long strips. The C × H and C × W branches can capture the interdependencies between channel characteristics and vertical or horizontal directions, respectively, thus retaining the morphological continuity of slender cracks in the feature map. At the same time, the convolution kernel used in the three branches can effectively filter the interference information generated by the complex road surface so as to better maintain the continuous shape of the crack and improve the crack localization accuracy under weak contrast. The attention structure diagram of the TDIA is shown in
Figure 2.
The TDIA module can be understood as a three-branch attention block designed for the elongated geometry of pavement cracks. Instead of computing channel attention and spatial attention separately, TDIA builds three pairwise interactions. The C × H branch emphasizes crack continuity along the vertical direction, the C × W branch emphasizes crack continuity along the horizontal direction, and the H × W branch enhances the spatial structure of crack regions. The outputs of these three branches are then multiplied to generate a crack-aware attention map. In this way, TDIA aims to strengthen weak and slender crack responses while suppressing irrelevant pavement textures.
For clarity, denotes the input feature map, where C, H, and W represent the channel number, height, and width, respectively. The three attention maps generated by the C × H, C × W, and H × W branches are denoted as , , and , respectively.
The first branch, the C × H branch, was used to establish the interaction between the channel and the height dimensions. First, strip average pooling and strip max pooling were applied to the input feature
X along the width dimension, and the results were summed to obtain a C × H × 1 feature map encoding the localization information, as in Equation (1).
In Equation (1), H represents the height, and the feature map
is obtained by summing the results of average pooling (
) and max pooling (
), with the height as the unit.
On the basis of the location information embedded feature map obtained from Equation (1), we generated the attention feature map via the following encoding method. Firstly, a one-dimensional convolution (Conv1D) with a convolution kernel size of 7 was used to enhance the position feature map. This design refers to the common design of representative attention modules. For example, CBAM [
15] uses a 7 × 7 convolution to generate a spatial attention map in the spatial attention branch; the triplet attention module [
17] also uses a convolution kernel size of 7. Therefore, this design can achieve a balance between a larger receptive field and limited computational overhead. Group normalization (GN) [
18] was subsequently used to improve the generalizability of training with mini-batches. Finally, the sigmoid activation function was used to generate attention weights across the height dimension. The encoding method is shown in Equation (2).
In Equation (2),
is the sigmoid function.
Similar to the C × H branch, the C × W branch first applies strip average pooling and strip max pooling to the input feature map
, as defined in Equation (3). It then employs a similar encoding method to generate an attention feature map for the width dimension, as shown in Equation (4).
In Equation (3), W represents the width, and the feature map
is obtained by summing the results of average pooling (
) and max pooling (
) with the width as the basic unit. In Equation (4),
is a sigmoid function.
The final branch computes the attention feature maps between two spatial dimensions. First, the input tensor
is compressed along the channel dimension via both average pooling and maximum pooling over the channels to obtain the spatial information feature map, as shown in Equation (5). Subsequently, the spatial information feature map was processed by two strip convolutions with kernel sizes of 1 × 7 and 7 × 1. The combination of horizontal and vertical strip convolutions enables the network to capture directional features along both spatial axes, enhancing the representation of slender and narrow crack structures in the spatial dimension [
19]. Simultaneously, it avoids irrelevant background interference caused by the square receptive field of traditional convolution and improves the directionality of feature extraction. Finally, the final spatial attention feature map is output through the sigmoid activation function, as shown in Equation (6).
In Equation (5), by considering the channels as the basic unit, the results of average pooling and max pooling are applied and then summed to obtain the feature map
. In Equation (6),
is a sigmoid function.
Finally, we multiply the three attention feature maps,
,
, and
obtained from Equations (2), (4), and (6), respectively, to obtain the comprehensive attention feature map. This comprehensive weight map is then multiplied by the input feature map X to obtain the output T of the TDIA module, as shown in Equation (7).
The final output of the TDIA module is . This module captures cross-dimensional interactions across channel, height, and width. It enhances key crack features and improves localization while preserving original semantic information.
In summary, TDIA is designed to enhance crack-related features through three complementary interactions. The C × H and C × W branches preserve the directional continuity of slender cracks along vertical and horizontal directions, while the H × W branch strengthens the spatial crack structure. Therefore, TDIA provides a more task-specific attention mechanism for low-quality crack images than conventional channel–spatial attention modules.
2.3. Multi-Groups Dilation Feature Fusion (MDFF)
After feature extraction, the traditional DeepLabv3+ model feeds the feature map into the atrous spatial pyramid pooling (ASPP) module before sending the output feature map to the decoder. The ASPP module consists of 5 branches, each consisting of a 1 × 1 convolution, three 3 × 3 dilated convolutions with dilation rates of 6, 12, and 18, and global average pooling. The three parallel dilated convolutions can extract features from different receptive fields and obtain feature representations of crack targets at different scales. Global average pooling was used to obtain global contextual information, and the scale of the feature map was adjusted by upsampling to be consistent with the other four branches.
However, since the traditional ASPP module [
4] uses a fixed square receptive field, it encounters significant limitations in complex pavement environments. Cracks typically manifest as slender linear structures. Consequently, square convolutions inevitably introduce substantial background noise into the feature maps when expanding the receptive field, leading to the dilution of fine crack features. Furthermore, the ASPP module applies the same convolution operation across all channels, thereby failing to adequately model morphological diversity. Therefore, we improved the ASPP module and proposed a multi-groups dilation feature fusion module (MDFF) by combining the grouping strategy idea [
20]. The structure diagram is shown in
Figure 3.
To make the feature extraction process clearer, the MDFF module can be understood as a grouped improved ASPP-like structure. First, the input feature map is divided into four sub-feature maps along the channel dimension, and each sub-feature map has a size of C/4 × H × W. Then, each sub-feature map is independently processed by the same multi-branch feature extraction block. For each group, one branch uses a 1 × 1 convolution to preserve local channel information, while the other three branches use strip dilated convolutions with dilation rates of 4, 8, and 12 to capture crack features at different receptive-field scales. The outputs of the four branches within each group are fused to obtain one corresponding output sub-feature map. Finally, the four output sub-feature maps are concatenated along the channel dimension to recover the final output feature map with size C × H × W.
Specifically, the output
of the TDIA attention module is first used as the input feature map of the module and is divided into
N groups (
N = 4) along the channel dimension to generate four sub-feature maps
. The
C/N ratio (
N = 4) is the number of channels in each group. Each of the four channel groups is independently processed by the same four-branch improved ASPP-style block. In this block, the 1 × 1 convolution branch and the three strip convolution branches with dilation rates of 4, 8, and 12 jointly generate the corresponding group-wise output feature map. One branch adjusts the number of channels by using a 1 × 1 convolution to obtain the output. The other three branches were processed by strip convolution modules (with dilation rates d of 4, 8, and 12), and
was finally calculated. The calculation process for the strip convolution module is shown in Equation (8).
In Equation (8),
denotes the i-th sub-feature map obtained after channel splitting. The strip dilated convolution operation is applied to each channel group under a given dilation rate d, where d takes values of 4, 8, and 12 in the strip dilated convolution branches. The resulting branch-level feature representation is denoted as
(where
i = 1, 2, 3, 4). This operation enables each channel group to extract elongated crack features at different receptive-field scales.
After the strip convolution module, the outputs of each branch
are concatenated along the channel dimension to obtain
, as shown in Equation (9). Multiple subfeature maps were recombined. Subsequently, the feature maps corresponding to the four heads
,
,
and
are concatenated. At this stage, the size of the feature map becomes
. Then, through 1 × 1 convolution, batch normalization (BN), and nonlinear activation (ReLU), the sub-feature maps interact across groups and are recombined into the output feature map
. The recombination operation is shown in Equation (10).
In Equations (9) and (10), , , , and denote the feature maps generated from the four channel groups. represents channel-wise concatenation, and denotes the final output feature map after 1 × 1 convolution, batch normalization, and ReLU activation.
In summary, MDFF first divides the input feature map into several channel groups and then applies an improved ASPP-style multi-branch operation to each group independently. Within each group, strip dilated convolutions with different dilation rates are used to extract multi-scale slender crack features, while the 1 × 1 convolution branch preserves local channel information. The group-wise outputs are finally concatenated to form the complete output feature map. This design improves multi-scale crack representation while reducing redundant full-channel computation and limiting the interference introduced by surrounding pavement textures.
2.4. Hierarchical Feature Integration Branches
The traditional DeepLabv3+ model fuses only high-level features rich in semantic information located at the tail of the encoder and detailed features from low-level features (Stage 1) during the decoding stage. However, this two-layer feature map fusion strategy is limited. First, the deepest feature map underwent multiple downsampling operations, resulting in the loss of fine crack details. Conversely, the shallowest feature map, which is rich in detail, lacks sufficient semantic context. The crack feature information contained in the equally important middle layer features (Stages 2 and 3) was ignored. If only high-level and low-level feature maps are fused, it is difficult to capture the complex and varied features of cracks, and the faint features of small cracks are easily obscured in a complex background.
To address this issue, we introduced a feature integration branch into the backbone network. This network integrates three key feature maps from Stages 1, 2, and 3 of the ResNet-50 feature extraction part, with 256, 512, and 1024 channels, respectively. First, we used 1 × 1 convolutions to adjust the number of channels in the three feature maps to 128, 256, and 512. Subsequently, we incorporated the dimension aware selective integration (DASI) module, designed by Shibiao Xu et al. [
21], to conduct adaptive feature fusion and produce the final output of this branch. The hierarchical feature integration branch is shown in
Figure 4.
Dimension aware selective integration (DASI) module is a channel-partitioning selection mechanism. It can adaptively define the weight of the feature map of each layer by referring to the scale and shape of cracks of different sizes, and selectively fuse these feature maps, as shown in
Figure 5.
Intuitively, the DASI module is used to decide how much information should be taken from shallow detail features and deep semantic features under the guidance of intermediate-layer features. For crack segmentation, shallow features contain edge and texture details, deep features contain semantic discrimination, and intermediate features provide scale-related structural information. Instead of directly concatenating these features, DASI adaptively assigns their contributions, so that small cracks can retain local details while still benefiting from high-level semantic context.
The dimension aware selective integration module comprises two main steps. First, convolution is applied to the high-dimensional feature map
, and bilinear interpolation upsampling is performed on the low-dimensional feature map
such that their features can be aligned with the intermediate-layer feature map
. Subsequently, these three feature maps were divided into four equal parts according to the channel dimension, resulting in
and
, where
i denotes the i-th segmentation feature. First, the weights
of the middle-layer feature
were calculated via the sigmoid activation function. Then, the weights of the high-dimensional and low-dimensional features are dynamically allocated via
and added together to obtain the feature integration result
for each segment. The calculation process is shown in Equation (11).
If is true, the integrated features are more biased toward local details. If is true, the integrated features are more biased toward global contextual features.
Then, all
are concatenated into
, and the calculation process is shown in Equation (12).
The output feature map integrates features from the shallow, middle, and deep layers of the encoder. It not only preserves detailed information, but also possesses strong semantic expressive power. Through the adaptive weight allocation mechanism of the DASI module, the network can dynamically adjust the contribution of each layer’s feature map according to the scale and morphology of the different cracks, thereby effectively mitigating the missed detection phenomenon caused by improper feature fusion.
In summary, DASI acts as an adaptive cross-layer feature selector. It avoids the limitation of directly fusing only shallow and deep features in the original DeepLabv3+ decoder. By introducing intermediate-layer guidance, DASI helps the network preserve fine crack boundaries and enhance weak crack responses in complex pavement textures.