ETAFHrNet: A Transformer-Based Multi-Scale Network for Asymmetric Pavement Crack Segmentation

Tan, Chao; Liu, Jiaqi; Zhao, Zhedong; Liu, Rufei; Tan, Peng; Yao, Aishu; Pan, Shoudao; Dong, Jingyi

doi:10.3390/app15116183

Open AccessArticle

ETAFHrNet: A Transformer-Based Multi-Scale Network for Asymmetric Pavement Crack Segmentation

by

Chao Tan

,

Jiaqi Liu

,

Zhedong Zhao

,

Rufei Liu

^*,

Peng Tan

,

Aishu Yao

,

Shoudao Pan

and

Jingyi Dong

College of Geodesy and Geomatics, Shandong University of Science and Technology, Qingdao 266590, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6183; https://doi.org/10.3390/app15116183

Submission received: 25 April 2025 / Revised: 23 May 2025 / Accepted: 26 May 2025 / Published: 30 May 2025

(This article belongs to the Special Issue Object Detection and Image Classification)

Download

Browse Figures

Versions Notes

Abstract

Accurate segmentation of pavement cracks from high-resolution remote sensing imagery plays a crucial role in automated road condition assessment and infrastructure maintenance. However, crack structures often exhibit asymmetry, irregular morphology, and multi-scale variations, posing significant challenges to conventional CNN-based methods in real-world environments. Specifically, the proposed ETAFHrNet focuses on two predominant pavement-distress morphologies—linear cracks (transverse and longitudinal) and alligator cracks—and has been empirically validated on their intersections and branching patterns over both asphalt and concrete road surfaces. In this work, we present ETAFHrNet, a novel attention-guided segmentation network designed to address the limitations of traditional architectures in detecting fine-grained and asymmetric patterns. ETAFHrNet integrates Transformer-based global attention and multi-scale hybrid feature fusion, enhancing both contextual perception and detail sensitivity. The network introduces two key modules: the Efficient Hybrid Attention Transformer (EHAT), which captures long-range dependencies, and the Cross-Scale Hybrid Attention Module (CSHAM), which adaptively fuses features across spatial resolutions. To support model training and benchmarking, we also propose QD-Crack, a high-resolution, pixel-level annotated dataset collected from real-world road inspection scenarios. Experimental results show that ETAFHrNet significantly outperforms existing methods—including U-Net, DeepLabv3+, and HRNet—in both segmentation accuracy and generalization ability. These findings demonstrate the effectiveness of interpretable, multi-scale attention architectures in complex object detection and image classification tasks, making our approach relevant for broader applications, such as autonomous driving, remote sensing, and smart infrastructure systems.

Keywords:

pavement crack segmentation; transformer neural networks; multi-scale feature fusion; global attention mechanism; high-resolution remote sensing; deep learning; infrastructure monitoring; interpretable classification

1. Introduction

Pavement cracks are critical indicators of road infrastructure integrity, and their early and accurate detection plays a vital role in supporting preventive maintenance, extending service life, and ensuring traffic safety. According to global statistics, surface cracks contribute to over 30% of road-related traffic accidents annually [1]. If left unrepaired, they allow moisture penetration, accelerating substructure deterioration and significantly increasing maintenance costs. Studies have shown that untreated cracks can raise annual road maintenance expenditures by approximately 15%.

Conventional manual inspections suffer from low efficiency and high subjectivity. Reported detection rates fall below 80%, with false detection rates exceeding 30% [2]. While experienced inspectors can recognize visible damage, manual approaches are difficult to scale and insufficiently accurate for large road networks. In contrast, automated vision-based systems have reduced false positive rates to under 5% [3], offering a promising direction for smart pavement monitoring.

Traditional methods based on threshold segmentation [4] or edge detection [5] perform poorly under complex lighting and noise conditions. The emergence of deep learning has led to substantial progress in crack segmentation. Encoder-decoder networks such as U-Net [6] enable end-to-end detection. DeepCrack [7], for instance, achieves high IoU through multi-scale fusion but struggles to retain fine-grained structural details.

To improve spatial resolution, HRNet [8] was introduced, maintaining high-resolution representations via parallel branches. Yang et al. [9] and Fan et al. [10] adopted multi-resolution and adaptive thresholding strategies, yet their models still underperform in detecting fine or net-like cracks.

In recent years, researchers have increasingly adopted attention mechanisms and Transformer-based architectures to improve global perception and feature representation in crack segmentation tasks. Chen et al. [11] and Wang et al. [12] introduced channel and non-local spatial attention, significantly enhancing discriminability and context awareness. However, these methods typically incur high computational overhead.

Further developments such as SENet [13], CBAM [14], and Pyramid Attention Networks [15] improved adaptability to crack morphology but still lack flexible weighting mechanisms and robust generalization under noisy backgrounds.

In the Transformer domain, ViT [16] enables long-range modeling but demands extensive computation. Swin Transformer [17] reduces complexity via window partitioning but compromises spatial continuity. SegFormer [18] merges CNN and Transformer strengths, achieving balanced performance, but fixed attention fusion often weakens fine-detail segmentation [19,20].

Recent improvements by Zheng et al. [21], Ding et al. [22], and Huang et al. [23] demonstrate progress in contextual interaction and structural awareness. However, many models still struggle to address high-resolution, multi-scale, and complex crack geometries encountered in practical deployments.

Two major challenges remain unresolved:

(1): Accurate identification of intersecting cracks. In real scenarios, cracks often branch or intersect. Without sufficient receptive field or contextual awareness, models tend to miss or misclassify these areas [24].
(2): Continuous modeling of long-range cracks. Cracks are typically thin and extended. In the absence of strong global context modeling, segmentation results become fragmented, particularly under high-resolution or multi-scale settings [7].

To address these issues, we propose a novel segmentation framework—ETAFHrNet (Efficient Transformer-Enhanced and Adaptive Fusion Attention Network)—which integrates convolutional and Transformer paradigms to balance accuracy and efficiency.

(1): Global-local collaborative feature modeling: We introduce an Efficient Hybrid Attention Transformer (EHAT) module into HRNet’s high-resolution branches, combining axial positional encoding and window attention to capture long-range dependencies while controlling computation [25,26].
(2): Adaptive multi-scale fusion: A novel Cross-Scale Hybrid Attention Module (CSHAM) adaptively weights spatial and directional features through cascaded axial and cross-scale attention, enhancing the detection of intersecting or subtle crack patterns [27,28].

The overall workflow of the proposed method is illustrated in Figure 1, which outlines the entire pipeline from data acquisition to output segmentation. This structured design ensures reproducibility and operational scalability in pavement crack detection tasks.

2. Related Work

In recent years, deep learning has demonstrated substantial potential in the domain of image segmentation, particularly within structural health monitoring applications. Among emerging trends, convolutional neural networks (CNNs) enhanced by attention mechanisms and Transformer-based architectures have attracted increasing research attention. Against this backdrop, this section presents a structured review of recent advancements in pavement crack detection, organized from three key perspectives: (1) the evolution of classical segmentation models; (2) the development and refinement of attention mechanisms; (3) recent breakthroughs in global context modeling. In addition, we critically assess the applicability and limitations of these methods in addressing the unique challenges posed by crack detection, including scale variation, spatial discontinuity, and background complexity.

2.1. Segmentation Model Evolution

Regarding the evolution of classical segmentation models, Fully Convolutional Networks (FCNs) [29] were the first to introduce end-to-end, pixel-level prediction frameworks, thereby laying the groundwork for modern semantic segmentation. U-Net [30] advanced this concept by proposing an encoder-decoder architecture with skip connections, achieving notable success in biomedical image segmentation and inspiring subsequent model designs. Building on these foundations, multi-scale feature fusion strategies have been widely adopted to further enhance segmentation performance. The DeepLab series [11] employed atrous (dilated) convolutions to expand the receptive field without compromising spatial resolution. PSPNet [31] introduced a Pyramid Pooling Module to effectively capture multi-scale contextual information. More recently, HRNet improved segmentation accuracy by maintaining high-resolution representations through parallel multi-branch architectures, enabling the precise localization of fine structural details.

Zhang et al. [32] applied HRNet to pavement crack detection and demonstrated the advantages of multi-resolution feature fusion for identifying elongated cracks under complex background conditions. However, conventional CNN-based models often rely on local convolutional operations, which struggle to simultaneously achieve global semantic understanding and fine-grained representation, particularly when cracks are morphologically diverse, sparsely distributed, or embedded in noisy surfaces [10]. For example, although U-Net preserves low-level features through skip connections, its fixed receptive field restricts its ability to capture the global topological continuity of cracks across varying scales. DeepLabv3+ extends contextual awareness through atrous convolutions, yet remains vulnerable to false positives caused by background surface noise and exhibits inadequate continuity modeling for slender, elongated cracks. PSPNet incorporates pyramid pooling for multi-scale context aggregation, but its coarse-grained feature integration tends to overlook small or subtle crack patterns. Even though HRNet excels at maintaining high-resolution features via parallel multi-branch structures, its reliance on traditional convolutions limits its capacity to model long-range dependencies in complex scenes.

Recent work by Yin et al. [33] introduced DCRNet, a dual-context residual network that jointly models local detail and global structure using parallel pathways. This dual-path design aligns closely with our use of the EHAT and CSHAM modules, which aim to enhance crack connectivity and multiscale representation. DCRNet has shown strong performance in capturing complex crack morphologies and thus provides a meaningful comparative reference for dual-context segmentation architectures.

These observations underscore two persistent challenges in CNN-based crack detection: (1) insufficient global perception to capture long-range crack structures, and (2) limited local feature representation for accurately identifying fine, fragmented, or intersecting cracks.

2.2. Attention Mechanisms

The introduction of attention mechanisms has provided new opportunities for addressing the challenges of feature selection and fusion in pavement crack detection. Early methods such as SENet [13] utilized global average pooling to capture inter-channel dependencies and dynamically reweight channel responses. However, due to the absence of spatial interaction, SENet remains insufficient for detecting elongated or spatially distributed crack structures. CBAM [14], which incorporates both channel and spatial attention, improves segmentation performance by focusing on locally salient features. Yet, it still lacks the capability to model long-range spatial dependencies, which are critical for capturing the continuity of dispersed crack segments.

To enhance global context modeling, DANet [20] introduced a dual-path attention structure, while non-local modules [34] employed self-attention mechanisms to establish pixel-level global correlations. Despite these advancements, many existing approaches rely on fixed-weight fusion strategies. For instance, in the work by Li et al. [35], CBAM and the non-local module are combined in series, yet the static integration scheme fails to adapt to the morphological diversity of cracks. In contrast, Guo et al. [36] proposed a dynamic convolutional attention network that employs learnable weights to adaptively assign attention across features, offering a promising approach for handling multi-scale and complex crack patterns.

Li et al. [37] proposed CrackCLF, a closed-loop feedback-based segmentation network that iteratively refines predictions by incorporating previous outputs as inputs. This dynamic correction mechanism complements our adaptive attention design and represents a promising direction for improving segmentation stability in noisy environments.

The design of attention mechanisms is especially critical in pavement crack detection, where the structures of interest are typically slender, elongated, and oriented in diverse directions. Detection algorithms must therefore balance the preservation of local detail with the need for global continuity [38]. While traditional attention modules such as CBAM are effective in enhancing local contrast, they often fail to establish relationships between spatially separated crack fragments. Recent studies suggest that directional and topological cues are key to improving detection accuracy. In particular, axial attention has been shown to enhance the recognition of horizontal and vertical crack components by independently modeling one-dimensional spatial dependencies [39].

Furthermore, Chen et al. [40] proposed an edge-aware attention network that explicitly enhances boundary preservation and continuity through guided refinement. This is particularly relevant to the directional and topological modeling objectives of our EHAT and CSHAM modules.

Nonetheless, many attention mechanisms continue to employ fixed-weight configurations, which limits their adaptability across diverse scenes. In environments where reticular and linear cracks coexist, this rigidity results in suboptimal segmentation performance [22]. Therefore, the development of dynamically adaptive attention mechanisms capable of modeling multi-scale, morphologically diverse crack structures remains an open and significant research challenge.

2.3. Transformer Architectures

In recent years, the remarkable performance of the Vision Transformer (ViT) in computer vision has spurred widespread exploration of global context modeling methods. ViT achieves holistic semantic representation by segmenting images into fixed-size patches and applying a multi-head self-attention mechanism to process them. However, its substantial computational cost and reliance on large-scale datasets limit its practicality in real-world deployment scenarios.

To balance accuracy and efficiency, a variety of Transformer-CNN hybrid architectures have been proposed. For instance, TransUNet [41] embeds local features extracted by convolutional layers into a Transformer encoder while employing skip connections to preserve spatial detail. CMT [42] introduces a dual-branch structure that facilitates dynamic interactions between local and global representations. Similarly, Mobile-Former [43] adopts a lightweight architecture to reduce computational overhead for mobile and embedded scenarios.

While these hybrid approaches have demonstrated success in general semantic segmentation tasks, their applicability to pavement crack detection remains limited. For example, the window-based partitioning strategy in the Swin Transformer [17] can disrupt the continuity of linear crack patterns, impairing segmentation accuracy. Likewise, the dual-branch structure in Mobile-Former incurs significant memory consumption when applied to high-resolution inputs [44]. Moreover, many hybrid models are based on the U-Net framework, which may not align well with the architectural design of HRNet, particularly in terms of maintaining high-resolution feature representations throughout the network [8].

As a representative Transformer-CNN fusion method, SegFormer has achieved a mean Intersection over Union (mIoU) of 79.5% in general segmentation tasks, attributed to its robust global context modeling capabilities [18]. However, its window partitioning mechanism may introduce discontinuities in crack representation, particularly under ultra-high-resolution inputs. Additionally, its ability to support real-time detection remains limited. In response, several lightweight Transformer modules have been proposed to reduce computational burden through local window attention and dimensionality reduction strategies.

Despite these improvements, a fundamental challenge persists: how to effectively represent directional crack features while maintaining global perceptual awareness [45]. Recent advances in axial positional encoding and directional feature enhancement mechanisms offer promising solutions, particularly for capturing the elongated and linear nature of pavement cracks [36]. Future research should continue to explore adaptive attention mechanisms and high-resolution feature preservation strategies. In particular, integrating the local sensitivity of CNNs with the global dependency modeling strengths of Transformers represents a promising direction for achieving both fine-grained precision and real-time performance in practical pavement crack detection systems.

3. Methods

3.1. ETAFHrNet Architecture

This paper presents a novel network architecture, termed the Efficient Transformer-Enhanced and Adaptive Fusion High-Resolution Network (ETAFHrNet). The overall architecture is illustrated in Figure 2. Building upon the HRNet framework, the proposed model preserves HRNet’s strength in maintaining multi-resolution feature representations, while integrating two key innovations: the Efficient Hybrid Attention Transformer (EHAT) and the Cross-Scale Hybrid Attention Module (CSHAM). These components are specifically designed to enhance the network’s capacity for crack representation by improving global context modeling and multi-scale feature fusion.

The architecture comprises three primary components. First, the HRNet backbone extracts multi-resolution features through parallel branches, maintaining high-resolution feature flow while generating rich semantic representations. Second, the EHAT module is embedded into the high-resolution branch to perform lightweight long-range dependency modeling. This is achieved through a combination of adaptive channel dimensionality reduction, axial positional encoding, and local window attention-mechanisms that are particularly effective in enhancing the perception of linear and directional crack structures. Third, following feature alignment via cross-resolution upsampling, the CSHAM module conducts cross-scale adaptive fusion and directional enhancement of multi-level features. This ensures that the segmentation head receives a comprehensive, high-resolution feature representation, enabling the generation of accurate prediction maps aligned with the input resolution.

To overcome the limitations of conventional HRNet in pavement crack detection—specifically, its limited global semantic modeling and rigid feature fusion—this study introduces two architectural innovations. First, the EHAT module improves linear feature representation by integrating axial positional encoding and local window attention. Its hybrid MLP structure combines the local inductive bias of convolutional operations with the global modeling capacity of Transformers, making it highly effective in capturing crack features across multiple orientations and scales, particularly in complex or subtle crack scenarios. Second, the CSHAM module applies a cross-scale attention mechanism to adaptively weight multi-resolution features and leverages axial attention to enhance directional feature expression. This design alleviates the information loss often caused by static fusion in conventional HRNet and performs robustly in scenes where slender, intersecting, and multi-scale cracks coexist.The following subsections provide a detailed explanation of the proposed EHAT and CSHAM modules.

3.2. Efficient Hybrid Attention Transformer (EHAT) Module

This module takes an input feature map of shape

B \times C \times H \times W

, where B denotes the batch size, C, the number of channels, and H and W, the height and width, respectively. To preserve the linear structural features of cracks while reducing computational complexity, the Efficient Hybrid Attention Transformer (EHAT) employs a series of optimization strategies, including adaptive channel reduction [46], axial enhancement, and local window attention [47]. An overview of the EHAT module’s architecture is presented in Figure 3.

A

1 \times 1

convolution reduces the input channels from C to

C^{'} = (C_{r} - c) / r

, where r is the reduction ratio,

C_{r}

the base number of reduced channels, and c a learnable channel bias. Spatial downsampling via bilinear interpolation, with a ratio of s, yields dimensions

H^{'} = H / s

and

W^{'} = W / s

, thereby compressing both spatial and channel dimensions to alleviate the computational burden. Axial positional encoding is subsequently applied to independently enhance features along the horizontal and vertical axes, introducing directional priors well suited for representing elongated crack structures.

F_{axial} = Concat (F_{h} + P_{h}, F_{v} + P_{v})

(1)

The downsampled feature map is partitioned into two components,

F_{h}

and

F_{v}

, corresponding to horizontal and vertical orientations. Learnable positional parameters

P_{h}

and

P_{v}

are added to each, after which, the results are concatenated via Concat(). This axial positional encoding introduces explicit directional cues, enhancing the model’s capacity to identify and preserve linear crack features. To balance computational efficiency with structural sensitivity, EHAT employs local attention within each axis, enabling efficient modeling of elongated patterns without incurring the overhead of full self-attention.

Attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(2)

Here, Q, K, and V denote the query, key, and value matrices, respectively, and

d_{k}

represents the dimensionality of the query vectors. The local window attention mechanism divides the feature map into non-overlapping regions, where self-attention is calculated independently within each window. This formulation preserves local structural integrity while significantly reducing the computational cost associated with full self-attention. To further enhance directional sensitivity, EHAT integrates an axial feature enhancement module, refining the representation of elongated crack patterns across horizontal and vertical orientations.

F_{enhanced} = F_{axial} + {Conv}_{axial} (F_{axial})

(3)

Here,

{Conv}_{axial} (\cdot)

denotes a convolution operation applied along the axial direction, designed to further refine crack-related feature representations. Departing from the standard Transformer paradigm, EHAT adopts a hybrid MLP structure [48], integrating convolutional layers with multilayer perceptrons [49]. This design leverages the local inductive bias of convolutions alongside the global modeling capacity of MLPs, enhancing the network’s ability to capture both fine-grained details and long-range dependencies.

F_{hybrid} = MLP (F_{enhanced}) + Conv (F_{enhanced})

(4)

MLP (\cdot)

denotes a multilayer perceptron (feedforward network), and

Conv (\cdot)

denotes a conventional convolution operation. By summing the outputs of both operations, the model effectively fuses global context with fine-grained spatial features.

The axial feature enhancement module reinforces the linear characteristics of cracks along both horizontal and vertical directions. Meanwhile, the hybrid MLP architecture combines the inductive biases of convolution with the expressive capacity of Transformers. To complete the EHAT module’s processing pipeline, feature maps are first upsampled to their original resolution, followed by a

1 \times 1

convolution to align channel dimensions.

The EHAT module incorporates several technical innovations within its overall architecture: it reduces computational complexity through adaptive channel reduction and local window attention; enhances directional feature perception via axial positional encoding and feature enhancement mechanisms; and integrates the local inductive bias of convolution with the global modeling capacity of Transformers through a hybrid MLP structure. These design innovations enable the EHAT module to maintain a lightweight structure while substantially enhancing the network’s capacity to detect cracks across diverse orientations and scales. It performs particularly well in complex backgrounds and subtle crack scenarios, providing a robust feature representation foundation for high-precision pavement crack segmentation.

3.3. Cross-Scale Hybrid Attention Module (CSHAM)

In the original HRNet architecture, the multi-scale feature fusion stage performs feature alignment across different resolution branches via upsampling, followed by direct fusion through summation or concatenation. This rigid fusion strategy lacks both adaptive weighting across scales and directional feature enhancement, which limits its effectiveness in pavement crack segmentation, particularly for elongated structures and scenes involving the coexistence of multi-scale cracks. In such scenarios, critical morphological information is often lost due to the uniform treatment of features with varying semantic granularity.

To overcome these limitations, we introduce the Cross-Scale Hybrid Attention Module (CSHAM) into the multi-scale fusion stage of HRNet. Specifically, after all feature maps are upsampled to a unified spatial resolution, CSHAM is inserted in place of naive fusion (as illustrated in Figure 4), enabling both cross-scale adaptive fusion [50] and axial attention enhancement [39]. This design ensures that the segmentation head receives a comprehensive, structurally-aware representation that integrates multi-scale contextual information while preserving directional cues critical for detecting complex crack morphologies.

The Cross-Scale Hybrid Attention Module (CSHAM) adopts a hierarchical connection structure designed to perform adaptive scale-wise feature weighting alongside directional feature enhancement. The process begins with a cross-scale attention mechanism, which evaluates the relative importance of each scale-specific feature by aggregating global contextual information. The computation is formally expressed as:

G = \frac{1}{N} \sum_{i = 1}^{N} F_{i}

(5)

Let

F_{i}

denote the feature map at scale i, where

i = 1, 2, \dots, N

and N is the total number of scales. To compute the importance of each scale, a channel dimension reduction operation is first applied, followed by a weight prediction module that assigns learnable importance scores. These weights are then used to reweight the corresponding feature maps, enabling the network to dynamically emphasize informative scales while suppressing less relevant representations.

W = Softmax (W_{2} δ (W_{1} G))

(6)

F_{fused} = \sum_{i = 1}^{N} W_{i} \cdot F_{i}

(7)

Here,

W_{1}

and

W_{2}

are learnable projection matrices used for dimensionality reduction and expansion, respectively, and

δ

represents the ReLU activation function. Once the attention weights are computed and scale-wise feature reweighting is performed, a fused feature map

F_{fused}

is obtained. To further enhance directional awareness, directional attention is applied by employing one-dimensional convolutions to generate attention maps along the horizontal and vertical axes. These attention maps are then element-wise multiplied with

F_{fused}

to selectively amplify features aligned with directional crack patterns. This process is formally defined as:

F_{axial} = σ ({Conv}_{h} (F_{fused})) ⊙ σ ({Conv}_{v} (F_{fused}))

(8)

In this context,

{Conv}_{h}

and

{Conv}_{v}

refer to one-dimensional convolution operations performed along the horizontal and vertical axes, respectively. The function

σ

represents the Sigmoid activation, and ⊙ denotes element-wise multiplication. This design is particularly effective in enhancing the model’s sensitivity to linear crack structures, regardless of orientation.

During backpropagation, the gradients of the cross-scale attention weights can be computed as:

\frac{\partial L}{\partial W_{i}} = \frac{\partial L}{\partial F_{out}} \cdot \frac{\partial F_{out}}{\partial F_{fused}} \cdot F_{i}

(9)

Let L denote the loss function, which quantifies the discrepancy between model predictions and ground truth labels. The parameter

W_{i}

represents the learnable attention weight for scale i, while

F_{i}

denotes the corresponding input feature map. Through gradient backpropagation, the model dynamically adjusts

W_{i}

to optimize the importance of each scale relative to the segmentation objective. The fused feature map

F_{fused}

encapsulates aggregated multi-scale representations, and the final output produced by the CSHAM module is represented as

F_{out}

. This formulation enables the network to selectively enhance both scale-sensitive and directionally discriminative features, thereby improving segmentation accuracy and promoting continuity in predicted crack structures.

The CSHAM module serves as a key component in facilitating multi-scale feature fusion within the HRNet architecture. Beyond optimizing the adaptive integration of features across resolutions, it significantly strengthens directional feature representation. By capturing diverse crack morphologies and preserving fine structural details, CSHAM contributes directly to improved segmentation performance and more precise edge localization across a wide range of road surface conditions.

4. Experimental Details

4.1. Dataset Preparation

As a pixel-level classification task, the performance of image segmentation models is highly dependent on the quality and diversity of the training dataset. However, existing publicly available road crack detection datasets often lack finely annotated templates capable of capturing the wide morphological variability of cracks across diverse real-world conditions. To address this gap, we constructed a dedicated multi-scene segmentation dataset focused on road surface cracks, referred to as the QD-Crack dataset. This dataset is based on high-resolution road surface imagery collected by professional pavement inspection vehicles operating on expressways in Shandong Province, China, since May 2023. All data collection activities were conducted with the authorization of relevant municipal authorities. To ensure data privacy and regulatory compliance, all original images were preprocessed to remove identifiable elements, such as licence plates and prominent landmarks. The base dataset comprises 500 high-resolution images, captured under a wide range of environmental conditions, including varying lighting, pavement materials, and crack types. These images reflect diverse forms of pavement distress and are stored in JPG format. Annotation was performed by a team of experienced road maintenance engineers—each with over three years of professional experience—using the Labelme tool for detailed, vector-based labeling of crack morphology. To ensure annotation quality and consistency, all labels were cross-verified by two independent inspection engineers. Discrepancies were resolved through panel-based expert review. The overall dataset construction workflow is illustrated in Figure 5. The QD-Crack dataset was collected by the authors from municipal roads in Qingdao, China. While it is not currently publicly available, it can be accessed upon reasonable request to the corresponding author for research purposes.

Given the high cost and labour intensity of pixel-level annotation, we adopted a semi-automatic labeling strategy inspired by the approach of Jia et al. [51]. This method focuses on annotating the primary crack structures rather than the intact road surface, thereby improving labeling efficiency while preserving semantic relevance. Initial annotations were generated with the assistance of edge detection algorithms, which provided a contour-based approximation of crack boundaries. The final annotated dataset comprises a JSON file containing approximately 13.5 million labeled points, subsequently converted into PNG-format semantic segmentation masks using a custom-developed Python 3.10 script. To examine the influence of annotation granularity on model performance, we designed two distinct labeling schemes: (1) a binary scheme with two categories: background and crack; (2) a three-class scheme, comprising background, linear cracks, and alligator (reticular) cracks. Representative examples from labeling schemes are shown in Figure 6.

To enhance dataset utility and improve the robustness and generalization capability of the trained models, we applied a set of essential image preprocessing procedures, including image enhancement and geometric correction. A comprehensive data augmentation pipeline was implemented, incorporating random angle rotation, brightness adjustment, contrast enhancement, sharpness optimization, and horizontal flipping. As a result, the dataset was expanded from 500 to a total of 2500 samples. These augmentation techniques not only increase data diversity but also emulate complex real-world engineering conditions. For instance, random rotation (within ±15°) allows the model to recognise cracks from multiple viewing angles; brightness and contrast adjustments enhance texture visibility under variable lighting; and sharpness optimization amplifies edge contrast, thereby improving the distinction between crack regions and the background. An illustration of these effects is provided in Figure 7. Additionally, horizontal flipping augments data volume while mitigating directional bias, encouraging the model to generalize across diverse crack orientations and morphologies.

4.2. Training Parameters and Methods

To ensure the accuracy and reproducibility of the experiments, the detailed configuration of the experimental environment is summarized in Table 1.

During the experimental procedure, the dataset was divided into a training set and a test set at a fixed ratio of 8:2. Model parameters were iteratively optimized using the training set, whereas the test set was reserved for assessing generalization capability. To efficiently manage GPU memory limitations, the batch size was configured to 8, enabling effective utilization of the available computational resources. Based on prior experimental experience, the number of training epochs was uniformly set to 120, as the loss function consistently converged near this point across multiple configurations. The convergence behavior is visualized in Figure 8, which illustrates the decline and stabilization of the loss function across epochs. The model achieves a stable convergence state by approximately the 120th epoch, validating the effectiveness of the selected training schedule.

In image segmentation tasks, accurate delineation of object contours often relies on spatially consistent feature distributions. Building upon this observation, we incorporated a transfer learning strategy to enhance model learning efficiency. The core principle of transfer learning lies in reusing knowledge—specifically, pre-trained model weights—from one task to accelerate learning in a related but distinct target task, much like how humans transfer prior experience to new problems. In this work, the model parameters were initially pre-trained on a large-scale crack detection dataset, and subsequently fine-tuned on the target dataset to adapt to the specific task requirements.

The SDNET2018 dataset [52] was selected as the source for pre-training. Comprising over 56,000 annotated concrete crack images, SDNET2018 is widely recognized in the field for its utility in training, validation, and benchmarking of crack detection algorithms. Experimental results demonstrate that this transfer learning approach [53] not only accelerates convergence but also yields significant improvements in segmentation accuracy. Owing to the visual feature similarities shared across diverse real-world objects, this strategy closely aligns with human perceptual learning processes.

To fully exploit the benefits of pre-trained knowledge, we adopted a freeze-thaw training strategy [54], as opposed to random weight initialization. Given that the network backbone is responsible for extracting generalizable low-level features, its parameters were initially frozen, while the remaining layers were fine-tuned on the target data. During the mid-to-late training phases, the backbone was gradually unfrozen to allow full network optimization and better task adaptation.

The initial learning rate was set to 0.0001 and dynamically adjusted using a cosine annealing schedule [55] to improve convergence stability and efficiency. To further stabilize training, we set the momentum parameter to 0.975 and employed the Adam optimizer [56], which adaptively adjusts learning rates by incorporating both first- and second-order moment estimates of the gradients.

4.3. Methods for Evaluation

Evaluations were carried out on the QD-Crack dataset as well as other publicly available crack segmentation datasets to comprehensively assess both the performance gains and generalization ability of our model. To assess the effectiveness of the proposed ETAFHrNet model, we performed comparative analyses against multiple state-of-the-art segmentation approaches documented in existing studies. The QD-Crack dataset, along with several other publicly accessible crack segmentation benchmarks, was utilized to thoroughly evaluate the performance improvements and generalization capability of our model. For quantitative assessment, six evaluation metrics were adopted: Intersection over Union (IoU), mean IoU (mIoU), Precision, Recall, F1-score, Frames Per Second (FPS), and Params. These metrics collectively capture the model’s segmentation accuracy, robustness, and inference efficiency. Specifically, IoU (Intersection over Union) measures the spatial correspondence between the predicted segmentation and the ground truth. It is calculated as the ratio of the area of overlap to the area of union between the predicted and actual regions. A higher IoU value reflects better segmentation performance. The metric is mathematically defined as:

IoU = \frac{A \cap B}{A \cup B}

(10)

mIoU refers to the mean IoU across all classes and provides a comprehensive assessment of model performance.

Precision quantifies the proportion of true positive predictions among all samples predicted as positive, reflecting the reliability of positive classifications. It is formally expressed as:

Precision = \frac{T P}{T P + F P}

(11)

Recall is the proportion of true positive samples that are correctly identified by the model, defined as:

Recall = \frac{T P}{T P + F N}

(12)

Here,

T P

denotes the number of correctly identified crack pixels (true positives), while

F P

corresponds to background pixels erroneously classified as cracks (false positives). Conversely,

F N

represents crack pixels that the model failed to detect, incorrectly labeling them as background (false negatives).

Relying solely on individual metrics such as Precision or Recall can lead to a skewed evaluation, particularly when class imbalance is present. For instance, a model may achieve high Precision yet still perform poorly overall if Recall is substantially low. To mitigate this issue, we utilize the F1-score, which computes the harmonic mean of Precision and Recall, offering a more balanced and informative measure of performance in imbalanced scenarios. The F1-score is defined as:

F 1 - score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(13)

In addition, we introduce FPS (Frames Per Second) as a measure of inference efficiency. FPS quantifies how many input images the model can process and output per second. A higher FPS reflects a more efficient network capable of faster real-time crack detection, which is particularly valuable for practical deployment in infrastructure monitoring systems.

5. Results and Discussion

5.1. Influence of Semantic Labels and Transfer Learning on Model Performance

This section investigates the impact of semantic labeling granularity and transfer learning strategies on the performance of segmentation models. We evaluated four mainstream architectures—U-Net, DeepLabv3+, HRNet, and the proposed ETAFHrNet—across datasets annotated using both two-class and three-class schemes (see Table 2). The two-class scheme included only background and crack categories, while the three-class variant further distinguished between linear cracks and alligator (reticular) cracks. The findings indicate that models trained using the two-class scheme consistently surpass those trained with the three-class approach. For example, U-Net exhibited improvements of around 3.2% in mIoU and 4.13% in F1-score when utilizing the two-class dataset. DeepLabv3+ displayed greater robustness to label granularity, with metric fluctuations remaining within a 3% margin. Notably, ETAFHrNet achieved the best results under the two-class setting, reaching an mIoU of 74.41% and an F1-score of 83.11%. This superior performance can be attributed to ETAFHrNet’s architectural design, which emphasizes long-range dependency modeling and directional feature enhancement. These mechanisms are especially effective when the task is simplified to binary segmentation, allowing the model to focus entirely on the structural continuity and contextual consistency of cracks without being distracted by inter-class ambiguity. In contrast, the three-class setting introduces greater intra-class variability and semantic overlap, which can interfere with feature representation and degrade performance. These findings suggest that the simplified two-class labeling strategy is more appropriate for practical crack detection tasks. Finer-grained class distinctions often introduce class imbalance and inter-class confusion [57,58], while offering limited added value in most real-world engineering applications.

We further evaluated the effect of transfer learning, specifically pre-training on a large-scale dataset followed by fine-tuning on a smaller, task-specific dataset, to assess its influence on model performance (see Table 3). The results demonstrate that this strategy leads to consistent and significant improvements across all evaluated models. Notably, ETAFHrNet achieved an 8.09% increase in mean Intersection over Union (mIoU), reaching a peak value of 74.4%, thereby outperforming all other models by a substantial margin. U-Net and SegFormer also benefited from transfer learning, with respective gains of 3.73% and 5.63% in mIoU, further validating the effectiveness of this approach [59]. It is worth noting that Transformer-based architectures, particularly CNN-Transformer hybrid models, typically lack inherent spatial inductive biases and often require large-scale training data to achieve optimal performance. Consequently, pre-training plays a critical role in enabling these models to generalize effectively, especially when applied to smaller, domain-specific datasets, such as those used for pavement crack segmentation.

Based on the experimental findings, it is evident that both simplified semantic labeling and transfer learning substantially enhance segmentation performance, with ETAFHrNet consistently demonstrating the strongest results across evaluation metrics. Notably, the transfer learning setup involved pre-training on a composite dataset that included publicly available sources (e.g., CRACK500, GAPs384), before fine-tuning on QD-Crack. This configuration simulates cross-domain adaptation and indirectly reflects the model’s ability to generalize beyond a localized dataset. To further investigate the critical factors influencing feature extraction and multi-scale fusion, and to conduct in-depth comparisons with alternative network architectures, we adopt the two-class labeling scheme and apply transfer learning as the default training strategy in all subsequent ablation and comparative experiments. This experimental setup is designed to ensure consistency and provide more reliable technical guidance for the deployment of segmentation models in practical pavement crack detection applications.

5.2. Ablation Experiment

To verify the synergistic contribution of the proposed CSHAM and EHAT modules to pavement crack segmentation performance, we conducted a series of systematic ablation experiments on a benchmark crack detection dataset. By progressively removing or replacing key architectural components, we quantitatively assessed the impact of each module on both segmentation accuracy and inference efficiency, measured in Frames Per Second (FPS).

As shown in Table 4 and Figure 9, the baseline model—comprising solely the original HRNet without integration of the CSHAM or EHAT modules—achieves an mIoU of 63.49%, with corresponding mPrecision and mRecall scores of 72.98% and 70.34%, respectively. The F1-score falls to 71.58%, and the inference speed is recorded at 14.51 FPS. These results indicate that conventional multi-scale fusion mechanisms, as employed in HRNet, are inadequate for capturing the elongated, fine-grained, and morphologically diverse structures characteristic of pavement cracks. Moreover, the visual outputs shown in Figure 10 reveal pronounced discontinuities and susceptibility to background noise in the baseline predictions, resulting in coarse segmentation contours and inconsistent structural delineation. These findings further underscore the importance of enhancing both feature fusion and directional awareness for high-precision crack segmentation.

From a resource perspective, the baseline HRNet contains 45.0 M parameters. Adding EHAT or CSHAM alone keeps the footprint below 48.5 M while lifting mIoU by at least 7 percentage points and increasing throughput by 60%. Activating both modules brings the total to only 50.6 M parameters (+12%) yet almost doubles FPS (14.51 to 28.56) and raises mIoU by 10.9 percentage points, delivering the best accuracy–efficiency balance.

When the EHAT module is introduced independently, the model achieves an mIoU of 70.18%, with mPrecision and mRecall reaching 81.08% and 77.05%, respectively. The F1-score rises to 78.93%, and the inference speed improves to 22.83 FPS. As shown in Figure 10, the inclusion of axial positional encoding and local window attention enhances the model’s ability to capture directionally oriented crack features, resulting in more continuous and clearly delineated crack contours.

In contrast, when only the CSHAM module is incorporated, the performance improves further, with an mIoU of 72.69%, mPrecision of 79.95%, and mRecall of 80.96%, yielding an F1-score of 81.45%. The inference speed remains comparably high at 22.64 FPS. As illustrated in the corresponding visualizations in Figure 10, the cross-scale attention and adaptive weighting mechanisms in CSHAM facilitate more effective integration of multi-resolution features. This enables improved detection of fine-grained crack structures, while preserving smooth and coherent segmentation boundaries.

When both the CSHAM and EHAT modules are enabled, the model achieves its best overall performance: an mIoU of 74.41%, mPrecision of 83.84%, mRecall of 84.51%, an F1-score of 83.11%, and an inference speed of 28.56 FPS. As shown in the rightmost column of Figure 10, crack patterns in examples (a)–(d) are accurately detected across multiple scales and orientations, with improved line continuity and significantly reduced background interference. In the regions highlighted by red boxes, the baseline and single-module variants exhibit noticeable segmentation gaps and discontinuities. In contrast, the combined use of CSHAM and EHAT yields contours that closely match the ground truth, demonstrating superior recognition of multi-directional and multi-scale cracks.

In conclusion, the collective findings in Table 4 and Figure 10 underscore the synergistic contributions of the two proposed modules. The EHAT module primarily enhances the model’s sensitivity to directional crack features, mitigating fragmentation and misclassification, while the CSHAM module improves feature expressiveness and background suppression through adaptive multi-scale fusion. Their integration leads to substantial gains in segmentation accuracy, structural consistency, and robustness under complex conditions. Additionally, the model achieves fast inference, rendering it highly applicable to real-time pavement crack detection scenarios.

5.3. Comparison with Existing Advanced Methods

To validate the performance advantages of the proposed ETAFHrNet model in pavement crack detection, we conducted comparative experiments on the self-constructed QD-Crack dataset against several state-of-the-art segmentation models, including U-Net, DeepLabv3+, SegFormer, PSPNet, and HRNet. All models were trained and fine-tuned under identical experimental conditions to ensure fair comparison. As shown in Figure 11, each model was evaluated using three primary metrics: mean Intersection over Union (mIoU), F1-score, and mean Recall (mRecall). The results demonstrate that ETAFHrNet achieves an mIoU of 74.41%, representing a 10.92% improvement over HRNet. In addition, it attains an mPrecision of 83.84%, an mRecall of 84.51%, and an F1-score of 83.11%, highlighting the model’s significant advantage in segmentation accuracy and overall performance [60].

Besides accuracy, ETAFHrNet also offers a balanced hardware footprint. With 50.6 M parameters, it is only 12% larger than the HRNet baseline yet 21% smaller than the Transformer-based SegFormer (64.0 M). Despite this mid-range size, ETAFHrNet delivers an mIoU that is 10.9 percentage points higher than HRNet and 5.2 percentage points higher than SegFormer, while achieving the highest throughput (28.56 FPS). This accuracy-capacity-speed triad makes it attractive for edge GPUs and NPUs that typically provide 8–16 GB of RAM.

Furthermore, as presented in Table 5 and Figure 12, ETAFHrNet outperforms U-Net by 13.08% in mean Precision (mPrecision) and 13.28% in mean Recall (mRecall), indicating its superior overall segmentation performance. While DeepLabv3+ leverages atrous convolution to expand the receptive field, it exhibits limitations in modeling the continuity of slender cracks, resulting in a relatively low mIoU of 65.72%. SegFormer, constrained by its window partitioning strategy, tends to generate fragmented crack predictions during high-resolution detection tasks. Similarly, PSPNet, due to its coarse-grained context modeling, demonstrates a higher omission rate in fine crack detection, achieving an mRecall of only 72.64%.

Parameter-wise, all CNN baselines cluster between 31 M and 45 M, whereas SegFormer scales up to 64 M. ETAFHrNet falls between the two groups, indicating that its performance boost stems from architectural design rather than brute-force model scaling.

Given this footprint, we further estimate the real-time capacity of ETAFHrNet over a typical pavement section. Assuming one image covers roughly 1.5 m of pavement, analyzing a 1 km segment requires about 667 images. At 28.56 FPS, the model can process these images in 23.4 s (excluding I/O), confirming its suitability for near-real-time mobile inspection.

To provide a clearer illustration of the proposed model’s performance benefits, we present a visual comparison using representative test samples, as shown in Figure 13. The figure presents original pavement images, ground-truth segmentation masks, and prediction results from ETAFHrNet, U-Net, and PSPNet, with red boxes marking key areas of discrepancy. The visual comparisons clearly show that ETAFHrNet offers superior performance in capturing directional and continuous crack features. In particular, for sample groups 1 and 4, ETAFHrNet accurately preserves crack continuity at junctions, where other models tend to produce fragmented outputs. In group 3, the model successfully detects faint, low-contrast cracks through adaptive multi-scale fusion, effectively mitigating the information loss seen in DeepLabv3+, which relies on a single-path fusion mechanism. In more visually complex backgrounds, such as those in groups 2 and 3, the integration of EHAT and CSHAM enhances both crack-to-background contrast and edge localization precision. By comparison, U-Net frequently exhibits crack discontinuities, attributed to its limited receptive field, while PSPNet often generates over-smoothed or mis-clustered predictions due to its coarse context modeling during feature fusion. These qualitative results further reinforce the quantitative superiority of ETAFHrNet in accurately segmenting diverse and challenging crack patterns.

Nevertheless, Figure 13 also reveals that ETAFHrNet is not flawless. (1) Sample a: the predicted transverse (horizontal) crack appears noticeably blurred compared with the sharper boundary produced by DeepLabv3+, indicating a tendency towards over-smoothing along horizontal orientations. (2) Sample c: the forked crack at the bottom is segmented with exaggerated width, resulting in an over-emphasized branch. These failure cases highlight the remaining optimization space for edge-preservation and scale-aware refinement.

Although the predicted segmentation maps from different models may appear visually similar in some cases, high-precision crack pattern identification plays a critical role in pavement management. Distinguishing between transverse and alligator cracks, for instance, informs whether surface sealing or full-depth patching is required. Precise segmentation also improves damage quantification, enabling more accurate cost estimation, lifecycle prediction, and prioritization of maintenance resources.

In summary, ETAFHrNet, empowered by its innovative hybrid attention mechanisms and adaptive multi-scale fusion strategy, delivers substantial improvements in segmentation accuracy, robustness, and computational efficiency for pavement crack detection. The model exhibits clear advantages in crack-structure preservation, background-noise suppression, and fine-detail restoration, underscoring its strong potential for deployment in real-world road-inspection and maintenance scenarios.

6. Conclusions

This study proposed ETAFHrNet, a Transformer-enhanced segmentation network specifically designed to tackle the challenges of detecting complex and irregular crack patterns in high-resolution pavement imagery. By integrating the Efficient Hybrid Attention Transformer (EHAT) and the Cross-Scale Hybrid Attention Module (CSHAM) into the HRNet backbone, our model effectively captures both long-range contextual dependencies and fine-grained structural features that are critical for accurate object segmentation and classification.

Comprehensive experiments on the self-constructed QD-Crack dataset confirm that ETAFHrNet surpasses state-of-the-art approaches, including U-Net, DeepLabv3+, and HRNet, in terms of segmentation accuracy, precision, recall, and inference speed. Ablation studies demonstrate that the two proposed attention modules provide complementary benefits, particularly in enhancing the representation of directionality, scale variation, and discontinuity, which are typical characteristics of asymmetric visual objects.

The proposed framework contributes to the development of interpretable and efficient AI models for infrastructure monitoring, with extensibility to a wide range of applications such as bridge inspection, tunnel lining analysis, and remote sensing-based structural assessment. Moreover, the model’s architecture aligns with the broader goals of object detection and image classification, especially under challenging conditions where traditional models struggle.

Although ETAFHrNet shows promising segmentation accuracy and inference speed, several practical constraints remain: (1) Data diversity: the QD-Crack dataset mainly contains dry asphalt surfaces captured in daylight; performance under concrete pavements, wet conditions, night-time illumination, and extreme weather has not yet been validated. (2) Micro-crack sensitivity: hairline cracks narrower than two pixels are occasionally missed, revealing insufficient fine-scale feature capture. (3) Pavement-material dependence: preliminary trials on concrete surfaces reveal false positives where aggregate texture is confused with cracks, indicating the need for material-aware domain adaptation. (4) Edge deployment: the current model still relies on an NVIDIA RTX 3070 Ti GPU; additional pruning and quantization are required for real-time inference on low-power edge devices. (5) Continuous video streams: experiments were conducted on discrete images; real-time tracking of cracks in on-board video sequences demands further pipeline optimization. (6) Domain generalization: transferability to geographically distinct road networks or other infrastructure (e.g., bridges, airport runways) remains to be verified through cross-domain testing. Addressing these issues constitutes our immediate future work.

Looking ahead, future research will focus on optimizing ETAFHrNet for lightweight deployment on edge devices, enhancing its generalizability across diverse environmental scenarios, and improving its ability to identify micro-scale defects under varying pavement materials. More broadly, our findings emphasize the significance of modeling asymmetry and multi-scale variation in visual data, a principle that is critical for building robust, generalizable, and explainable object-recognition systems across real-world domains.

Author Contributions

Conceptualization, C.T. and R.L.; methodology, R.L. and Z.Z.; software, J.L.; validation, P.T., A.Y. and C.T.; formal analysis, C.T. and P.T.; investigation, S.P. and Z.Z.; resources, R.L. and Z.Z.; data curation, J.D.; writing original draft, C.T., J.L. and Z.Z.; writing review and editing, R.L. and P.T.; visualization, S.P.; supervision, R.L.; project administration, R.L.; funding acquisition, R.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation of China (Grant No. 42001414) and the “Elite Program” Research Support Foundation (Grant No. 0104060541613).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

For access to the data from this study, please contact the corresponding author.

Acknowledgments

The authors thank the anonymous reviewers for their constructive feedback, which greatly improved the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huyan, J.; Li, W.; Tighe, S.; Xu, Z.; Zhai, J. CrackU-net: A Novel Deep Convolutional Neural Network for Pixelwise Pavement Crack Detection. Struct. Control Health Monit. 2020, 27, e2551. [Google Scholar] [CrossRef]
Ragnoli, A.; De Blasiis, M.R.; Di Benedetto, A. Pavement Distress Detection Methods: A Review. Infrastructures 2018, 3, 58. [Google Scholar] [CrossRef]
Oliveira, H.; Correia, P.L. Automatic Road Crack Detection and Characterization. IEEE Trans. Intell. Transp. Syst. 2013, 14, 155–168. [Google Scholar] [CrossRef]
Nafaa, S.; Essam, H.; Ashour, K.; Emad, D.; Mohamed, R.; Elhenawy, M.; Ashqar, H.I.; Hassan, A.A.; Alhadidi, T.I. Automated Pavement Cracks Detection and Classification Using Deep Learning. arXiv 2024, arXiv:2406.07674. [Google Scholar] [CrossRef]
Mukherjee, R.; Iqbal, H.; Marzban, S.; Badar, A.; Brouns, T.; Gowda, S.; Arani, E.; Zonooz, B. AI Driven Road Maintenance Inspection. arXiv 2021, arXiv:2106.02567. [Google Scholar] [CrossRef]
Li, Y.; Ma, R.; Liu, H.; Cheng, G. Real-Time High-Resolution Neural Network with Semantic Guidance for Crack Segmentation. Autom. Constr. 2023, 156, 105112. [Google Scholar] [CrossRef]
Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A Deep Hierarchical Feature Learning Architecture for Crack Segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5686–5696. [Google Scholar] [CrossRef]
Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection. arXiv 2019, arXiv:1901.06340. [Google Scholar] [CrossRef]
Fan, R.; Bocus, M.J.; Zhu, Y.; Jiao, J.; Wang, L.; Ma, F.; Cheng, S.; Liu, M. Road Crack Detection Using Deep Convolutional Neural Network and Adaptive Thresholding. arXiv 2019, arXiv:1904.08582. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 833–851. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid Attention Network for Semantic Segmentation. arXiv 2018, arXiv:1805.10180. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar] [CrossRef]
Liu, H.; Miao, X.; Mertz, C.; Xu, C.; Kong, H. CrackFormer: Transformer Network for Fine-Grained Crack Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3763–3772. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6877–6886. [Google Scholar] [CrossRef]
Ding, F. Crack Detection in Infrastructure Using Transfer Learning, Spatial Attention, and Genetic Algorithm Optimization. arXiv 2024, arXiv:2411.17140. [Google Scholar] [CrossRef]
Huang, Y.; Shi, Z.; Wang, Z.; Wang, Z. Improved U-Net Based on Mixed Loss Function for Liver Medical Image Segmentation. Laser Optoelectron. Prog. 2020, 57, 221003. [Google Scholar] [CrossRef]
Zhang, L.; Yang, F.; Daniel Zhang, Y.; Zhu, Y.J. Road Crack Detection Using Deep Convolutional Neural Network. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3708–3712. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient Transformer for High-Resolution Image Restoration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5718–5729. [Google Scholar] [CrossRef]
Chen, C.F.R.; Fan, Q.; Panda, R. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 347–356. [Google Scholar] [CrossRef]
Tian, D.; Han, Y.; Liu, Y.; Li, J.; Zhang, P.; Liu, M. Hybrid Cross-Feature Interaction Attention Module for Object Detection in Intelligent Mobile Scenes. Remote Sens. 2023, 15, 4991. [Google Scholar] [CrossRef]
Wang, H.; Zhu, Y.; Green, B.; Adam, H.; Yuille, A.; Chen, L.C. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12349, pp. 108–126. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar] [CrossRef]
Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J. High-Resolution Representations for Labeling Pixels and Regions. arXiv 2019, arXiv:1904.04514. [Google Scholar] [CrossRef]
Yin, Z.; Liang, K.; Ma, Z.; Guo, J. Duplex Contextual Relation Network For Polyp Segmentation. In Proceedings of the 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), Kolkata, India, 28–31 March 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–5. [Google Scholar] [CrossRef]
Xu, C.; Zhang, Q.; Mei, L.; Chang, X.; Ye, Z.; Wang, J.; Ye, L.; Yang, W. Cross-Attention-Guided Feature Alignment Network for Road Crack Detection. ISPRS Int. J. Geo-Inf. 2023, 12, 382. [Google Scholar] [CrossRef]
Lin, H.; Cheng, X.; Wu, X.; Yang, F.; Shen, D.; Wang, Z.; Song, Q.; Yuan, W. CAT: Cross Attention in Vision Transformer. arXiv 2021, arXiv:2106.05786. [Google Scholar] [CrossRef]
Guo, F.; Liu, J.; Lv, C.; Yu, H. A Novel Transformer-Based Network with Attention Mechanism for Automatic Pavement Crack Detection. Constr. Build. Mater. 2023, 391, 131852. [Google Scholar] [CrossRef]
Li, C.; Fan, Z.; Chen, Y.; Sheng, W.; Wang, K.C.P. CrackCLF: Automatic Pavement Crack Detection Based on Closed-Loop Feedback. IEEE Trans. Intell. Transp. Syst. 2024, 25, 5965–5980. [Google Scholar] [CrossRef]
Chen, L.C.; Yang, Y.; Wang, J.; Xu, W.; Yuille, A.L. Attention to Scale: Scale-Aware Semantic Image Segmentation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3640–3649. [Google Scholar] [CrossRef]
Ho, J.; Kalchbrenner, N.; Weissenborn, D.; Salimans, T. Axial Attention in Multidimensional Transformers. arXiv 2019, arXiv:1912.12180. [Google Scholar] [CrossRef]
Chen, Y.; Cheng, H.; Wang, H.; Liu, X.; Chen, F.; Li, F.; Zhang, X.; Wang, M. EAN: Edge-Aware Network for Image Manipulation Localization. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 1591–1601. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Guo, J.; Han, K.; Wu, H.; Tang, Y.; Chen, X.; Wang, Y.; Xu, C. CMT: Convolutional Neural Networks Meet Vision Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12165–12175. [Google Scholar] [CrossRef]
Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-Former: Bridging MobileNet and Transformer. arXiv 2021, arXiv:2108.05895. [Google Scholar] [CrossRef]
Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-Former: Bridging MobileNet and Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5260–5269. [Google Scholar] [CrossRef]
Wu, Z.; Liu, Z.; Lin, J.; Lin, Y.; Han, S. Lite Transformer with Long-Short Range Attention. arXiv 2020, arXiv:2004.11886. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
Zou, R.; Song, C.; Zhang, Z. The Devil Is in the Details: Window-Based Attention for Image Compression. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17471–17480. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Tolstikhin, I.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. MLP-Mixer: An All-MLP Architecture for Vision. arXiv 2021, arXiv:2105.01601. [Google Scholar] [CrossRef]
Shao, D.; Ren, L.; Ma, L. MSF-Net: A Lightweight Multi-Scale Feature Fusion Network for Skin Lesion Segmentation. Biomedicines 2023, 11, 1733. [Google Scholar] [CrossRef]
Jia, G.; Song, W.; Jia, D.; Zhu, H. Sample Generation of Semi-automatic Pavement Crack Labelling and Robustness in Detection of Pavement Diseases. Electron. Lett. 2019, 55, 1235–1238. [Google Scholar] [CrossRef]
Maguire, M.; Dorafshan, S.; Thomas, R.J. SDNET2018: A Concrete Crack Image Dataset for Machine Learning Applications; Utah State University: Logan, UT, USA, 2018. [Google Scholar] [CrossRef]
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A Comprehensive Survey on Transfer Learning. arXiv 2019, arXiv:1911.02685. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. arXiv 2014, arXiv:1409.0575. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]
Jamal, M.A.; Brown, M.; Yang, M.H.; Wang, L.; Gong, B. Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition From a Domain Adaptation Perspective. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7607–7616. [Google Scholar] [CrossRef]
Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-Balanced Loss Based on Effective Number of Samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar] [CrossRef]
Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.; Keutzer, K.; Vajda, P. Visual Transformers: Token-Based Image Representation and Processing for Computer Vision. arXiv 2020, arXiv:2006.03677. [Google Scholar] [CrossRef]
Wang, Y.; Liu, C.; Fan, Y.; Niu, C.; Huang, W.; Pan, Y.; Li, J.; Wang, Y.; Li, J. A Multi-Modal Deep Learning Solution for Precise Pneumonia Diagnosis: The PneumoFusion-Net Model. Front. Physiol. 2025, 16, 1512835. [Google Scholar] [CrossRef]

Figure 1. Workflow of the proposed pavement crack detection framework.

Figure 2. The ETAFHrNet architecture.

Figure 3. The Efficient Hybrid Attention Transformer (EHAT) Module.

Figure 4. The Cross-Scale Hybrid Attention Module (CSHAM).

Figure 5. Dataset construction process.

Figure 6. Examples of labeling schemes for different crack types. Each column from (a–d) represents a complete sample group, consisting of the following: top row—original pavement image, middle row—binary annotation, bottom row—overlay mask. Specifically, (a) transverse crack, (b) longitudinal crack, (c) alligator crack on asphalt pavement, and (d) intersecting crack on cement pavement.

Figure 7. Image preprocessing visualization. The images from top-left to bottom-right are as follows: (a) original image, (b) contrast-enhanced, (c) brightness-adjusted, (d) horizontally flipped, (e) sharpness-optimized, and (f) randomly rotated.

Figure 8. The trend of loss function with the number of training rounds.

Figure 9. Confusion matrix visualization of ablation experiments.

Figure 10. Ablation experiment comparison. (a–d) Randomly sampled images from the dataset. The red boxes indicate regions where differences occur.

Figure 11. Performance comparison (mIoU, F1-score, etc.) of models on the dataset.

Figure 12. Comparison of classification performance across different semantic segmentation models based on their confusion matrices.

Figure 13. Segmentation outcomes comparison among different models. (a–d) Randomly sampled images from the dataset. The red boxes indicate regions where differences occur.

Table 1. Model training configuration.

Configuration Items	Configuration
Operating System	Windows 11
Deep Learning Framework	PyTorch 1.10.0
Processor	Intel Core i7-12700k
RAM	32 GB
GPU	NVIDIA GeForce RTX 3070 Ti (8 GB)
GPU Memory	8 GB
CUDA Version	11.3

Table 2. Comparison of semantic segmentation model performance across multiple datasets.

Model	Classes	mIoU (%)	mRecall (%)	mPrecision (%)	F1-Score (%)
U-Net	two	62.85	70.76	71.23	71.90
U-Net	three	59.62	67.48	68.07	67.77
DeepLabv3_Plus	two	65.72	73.25	73.86	73.92
DeepLabv3_Plus	three	62.47	70.09	70.54	70.31
HRNet	two	63.49	72.98	70.34	71.58
HRNet	three	60.15	69.11	67.32	68.20
ETAFHrNet	two	74.41	83.84	84.51	83.11
ETAFHrNet	three	71.08	79.21	76.64	79.42

Table 3. Results of transfer learning experiments.

Model	Transfer Learning	mIoU (%)	mRecall (%)	mPrecision (%)	F1-Score (%)
U-Net	No	59.12	66.32	67.28	67.45
U-Net	Yes	62.85	70.76	71.23	71.90
DeepLabv3_Plus	No	61.47	69.15	69.87	69.90
DeepLabv3_Plus	Yes	65.72	73.25	73.86	73.92
HRNet	No	60.08	68.44	68.93	68.71
HRNet	Yes	63.49	72.98	70.34	71.58
SegFormer	No	63.58	71.72	72.13	72.25
SegFormer	Yes	69.21	76.52	76.88	76.73
ETAFHrNet	No	66.32	74.65	75.15	75.38
ETAFHrNet	Yes	74.41	83.84	84.51	83.11

Table 4. Ablation experiment performance comparison.

Model	EHAT	CSHAM	mIoU (%)	mPrecision (%)	mRecall (%)	F1-Score (%)	FPS	Params (M)
HRNet	No	No	63.49	72.98	70.34	71.58	14.51	45.0
	Yes	No	70.18	81.08	77.05	78.93	22.83	47.5
	No	Yes	72.69	79.95	80.96	81.45	22.64	48.2
	Yes	Yes	74.41	83.84	84.51	83.11	28.56	50.6

Table 5. Performance comparison of various semantic segmentation models on the QD-Crack dataset. (Bold text highlights better model parameters).

Model	mIoU (%)	mPrecision (%)	mRecall (%)	F1-Score (%)	FPS	Params (M)
U-Net	62.85	70.76	71.23	71.90	15.37	31.0
HRNet	63.49	72.98	70.34	71.58	14.51	45.0
PSPNet	64.32	71.85	72.64	72.24	17.22	42.6
DeepLabv3+	65.72	73.25	73.86	73.92	19.84	42.0
SegFormer	69.21	76.52	76.88	76.73	21.34	64.0
ETAFHrNet	74.41	83.84	84.51	83.11	28.56	50.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tan, C.; Liu, J.; Zhao, Z.; Liu, R.; Tan, P.; Yao, A.; Pan, S.; Dong, J. ETAFHrNet: A Transformer-Based Multi-Scale Network for Asymmetric Pavement Crack Segmentation. Appl. Sci. 2025, 15, 6183. https://doi.org/10.3390/app15116183

AMA Style

Tan C, Liu J, Zhao Z, Liu R, Tan P, Yao A, Pan S, Dong J. ETAFHrNet: A Transformer-Based Multi-Scale Network for Asymmetric Pavement Crack Segmentation. Applied Sciences. 2025; 15(11):6183. https://doi.org/10.3390/app15116183

Chicago/Turabian Style

Tan, Chao, Jiaqi Liu, Zhedong Zhao, Rufei Liu, Peng Tan, Aishu Yao, Shoudao Pan, and Jingyi Dong. 2025. "ETAFHrNet: A Transformer-Based Multi-Scale Network for Asymmetric Pavement Crack Segmentation" Applied Sciences 15, no. 11: 6183. https://doi.org/10.3390/app15116183

APA Style

Tan, C., Liu, J., Zhao, Z., Liu, R., Tan, P., Yao, A., Pan, S., & Dong, J. (2025). ETAFHrNet: A Transformer-Based Multi-Scale Network for Asymmetric Pavement Crack Segmentation. Applied Sciences, 15(11), 6183. https://doi.org/10.3390/app15116183

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ETAFHrNet: A Transformer-Based Multi-Scale Network for Asymmetric Pavement Crack Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Segmentation Model Evolution

2.2. Attention Mechanisms

2.3. Transformer Architectures

3. Methods

3.1. ETAFHrNet Architecture

3.2. Efficient Hybrid Attention Transformer (EHAT) Module

3.3. Cross-Scale Hybrid Attention Module (CSHAM)

4. Experimental Details

4.1. Dataset Preparation

4.2. Training Parameters and Methods

4.3. Methods for Evaluation

5. Results and Discussion

5.1. Influence of Semantic Labels and Transfer Learning on Model Performance

5.2. Ablation Experiment

5.3. Comparison with Existing Advanced Methods

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI