ANF-Net: A Refined Segmentation Network for Road Scenes with Multiple Noises and Various Morphologies of Cracks

Hu, Xiao; Chen, Qihao; Liu, Xiuguo; Deng, Gang; Chi, Cheng; Wang, Bin

doi:10.3390/rs17060971

Open AccessArticle

ANF-Net: A Refined Segmentation Network for Road Scenes with Multiple Noises and Various Morphologies of Cracks

by

Xiao Hu

¹,

Qihao Chen

¹

,

Xiuguo Liu

¹

,

Gang Deng

¹

,

Cheng Chi

¹ and

Bin Wang

^2,*

¹

School of Geography and Information Engineering, China University of Geosciences, Wuhan 430074, China

²

Command Center of Natural Resources Comprehensive Survey, China Geological Survey, Beijing 100055, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 971; https://doi.org/10.3390/rs17060971

Submission received: 20 December 2024 / Revised: 24 February 2025 / Accepted: 4 March 2025 / Published: 10 March 2025

Download

Browse Figures

Versions Notes

Abstract

Cracks are a common early road defect that tends to worsen with the aging of roads, potentially leading to severe structural damage. Timely and accurate crack detection plays a crucial role in mitigating such risks and holds significant importance for infrastructure maintenance. Deep learning techniques have demonstrated excellent performance in image-based crack extraction tasks. However, challenges persist due to the presence of numerous noisy pixels in the image background and the diverse and intricate morphologies of cracks, leading to issues such as misclassification and omission. To address these issues, this paper proposes a refined pixel-level segmentation network (ANF-Net) suitable for complex crack detection scenarios with high noise levels and diverse crack morphologies. When extracting crack features, on one hand, the network introduces an attention module tailored for crack scenes to learn pixel-wise feature weights, enabling the network to focus on crack regions and thereby reducing the impact of similar background features, mitigating false positives caused by noise misclassification. On the other hand, a constrained multi-morphological convolution structure is constructed by imposing learnable continuous constraints on the deformation offsets of convolutional kernels, allowing the network to adaptively fit different crack shapes. This design enhances the network’s ability to extract cracks in morphologically diverse, narrow, and densely populated regions, effectively preventing issues such as crack extraction interruptions and omissions. Additionally, a multi-scale discrete wavelet transform enhancement module is designed to assist the network in considering frequency domain information that contains crack features, further improving its feature extraction capability. Simulations are conducted using three publicly available crack datasets, and the proposed method is compared with mainstream segmentation models. The results demonstrate that the proposed method achieves F1 scores of 87.9%, 82.5%, and 71.5% on the three datasets, respectively, all of which surpass the performance of current mainstream segmentation models. The proposed network accurately extracts road cracks and exhibits robust performance.

Keywords:

attention mechanism; crack segmentation; convolutional neural network; tubular structure; wavelet enhancement

Graphical Abstract

1. Introduction

Cracks are a prevalent form of road damage and serve as a crucial reference indicator in the monitoring of road health [1]. The majority of pavement defects initially manifest as surface cracks, which not only impact the service life and traffic efficiency of roads but also pose threats to human life and property safety. Therefore, regular crack detection plays a crucial role in infrastructure maintenance and ensuring its stable operation. Traditional crack detection relies primarily on manual visual interpretation, which is time-consuming, labor-intensive, and subjective. In contrast, computer vision-based crack detection methods have gained widespread recognition and application due to their efficiency and accuracy [2].

Early image crack extraction methods are typically rule-driven. The crack extraction method is proposed based on the assumption that the brightness of cracks is darker than the background, employing threshold segmentation [3]. The variations in light, color, shadow, and texture offer edge information in digital images. Leveraging the changes in pixel intensity along the edges as a basis for extracting image features, edge operator methods have been proposed and applied in crack extraction tasks, yielding favorable results [4]. The transformation of image signals from the time domain to the frequency domain facilitates the effective extraction of edge information in images. Based on this, a crack extraction method using wavelet transform is proposed [5]. These methods can rapidly extract cracks from images, but they are significantly influenced by the background. Interferences from external factors such as changes in lighting and shadow occlusion may render these methods ineffective. To enhance the adaptability of methods in real-world environments, research has explored the application of machine learning techniques to crack extraction tasks [6,7,8], including artificial neural networks, random forests, support vector machines, etc. These methods have indeed achieved satisfactory extraction results. However, due to their heavy reliance on manual feature selection, they exhibit certain limitations in practical applications.

In recent years, deep learning has experienced rapid development in the field of computer vision, achieving significant breakthroughs in areas such as object detection, image classification, and semantic segmentation [9,10,11,12]. Unlike traditional rule-driven methods, deep learning approaches are data-driven, leveraging strengths such as powerful feature extraction, high accuracy, strong adaptability, and scalability. This has made them a focal point in current research within the field of computer vision [13]. Deep learning has also been successfully applied to crack extraction tasks, providing accurate pixel-level annotations for crack images. High-precision crack maps serve as a powerful basis for road quality assessment [14,15]. Zhang et al. proposed an efficient road crack detection network, CrackNet [16]. The network structure does not include any pooling layers to reduce the output of the previous layer, ensuring constant width and height of the layers to achieve high-precision pixel-level extraction. Ronneberger et al. proposed the classic semantic segmentation network UNet [17]. This network, designed with an encoder–decoder structure, integrates high-resolution features into the decoding process, allowing for better recovery of detailed target information. It has achieved success in biomedical image segmentation. Due to the superior performance of the network and the similarity of the scenarios, David Jenkins et al. successfully introduced and applied UNet to road crack extraction tasks [18]. Badrinarayanan et al. modified VGG-16 to create SegNet [19]. In SegNet, in the decoding layers, it receives the max pooling indices from the corresponding encoding layers and uses them in the upsampling process, reducing the training parameters while maintaining a faster speed compared to UNet. Chen et al. applied SegNet to the task of crack extraction and demonstrated the effectiveness of the network [20]. Chen et al. enhanced the traditional encoder–decoder network by adding a simple yet effective decoding module to construct DeepLabV3+, which notably improves the segmentation results of object boundaries [21]. In order to further enhance the accuracy of crack extraction in Convolutional Neural Network (CNN), researchers conducted an in-depth analysis of crack features and designed a network specifically tailored for crack scenes [22]. Zou et al. built the DeepCrack network on the SegNet architecture [23], effectively capturing linear structures of cracks through multi-scale feature fusion. Zhang et al. pointed out that in complex backgrounds, crack extraction is susceptible to the influence of elements such as shadows and speckles. This is because the network lacks sufficient contextual information to aid scene perception. In response, they purposefully constructed a context feature enhancement module and introduced it into the network [24]. In addition, the paper mentions that due to the fixed geometric structure of convolution, it is challenging to adapt to the diverse morphologies of cracks. Therefore, deformable convolutions are introduced into the network to extract more crack features through the offset of sampling points. Zhou et al. pointed out that for tasks such as crack extraction, which are heavily influenced by the background, it is essential for the network to effectively capture long-term dependencies in feature information. The paper introduces a hybrid attention mechanism to address this issue [25]. Geng et al. pointed out that traditional networks struggle to detect thin cracks due to their inability to effectively extract features related to thin cracks. The paper introduces a wavelet transformation method, which effectively supplements the frequency information that has been overlooked [26]. In recent years, Transformer has demonstrated superior feature extraction performance in computer vision tasks due to its global modeling ability and has gradually been applied to fracture extraction tasks. Liu et al. proposed a network called CrackFormer for fine-grained fracture detection, which achieves cross-feature channel context information extraction and long-range modeling by embedding a self-attention module [27]. It also captures context information with a large receptive field for long-distance modeling. Guo et al. used Swin Transformer as an encoder to capture global and long-range semantic features, and experimental results have proven its effectiveness for fracture extraction tasks with diverse distribution and morphology [28]. However, the modeling process of Transformer requires large datasets, and it has not shown significant performance superiority on small sample datasets.

These networks have been successfully applied and proven effective in practice. However, despite the achievements mentioned above, image-based crack extraction tasks still face two major challenges. Firstly, in complex scenes, crack extraction is prone to interference from background pixels (such as pitting and shadows), whose characteristics closely resemble those of cracks. This similarity often results in the inclusion of extraneous noise outside the actual crack regions, ultimately leading to false detection in the extraction results. Secondly, due to the inherently slender and tubular structure of cracks, along with their diverse morphologies and irregular intensity variations, conventional convolutional operations struggle to adapt sampling points in a morphology-aware manner. Consequently, the network fails to accurately and comprehensively capture crack features, ultimately leading to missed detection in the extraction results. This demands that the network, while making full use of multi-level feature information, should also possess accurate contextual awareness.

In response to the aforementioned challenges, this paper introduces a refined segmentation network (ANF-Net) designed for road scenarios with diverse morphologies and significant noise in crack features. The paper systematically constructs three modules to enhance the network’s focus on crack regions and improve its capability to extract crack features.

The main contributions of this paper can be summarized as follows:

(1): To address the issue of excessive noise in the crack extraction results in complex noise scenarios, we introduce the coordinate attention module [29] to selectively enhance the network’s ability to capture features in crack regions. This effectively mitigates the problem of including too many noisy pixels in the crack extraction results.
(2): To address the challenge of diverse morphological variations in cracks, leading to incomplete or missing crack features in the extraction results due to the network’s inability to accurately capture them, we devise the multi-scale discrete wavelet transform enhancement module. This module supplements the frequency domain information containing crack edge features, which is overlooked during the network’s downsampling process. Additionally, the constrained multi-morphological convolution structure is designed to impose constrained shifts on convolution sampling points, thereby enhancing the network’s perceptual capabilities and improving the precision of crack extraction.
(3): This paper introduces an end-to-end, high-precision network, ANF-Net, designed for the challenges posed by multi-morphology and noisy crack scenarios. ANF-Net achieves automatic and accurate pixel-level crack extraction, outperforming classic crack extraction algorithms such as SegNet, DeepCrack, and DeepLabV3+.

The remaining sections of this paper are organized as follows: Section 2 provides a detailed introduction to several key technologies addressed in this paper. Section 3 elaborates on the network architecture of the constructed model and the targeted improvement modules. Section 4 presents the simulation details and results, along with a precision evaluation and in-depth analysis of the simulation outcomes. Finally, Section 5 summarizes the work presented in this paper and outlines potential directions for future development.

2. Related Work

In this section, we primarily introduce several key techniques involved in the network construction process, including the Attention Mechanism and Discrete Wavelet Transform. The necessity and superiority of the methods employed in this paper are analyzed.

2.1. Attention Mechanism

Attention is a crucial mechanism in the field of computer vision. Based on the idea of emulating human attention, it reduces the computational complexity of image processing by introducing models that focus only on specific regions of the image rather than the entire image, thereby enhancing performance [30]. Common attention mechanisms include channel attention, spatial attention, and self-attention mechanisms. SENet is an effective channel attention mechanism that enhances model focus on relevant channel information by learning adaptive channel weights [31]. ECANet is an improvement based on SENet, mainly addressing the issue of its large parameter size and effectively enhancing the performance of the network [32]. Spatial Attention Mechanisms (SAMs), considering the spatial dimension, enable adaptive selection of spatial regions, directing the model’s focus towards relevant areas of the image [33]. The abovementioned methods address the problem separately from the channel and spatial dimensions but lack a comprehensive consideration of both aspects. The Convolutional Block Attention Module (CBAM) organically integrates both aspects by constructing a channel-wise and spatial-wise attention mechanism in a concatenated manner. This allows for better capturing of crucial information in the image, including both channel and spatial information, thereby effectively enhancing the model’s expressive capability [34]. In recent years, the self-attention mechanism, exemplified by models such as the Vision Transformer (ViT), has demonstrated promising results in the field of computer vision [35]. However, it requires a substantial amount of data for training, making it unsuitable for tasks such as crack extraction that involve small sample sets.

In order to enhance the network’s attention to crack regions, reduce interference from background noise to a greater extent, and avoid increased computational costs, Coordinate Attention embeds positional information into channel attention, thereby reducing computational resource utilization and improving efficiency [29]. Furthermore, this attention module aggregates features along both the vertical and horizontal directions, effectively capturing long-range dependency relationships in the feature map. For tasks involving the extraction of cracks exhibiting linear tubular structures, this approach demonstrates higher feature extraction efficiency and greater accuracy.

2.2. Discrete Wavelet Transform

Discrete Wavelet Transform (DWT) is a discrete sampling of wavelets. Performing a two-dimensional discrete wavelet transform on an image can decompose it into low-frequency components containing the basic features of the image and high-frequency components containing detailed information and edge contour information of the image. Its superior performance in image signal decomposition makes it widely applicable in digital image processing, such as image segmentation, image compression, image denoising, and other fields. Even before the advent of deep learning methods, researchers had combined discrete wavelet transform with traditional approaches for image segmentation tasks. In image segmentation tasks, Liu et al. achieved image denoising by applying DWT and eliminating the high-frequency components obtained through decomposition. They performed threshold segmentation only on the low-frequency components to improve accuracy [36]. K. Taifi et al. leveraged the high-frequency components containing more edge contour information. They completed the task of image crack extraction by preserving the high-frequency components obtained through DWT and combining them with the Jerman Enhancement Filter [37]. Liu et al. proposed an improved multi-scale Retinex algorithm to enhance crack images. They integrated DWT into the traditional multi-scale Retinex algorithm to avoid the halo effect and reduce image distortion [38].

In recent years, with the rapid development of deep learning, for classical encoder–decoder networks, the shallow regions in the network have a small receptive field, enabling the extraction of detailed information from the image. Through downsampling operations, the receptive field of deep regions in the network gradually increases, allowing the capture of semantic information in the image. The combination of rich detailed information and abstract semantic information is used in the prediction process of the decoder. However, research indicates that traditional convolutional neural networks predominantly focus on spatial domain information, leading to the loss of a significant amount of frequency information during the encoding process, thereby affecting the prediction accuracy of the decoder [39]. Therefore, researchers have combined DWT with CNN to compensate for the frequency information overlooked by CNN. Li et al. argued that CNN loses some information during the downsampling process. They proposed replacing pooling layers with wavelet transform operations to retain all information while achieving downsampling [40]. Fujieda et al. addressed the issue of information loss in traditional CNNs by incorporating the spectral information obtained from wavelet transform into the CNN. They demonstrated the effectiveness of this approach in texture classification, simultaneously reducing the required parameters during model training [41]. Yang et al. combined the high-frequency information obtained from wavelet transform with the downsampling results of CNN through a multi-scale input approach. They utilized attention mechanisms to eliminate noise information, minimizing the impact of the background in crack images and effectively improving the accuracy of crack extraction [42].

3. Methodology

3.1. Overview of the Proposed Method

This paper proposes a refined segmentation network designed for road scenes with diverse crack morphologies and various noise levels (ANF-Net). The network has been specifically improved in the encoder and decoder. To address the issue of false detections caused by background noise interference during crack extraction, the Coordinate Attention Module is introduced. By learning pixel-wise feature weights, this module enables the network to focus on crack regions, reducing the impact of similar background features. To tackle omissions caused by insufficient feature extraction due to the morphological diversity of cracks, Constrained Multiform Convolution Structure is designed. By imposing learnable continuous constraints on the deformation offsets of convolutional kernels, this structure allows the network to adaptively fit crack shapes, facilitating accurate and comprehensive crack feature extraction. These strategies significantly enhance the network’s capability to extract morphologically diverse cracks from high-noise images. The specific architecture is illustrated in Figure 1. U-Net was originally designed for medical image segmentation and achieved excellent segmentation results on small medical image datasets. Due to the similarity between road crack images and the topological structure of medical images, and the relatively small datasets for both tasks, ANF-Net is constructed based on U-Net. The network is mainly composed of three parts: an encoder, a decoder, and lateral connection structures. The encoder and decoder exhibit symmetric four-layer structures. The encoder is constructed by stacking four consecutive feature extraction layers, each of which consists of two consecutive convolution operations, normalization, and activation layer. Information transfer between layers is accomplished by 2 × 2 max-pooling layers. The channel numbers for each layer are 64, 128, 256, and 512, respectively. The decoder aggregates information from the corresponding levels of the encoder through lateral connection structures and accomplishes the upsampling process using bilinear interpolation.

After two consecutive layers of convolution, normalization, and activation in both the encoder and decoder, the Coordinate Attention Module is embedded. The use of the attention module does not alter the size or number of channels of the feature maps. The feature maps that pass through the attention module are employed in both the downsampling process of the encoder and the lateral connection. In the encoder, the feature map obtained after downsampling is concatenated with the high-frequency channel features obtained from discrete wavelet transform. After feature aggregation, this combined input is then fed into the subsequent convolutional process. In the last two layers of the encoder, the Constrained Multiform Convolution Structure is used to replace the traditional convolutional structure. This process also does not alter the size or number of channels of the feature maps.

3.2. Multi-Scale Discrete Wavelet Transform Enhancement Module

Traditional encoder–decoder networks are mainly constructed through the stacking of convolutional layers and pooling layers. This structure exhibits two issues: (1) It processes the image only in the spatial domain, lacking frequency domain information; (2) Multiple downsampling processes lead to information loss. To address the aforementioned issues, this paper constructs a multi-scale discrete wavelet transform enhancement module. The introduction of discrete wavelet transform effectively supplements the missing frequency information in traditional neural networks. Additionally, the multi-scale input strategy addresses the issue of information loss during the downsampling process.

During the process of image discrete wavelet transform, four filters,

f_{l l}

,

f_{l h}

,

f_{h l}

, and

f_{h h}

, are used to filter the original image, resulting in a low-frequency component

F_{l l}

containing the main image information, and three high-frequency components,

F_{l h}

,

F_{h l}

, and

F_{h h}

, containing image detail information. A single discrete wavelet transform simultaneously performs a two-fold downsampling operation on the image, maintaining consistency with the scale of the maximum pooling downsampling operation in the network proposed in this paper. The four components obtained through a single discrete wavelet transform are as follows:

F_{l l} = (f_{l l} \otimes X) ↓ 2

(1)

F_{l h} = (f_{l h} \otimes X) ↓ 2

(2)

F_{h l} = (f_{h l} \otimes X) ↓ 2

(3)

F_{h h} = (f_{h h} \otimes X) ↓ 2

(4)

where

F_{l l}

represents the low-frequency component,

F_{l h}

represents the horizontal high-frequency component,

F_{h l}

represents the vertical high-frequency component,

F_{h h}

represents the diagonal high-frequency component,

\otimes

represents element-wise multiplication, and

↓

represents downsampling. Taking into account both computational performance and algorithm efficiency, this paper employs the widely used Haar wavelet filter for filtering operations, defined as follows:

f_{l l} = [\begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix}] f_{l h} = [\begin{matrix} - 1 & - 1 \\ 1 & 1 \end{matrix}] f_{h l} = [\begin{matrix} - 1 & 1 \\ - 1 & 1 \end{matrix}] f_{h h} = [\begin{matrix} 1 & - 1 \\ - 1 & 1 \end{matrix}]

(5)

Similarly, the next level of discrete wavelet transform for the image is based on the low-frequency component obtained from the previous level of discrete wavelet transform, and it can also yield three high-frequency components containing image detail information. To better compensate for the information loss during network downsampling, it is necessary to concatenate the high-frequency components with the downsampling results. As the dimension of the feature map composed of high-frequency components is much smaller than the dimension of the feature map after downsampling, direct concatenation has a limited impact on the downsampling results.

It is not sufficient to supplement the lost detailed information. Therefore, in this paper, the dimensions of the feature map constructed from high-frequency features is expanded through a 3 × 3 convolution to match the dimensions of the feature map after downsampling before the concatenation process. Subsequently, convolution operations are employed to refine the concatenated feature map, ensuring the dimensions of the reconstructed features match that of the downsampled features. Finally, the refined features are reintroduced into the network for subsequent processing.

3.3. Constrained Multi-Morphological Convolution Structure

CNN extracts image features through convolutional layers with fixed-scale convolution operations. In this structure, the receptive field within the same layer remains fixed. For the task of crack extraction, in various scenes, cracks exhibit diverse morphological variations, and their orientations vary significantly. Due to the constraints of the fixed geometric structure of convolution, the sampling points cannot be effectively constrained to the crack pixels, resulting in the network’s inability to learn diverse morphological features. Consequently, issues such as crack extraction interruptions and suboptimal extraction of narrow, small, and dense cracks may arise. Previous researchers have effectively addressed this issue by employing deformable convolutions, introducing deformation offsets

Δ

into regular convolutions. In deformable convolution, for each position

p_{0}

in the output feature map

Y

, there is a corresponding position in the input feature map

X

:

Y (p_{0}) = \sum_{p_{n} \in R} ω (p_{n}) \cdot X (p_{0} + p_{n} + Δ p_{n})

(6)

where

ω (p_{n})

represents the convolutional kernel weight at point

p_{n}

,

R

denotes the complete set of sampling points,

p_{n}

iterates over all values in the point set

R

, and

Δ p_{n}

represents the learnable offset. Although the learning-based offset

Δ

has addressed to some extent the issue of the fixed and unchanged receptive field in the same layer of CNN, without constraints, it may lead to the deviation of the receptive field from the target area. Especially in dense and narrow crack areas, problems such as crack extraction interruption and inaccuracy may still exist. The specific structures of standard convolution, dilated convolution, and deformable convolution are shown in Figure 2a.

To address the aforementioned issues and inspired by the work of Qi et al. [43], this paper introduces a constrained multiform convolutional structure, successfully applied in the task of crack extraction. Specifically designed to address the challenges posed by traditional convolution in accurately extracting features from narrow, elongated, and morphologically diverse crack regions, this paper proposes a targeted solution to issues such as crack extraction interruptions and omissions. The detailed structure is illustrated in Figure 2b. For a fixed-size convolutional kernel window, the kernel is distributed only in the horizontal and vertical directions, with the center of the kernel as the origin within the window. For the horizontal direction, considering the kernel center position

K_{i}

as the reference, the position

K_{i + 1}

at a distance from the central grid requires applying a vertical offset

Δ = \{δ| δ ϵ [- 1,1]\}

based on the previous grid position

K_{i}

, where the current offset is computed based on the cumulative process of the previous position offset to ensure that the convolutional kernel conforms to a linear structural pattern. The specific offset implementation is shown in Figure 3. In contrast to the free learning process of convolutional kernel deformation offsets in deformable convolutions, this structure constrains the offset process, aligning more with the characteristics of the crack-like structures, with sampling points more focused on crack pixels. The convolutional kernel along the horizontal direction is denoted as:

K_{i \pm c} = \{\begin{cases} (x_{i + c}, y_{i + c}) = (x_{i} + c, y_{i} + \sum_{i}^{i + c} Δ y) \\ (x_{i - c}, y_{i - c}) = (x_{i} - c, y_{i} + \sum_{i - c}^{i} Δ y) \end{cases}

(7)

where among which, for a horizontal convolutional kernel,

c

represents the horizontal distance from each grid in the kernel to the central grid. Similarly, in the vertical direction, the position relative to the central grid in the convolutional kernel requires applying a horizontal offset based on the previous network position to fit the morphology of the target. The convolutional kernel along the vertical direction is denoted as:

K_{j \pm c} = \{\begin{cases} (x_{j + c}, y_{j + c}) = (x_{j} + \sum_{j}^{j + c} Δ x, y_{j} + c) \\ (x_{j - c}, y_{j - c}) = (x_{j} + \sum_{j - c}^{j} Δ x, y_{j} - c) \end{cases}

(8)

The learned offset

Δ

is typically not an integer. To satisfy the sampling requirements in the image, bilinear interpolation is applied to the offset target position. The specific formula is as follows:

K = \sum_{K^{'}} b (K_{x}, {K_{x}}^{'}) \cdot b (K_{y}, {K_{y}}^{'}) \cdot K^{'}

(9)

where

K

represents the fractional position along the horizontal and vertical directions in the formula,

K^{'}

enumerates all integral spatial positions, and

b

is the two one-dimensional kernels in bilinear interpolation.

3.4. Coordinate Attention Module

In the task of crack recognition, due to the influence of various scene factors (such as pitting, shadows, etc.), the characteristics of these interference factors are extremely similar to cracks, making it difficult for the network to focus on accurate perception of the crack area. Building upon the traditional attention mechanism, and aiming to reduce computational resource consumption while addressing the issue of long-range dependencies in crack extraction tasks, this paper introduces the Coordinate Attention Module (CAM) [29]. CAM has a lower computational complexity and provides sufficient contextual awareness. The specific structure is illustrated in Figure 4. CAM mainly consists of two modules: Coordinate Information Embedding and Coordinate Attention Generation, which encode precise positional information about channel relationships and long-range dependencies.

Coordinate Information Embedding: To capture remote interactions with precise spatial information in the image space, CAM processes the input feature map of size

C \times H \times W

as follows:

F ϵ R^{H \times w \times c}

undergoes a 1D pooling operation, specifically, pooling is performed along the horizontal and vertical directions of the feature map using two pooling kernels (H,1) and (1,W), respectively.

The pooling process in the horizontal direction using the pooling kernel

(H, 1)

can be expressed as:

z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)

(10)

Similarly, the pooling process in the vertical direction using the pooling kernel

(1, W)

can be expressed as:

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w)

(11)

This pooling operation along two directions helps capture long-range dependencies while preserving precise positional information.

Coordinate Attention Generation: The feature maps

z_{c}^{h}

and

z_{c}^{w}

obtained through pooling transformations along two directions are concatenated along the spatial dimension to yield a feature map of size

C \times 1 \times (W + H)

. Simultaneously, to enhance computational efficiency, a scaling factor

r

is introduced to scale the channel dimension, resulting in a downsized feature map with dimensions

C / r \times 1 \times (W + H)

. Apply a

1 \times 1

convolutional transformation and activation to the downsized feature map. Subsequently, the feature map is separated along the spatial dimension to obtain

f^{h} ϵ R^{C / r \times H \times 1}

and

f^{w} ϵ R^{C / r \times 1 \times W}

, and their channel numbers are kept consistent with the input feature map:

g^{h} = σ (C_{1} (f^{h}))

(12)

g^{w} = σ (C_{2} (f^{w}))

(13)

where

σ

represents the sigmoid activation function, and

C_{1}

and

C_{2}

denote the

1 \times 1

convolutional operations. The resulting

g^{h}

has dimensions

C \times H \times 1

, and

g^{w}

has dimensions

C \times 1 \times W

. Finally, the original feature map is weighted using the horizontal attention weight component and the vertical attention weight component, resulting in the output feature map

Y

:

Y_{c} (i, j) = X_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(14)

4. Simulation Results and Analysis

In this section, we first present detailed information about the neural network designed in this paper, including network architecture and parameter settings. Subsequently, we introduce several crack datasets used for simulation. We then provide a detailed display of the crack extraction results and compare them with the simulation results of mainstream crack extraction methods. Additionally, we conduct ablation simulations and performance comparisons between different modules. Finally, we evaluate the accuracy using multiple metrics.

4.1. Implementation Details

The proposed method and various comparative ablation simulations in this paper are implemented using the PyTorch 1.10.0 framework. The simulations are conducted on a system with Windows 10, an NVIDIA Quadro P6000 GPU, and an Intel Core i9-7900X CPU. For the convolutional operations in both the encoder and decoder stages, batch normalization and ReLU activation are applied immediately afterward.

The software environment used for the simulations is Python 3.8. Stochastic gradient descent (SGD) with momentum is employed as the optimizer in the network, with the momentum set to 0.9 and weight decay set to 0.0001. In the simulations, the initial learning rate is set to 0.01, and warm-up training is employed, with the number of epochs set to 200. During the sequential training process, only the weights corresponding to the best performance are saved for subsequent prediction processes.

4.2. Datasets

To better conduct comparative and ablation simulations for the ANF-Net proposed in this paper, three representative crack datasets are selected, including: The DeepCrack dataset for cracks in multiple scenes [44], the YCD dataset for cracks in roads and concrete walls [45], and the CFD dataset for road cracks are used in the simulations [46]. The scenes in these datasets are diverse, including images with background noise, images of narrow cracks, and images of densely cracked areas. This diversity allows for a comprehensive validation of the effectiveness and robustness of the proposed method. This paper randomly divides the crack datasets into training, validation, and test sets, with a ratio of 7:2:1 for training and prediction tasks. The specific details of the datasets are provided below:

(1): DeepCrack: The dataset comprises cracks from various scales and scenes, with pixel-level annotations already completed. The total number of samples is 537, all with dimensions of 544 × 384. The dataset has been randomly partitioned into a training set of 376 images, a validation set of 107 images, and a test set of 54 images, maintaining proportional distribution.
(2): YCD: The dataset includes crack images collected from web sources as well as captured in the field, with pixel-level annotations already completed. There are a total of 776 samples. Due to the varied distances at which the images were captured, their dimensions differ. For simulations convenience, this paper resizes all images to 512 × 512. The dataset has been randomly partitioned into a training set of 542 images, a validation set of 156 images, and a test set of 78 images, maintaining proportional distribution.
(3): CFD: The dataset comprises images depicting road surface crack scenes, with pixel-level annotations already completed. There are a total of 118 samples, each with dimensions of 480 × 320. The dataset has been randomly divided into a training set of 82 images, a validation set of 24 images, and a test set of 12 images, maintaining proportional distribution.

The dataset used in this study has a relatively small sample size for the neural network. Although networks suitable for small datasets have been chosen for training, in order to further improve the predictive accuracy of the network, random image augmentation techniques are applied during the training process. These techniques include size adjustment, horizontal and vertical flipping, and random cropping. Simulations results demonstrate the beneficial impact of these methods on predictive performance.

4.3. Evaluation Metrics

For the task of crack extraction, the images are categorized into two classes: crack pixels and non-crack pixels. To better evaluate the accuracy of the extraction results, this paper employs five commonly used evaluation metrics in image segmentation, including: Accuracy: the ratio of correctly predicted pixels to the total number of pixels; Precision: the ratio of correctly predicted pixels to the total predicted pixels; Recall: the ratio of correctly predicted pixels to the total pixels in the ground truth samples; mIoU: indicates the degree of overlap between predicted pixels and pixels in the ground truth samples; F1 Score: represents the harmonic mean of Precision and Recall, with the specific calculation formula as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(15)

P r e c i s i o n = \frac{T P}{T P + F P}

(16)

R e c a l l = \frac{T P}{T P + F N}

(17)

m I o U = \frac{T P}{T P + F P + F N}

(18)

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(19)

where true positive (TP) denotes the number of pixels that are actually true and predicted as true. False positive (FP) represents the number of pixels that are actually false but predicted as true. False negative (FN) represents the number of pixels that are actually true but predicted as false. True negative (TN) represents the number of pixels that are actually false and predicted as false.

4.4. Comparison and Discussion

To thoroughly validate the effectiveness and robustness of the proposed method, three publicly available crack datasets, namely, DeepCrack, YCD, and CFD, are utilized in this section for simulations. The method developed in this paper is compared with several commonly used and high-performance neural network methods in the field of crack extraction, including SegNet, DeepCrack, DeepLabV3+, UNet, UNet++, and others. Furthermore, the simulation results are rigorously evaluated for accuracy, and a comprehensive analysis and discussion are conducted from both qualitative and quantitative perspectives.

(1): Comparison on DeepCrack

The DeepCrack dataset is processed through various networks for the training, validation, and prediction processes. The final visualized prediction results are illustrated in Figure 5. From left to right, the images in the figure present the original image, the ground truth, the predicted result of the proposed method, the predicted result of SegNet, the predicted result of DeepCrack, the predicted result of DeepLabV3+, the predicted result of UNet, and the predicted result of UNet++.

For ease of qualitative assessment of the network predictive performance, regions where the predictions are incorrect are highlighted with red bounding boxes in the figure. From several sets of result images, qualitative analysis yields the following: For crack images with significant background noise, the prediction results of conventional methods often include a substantial amount of noise, especially in the case of DeepCrack and U-Net. The extracted crack results are not sufficiently pure and accurate. Due to the rational construction of CAM in the proposed method, the network’s ability to capture crucial spatial information in the image is effectively enhanced, resulting in prediction results that are closer to the ground truth. For narrow and densely packed crack images, the use of the CMC structure in the method effectively constrains the sampling points to the crack pixels. This results in prediction outcomes that surpass conventional methods in terms of visual effects and detection accuracy. In addition, various image segmentation evaluation metrics are quantitatively calculated for the crack extraction results of multiple networks on the DeepCrack dataset. The results are shown in Table 1, where the best simulations outcomes are highlighted in bold. It can be observed that the proposed ANF-Net achieves a Precision of 87.2%, Recall of 88.7%, mIoU of 88.9%, and F1 score of 87.9% on the DeepCrack dataset. Among these, mIoU and F1, as two crucial evaluation metrics for optimizing image segmentation performance, the proposed method achieves the highest scores. Compared to the suboptimal results in the comparative methods, there is a notable improvement of 1.1% in mIoU and a 1.1% enhancement in F1. It can be concluded that the proposed method demonstrates superiority in visual effects and detection accuracy. The detection results also hold practical guidance value for real-world applications.

(2): Comparison on YCD

To validate the model’s generalization capability, this paper conducts cross-sectional comparative simulations using the YCD crack public dataset. The model is similarly subjected to training, validation, and prediction operations. The final visualized prediction results are illustrated in Figure 6. From left to right, the images in the figure present the original image, the ground truth, the predicted result of the proposed method, the predicted result of SegNet, the predicted result of DeepCrack, the predicted result of DeepLabV3+, the predicted result of UNet, and the predicted result of UNet++.

The inadequately predicted regions in the result images are highlighted with red boxes, and a qualitative analysis of the prediction images yields the following: Conventional methods exhibit noticeable instances of over-segmentation, particularly in the case of U-Net and U-Net++. These methods tend to inaccurately predict the shadowed portions of crack edges as cracks. In contrast, the constructed ANF-Net in this paper demonstrates excellent noise resistance. The modules designed for feature capture of cracks are more comprehensive, resulting in cleaner and more accurate prediction outcomes that closely approach the ground truth. Additionally, employing various image segmentation accuracy evaluation metrics, a quantitative analysis of the prediction results of multiple networks on the YCD dataset is conducted. The obtained results are shown in Table 2. It can be observed that the proposed ANF-Net achieves a Precision of 84.7%, Recall of 80.5%, mIoU of 84.8%, and F1 score of 82.5% on the YCD dataset. Compared to the suboptimal results predicted by the comparative methods, there is a significant improvement of 1.2% in mIoU and a 1.5% enhancement in F1. Furthermore, this further illustrates the superior performance of the proposed ANF-Net in both the visual and precision aspects of crack extraction. The method also demonstrates a certain level of generalization capability.

(3): Comparison on CFD

To further validate the model’s generalization capability, this paper conducts cross-sectional comparative simulations using the CFD crack public dataset. Through training, validation, and prediction of the model, the final results are illustrated in Figure 7. From left to right, the images in the figure present the original image, the ground truth, the predicted result of the proposed method, the predicted result of SegNet, the predicted result of DeepCrack, the predicted result of DeepLabV3+, the predicted result of UNet, and the predicted result of UNet++.

Similarly, the inadequately predicted regions in the result images are annotated with red boxes. A qualitative analysis reveals the following: Conventional methods often encounter issues of insufficient extraction in their prediction results, particularly with DeepLabV3+ and U-Net, where the prediction effectiveness in regions with weak crack intensity is suboptimal. Simultaneously, there is an issue of interruptions in crack extraction, with the main causes being the lack of established long-range dependencies in the image and insufficient information extraction in the network encoder. The constructed ANF-Net exhibits better performance in predicting cracks in areas with lower intensity. Moreover, the predicted cracks in the result images are less prone to interruptions and boundary fuzziness. Additionally, a quantitative analysis is conducted using various image segmentation accuracy evaluation metrics. The obtained results are shown in Table 3. It can be observed that in comparison to other methods, the proposed approach achieves the overall highest extraction accuracy. Specifically, Precision is 73.1%, mIoU is 77.1%, and F1 is 71.5%, all of which are at the highest levels among the comparative methods. Furthermore, compared to the suboptimal results in the comparative methods, there is a significant improvement of 0.9% in mIoU and a 1.6% enhancement in F1. This more fully illustrates that the proposed method excels in the accuracy of crack extraction tasks, especially in scenarios with low crack intensity and high background noise. Through simulations with multiple datasets, it is further demonstrated that the proposed method exhibits strong generalization capability, providing valuable guidance for practical production applications.

4.5. Ablation Simulations

In this paper, several issues in traditional crack extraction methods are identified in detail. Targeted modules, including MWE, CMC, and CAM, are constructed based on UNet to address these challenges. In this section, dedicated ablation simulations are conducted using the DeepCrack dataset to investigate these three modules. This involved simulations without adding any modules, simulations with individual module additions, and simulations with combined module additions, resulting in a total of six groups. Simultaneously, a comprehensive qualitative analysis of the network’s prediction results is conducted. Precision, Recall, mIoU, and F1 are chosen as the four metrics for accuracy quantitative analysis.

As shown in Figure 8, the comparative images of network prediction results in ablation simulations, the regional median Dice index, and the FPS-F1 score scatter plot are presented. In the comparative images of the prediction results, red markings indicate correctly predicted crack pixels, green markings denote erroneously predicted crack pixels, and blue markings represent crack pixels that are not successfully predicted. As shown in Table 4, the results of the crack prediction accuracy evaluation are presented. Combining the results in the figures and tables, a comprehensive analysis can be conducted, yielding the following: For the baseline network without adding any modules, the accuracy of the crack prediction results is at the lowest level, with mIoU at only 83.8% and F1 at only 81.6%. Adding CAM to the baseline network individually increases F1 by 2.8%. This is because the attention mechanism enhances effective spatial information in the image, improving the network’s noise resistance. Adding the CMC module individually to the baseline network increases F1 by 2.2%. This is because the convolutional structure in this paper effectively adapts to tubular and multi-morphological crack structures through constrained learning offsets, extracting more features of crack pixels. Adding the MWE module individually to the baseline network increases F1 by 0.5%. This is because the multi-scale wavelet enhancement structure effectively supplements the frequency information lost during the downsampling process. Combining the CAM and CMC modules and adding them to the baseline network increases F1 by 5%. Combining CAM, CMC, and MWE in the baseline network results in an effective improvement of 6.3% in F1, achieving the highest scores in Precision, mIoU, and F1. From the ablation results, it is evident that the addition of the three modules is beneficial for improving the performance of the network. The combination of these modules, especially when all three are added simultaneously, helps the network achieve the best crack prediction results. The visualized results in the figures also indicate that the final constructed network presents the optimal visual effects, with extraction results that are cleaner and more accurate.

4.6. Comparison of Different Modules

For a thorough analysis of the performance superiority of each module, this section independently designs several sets of cross-sectional comparative simulations using the DeepCrack dataset. It compares the performance of different attention mechanisms and convolutional sampling structures. Finally, accuracy evaluation and detailed discussions are conducted on the simulations results.

(1): Comparison of Different Attention Modules

To thoroughly illustrate the superior performance of the constructed CAM in crack extraction tasks, this section specifically designs several sets of comparative simulations using the DeepCrack dataset. These simulations involve embedding different attention modules in the network, including embedding channel attention modules SEBlock and ECABlock, embedding the hybrid attention module CBAM, and embedding the CAM constructed in this paper. During the analysis process, two additional performance evaluation metrics for the network are introduced based on the original precision evaluation indicators: Parameters, measuring the network’s parameter quantity, and FLOPs, measuring the network’s computational load. The final visualized results of the network predictions are shown in Figure 9a, and the quantitative evaluation results are presented in Table 5. A comprehensive analysis of the graphical and tabular results reveals the following: Building four types of attention modules in the network, the new network exhibits minimal variations in the Parameters and FLOPs indicators. The fluctuation range for Parameters is approximately 0.89%, and the fluctuation range for FLOPs is even smaller. This indicates that CAM neither introduces excessive parameters nor affects the operational efficiency of the network. From the crack prediction accuracy evaluation results for SEBlock, the F1 metric improves by 0.1% compared to the original network, with a minimal impact on accuracy. For ECABlock, F1 improves by 2.2% on the original network. For CBAM, F1 improves by 2.3% on the original network. For CAM, F1 significantly improves by 2.8% on the original network. Additionally, introducing CAM results in crack prediction outcomes that are cleaner and less affected by background noise. In the extraction results, there are fewer background noise pixels. In summary, the CAM module used in this paper achieves the highest accuracy in crack prediction without introducing excessive parameters. Compared to other attention modules, it exhibits excellent performance superiority.

(2): Comparison of Different Convolutional Structures

In this paper, a targeted CMC structure is constructed to address the issue of insufficient feature extraction leading to inaccurate crack extraction results in multi-morphological dense crack scenes. Due to the CMC module being constructed based on the concepts of dilated convolution and deformable convolution, to thoroughly illustrate the superiority of CMC in crack extraction tasks, this section designed several sets of comparative simulations using the DeepCrack dataset, including base network, base network with dilated convolution, base network with deformable convolution, and base network with the CMC module. It is worth noting that, to better demonstrate the superior performance of the modules used in the simulation, the base network here is UNet with CAM. Finally, the evaluation is performed using various metrics, including Parameters and FLOPs, which measure the network’s operational performance, and Precision, Recall, mIoU, and F1, which measure the network’s prediction accuracy. The final visualized results of crack extraction are shown in Figure 9b, and the quantitative evaluation results are presented in Table 6.

From the data in the table, it can be observed that, compared to the base network, although dilated convolution does not introduce excessive parameters, the final prediction accuracy has decreased to some extent, with a decrease of 0.3% in mIoU and a corresponding decrease of 0.3% in F1. Deformable convolution performs better, with an improvement of 1.1% in mIoU and a corresponding increase of 1.5% in F1, without introducing excessive parameters. Compared to these two convolutional structures, the network built with CMC in the baseline achieves the optimal accuracy results, with an effective improvement of 1.8% in mIoU and a corresponding increase of 2.2% in F1. Although this method introduces to some extent more parameters, there is no significant performance perceptual difference in actual training and prediction processes. Additionally, from the visual results, it can be observed that the network using the CMC structure produces better predictions in narrow, dense regions of cracks, with fewer issues such as fragmentation or missing areas. In summary, the introduced CMC in this paper achieves higher prediction accuracy compared to other convolutional sampling structures. Overall, this method exhibits good performance superiority.

5. Conclusions

This paper proposes a fine-grained segmentation network suitable for road scenes with multiple morphologies and noises in cracks. Aiming at enhancing the network’s capability to capture features, attention mechanisms are deliberately introduced in both the encoder and decoder. This adaptation allows the network to better handle complex and noisy scenes. A constrained multi-morphological convolutional structure is integrated into the encoder, enabling the network to achieve improved predictive performance in scenes with diverse crack morphologies. Furthermore, a multi-scale discrete wavelet enhancement module is designed in the encoder to effectively supplement the frequency domain information containing crack features that may be overlooked during the downsampling process. The aforementioned improvements effectively address the issue of inaccurate detection in scenes with diverse crack morphologies and significant background noise in road images.

This paper conducts comparative simulations using three publicly available crack datasets: DeepCrack, YCD, and CFD. The results of the simulations demonstrate the effectiveness of the proposed method. Compared to suboptimal methods, the F1 score is significantly improved by 1.1%, 1.5%, and 1.6%, respectively. A dedicated ablation study using the DeepCrack dataset is performed by integrating the three modules into the baseline network, resulting in a remarkable 6.3% increase in the F1 score. In the module comparison simulations, the construction of the CAM and CMC individually leads to effective increases of 2.8% and 2.2% in the F1 score, respectively. The comprehensive simulations results affirm that the ANF-Net network constructed in this paper exhibits outstanding performance in crack segmentation compared to other classical deep learning methods.

While the constructed ANF-Net in this study achieves promising results in crack extraction tasks, it is noteworthy that deep learning methods require a substantial amount of annotated data for training to ensure the effectiveness and robustness of the network. However, annotating cracks in various scenarios is a challenging task. Thus, the focus of future research should be on improving the network’s predictive capabilities in scenarios with limited sample data. Furthermore, we aim to extend this method to more application scenarios, such as rock fracture extraction and tunnel cavity crack detection, to enhance the engineering practicality of the approach.

Author Contributions

Conceptualization, X.H., X.L. and Q.C.; methodology development, X.L., X.H. and G.D.; processing and validation, X.H.; writing—original draft, X.H.; writing—review and editing, C.C., B.W. and X.L.; funding acquisition, X.L.; project supervision, X.L. and B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key R&D Program of Hubei (Grant 2022BCA080) and China Geological Survey project (Grant TC240FABQ).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bhat, S.; Naik, S.; Gaonkar, M.; Sawant, P.; Aswale, S.; Shetgaonkar, P. A survey on road crack detection techniques. In Proceedings of the 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE), Vellore, India, 24–25 February 2020; pp. 1–6. [Google Scholar]
Prasanna, P.; Dana, K.; Gucunski, N.; Basily, B. Computer-vision based crack detection and analysis. In Proceedings of the Sensors and Smart Structures Technologies for Civil, Mechanical, and Aerospace Systems 2012, San Diego, CA, USA, 6 April 2012; pp. 1143–1148. [Google Scholar]
Peng, L.; Chao, W.; Shuangmiao, L.; Baocai, F. Research on crack detection method of airport runway based on twice-threshold segmentation. In Proceedings of the 2015 Fifth International Conference on Instrumentation and Measurement, Computer, Communication and Control (IMCCC), Qinhuangdao, China, 18–20 September 2015; pp. 1716–1720. [Google Scholar]
Maode, Y.; Shaobo, B.; Kun, X.; Yuyao, H. Pavement crack detection and analysis for high-grade highway. In Proceedings of the 2007 8th International Conference on Electronic Measurement and Instruments, Xi’an, China, 16–18 August 2007; pp. 4-548–4-552. [Google Scholar]
Douka, E.; Loutridis, S.; Trochidis, A. Crack identification in beams using wavelet analysis. Int. J. Solids Struct. 2003, 40, 3557–3569. [Google Scholar] [CrossRef]
Bayar, G.; Bilir, T. A novel study for the estimation of crack propagation in concrete using machine learning algorithms. Constr. Build. Mater. 2019, 215, 670–685. [Google Scholar] [CrossRef]
Kim, H.; Ahn, E.; Shin, M.; Sim, S.-H. Crack and noncrack classification from concrete surface images using machine learning. Struct. Health Monit. 2019, 18, 725–738. [Google Scholar] [CrossRef]
Müller, A.; Karathanasopoulos, N.; Roth, C.C.; Mohr, D. Machine learning classifiers for surface crack detection in fracture experiments. Int. J. Mech. Sci. 2021, 209, 106698. [Google Scholar] [CrossRef]
Cha, Y.J.; Choi, W.; Büyüköztürk, O. Deep learning-based crack damage detection using convolutional neural networks. Comput. -Aided Civ. Infrastruct. Eng. 2017, 32, 361–378. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3523–3542. [Google Scholar] [CrossRef]
Wan, S.; Gong, C.; Zhong, P.; Du, B.; Zhang, L.; Yang, J. Multiscale dynamic graph convolutional network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3162–3177. [Google Scholar] [CrossRef]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep learning for hyperspectral image classification: An overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
Guan, H.; Li, J.; Yu, Y.; Chapman, M.; Wang, H.; Wang, C.; Zhai, R. Iterative tensor voting for pavement crack extraction using mobile laser scanning data. IEEE Trans. Geosci. Remote Sens. 2014, 53, 1527–1537. [Google Scholar] [CrossRef]
Li, H.; Wang, W.; Wang, M.; Li, L.; Vimlund, V. A review of deep learning methods for pixel-level crack detection. J. Traffic Transp. Eng. 2022, 9, 945–968. (In English) [Google Scholar] [CrossRef]
Zhang, A.; Wang, K.C.P.; Li, B.; Yang, E.; Dai, X.; Peng, Y.; Fei, Y.; Liu, Y.; Li, J.Q.; Chen, C. Automated Pixel-Level Pavement Crack Detection on 3D Asphalt Surfaces Using a Deep-Learning Network. Comput.-Aided Civ. Infrastruct. Eng. 2017, 32, 805–819. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Part III 18. pp. 234–241. [Google Scholar]
Jenkins, M.D.; Carr, T.A.; Iglesias, M.I.; Buggy, T.; Morison, G. A deep convolutional neural network for semantic pixel-wise segmentation of road and pavement surface cracks. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, 3–7 September 2018; pp. 2120–2124. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Cai, Z.; Zhao, X.; Chen, C.; Liang, X.; Zou, T.; Wang, P. Pavement crack detection and recognition using the architecture of segNet. J. Ind. Inf. Integr. 2020, 18, 100144. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8 September 2018; pp. 801–818. [Google Scholar]
Hamishebahar, Y.; Guan, H.; So, S.; Jo, J. A comprehensive review of deep learning-based crack detection approaches. Appl. Sci. 2022, 12, 1374. [Google Scholar] [CrossRef]
Zou, Q.; Zhang, Z.; Li, Q.; Qi, X.; Wang, Q.; Wang, S. Deepcrack: Learning hierarchical convolutional features for crack detection. IEEE Trans. Image Process. 2018, 28, 1498–1512. [Google Scholar] [CrossRef]
Zhang, L.; Liao, Y.; Wang, G.; Chen, J.; Wang, H. A Multi-Scale Contextual Information Enhancement Network for Crack Segmentation. Appl. Sci. 2022, 12, 11135. [Google Scholar] [CrossRef]
Zhou, Q.; Qu, Z.; Li, Y.-X.; Ju, F.-R. Tunnel crack detection with linear seam based on mixed attention and multiscale feature fusion. IEEE Trans. Instrum. Meas. 2022, 71, 5014711. [Google Scholar] [CrossRef]
Geng, P.; Lu, J.; Ma, H.; Yang, G. Crack Segmentation Based on Fusing Multi-Scale Wavelet and Spatial-Channel Attention. Struct. Durab. Health Monit. (SDHM) 2023, 17, 1–22. [Google Scholar] [CrossRef]
Liu, H.; Miao, X.; Mertz, C.; Xu, C.; Kong, H. Crackformer: Transformer network for fine-grained crack detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3783–3792. [Google Scholar]
Guo, F.; Liu, J.; Lv, C.; Yu, H. A novel transformer-based network with attention mechanism for automatic pavement crack detection. Constr. Build. Mater. 2023, 391, 131852. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Brauwers, G.; Frasincar, F. A general survey on attention mechanisms in deep learning. IEEE Trans. Knowl. Data Eng. 2021, 35, 3279–3298. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Zhu, X.; Cheng, D.; Zhang, Z.; Lin, S.; Dai, J. An empirical study of spatial attention mechanisms in deep networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6688–6697. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8 September 2018; pp. 3–19. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2010, arXiv:2010.11929. [Google Scholar]
Liu, L.; Yang, N.; Lan, J.; Li, J. Image segmentation based on gray stretch and threshold algorithm. Optik 2015, 126, 626–629. [Google Scholar] [CrossRef]
Taifi, K.; Taifi, N.; Azougaghe, E.-s.; Safi, S. An Automatic Detection by Classification of Cracked Pixels or Noncracked Pixels in Road Surface. Math. Probl. Eng. 2021, 2021, 3151460. [Google Scholar] [CrossRef]
Liu, S.; Han, Y.; Xu, L. Recognition of road cracks based on multi-scale Retinex fused with wavelet transform. Array 2022, 15, 100193. [Google Scholar] [CrossRef]
Liu, P.; Zhang, H.; Zhang, K.; Lin, L.; Zuo, W. Multi-level wavelet-CNN for image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 773–782. [Google Scholar]
Li, E.; Tang, H. A novel convolutional neural network for pavement crack segmentation. In Proceedings of the 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), Pasadena, CA, USA, 13–16 December 2021; pp. 95–99. [Google Scholar]
Fujieda, S.; Takayama, K.; Hachisuka, T. Wavelet convolutional neural networks for texture classification. arXiv 2017, arXiv:1707.07394. [Google Scholar]
Yang, G.; Geng, P.; Ma, H.; Liu, J.; Luo, J. DWTA-Unet: Concrete Crack Segmentation Based on Discrete Wavelet Transform and Unet. In Proceedings of the 2021 Chinese Intelligent Automation Conference, Zhanjiang, China, 5–7 November 2022; pp. 702–710. [Google Scholar]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic Snake Convolution based on Topological Geometric Constraints for Tubular Structure Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6070–6079. [Google Scholar]
Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Yang, X.; Li, H.; Yu, Y.; Luo, X.; Huang, T.; Yang, X. Automatic pixel-level crack detection and measurement using fully convolutional network. Comput. -Aided Civ. Infrastruct. Eng. 2018, 33, 1090–1109. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Fan, D.-P.; Ji, G.-P.; Cheng, M.-M.; Shao, L. Concealed object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6024–6042. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Chen, H.; Lin, H. An effective hybrid atrous convolutional network for pixel-level crack detection. IEEE Trans. Instrum. Meas. 2021, 70, 5009312. [Google Scholar] [CrossRef]

Figure 1. Network framework of the proposed ANF-Net.

Figure 2. Diagram of several convolutional structures. (a) Standard convolutional structure, dilated convolutional structure, and deformable convolutional structure. (b) Constrained multi-morphological convolution structure diagram.

Figure 3. (Left) Illustration of the coordinates calculation of the CMC. (Right) The receptive field of the CMC.

Figure 4. Illustration of Coordinate Attention Module.

Figure 5. Visualization results obtained by various methods on DeepCrack dataset. Areas that have not been detected or that have been incorrectly detected are marked with red boxes.

Figure 6. Visualization results obtained by various methods on YCD dataset. Areas that have not been detected or that have been incorrectly detected are marked with red boxes.

Figure 7. Visualization results obtained by various methods on CFD dataset. Areas that have not been detected or that have been incorrectly detected are marked with red boxes.

Figure 8. (a) Visualization results of ablation simulations. From left to right, the predicted results of original image, ground truth, +CAM, +CMC, +MWE, +CAM+CMC, and proposed are displayed. The upper left corner of the prediction graph indicates the F1 score of the prediction results. For comparison, the correctly predicted areas are marked in red, the incorrectly predicted areas are marked in green, and the unpredictable areas are marked in blue. (b) Region-wise median Dice coefficient. (c) FPS–F1 score scatter plot.

Figure 9. (a) Visualization results of different attention modules on DeepCrack dataset. From left to right, the original image, ground truth, and crack segmentation results of the baseline network, as well as newly added SE Block, ECA Block, CBAM, and CAM (ours) are displayed in sequence. (b) Visualization results of different convolution structures on DeepCrack dataset. From left to right, the crack segmentation results of the original image, ground truth, and baseline network, as well as newly added DilatedConv, DCN, and CMC (ours) are displayed in sequence.

Table 1. Performance comparison of different methods on the DeepCrack dataset.

	Precision	Recall	mIoU	F1
SegNet	77.7%	88.4%	84.6%	82.7%
DeepCrack	70.2%	83.4%	79.9%	76.2%
DeepLabV3+	82.0%	81.9%	84.1%	82.0%
UNet	73.9%	91.1%	83.8%	81.6%
UNet++	77.6%	89.8%	85.1%	83.3%
SINet [47]	90.9%	83.1%	87.8%	86.8%
TransUNet [48]	84.3%	78.8%	84.3%	81.4%
SwinT [49]	91.0%	68.7%	78.3%	78.4%
Ours	87.2%	88.7%	88.9%	87.9%

The highest score for each metric in the table has been bolded.

Table 2. Performance comparison of different methods on the YCD dataset.

	Precision	Recall	mIoU	F1
SegNet	77.6%	84.7%	83.6%	81.0%
DeepCrack	78.3%	81.5%	82.8%	79.9%
DeepLabV3+	78.5%	83.4%	83.5%	80.9%
UNet	75.0%	85.3%	82.7%	79.8%
UNet++	73.8%	86.5%	82.6%	79.6%
PSPNet [50]	77.2%	72.4%	63.7%	74.7%
HACNet-D [51]	79.6%	62.5%	75.2%	70.0%
Ours	84.7%	80.5%	84.8%	82.5%

The highest score for each metric in the table has been bolded.

Table 3. Performance comparison of different methods on the CFD dataset.

	Precision	Recall	mIoU	F1
SegNet	72.4%	66.4%	75.9%	69.3%
DeepCrack	72.6%	61.1%	74.2%	66.4%
DeepLabV3+	58.5%	59.6%	70.0%	59.1%
UNet	79.9%	52.1%	72.3%	63.1%
UNet++	75.0%	65.4%	76.2%	69.9%
Ours	73.1%	69.9%	77.1%	71.5%

The highest score for each metric in the table has been bolded.

Table 4. Ablation study of ANF-Net.

Group	CAM	CMC	MWE	Precision	Recall	mIoU	F1
1				73.9%	91.1%	83.8%	81.6%
2	√			80.1%	89.1%	86.0%	84.4%
3		√		80.9%	86.9%	85.5%	83.8%
4			√	73.0%	93.7%	84.2%	82.1%
5	√	√		86.1%	87.2%	87.8%	86.6%
6	√	√	√	87.2%	88.7%	88.9%	87.9%

Table 5. Performance comparison of different attention modules on the DeepCrack dataset.

	Params (10⁶)	FLOPs (10⁹)	Precision	Recall	mIoU	F1
Baseline	17.263	128.106	73.9%	91.1%	83.8%	81.6%
+SEBlock	17.339	128.132	79.2%	84.3%	83.9%	81.7%
+ECABlock	17.263	128.132	83.0%	84.7%	85.5%	83.8%
+CBAM	17.416	128.160	83.6%	84.2%	85.6%	83.9%
+CAM	17.326	128.164	80.1%	89.1%	86.0%	84.4%

Table 6. Performance comparison of different convolutional structures on the DeepCrack dataset.

	Params (10⁶)	FLOPs (10⁹)	Precision	Recall	mIoU	F1
Baseline	17.326	128.164	80.1%	89.1%	86.0%	84.4%
+Dilated Conv	17.328	128.164	81.9%	86.4%	85.7%	84.1%
+DCN	17.616	128.705	84.8%	86.9%	87.1%	85.9%
+CMC	27.085	176.050	86.1%	87.2%	87.8%	86.6%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, X.; Chen, Q.; Liu, X.; Deng, G.; Chi, C.; Wang, B. ANF-Net: A Refined Segmentation Network for Road Scenes with Multiple Noises and Various Morphologies of Cracks. Remote Sens. 2025, 17, 971. https://doi.org/10.3390/rs17060971

AMA Style

Hu X, Chen Q, Liu X, Deng G, Chi C, Wang B. ANF-Net: A Refined Segmentation Network for Road Scenes with Multiple Noises and Various Morphologies of Cracks. Remote Sensing. 2025; 17(6):971. https://doi.org/10.3390/rs17060971

Chicago/Turabian Style

Hu, Xiao, Qihao Chen, Xiuguo Liu, Gang Deng, Cheng Chi, and Bin Wang. 2025. "ANF-Net: A Refined Segmentation Network for Road Scenes with Multiple Noises and Various Morphologies of Cracks" Remote Sensing 17, no. 6: 971. https://doi.org/10.3390/rs17060971

APA Style

Hu, X., Chen, Q., Liu, X., Deng, G., Chi, C., & Wang, B. (2025). ANF-Net: A Refined Segmentation Network for Road Scenes with Multiple Noises and Various Morphologies of Cracks. Remote Sensing, 17(6), 971. https://doi.org/10.3390/rs17060971

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ANF-Net: A Refined Segmentation Network for Road Scenes with Multiple Noises and Various Morphologies of Cracks

Abstract

1. Introduction

2. Related Work

2.1. Attention Mechanism

2.2. Discrete Wavelet Transform

3. Methodology

3.1. Overview of the Proposed Method

3.2. Multi-Scale Discrete Wavelet Transform Enhancement Module

3.3. Constrained Multi-Morphological Convolution Structure

3.4. Coordinate Attention Module

4. Simulation Results and Analysis

4.1. Implementation Details

4.2. Datasets

4.3. Evaluation Metrics

4.4. Comparison and Discussion

4.5. Ablation Simulations

4.6. Comparison of Different Modules

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI