MSP U-Net: Crack Segmentation for Low-Resolution Images Based on Multi-Scale Parallel Attention U-Net

Kim, Joon-Hyeok; Noh, Ju-Hyeon; Jang, Jun-Young; Yang, Hee-Deok

doi:10.3390/app142411541

Open AccessArticle

MSP U-Net: Crack Segmentation for Low-Resolution Images Based on Multi-Scale Parallel Attention U-Net

Department of Computer Science, Chosun University, Gwangju 61452, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(24), 11541; https://doi.org/10.3390/app142411541

Submission received: 12 November 2024 / Revised: 28 November 2024 / Accepted: 30 November 2024 / Published: 11 December 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

As the expected lifespans of structures and road approaches, as well as the importance of road maintenance, increase globally, safety inspections have emerged as a crucial task. Nonetheless, the existing crack detection models focus on multi-scale feature loss and performance degradation in learning various types of cracks. We propose the Multi-Scale Parallel Attention U-Net (MSP U-Net) as a network designed for low-resolution images that considers the irregular characteristics of cracks. MSP U-Net applies a large receptive field flock to an attention U-Net, minimizing feature loss across multiple scales. Using the Crack500 dataset, our network achieved a mean intersection of union (mIoU) of 0.7752, outperforming the existing methods on low-resolution images.

Keywords:

crack detection; deep learning; semantic segmentation

1. Introduction

As traffic structures and paved roads worldwide are approaching their expected lifespans, cracks tend to form and expand over time, compromising traffic safety. Consequently, safety inspections for road maintenance are conducted annually across the globe [1]. These inspections are crucial in ensuring safe transportation and traffic conditions. Research in this domain can be divided into two methodologies. The first methodology is based on traditional image processing techniques [2]. Because this approach considers only the features of images—without considering the characteristics of target objects, such as edge detectors—it typically exhibits limitations in detecting targets with irregular features in various environments (e.g., lighting, angles, and backgrounds). The second methodology is based on deep learning technology, which offers the advantage of generalizability to various environments and irregular objects. Deep learning is a stable methodology that considers not only the features of images but also those of detection targets. The core of this technology is the convolutional neural network (CNN), which has consistently outperformed traditional computer vision techniques. With the increasing use of graphics processing units (GPUs), the development of CNNs has also accelerated. Since the emergence of AlexNet [3], which won the ImageNet LSVRC-2010 contest, a wide range of CNN-based architectures have been researched and applied to tasks such as road crack detection, including VGGNet [4], introduced in 2014; ResNet [5], introduced in 2015; and DenseNet [6], introduced in 2017.

Although the early neural-network-based models for crack detection have correctly outlined the research trajectory, they exhibited clear drawbacks. First, these models were insufficiently adaptable to diverse data characteristics, such as brightness and illumination, and were highly sensitive to image resolution. Recently, researchers have developed segmentation algorithms that leverage deep learning with advantages in various environments.

Lau et al. [7] employed a pre-trained ResNet-34 from ImageNet as an encoder and constructed a decoder using transpose convolutional layers, similar to the conventional U-Net, achieving an F-1 Score of 0.7327. They demonstrated distinct advantages over other methods by applying spatial and channel feature extraction techniques in the form of Spatial and Channel Squeeze and Excitation (SCSE) modules [8]. Nguyen et al. [9] applied preprocessing techniques suitable for deep-learning-based networks rather than digital image processing techniques used in traditional computer vision tasks, showing good accuracy and model stability. Yu et al. [10] proposed a semantic segmentation network called RUC-Net that combines ResNet-18 with U-Net. They applied focal loss functions to address class imbalance issues, reduced sample weights, and demonstrated better detection performance than classical segmentation algorithms. Wang et al. [11] proposed U-Net with DenseNet121 as an encoder based on a Pyramid Attention Network (PAN). They applied feature pyramid attention modules between the encoder and decoder to extract high-level features for classification and segmentation. The decoder module achieved good performance by accurately combining features from both low and high dimensions using the Global Attention Upsample (GAU) module. Di et al. [12] designed a U-Net-based architecture using a ResNet-50 encoder. Through transfer learning, the proposed architecture outperformed the existing methods using a common dataset. Choi et al. [13] proposed a network called SDDNet by applying a purely deep learning method. This network can perform real-time crack detection while ignoring backgrounds that resemble crack surfaces. Despite an 88-fold reduction in parameters compared to the latest models at the time of publication, SDDNet achieved better performance and faster processing speed than said models. Kang et al. [14] proposed STRNet, which excels at segmenting concrete cracks at the pixel level. The network was trained using a large-scale dataset and tested on 545 images. The performance-recorded Precision, Recall, F-1 Score, and mIoU were 91.7%, 92.7%, 92.2%, and 92.6%, respectively. The proposed network has the advantage of processing relatively large input images (1280 × 720 and 1024 × 512) in real-time while maintaining excellent performance. Chun et al. [15] proposed a road surface crack detection technique using fully convolutional neural networks (FCNNs). They segmented images obtained while driving on actual roads, with a total of 6756 images, and classified pavement damage into six classes. Because the dataset was initially insufficient, it was stabilized using K-fold cross-validation. Furthermore, they applied a semi-supervised learning-based method using pseudo-label datasets, outperforming conventional supervised-learning-based networks.

Although the aforementioned studies have been somewhat successful in improving crack detection performance, they still exhibited suboptimal accuracy when handling low-resolution images. One of the main reasons for this is the loss of features in certain layers of the encoder in an autoencoder-based structure. Drawing inspiration from this, we propose a network that achieves improved detection performance with low-resolution images. The contributions of the proposed network are summarized as follows:

First, we focused on the loss of features occurring in certain network layers as a major cause of degradation in detection performance. To mitigate this feature loss, we propose an attention gate with a larger receptive field block that preserves the diversity of the receptive field. Second, we resized high-resolution images to a lower resolution as part of preprocessing and used them as input data for training, demonstrating satisfactory crack detection performance even in low-resolution settings. Note that the high-resolution images range from 512 × 512 to 1024 × 1024, while the low-resolution images are set to 128 × 128.

In this paper, we briefly describe related studies on the methods used and present the results of experiments where our approach is applied. Our research excels at detecting fine cracks, particularly in low-resolution images, as demonstrated in the experimental results.

2. Related Work

In this section, we introduce the methods primarily used in our research and the related studies.

2.1. Residual Block

The residual block, which is widely used in neural networks, is based on the concept of residual learning utilized in ResNet [5]. This concept involves a skip connection, wherein the result, after passing through the activation function following the convolutional layers, is added to the result of the previous layer’s operation. Figure 1 illustrates the skip connection of a residual block.

This alleviates the vanishing gradient problem that occurs when a backpropagation algorithm is used in deep networks.

2.2. Attention Network

Attention [16] was introduced in the field of natural language processing with the presentation of transformer architecture. The input sequence is divided into a Query, Key, and Value to calculate the relevance of each element with respect to every other element. This procedure is referred to as self-attention. The following formula represents the objective function for linear operations with the elements of the input sequence: Query, Key, Value, and dimensions d_k and V.

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}})

(1)

As an extension of this concept, Multi-Head Attention (MHA) is designed to extract various features and compute their relevance to the input sequence. Through this process, the relationship between the input and output can be determined. Finally, positional encoding is performed to restore the order of the previously divided input sequence, allowing the network to determine the sequence information of the input and relative positions of the feature elements. The formula for MHA calculation is as follows:

M H A (Q, K, V) = C o n c a t ({h e a d}_{1}, \dots, {h e a d}_{h}) W^{O}

(2)

{h e a d}_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(3)

In recent years, attention mechanisms have demonstrated excellent performance in various fields, including segmentation. By mimicking the human perception of visual information, the attention mechanism selects important and useful information from large amounts of data. This is a reasonable approach for detecting cracks involving unspecified and diverse variables.

Lin et al. [17] introduced a full-attention U-Net, which utilizes an attention mechanism and skip connections. The distinguishing feature of this network is the application of attention blocks to all layers by incorporating attention networks in the decoder as well as the encoder. Although this approach exhibits slightly lower performance compared to the case where an attention network is only used in the decoder, it achieves higher performance when noise removal strategies are applied.

Wang et al. [18] proposed the SegCrack model, which adopts a hierarchical transformer architecture to output multistage features and alleviates class imbalances by applying an ascending path, lateral connections, and the OHEM strategy. By subsampling advantageous pixel features for losses, it enhances crack detection performance, achieving an F-1 Score and mIoU of 96.05% and 92.63%, respectively, outperforming CNN-based networks. Sun et al. [19] introduced the DMA-Net, which incorporates a multi-scale attention module into the decoder component of the DeepLabv3+ model, thereby improving the integration of multi-scale features present in irregularly shaped cracks. They conducted extensive experiments using three datasets, demonstrating the practicality of the proposed method even with smartphone-captured pavement images.

Research utilizing vision transformers (ViTs), an extended form of the attention mechanism, has also been actively pursued. Tao et al. [20] proposed a convolution-transformer network for crack segmentation. This network—which included extended residual blocks, boundary-aware modules, and MobileViT blocks—was experimentally demonstrated to outperform previously developed methods.

Dong et al. [21] proposed a patch-based semantic segmentation method that trains by obtaining image-level labels instead of pixel-level annotations by dividing images into patch units, creating synthetic labels through identification/location algorithms, and performing CRF processing to reduce method complexity. Moon et al. [22] proposed PCTC-Net. This parallel dual-encoder network, which integrates pre-conv-based transformers and CNNs (PCTC-Net), addresses data-intensive problems with relatively small datasets, demonstrating higher generalizability than the DTC-Net state-of-the-art (SOTA) model.

Oktay et al. [23] proposed the integration of an attention mechanism into a U-Net architecture. In pancreatic segmentation tasks conducted in medical imaging, the segmentation performance was improved compared to the original U-Net architecture, demonstrating the network’s ability to focus on image features.

3. Proposed Method

3.1. Proposed Architecture

The proposed model represents a deep learning technique that can accurately detect and segment cracks of various sizes from images. Figure 2 illustrates the convolution block implemented in the proposed architecture. In the encoder, the convolutional layers pass through, generate filters, and apply the batch normalization and ReLU activation functions.

Figure 3 presents an overview of the proposed network. Initially, 2 × 2 max pooling is performed to reduce the spatial dimensions. Subsequently, the 3 × 3 CNN layers pass sequentially based on the filters of the convolutional layers. Then, a bottleneck layer with 3 × 3 convolutional layers containing 1024 filters is applied. In the decoder part, upsampling is performed to restore the reduced spatial dimensions, and weights are applied through the attention block. Similarly, in the decoder, after passing through the 3 × 3 CNN layers, the final output is generated through the convolutional layers and a sigmoid activation function.

3.2. Attention Gate, Bigger Receptive Field Block

This layer is integrated into the decoder component of the standard U-Net architecture. By applying the attention mechanism through an attention gate, greater focus is placed on important information via weighting operations at each encoder stage. This is applied along with upsampling to combine the feature maps of the higher levels, ensuring more accurate and detailed segmentation. Figure 4 depicts the attention gate of the proposed network.

We designed the attention gate as the core module for crack detection in low-resolution images. This module generates weights for attention by passing them through two 1 × 1 convolutional layers and a bigger receptive field block (BRFB). The f_prev layer takes the feature from the immediately preceding layer as input. g_skip and f_skip take features from the corresponding layers in the encoder as inputs through skip connections. The difference between the two layers lies in the receptive field sizes of the extracted features. To consider irregular crack features, we minimized the feature loss by bringing features from both small and large receptive fields into the encoder layers and performing summation operations. This allowed us to focus on features that required attention. Within the module, all three layers generate feature maps with coefficient channels, with each coefficient equivalent to the number of channels in the corresponding encoder layer. Details of the larger receptive field block are depicted in Figure 5.

Features introduced via the skip connection in the f_skip layer have a limited receptive field compared with those of the previous layer. Like other U-net, we adopted a method using 7 × 7, 5 × 5, and 3 × 3 layers to extract wide context information. This layer recognizes more complex patterns and extracts features from a wider receptive field. By considering the irregularities and varying sizes of cracks, the proposed architecture is capable of learning deeper features. The computed feature maps are summed, followed by the application of the ReLU activation function. The objective functions of the proposed attention gate can be expressed as follows:

f_{p r e v} (x) = r (W_{x} \cdot x + b)

(4)

f_{s k i p} (y) = r (W_{y} \cdot y + b)

(5)

g_{s k i p} (z) = r (W_{z} \cdot z + b)

(6)

α = s i g (r (W x + W y + W z + b_{x, y, z}) * (W_{z} + b_{z}))

(7)

where x represents the features of the previous layer, y and z represent features corresponding to the encoder layers with the same values, and α denotes the attention weight computation result. The ReLU activation function ensures that only positive values remain, whereas negative values are set to 0. After performing a 1 × 1 convolution and batch normalization, the sigmoid activation function is applied to obtain the final attention weights. These weights, ranging from 0 to 1, indicate the importance of each pixel. Finally, the attention weights are multiplied by the operation results of the skip connection layer. Our experimental results demonstrate improved edge and detailed crack detection in the decoder layers. Furthermore, the proposed attention block minimizes feature loss, allowing for more accurate target localization. Overall, we successfully achieved multi-feature fusion by combining diverse receptive field features among layers, thereby enhancing crack detection performance.

3.3. Loss Function

We adopted Binary Cross-Entropy (BCE) as the loss function for training. This function selects the class that best describes the input and adjusts the parameters to values closer to the ground-truth probabilities. The formula for BCE is as follows:

B C E (y, \hat{y}) = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i})]

(8)

Here, y represents the actual binary label values of the input data, and

\hat{y}

denotes the probability values between 0 and 1, representing the model’s actual predictions. N indicates the number of samples, and the total loss is calculated by summing the calculations for the actual labels of 1 and 0. This metric penalizes situations wherein the model misclassifies all scenarios of true-positive and false-positive predictions, thereby encouraging the model to generate calibrated probabilities. Moreover, this function is differentiable, making it suitable for training neural networks using gradient-based optimization algorithms such as stochastic gradient descent. In addition, it is specialized for binary classification problems, making it appropriate for cases in which each instance of crack detection belongs to one of two classes.

4. Analysis of Experimental Result

The Crack500 [24] dataset consists of 3368 crack images divided into subsets of 1896, 348, and 1124 images for training, validation, and testing, respectively. Among these data, 352 images were duplicates; nonetheless, all 3368 images, including the duplicates, were used. This dataset was captured using a mobile phone camera on the main campus of Temple University. Each image was divided into 16 nonoverlapping patches, and only regions containing more than 1000 crack pixels were retained. The input images were resized using the nearest neighbor method, and size-based training was conducted based on the resized images. Detailed information regarding the dataset is presented in Figure 6.

The following settings were adopted throughout the experiments. Training was conducted for 100 epochs with an initial learning rate of 2 × 10⁻³. A dynamic learning rate adjustment metric was applied to improve the learning rate. The batch size for training was set to 16, and Adam [24] was used as the initialization function. Table 1 provides information on the hardware used in the experiments.

The evaluation metrics were based on a confusion matrix. We adopted the F-1 score, which is the harmonic mean of precision and recall, along with the mean intersection over union (mIoU). The F-1 score considers both positive and negative model samples, thereby complementing issues that may arise when considering only precision or recall.

P r e c i s i o n : \frac{T P}{(T P + F P)}

(9)

R e c a l l : \frac{T P}{(T P + F N)}

(10)

F - 1 S c o r e : 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

The mIoU is a metric that represents the proportion of overlapping areas between predicted and ground-truth regions, evaluating how accurately objects are detected. The following formula is used for the IoU metric:

\frac{1}{N} \sum_{i = 1}^{N} {I o U}_{i}

(12)

The confusion matrix for the training results, shown in Figure 7, indicates that while the segmentation of areas without cracks was highly accurate, there was room for improvement in accurately segmenting the crack areas. In particular, incorrect classification was observed in fine cracks, as well as in distinguishing cracks from dark areas such as shadows.

Looking at the results in Table 2, our proposed network demonstrated higher performance when low-resolution images from the Crack500 dataset were used as the input. The results show an approximate difference of 0.17 in terms of mIoU between the model trained with 128 × 128 input images and that trained with 512 × 512 input images. Overall, the model with low-resolution input images exhibited a higher mIoU. Figure 8 presents prediction results with respect to input image size. Here, both test images were predicted to be similar to the ground truth at 64 × 64 and 128 × 128 resolutions. However, when handling higher-resolution images, the model did not accurately represent crack boundaries and tended to misidentify non-crack areas as cracks. To obtain objective results, comparative experiments were conducted using several models. First, we compared the prediction results across FCN-8s [25], U-Net [26], and our network, with the results shown in Figure 9 and Figure 10.

The comparison of the prediction results among the three models demonstrated the proposed network’s relative strength in detailed detection. Although the network did not perfectly capture the connectivity present within fine cracks, it demonstrated a more sensitive response than the other models. Table 3 summarizes the comparative results between our network and those proposed by Lau et al. [7], Nguyen et al. [9], Yu et al. [10], Di et al. [12], and Katsamenis et al. [27].

From these results, we can observe that the proposed method significantly outperformed the other networks in terms of mIoU. Specifically, the proposed method achieved an F-1 score of 0.7436 and mIoU of 0.7752. Notably, it exhibited higher performance when the BRFB module was used, demonstrating the effectiveness of the BRFB in conjunction with the attention gate designed to mitigate feature loss. The substantial difference in mIoU, reaching over 0.2 in some cases, confirms a considerable overlap between the predicted areas and actual regions of interest across all test images. Table 4 shows the results based on whether the BRFB module is used. This result shows that the use of the BRFB module yields statistically significant results in the metrics and demonstrates its effectiveness in aiding crack detection.

Table 5 shows the results of training with different batch sizes, with the best performance observed for a batch size of 16. The network incorporating the BRBF module ultimately achieved the highest mIoU, demonstrating its ability to effectively learn and segment key features even for low-resolution images.

5. Conclusions

In this study, we designed a module that enhances the attention block to minimize feature loss in the existing autoencoder-style architectures. Through the combination of a skip connection, which reduces the loss in both the encoder and decoder, and the BRFB, which learns features from diverse receptive fields, we achieved effective extraction of features from low-resolution images while minimizing loss and accurately locating features. We conducted a comparative analysis of the existing models using the Crack500 dataset. Notably, the proposed architecture achieved the highest mIoU metric compared to methods using the same dataset, demonstrating the superior performance of our approach. Furthermore, in the comparative experiments with high-resolution images, our method exhibited better performance when low-resolution images were used as input. However, the proposed method has limitations when handling very fine or extensive cracks. Although this combination reduces feature loss, it might not entirely address the loss of some crucial feature points. In the future research, the following directions should be pursued:

First, we intend to evaluate our model’s generalizability by training it on larger and more diverse datasets. We plan to collect publicly available datasets from various sources and merge them to construct a more diverse overall dataset of crack images. This will allow the model to be trained on a variety of crack patterns and environments.

Second, we intend to optimize the network architecture and training process to further enhance performance. Specifically, we aim to find ways to reduce the number of training parameters without compromising performance.

Furthermore, recent advances in semantic segmentation have led to the introduction of a series of new loss functions. It may be possible to incorporate these factors into the proposed network to further improve its performance.

This work is based on the author’s undergraduate thesis submitted to Chosun University [28].

Author Contributions

Conceptualization, J.-H.K. and H.-D.Y.; methodology, J.-H.K., J.-H.N. and H.-D.Y.; software, J.-H.K., J.-H.N. and J.-Y.J.; validation, J.-H.K. and J.-Y.J.; formal analysis, J.-H.K.; data curation, J.-Y.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work is the result of research supported by the Ministry of Education and the National Research Foundation of Korea under the 3rd Stage Leading University Project for Industry-Academia Collaboration (LINC 3.0).

Institutional Review Board Statement

No applicable.

Informed Consent Statement

No applicable.

Data Availability Statement

The data can be accessed at https://github.com/fyangneil/pavement-crack-detection, accessed on 25 November 2024.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tighe, S.; Li, N.Y.; Falls, L.C.; Haas, R. Incorporating road safety into pavement management. Transp. Res. Rec. 2000, 1699, 1–10. [Google Scholar] [CrossRef]
Tang, F.; Ma, T.; Guan, Y.; Zhang, Z. Quantitative analysis and visual presentation of segregation in asphalt mixture based on image processing and BIM. Autom. Constr. 2021, 121, 103461. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Lau, S.L.H.; Chong, E.K.P.; Yang, X.; Wang, X. Automated pavement crack segmentation using u-net-based convolutional neural network. IEEE Access 2020, 8, 114892–114899. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Nguyen, N.T.H.; Tran, Q.L.; Do, H.N.; Choi, H.J. Pavement crack detection using convolutional neural network. In Proceedings of the 9th International Symposium on Information and Communication Technology, Da Nang, Vietnam, 6–7 December 2018; pp. 251–256. [Google Scholar]
Yu, G.; Li, W.; Chen, X.; Zhang, Y. RUC-Net: A residual-unet-based convolutional neural network for pixel-level pavement crack segmentation. Sensors 2022, 23, 53. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Su, C. Convolutional neural network-based pavement crack segmentation using pyramid attention network. IEEE Access 2020, 8, 206548–206558. [Google Scholar] [CrossRef]
Di Benedetto, A.; Fiani, M.; Gujski, L.M. U-Net-based CNN architecture for road crack segmentation. Infrastructures 2023, 8, 90. [Google Scholar] [CrossRef]
Choi, W.; Cha, Y.-J. SDDNet: Real-time crack segmentation. IEEE Trans. Ind. Electron. 2019, 67, 8016–8025. [Google Scholar] [CrossRef]
Kang, D.H.; Cha, Y.-J. Efficient attention-based deep encoder and decoder for automatic crack segmentation. Struct. Health Monit. 2022, 21, 2190–2205. [Google Scholar] [CrossRef] [PubMed]
Chun, C.; Ryu, S.-K. Road surface damage detection using fully convolutional neural networks and semi-supervised learning. Sensors 2019, 19, 5501. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Lin, F.; Zhang, C.; Yang, Y.; Wang, X. Crack semantic segmentation using the U-Net with full attention strategy. arXiv 2021, arXiv:2104.14586. [Google Scholar]
Wang, W.; Su, C. Automatic concrete crack segmentation model based on transformer. Autom. Constr. 2022, 139, 104275. [Google Scholar] [CrossRef]
Sun, X.; Zhang, J.; Yan, X.; Zhang, H. DMA-Net: DeepLab with multi-scale attention for pavement crack segmentation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18392–18403. [Google Scholar] [CrossRef]
Tao, H.; Li, Q.; Zhou, W.; Li, H. A convolutional-transformer network for crack segmentation with boundary awareness. In Proceedings of the IEEE International Conference on Image Processing, Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 86–90. [Google Scholar]
Dong, Z.; Mao, Q.; Wang, Y.; Chen, B.; Tang, L. Patch-based weakly supervised semantic segmentation network for crack detection. Constr. Build. Mater. 2020, 258, 120291. [Google Scholar] [CrossRef]
Moon, J.-H.; Lee, S.-H.; Kim, J.-H.; Kim, J.-T. PCTC-Net: A Crack Segmentation Network with Parallel Dual Encoder Network Fusing Pre-Conv-Based Transformers and Convolutional Neural Networks. Sensors 2024, 24, 1467. [Google Scholar] [CrossRef] [PubMed]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Katsamenis, I.; Protopapadakis, E.; Doulamis, A.; Doulamis, N. A Few-Shot Attention Recurrent Residual U-Net for Crack Segmentation. In Proceedings of the International Symposium on Visual Computing, Lake Tahoe, NV, USA, 16–18 October 2023; Springer: Cham, Switzerland, 2023; pp. 199–209. [Google Scholar]
Kim, J. Crack Segmentation for Low Resolution Image Based on Attention U-Net. Master’s Thesis, Chosun University, Gwangju, Republic of Korea, 2024. [Google Scholar]

Figure 1. Residual block example.

Figure 2. Convolution block.

Figure 3. Illustration of MSP U-Net.

Figure 4. Details of proposed attention gate.

Figure 5. Bigger receptive field block.

Figure 6. Crack500 dataset samples.

Figure 7. Result of confusion matrix.

Figure 8. Segmentation results by input image size.

Figure 9. Segmentation result 1. (a) input image, (b) ground truth, (c) ours, (d) U-net, (e) FCN-8s.

Figure 10. Segmentation results 2. (a) input image, (b) ground truth, (c) ours, (d) U-Net, (e) FCN-8s.

Table 1. Experimental hardware.

CPU	Memory	GPU
AMD EPYC 2713	512 GB	RTX A6000
OS	Python	Tensorflow
Ubuntu 20.04	3.8	2.5.0

Table 2. Training results by input size.

Input Image Size	Precision	Recall	F-1 Score	mIoU
64 × 64	0.7092	0.6832	0.6959	0.7495
128 × 128	0.7561	0.7315	0.7436	0.7752
256 × 256	0.7352	0.6926	0.7133	0.7170
512 × 512	0.7310	0.6883	0.7090	0.5962

Table 3. Comparison with other studies (Crack500).

Method	Precision	Recall	F-1 Score	mIoU
U-Net By Lau [7]	0.7426	0.7285	0.7327	0.5782
U-Net By Nguyen [9]	0.6954	0.6744	0.6895	0.5261
U-Net By Yu [10]	0.6988	0.7619	0.7290	0.5736
U-Net By A Di [12]	0.8534	0.6813	0.7327	0.6248
FCN-8s [25]	0.7031	0.6172	0.6574	0.6898
U-Net [26]	0.7207	0.6401	0.6779	0.6852
Attention U-Net By [27]	-	-	0.7706	0.6311
Ours	0.7561	0.7315	0.7436	0.7752

Table 4. Comparison using BRFB module (Crack500).

Method	Precision	Recall	F-1 Score	mIoU
Without BRFB (Attention U-Net)	0.7195	0.7033	0.7113	0.7183
With BRFB	0.7561	0.7315	0.7436	0.7752

Table 5. Comparison of training result by batch size (input size is 128 × 128).

Batch Size	Precision	Recall	F-1 Score	mIoU
4	0.6884	0.6746	0.6814	0.7398
8	0.7218	0.6448	0.6811	0.7404
16	0.7561	0.7315	0.7436	0.7752
32	0.7128	0.7087	0.7107	0.7585
64	0.7199	0.6472	0.6816	0.7406

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.-H.; Noh, J.-H.; Jang, J.-Y.; Yang, H.-D. MSP U-Net: Crack Segmentation for Low-Resolution Images Based on Multi-Scale Parallel Attention U-Net. Appl. Sci. 2024, 14, 11541. https://doi.org/10.3390/app142411541

AMA Style

Kim J-H, Noh J-H, Jang J-Y, Yang H-D. MSP U-Net: Crack Segmentation for Low-Resolution Images Based on Multi-Scale Parallel Attention U-Net. Applied Sciences. 2024; 14(24):11541. https://doi.org/10.3390/app142411541

Chicago/Turabian Style

Kim, Joon-Hyeok, Ju-Hyeon Noh, Jun-Young Jang, and Hee-Deok Yang. 2024. "MSP U-Net: Crack Segmentation for Low-Resolution Images Based on Multi-Scale Parallel Attention U-Net" Applied Sciences 14, no. 24: 11541. https://doi.org/10.3390/app142411541

APA Style

Kim, J.-H., Noh, J.-H., Jang, J.-Y., & Yang, H.-D. (2024). MSP U-Net: Crack Segmentation for Low-Resolution Images Based on Multi-Scale Parallel Attention U-Net. Applied Sciences, 14(24), 11541. https://doi.org/10.3390/app142411541

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSP U-Net: Crack Segmentation for Low-Resolution Images Based on Multi-Scale Parallel Attention U-Net

Abstract

1. Introduction

2. Related Work

2.1. Residual Block

2.2. Attention Network

3. Proposed Method

3.1. Proposed Architecture

3.2. Attention Gate, Bigger Receptive Field Block

3.3. Loss Function

4. Analysis of Experimental Result

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI