This study developed its algorithms in the Python 3.8.20 environment. It used PaddlePaddle-GPU 2.4.2 as the deep learning core framework and combined it with the PaddleSeg 2.7.0 semantic segmentation development kit to complete model construction and training.
In data processing and experiments, the following key libraries were primarily used: OpenCV-python (4.12.0.88) and Pillow (10.4.0) for image reading, enhancement, and preprocessing; NumPy (1.24.4) for matrix operations, Pandas (2.0.3) for label and data management; VisualDL (2.5.3) to monitor real-time loss function (Loss) and index changes during training, Matplotlib (3.7.5) to draw result comparison charts; Scikit-learn (1.3.2) and SciPy (1.10.1) for statistical analysis of evaluation indicators, and tqdm (4.67.1) for visualizing training progress. These libraries, each with its specific functionality, were essential for the successful execution of data processing and experiments.
3.1. Model Establishment
- (1)
Model Selection
To select the most suitable semantic segmentation model for this study, this section introduces the classic U-Net, based on convolutional neural networks, and SegFormer, based on the Transformer architecture, as benchmark models for comparative experiments. To ensure the fairness and reliability of the results, all experiments were conducted under the same dataset, training strategy, and hardware environment. The quantitative evaluation results of each model are shown in
Table 1.
The experimental results clearly show that DeepLabv3 achieved the best performance in all three key evaluation metrics. Specifically, compared to the classic U-Net model, the average intersection-over-union (mIoU) and pixel accuracy (Acc) of DeepLabv3 increased by 0.0499 and 0.0568, respectively. Moreover, the Kappa coefficient showed a relative improvement of approximately 17.88%. Furthermore, when compared with the SegFormer model, which has shown excellent performance in recent years, DeepLabv3 maintained its performance superiority. Specifically, its mIoU, Acc, and Kappa indicators were 0.0085, 0.0258, and 0.0222 higher, respectively.
The comparison results indicate that DeepLabv3 demonstrates superior performance in feature extraction and in capturing multi-scale contextual information. DeepLabv3 achieves higher segmentation accuracy and overall reliability compared to U-Net and SegFormer. Therefore, it was reasonable and beneficial to choose DeepLabv3 as the core model for this study.
Based on the DeepLabv3 model [
16], this study develops a Convolutional Neural Network (CNN)-based semantic segmentation model [
17].
Figure 6 shows the logic flow of this model. This model exhibits high-precision segmentation capability and can effectively identify damage ranging from large-scale peeling to fine cracks.
Note: Feature extraction layer (DCNN with Atrous Convolution): The input mural image first enters the deep convolutional neural network (DCNN). To expand the receptive field without losing spatial resolution, the model introduces atrous convolution (Atrous Convolution). This allows the model to obtain more dense feature responses, which is crucial for identifying subtle cracks or small areas of peeling in the mural.
The Atrous Spatial Pyramid Pooling (ASPP) module is crucial for recognizing multi-scale damage. This module contains five parallel branches: a 1 × 1 standard convolution layer; three 3 × 3 atrous convolution layers with sampling rates of r = 6, 12, and 18; and an image-level pooling layer to obtain global statistical information to enhance the model’s understanding of the complex mural background.
Feature fusion and dimensionality reduction (Concat and 1 × 1 Conv): The feature maps produced by each branch of the ASPP are concatenated along the channel dimension (Concat). Subsequently, a 1 × 1 convolution is used to fuse features and adjust the number of channels, reducing computational complexity and integrating multi-scale information.
Upsampling and output (Upsample by 4): The fused feature layer is upsampled by a factor of four using bilinear interpolation to restore the feature map to the original image size.
Damage prediction result: The final output layer assigns a category label to each pixel, generating a color damage identification mask consistent with the original image size.
- (2)
Dataset Creation
To maintain high resolution, enable effective feature learning under limited video memory constraints, and expand the sample size simultaneously to enhance training stability and accuracy, we used a custom script to segment the original images into 768 × 768 non-overlapping patches. After completing the division, a total of 1344 local images were obtained. Each local image is distinct and non-overlapping. The reasons for selecting the segmentation size are as follows: if the size is too small (e.g., 512 × 512), the contextual semantic information would be insufficient, hindering the model’s understanding of the macroscopic damage distribution; if the size is too large (e.g., 1024 × 1024), it would increase the memory load and decrease the training efficiency.
Following general experience [
18], which indicates that over 70% of the data should be dedicated to training, the dataset was split at an 8:1:1 ratio. This splitting yielded a training set of 1075 images, a validation set of 134 images, and a test set of 135 images. The training, validation, and test sets were derived from different original images.
- (3)
Data Labeling
Precise labeling is essential to ensure that the model learns visual features accurately. The damage boundaries were manually annotated point by point using the “Labelme” and “Polygon” tools. Four categories were labeled: crack, brick_damage, peel_off, and the interfering category “brick_joint”, as shown in
Figure 7.
To ensure the reproducibility of scientific experiments and dataset standardization, this study formulated strict annotation protocols.
First, all annotation work was carried out directly on the 768 × 768-pixel local images after segmentation, rather than on the ultra-high-resolution original images. This ensured that the annotation coordinates were strictly aligned with the resolution of the model training input, thereby avoiding boundary deformation caused by scaling.
Second, the annotation operation principles for various categories are elaborated upon in
Section 2.1 on injury types and characteristics.
Third, when categories overlapped, the principle of prioritizing injury over structure was applied. If cracks or peeling extended into the brick-joint area, the overlapping part should be uniformly annotated as the corresponding injury category (crack or peel_off), excluding the brick_joint label. In transitional regions with extremely blurred boundaries, annotators should delineate only the core areas where damage has been visually verified, thereby avoiding subjective over-extension. Notably, in the collected dataset, the vast majority of mural damages displayed distinct, sharp boundaries, making them highly amenable to precise pixel-level semantic segmentation. Although a small number of cases with blurred boundaries exist, the proportion of such cases is negligible and has minimal impact on the overall training and evaluation of the segmentation model.
Fourth, segmenting the images into local parts significantly clarifies the range of fine cracks, facilitating more accurate annotation.
Finally, the annotation process was completed by one person and then cross-checked by another. The cross-check focused on detecting missed annotations of fine cracks and label-conflict areas. For disputed boundaries, the final mask outline was determined through joint discussions.
To ensure the objectivity and reproducibility of the dataset, a quantitative assessment of the annotation consistency was carried out. Specifically, 5% of the dataset samples were randomly selected, and another researcher was asked to re-label them independently in accordance with the same annotation criteria. Subsequently, the Dice Similarity Coefficient (DSC) was employed to measure the consistency between the two sets of annotations. The analysis showed an average DSC of 0.8039, which suggested a high level of annotation consistency. This finding further verified that the proposed annotation guidelines could successfully mitigate subjective bias in complex situations.
After labeling, the JSON files generated by Labelme were converted into semantic segmentation mask images using a conversion script, thereby creating a dataset ready for training.
- (4)
Model Settings and Training
The model was configured with five classes (four target categories plus background). In this study, the background category is defined as the undamaged and intact areas of the mural. The category “back_ground” is assigned the value 0, “crack” is assigned the value 2, “brick_joint” is assigned the value 3, “peel_off” is assigned the value 4, and “brick_damage” is assigned the value 5. Then, to determine the contribution of the “brick-joint” interference category to the overall performance improvement of the model, this study conducted ablation experiments for verification. The performance indicators before and after the experiments are shown in
Table 2.
The experimental results show that removing the “brick-joint” category significantly decreased the model’s overall performance. Therefore, the experiment demonstrates the necessity of including the “brick-joint” category. Retaining this category helps the model comprehend the overall image structure and maintain favorable overall indicators.
This network is based on the DeepLabV3 architecture, using ResNet50_vd as the backbone network. Subsequently, ReLU was employed as the standard activation function throughout the network. Spatial features were primarily extracted using 3 × 3 convolution kernels (including dilated convolutions), supplemented by 1 × 1 convolutions for channel projection. To enhance the detection of fine damages (e.g., linear cracks), the output stride was strictly set to 8, thereby maintaining a higher spatial resolution of the feature maps.
During training, empirical hyperparameters were adopted instead of automatic optimization. The training parameters included a batch size of 2, a total of 54,000 iterations, and a learning rate of 0.01. Optimization was carried out using Stochastic Gradient Descent (SGD) with a momentum of 0.9. CrossEntropyLoss was selected as the loss function to quantify the difference between the predicted probability distribution and the true labels. For each pixel in the image, the loss-function formula is as follows:
Among them, M represents the total number of damage categories, yc is a binary indicator variable (1 if the pixel belongs to the category, 0 otherwise), and pc is the probability that the model predicts the pixel to belong to category c. This loss function encourages the model to assign a higher probability to the correct damage type.
Furthermore, data augmentation techniques, such as random scaling, cropping, horizontal flipping, and adjustments to brightness, contrast, and saturation, were applied. After completing the above configuration, the model training was initiated.
3.2. Model Prediction Results
To objectively evaluate the DeepLabv3 model’s recognition accuracy for mural damage, this study introduced two key semantic segmentation evaluation indicators: pixel accuracy (Acc) and mean intersection over union (mIoU).
Acc represents the ratio of correctly classified pixels to the total number of pixels. This metric reflects the intuitive accuracy of the model’s overall image-classification performance. The formula is as follows:
The mIoU is the most commonly used metric for evaluating semantic segmentation performance. Specifically, it calculates the average intersection-over-union ratio across all classes. Compared with Acc, mIoU can better measure the model’s ability to capture the boundaries of damage, particularly in murals where the damaged areas account for a small proportion. The formula is as follows:
Here, k represents the number of categories; nji denotes the number of pixels that are correctly classified; and ti represents the total number of pixels in this category.
After training, the model attained a mIoU of 47.8% and an Acc of 77.97% on the test set. The prediction results are presented in
Figure 8.
The results indicate that the model exhibits effective detection capabilities for all three types of damage. It successfully generates segmentation masks that match the specific morphology of each damage type. In classification, the model accurately captures crack paths of different scales and effectively distinguishes them from brick joints, which look visually similar. Regarding peeling damage, although identifying areas with blurred edges is challenging, the overall localization of these areas is accurate.
Despite the relatively low mIoU value, constrained by the extremely small proportion of crack-occupied pixels, the visualization in
Figure 8 shows that the generated masks fully cover the damaged areas and exhibit good edge alignment. Therefore, the model meets the basic requirements for localizing mural damage. Consequently, the resulting segmentation maps effectively serve as suitable masks for the subsequent virtual restoration module, validating the efficacy of the DeepLabv3-based semantic segmentation approach in mural damage detection.
Figure 9 shows the mIoU curve, Acc curve, loss curve, and learning rate (Ir) graph obtained from the training. Initially, the model’s mIoU increased rapidly during the early training phase, accompanied by a significant decrease in loss. After approximately 100 epochs, the mIoU of the validation set stabilized, and the losses of both the validation and training sets converged synchronously. No divergence was observed, where the training loss continued to decline while the validation set loss rebounded. Moreover, no significant overfitting was detected.
Table 3 shows the IoUs for various damage types in the training and validation sets.
Although the value of 0.5202 may seem low for standard semantic segmentation, this is mainly due to the severe imbalance of classes in the damaged mural dataset. Specifically, cracks are represented as extremely fine linear structures, typically occupying less than 1% of the image’s total pixels. This phenomenon significantly affects the calculation of the intersection-over-union (IoU), leading to a relatively low overall mIoU score. To address the model’s poor performance on imbalanced data, a combined loss function integrating dice loss and cross-entropy loss was adopted instead of the standard cross-entropy loss.
The results are shown in
Table 4.The results indicate that after incorporating dice loss, the model’s IoU for the crack category increased by 2.87%. Moreover, it effectively addresses the issue of imbalanced crack damage categories.
The steady convergence of the loss curve indicates that this model has achieved stable feature-extraction capabilities. The research results show that the model can separate the damage features from the complex background. This satisfies the accuracy requirement for using the detected damage areas as a mask for the repair module.
However, the accuracy of the model remains subject to certain limitations. The primary factors influencing the accuracy are as follows.
- (1)
The diversity of the dataset is severely limited because it is derived from only four original mural images. Although the block-based segmentation method avoids direct data leakage and increases the sample size to 1344 local images, the dataset is still highly specific to a single case study. Limited exposure to different damage conditions and an unbalanced sample distribution limit the model’s robustness.
- (2)
Substantial visual differences exist between damage types. For example, cracks and peeling exhibit significant variations in morphology, texture, and color. These differences pose challenges to the model’s ability to generalize its recognition capabilities.
- (3)
Constrained by the 8 GB video-memory capacity of the RTX A4000, the experiment cannot directly process high-resolution original images during recognition model training. To sustain model training, the input images were divided into 768 × 768-pixel blocks. Quantitative analysis indicated that the receptive field of a single image block accounted for merely about 0.3% of the total area of the entire mural (13,012 × 13,822). This high cropping ratio prevents the model from obtaining macroscopic structural information across regions. This represents the primary hardware constraint that impedes the further improvement of recognition accuracy.
To rigorously evaluate the model’s robustness, a four-fold cross-validation strategy with one area left out was adopted. Specifically, the 1344 slices were derived from four distinct spatial regions of the mural: the bottom, top, east, and west. In each fold, slices from three regions were combined to form the training set, while slices from the remaining, entirely unseen area were carefully separated as the validation set.
The cross-validation results are presented in
Table 5. The model achieved an mIoU of 42.82% across four distinct mural areas, and its Acc reached 73.69%. Although the model’s performance may show minor fluctuations depending on the specific test area, these fluctuations reflect the inherent differences in the damage characteristics of distinct mural walls. However, the low standard deviation attests to the stability of the DeepLabv3-based architecture, suggesting that the model has successfully learned the general damage features instead of overfitting to specific structural areas.