SmaAt-UNet Optimized by Particle Swarm Optimization (PSO): A Study on the Identification of Detachment Diseases in Ming Dynasty Temple Mural Paintings in North China

Luo, Chuanwen; Shang, Zikun; Zhang, Yan; Pan, Hao; Nuermaimaiti, Abdusalam; Wang, Chenlong; Li, Ning; Zhang, Bo

doi:10.3390/app152212295

Open AccessArticle

SmaAt-UNet Optimized by Particle Swarm Optimization (PSO): A Study on the Identification of Detachment Diseases in Ming Dynasty Temple Mural Paintings in North China

by

Chuanwen Luo

^1,*

,

Zikun Shang

¹,

Yan Zhang

²,

Hao Pan

¹,

Abdusalam Nuermaimaiti

¹,

Chenlong Wang

¹,

Ning Li

² and

Bo Zhang

^1,*

¹

Department of Architecture, School of Architecture and Art, North China University of Technology, Jinyuanzhuang Road 5, Shijingshan District, Beijing 100144, China

²

Beijing Historical Building Protection Engineering Technology Research Center, Beijing University of Technology, Beijing 100124, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(22), 12295; https://doi.org/10.3390/app152212295

Submission received: 26 September 2025 / Revised: 17 November 2025 / Accepted: 17 November 2025 / Published: 19 November 2025

(This article belongs to the Special Issue AI-Driven Computer Vision and Pattern Recognition: Challenges and Applications)

Download

Browse Figures

Versions Notes

Abstract

The temple mural paintings of the Ming Dynasty in China are highly valuable cultural heritage. However, murals in North China have long faced deterioration such as pigment-layer detachment, which seriously threatens their preservation and study, gradually leading to cultural incompleteness and impeding protection decisions. This study proposes a coherent deep-learning technical paradigm, constructs a mural dataset, compares the performance of multiple models, and optimizes the selected model to enable automatic identification of mural detachment. The study applies five segmentation models—UNet, U2-NetP, SegNet, NestedUNet, and SmaAt-UNet—to perform a systematic comparison under the same conditions on 37,685 image slices, and evaluates their performance using four metrics: IoU, Dice, MAE, and mPA. Owing to its lightweight structure and attention-enhanced feature-extraction module, SmaAt-UNet effectively preserves mural edge details and performs best at identifying pigment-layer detachment. After introducing Particle Swarm Optimization (PSO), the IoU of the SmaAt-UNet model on the dataset increased to 73.25%, the Dice increased to 79.36%, the mPA increased to 97.02%, and the MAE decreased from 0.0592 to 0.0455, corresponding to an absolute reduction of 0.0137, and the model’s generalization ability and edge-recognition accuracy were significantly enhanced. This study constructs a systematic identification framework for pigment layer detachment in Ming Dynasty (1368–1644 AD) temple murals, closely combining deep learning technology with cultural heritage protection. It not only realizes the automatic identification of disease areas but also provides technical support for preventive protection and the construction of digital archives.

Keywords:

Ming Dynasty mural paintings; detection of pigment layer detachment; SmaAt-UNet; PSO; cultural heritage protection

1. Introduction

As precious carriers that integrate religious art and secular culture, ancient Chinese temple mural paintings possess not only very high artistic and esthetic value but also serve as crucial material evidence for studying social forms, the development of religions, and the evolution of craftsmanship techniques [1]. However, these murals, which have endured hundreds of years of wind and rain erosion, are generally afflicted by various types of deterioration resulting from long-term exposure to natural environments and human disturbances; among these, mural detachment is the most common and damaging type. Multiple factors collectively induce this deterioration, specifically natural weathering, periodic temperature–humidity variations, microbial corrosion, and human-related disturbances over time [2,3,4,5]. Concurrently, it is often accompanied by other degradation manifestations, including crack growth, water penetration, and surface pollution [6,7,8]. Currently, mural-deterioration detection mainly relies on visual observation and manual recording by cultural-heritage conservation experts. This method is not only inefficient and time-consuming, but also highly susceptible to the subjective factors of observers, making it difficult to achieve objective quantification and early warning of damage characteristics [9]. More critically, pigment layer detachment is usually concealed in its early stages and barely detectable by the naked eye. Once large-scale visible loss occurs, the optimal restoration opportunity has often been missed [10]. Therefore, there is an urgent need to introduce advanced technical means to establish a more scientific, accurate, and efficient method for the detection and evaluation of pigment layer detachment, thereby providing reliable data support and decision-making basis for the preventive conservation of ancient Chinese temple mural paintings.

As computer technology has evolved, traditional image processing methods have found extensive application in the conservation and restoration of ancient murals. The body of research related to mural-damage detection and restoration centers on the innovation of image-processing techniques and the optimization of their performance—an emphasis that has given rise to a variety of distinct research directions, each rooted in a unique technical pathway.

During the early twenty-first century, mainstream work leveraged traditional image-processing algorithms to locate damaged mural areas and restore textures. Centered on core technologies—including color-space analysis, region growing, and multispectral matching—researchers developed detection and restoration models tailored to prevalent mural damage modalities: spalling, cracks, and scratches. Notable advances improved the calibration precision for spalling, cracks, and scratches either by combining multi-dimensional gradient detection with guided filtering in the HSV color space or by refining the region-growing algorithm through spalling feature analysis in YCbCr and HSV spaces [11,12]. Furthermore, automatic color-space matching was implemented using pigment databases built on multispectral technology, mural textures were recovered via a multi-scale edge-reconstruction-based restoration method, and crack repair efficiency was markedly boosted with an improved self-organizing map algorithm [13,14,15]. Such traditional methods generally required substantial manual intervention and lacked robustness, making it hard for them to cope with complex and variable mural-damage scenarios.

With the advancement of electronic-information technology, traditional image processing and AI have been increasingly merged. Based on U-Net’s encoder–decoder structure and skip connections, deep-learning networks have been optimized in terms of parameters, feature capture, and scene adaptability, and extended to mural-damage tasks and related fields, producing efficient, task-tailored models. Mainstream studies showed that improved U-Net variants achieve performance breakthroughs in complex scenes: Mini-Unet balances accuracy and efficiency in tunnel crack detection via depthwise separable convolution and hybrid loss [16]. U-Net-FML boosts concrete crack segmentation (MIoU of 76.4%) with feature map segmentation and multi-path propagation. U-Net-AFFU fusion can handle faded, noisy murals [17,18]. For large-area missing murals, CNN + k-nearest-neighbor matching enables controllable restoration; for large-area irregular damage, a two-stage conditional GAN-PConv framework can restore defects without manual masks [19,20].

In recent years, mainstream studies have used deep learning’s automatic feature extraction capability to improve damage identification accuracy and restoration efficiency in mural conservation via optimizing networks or integrating GANs. In the damage-detection field, lightweight semantic feature-based detection enhances damaged/blurred scene performance through the use of adaptive random erasure and semantic module [21]. A modified Res-UNet achieved 98.19% accuracy in Forbidden City color painting crack/paint peeling segmentation [22]. Ghost-C3SE-based lightweight YOLOv5 boosted mural detection speed/accuracy in Yungang Grottoes [23]. In restoration technology, the innovations have focused on fidelity and quality balance. High-fidelity restoration via the segmented virtual method + U-Net-MobileNet or CAAT-GAN-based coordinate attention aggregation has been applied to restoration [24,25]. Optimized crack detection was achieved by embedding an attention mechanism into U-Net and combining it with optical pulse thermography [26].

Based on the aforementioned research status and existing issues, this study took the mural paintings from twelve Ming Dynasty temples in North China as the research objects. Focusing on mural detachment—the most common and damaging type of deterioration—a standardized high-precision annotated dataset was constructed (Figure 1).

2. Materials and Methods

Twelve representative Ming Dynasty temple murals in North China were chosen as the research subjects in this study. Firstly, mural images were collected to establish the data foundation for the image segmentation model for detecting mural detachment deterioration. Secondly, five image segmentation algorithms were selected and applied to the research objects and their performances were evaluated using four metrics: IoU, Dice, MAE, and mPA. Thirdly, based on the actual detected features of image elements, hyperparameters of the optimal algorithm, selected from the five tested algorithms, were optimized. This optimization aimed to further enhance the generalization capability and edge detection precision of the deep learning-based image segmentation algorithm.

2.1. Research Objects and Data Collection

The target temples and monasteries in this study are distributed across Beijing, Hebei Province, and Shanxi Province, China (Figure 2). These regions cover areas with murals that exhibit relatively good preservation, typical artistic styles, and clear manifestations of deterioration (Table 1). Among them, the temples and monasteries in Beijing and Hebei are mostly imperially commissioned structures, such as Fahai Temple in Shijingshan District, Beijing, and Zhaohua Temple in Huai’an, Hebei Province, whose murals are exquisitely decorated and hold high conservation value [27]. In contrast, Shanxi Province is dominated by folk-built temples and monasteries, including Yong’an Temple in Hunyuan, Zishou Temple in Lingshi, and Jiyi Temple in Xinjiang. These murals are more widely distributed and diverse in type, providing excellent sample representativeness [28].

In terms of the temporal distribution of the mural creation periods, the selected temple/monastery murals are mainly concentrated in the mid-Ming Dynasty, with a small number from the early and late Ming Dynasty. This distribution reflects the evolution of mural styles and craftsmanship across different historical stages of the Ming Dynasty. The selected sites fall into two categories: imperially commissioned and local folk-built, representing distinct mural painting techniques and cultural backgrounds. All twelve selected temples and monasteries are China National Key Cultural Relics Protection Units, covering the 1st to 8th batches of such units, and thus possess high historical, artistic, and scientific research value.

This study used existing digital image resources from authoritative institutions to obtain high-quality image data without interfering with the cultural relics, replacing destructive on-site photography for data collection. The mural images were primarily sourced from three categories: public digital archives of cultural relics protection units at various levels, high-definition image resources published by museums and cultural heritage research institutions, and outcomes of selected mural research projects that have finalized digital acquisition. All these images were captured and processed by qualified professional teams and have high image clarity and color restoration accuracy, which can meet the precision requirements for the training and validation of deep learning models.

Moreover, using existing digital resources not only adheres to the minimum intervention principle in cultural heritage protection but also improves the consistency and standardization of image collection, laying a solid foundation for subsequent data annotation and model evaluation.

2.2. Image Annotation and Preprocessing

This study focused on mural detachment as the primary identification target. According to the national standard Diseases and Illustrations of Ancient Murals published in China, this type of deterioration is the most prevalent in Ming Dynasty temple and monastery murals and directly affects the readability of the mural scenes and the integrity of their artistic value. As a type of surface exfoliation deterioration, mural detachment is characterized by partial or extensive peeling of the pigment layer from the background layer or ground layer. Its distinct visual boundaries enable precise identification using image segmentation techniques. According to statistics on deterioration of Ming Dynasty murals, mural detachment represents over 40% of all cases, granting it high significance for identification and top priority in conservation compared to other deterioration types [29,30].

A total of 655 large-scale, ultra-high-definition mural images were collected from the 12 selected temples for the present study. Pixel-level fine annotation of regions with pigment layer detachment was conducted, strictly following the terminological definitions and graphical standards outlined in Diseases and Illustrations of Ancient Murals. The LabelMe annotation tool (https://labelme.io/) was employed to delineate deterioration boundaries via polygon rendering (Figure 3), with the final output being JSON annotation files that include precise boundary details.

The entire annotation process used a quality control mechanism involving three steps: initial annotation, three rounds of iterative revision, and expert final review. Specifically, the initial annotation was completed by trained researchers, followed by two rounds of cross-validation and one round of consistency check, which was finally reviewed by experts in the cultural relic restoration field to ensure the accuracy of the annotation boundaries and the consistency of the deterioration identification. During the quality control procedure, 192 samples were removed because of severely uneven lighting, image blurriness, or because there was no expert consensus on the annotation boundaries. Ultimately, a high-quality detachment annotation dataset for 463 Ming Dynasty temple mural images was completed, providing a solid foundation for subsequent model training.

Following annotation, systematic preprocessing was conducted on the original images and annotation data in this study, aiming to build a high-quality dataset for deep learning training. Firstly, the image and label information extracted from the JSON-format annotation files was converted into a grayscale image format, and a mask layer for the corresponding channel was generated.

Specifically, we re-partitioned the data by temple: murals from Guangsheng Temple (early Ming Dynasty), Zhaohua Temple (mid-Ming Dynasty), and Yunlin Temple (late Ming Dynasty) were selected as the independent test set, while mural data from the other nine temples were used for the training and validation sets. To avoid slice-level data leakage, we completed temple-level data partitioning before conducting sliding window slicing and data augmentation. Murals from each temple were assigned exclusively to one subset (training, validation, or test set), and the resulting 256 × 256 tiles were generated solely within that subset. This strategy ensured complete independence between different data subsets, effectively eliminating potential overlap issues across temples or slices.

Secondly, a sliding window cropping algorithm was adopted to slice the original-sized images and mask images into fixed windows of

256 \times 256

pixels with a step size of 128 pixels, and a 50% overlapping area was set to ensure the continuity of deterioration features. This overlapping sampling strategy can maximize the coverage of all potential deterioration areas and effectively avoid the risk of key features being truncated at the slice boundaries. In the process of slice generation, an automated filtering mechanism was employed to remove blank areas consisting solely of the mural background. Such rigorous screening standards substantially increased the effective information density of the training dataset, reducing the overfitting to non-deterioration areas during training. Consequently, the training efficiency and label quality of the model were notably enhanced.

Thirdly, to enhance the model’s robustness and alleviate the issue of insufficient data volume, a variety of image augmentation strategies were further adopted to expand the sample size. The augmentation operations rotation (90°/180°/270°), additive Gaussian noise, and Gaussian blur processing were performed while ensuring that the label data changed synchronously with the images. These augmented images preserved the original structural characteristics and the distribution morphology of deteriorations, which significantly enhances the generalization capability of the model.

In the end, 37,685 image slices along with their corresponding labels were acquired.

2.3. Segment Models

To achieve high-precision identification of pigment layer detachment areas in Ming Dynasty temple murals, five representative image segmentation models were selected: UNet, U²-NetP, SegNet, NestedUNet, and SmaAt-Unet (Table 2). A comparative analysis of their segmentation performance on mural images was conducted. These models have a solid application foundation in fields such as medical image processing and natural scene segmentation. Their network structures possess strong edge-preserving capabilities and multi-scale feature extraction capabilities, making them suitable for mural deterioration image recognition tasks characterized by complex textures and blurred edges [31,32].

UNet has a classic symmetric encoder–decoder architecture with a skip connection mechanism, which can effectively preserve image edges and detailed features, making it suitable for scenarios with moderate data volumes and clear target contours [33]. U²-Net is a lightweight U²-Net structure that adopts nested residual modules (RSUs) to enhance multi-scale feature extraction capabilities while maintaining a low number of parameters, and is applicable to images with blurred boundaries and irregular shapes of detachment areas [34]. SegNet utilizes max-pooling indices for upsampling in the decoding stage, which reduces the number of model parameters while improving spatial reducibility, making it suitable for inference deployment in embedded systems or environments with limited computing power [35]. NestedUNet (UNet++) introduces multi-layer nested skip connections and dense feature fusion based on UNet, which enhances the expressive ability for small-scale targets and complex structures, and is suitable for processing mural areas with highly variable detachment morphologies [36]. SmaAt-UNet, on the other hand, integrates an attention mechanism and lightweight convolution units, which enhances the modeling ability for semantic edges while maintaining low computational complexity, and is particularly suitable for processing mural images with sparse information distribution but important structures [37]. These five models have distinct characteristics in terms of structural complexity, number of parameters, edge recognition ability, and computational efficiency; specific structural comparisons are shown in Table 2.

2.4. Hyperparameter Optimization via Particle Swarm Optimization (PSO) Algorithm

To further enhance the performance of the models in the task of identifying detachment in Ming Dynasty murals, this study introduces the Particle Swarm Optimization (PSO) algorithm for the automatic search and tuning of key hyperparameters of neural networks. PSO, proposed by Kennedy and Eberhart in 1995, is a global optimization algorithm that simulates the foraging behavior of bird flocks [38]. It exhibits advantages such as high search efficiency, simple implementation, and easy parallelization, and has been widely applied in hyperparameter optimization of deep learning models. The basic idea of PSO is to regard each possible parameter combination as a particle in the search space. By simulating the collaborative update mechanism of particles in the search space based on individual experience and group experience, the optimal solution is gradually approached. In each iteration, each particle updates its velocity and position according to its historical optimal position and the global optimal position, thereby achieving dynamic optimization of the objective function.

Within this study, the PSO algorithm was employed to optimize the key hyperparameters of the image segmentation models. PSO’s ability to search in high-dimensional continuous spaces renders it well-suited for addressing nonlinear couplings between parameters that arise during the training of complex models. In comparison to conventional Grid Search or Random Search, PSO achieves higher convergence rates and superior performance when allocated the same computational budget [39]. In the subsequent experimental part, the PSO process will be integrated into the SmaAt-UNet model, and its performance enhancement effect will be validated using the constructed dataset (Figure 4).

3. Experiments and Validation

All training procedures in this study were executed with NVIDIA GPU support. The hardware setup—comprising an NVIDIA Quadro P5000 GPU and an Intel Xeon Gold 6128 CPU—was selected to meet the computational demands of training on large-scale mural image datasets. For software, image processing, model training, and performance assessment were implemented using Python 3.9.13 and the PyTorch 1.13.1 deep learning framework. To guarantee the convergence stability and performance of image segmentation models for Ming Dynasty temple mural detachment identification, the five candidate models mentioned earlier were trained and validated within a consistent training framework, with further optimization performed on the model that achieved the highest performance.

The AdamW optimizer was employed during the model training process, combined with the ReduceLROnPlateau learning rate scheduling strategy, to achieve dynamic learning rate adjustment. The initial learning rate was set to 1 × 10⁻⁴, and the weight decay was 1 × 10⁻⁴. The batch size was set as 16, with a maximum of 80 training epochs. An early stopping mechanism was also implemented, as training was terminated early if the performance on the validation set showed no significant improvement for 15 consecutive epochs, so as to prevent overfitting. All division processes were performed at the slice level to ensure that the same original mural image would not appear in different subsets simultaneously, thereby avoiding data leakage.

3.1. Definition and Explanation of Evaluation Metrics

To comprehensively evaluate the performance of image segmentation models in the task of identifying pigment layer detachment in Ming Dynasty temple murals, this study selected four metrics: Intersection over Union (IoU), Dice Coefficient, Mean Absolute Error (MAE), and Mean Pixel Accuracy (mPA). These metrics can measure the models’ segmentation effectiveness and error characteristics from different perspectives, forming a complementary evaluation system.

IoU measures the overlap between the predicted mask and the ground truth annotation region, and is one of the most commonly used evaluation metrics in the field of semantic segmentation. Here, True Positive (TP) refers to the number of pixels predicted as “detachment” that are also actual detachment pixels; False Positive (FP) denotes the number of pixels predicted as “detachment” that are actually non-detachment pixels; and False Negative (FN) represents the number of actual detachment pixels that are not predicted as such. A higher IoU value indicates better consistency between the predicted results and the ground truth annotations.

Similar to IoU, the Dice Coefficient emphasizes the overlap between the segmentation result and the ground truth region, but is more sensitive to small-area targets. Widely used in medical image segmentation and digital identification of cultural heritage, the Dice Coefficient can effectively reflect the model’s ability to extract the boundaries of mural deterioration regions.

MAE measures the pixel-level difference between the predicted probability map and the ground truth mask.

P_{i}

and

G_{i}

are the predicted value and ground truth label value of the i-th pixel, respectively, and N is the total number of pixels. A lower MAE indicates that the probability map output by the model is closer to the actual distribution of deterioration.

mPA evaluates the mean per-class accuracy of the model in classifying all pixels. Here, TN refers to the number of pixels predicted as background that are also actual background pixels. Suitable for assessing overall pixel-level classification performance, mPA can reflect the model’s overall ability to identify both background and deterioration regions.

I o U = \frac{T P}{T P + F P + F N}

(1)

D i c e = \frac{2 T P}{2 T P + F P + F N}

(2)

M A E = \frac{1}{N} \sum_{i = 1}^{N} | P_{i} - G_{i} |

(3)

m P A = \frac{1}{2} (\frac{T P}{T P + F N} + \frac{T N}{T N + F P})

(4)

Together, the four metrics described above constitute an integrated evaluation framework for model performance. Specifically, IoU and Dice focus on quantifying the spatial overlap of segmented areas, MAE assesses pixel-wise errors in predicted probabilities, and mPA serves to validate the overall classification precision. By leveraging the synergy of these metrics, it becomes possible to conduct a thorough and unbiased quantitative assessment of how well the model performs in identifying mural detachment.

3.2. Loss Function

To optimize the segmentation performance of the models in mural detachment areas, we utilized the Stable IoU-BCE Loss function [40] as the primary loss function. It integrates the optimization objectives of regional overlap accuracy and pixel-level error, maintaining training stability in small lesion areas and complex boundary conditions.

The definition of the loss function is as follows:

S t a b l e I o U - B C E L o s s = α \cdot S t a b l e I o U L o s s + β \cdot B C E L o s s

in which

S t a b l e I o U L o s s = 1 - \frac{T P + ϵ}{T P + F P + F N + ϵ}, ϵ = 1 0^{- 6}

B C E L o s s = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log (p_{i}) + (1 - y_{i}) \log (1 - p_{i})]

In the formula, TP, FP, and FN represent the counts of true positive, false positive, and false negative pixels, respectively.

y_{i}

and

p_{i}

indicate the true label and predicted probability of the

i - t h

pixel, respectively, with N being the total number of pixels. The setting

ϵ = 10^{- 6}

was referenced from relevant literature [41,42] and is commonly used to ensure numerical stability. The weights were set differently depending on the training phase: during the model comparison phase, all models use

α = 10

and

β = 1

to ensure fairness in evaluation; during the PSO phase,

α

and

β

are dynamically optimized by PSO. This composite loss function shows superior convergence and stability during training, significantly reducing the overfitting risk when using the BCE loss function alone, particularly in cases where lesion regions are sparsely distributed or boundaries are unclear.

3.3. Design of Comparative Experiments

As previously stated, performance comparison experiments were performed on the five chosen image segmentation models to assess how different network architectures perform and adapt to the specific task of detecting pigment layer detachment in Ming Dynasty temple murals. Encompassing classical, modified, and lightweight deep segmentation networks, these five models enable a comprehensive assessment of how diverse structural designs handle mural deterioration imagery. By systematically comparing these models within a consistent experimental framework, the research sought to determine the best-performing candidate model—specifically in terms of detachment region detection precision, boundary retention, and overall recognition stability—to establish a foundation for subsequent optimization trials.

To ensure the fairness of the comparison, all models were trained and validated using the same dataset and unified training settings. This included identical preprocessing and augmentation strategies, as well as the same loss function (

S t a b l e I o U - B C E L o s s = α * S t a b l e I o U L o s s + β * B C E L o s s

with

α = 10

and

β = 1

). Each model was trained for a maximum of 80 epochs, with an early stopping mechanism in place. This approach ensured that the experimental results reflect the inherent performance differences of the model structures rather than varying training strategies.

To assess how different network architectures perform in identifying pigment layer detachment in Ming Dynasty temple murals, this study initiated a set of comparative experiments involving five representative image segmentation models. The selected models—UNet, U²-NetP, SegNet, NestedUNet, and SmaAt-UNet—span classical architectures, improved variants, and lightweight designs, ensuring comprehensive coverage of the segmentation network landscape and enabling objective evaluation of their ability to process mural deterioration imagery. A systematic comparison of the five models was performed under a unified experimental framework, aiming to screen out the candidate model with the optimal performance in terms of detachment region detection accuracy, boundary preservation capability, and overall recognition stability, thereby laying a foundation for subsequent optimization experiments.

The experimental procedures were as follows:

(1)

Data Preparation: The preprocessed dataset, consisting of 37,685 image slices, was loaded. Consistent data augmentation (e.g., rotation, flipping, and brightness adjustment) was applied to all models to mitigate overfitting and ensure data distribution consistency.

(2)

Model Initialization: All models were implemented using the PyTorch deep learning framework. Their predefined architectures were strictly followed—e.g., UNet’s symmetric encoder–decoder with skip connections, and SmaAt-UNet’s integration of spatial attention modules—and model weights were initialized using the same random seed to eliminate initialization bias.

(3)

Training was executed as a cyclic process on the training set, with each epoch encompassing the processing of all mini-batches. The key training configurations were standardized across the models:

Batch size: 16 (optimized to balance computational efficiency and gradient stability);
Parameter update: Model parameters were updated iteratively after each epoch using a dynamic loss function (Stable IoU—BCE Loss = α ∗ Stable IoU Loss + β ∗ BCE Loss with α = 10 and β = 1 for the comparison phase) to adapt to the evolving data distribution during training;
Maximum epochs: 80 (enforced for all models to eliminate performance bias caused by variable training durations, ensuring that each model reaches a fully converged state).

Performance monitoring and optimization were implemented consistently as follows:

Validation frequency: Every 5 epochs, to track training progress and prevent overfitting;
Learning rate scheduling: ReduceLROnPlateau strategy (patience = 10), which decreases the learning rate when the validation IoU plateaus, facilitating fine-grained parameter tuning;
Early stopping safeguard: Triggered if validation IoU stagnates for 15 consecutive epochs (to capture the optimal model state), though the full 80-epoch training was completed to maintain experimental consistency.

(4)

To minimize the impact of random initialization on the experimental results, this study conducted repeated experiments under multiple random seed conditions. Five different random seeds (42, 123, 456, 789, and 1011) were set, and the model was independently trained five times. All performance metrics (IoU, Dice, MAE, and mPA) were averaged over the five experimental results to obtain the final reported values, and the 95% confidence intervals were calculated to quantify the statistical stability of the model performance.

(5)

Post-training evaluation was conducted on the independent test set (not involved in training/validation) to ensure unbiased results: the models performed inference to generate pixel-level prediction masks, and four quantitative metrics were computed for comprehensive assessment—IoU, Dice, MAE, and mPA. Strict experimental controls were applied to guarantee comparative fairness: all models shared the same dataset (training/validation/test splits), unified training protocols, identical data preprocessing and augmentation pipelines, and the same loss function.

3.4. PSO Experiment Design

After completing the performance comparison of the five segmentation models, this study adopted the PSO algorithm to optimize the key hyperparameters of the SmaAt-UNet model, which exhibited the best performance. Different from traditional optimization that only targets a single loss weight, this study simultaneously conducted joint optimization across three dimensions: learning rate, weight decay, and the weights (α and β) in the

S t a b l e I o U - B C E L o s s = α * S t a b l e I o U L o s s + β * B C E L o s s

. During the specific implementation, about 15% of the samples were specifically set aside as a tuning set when dividing the training set. This tuning set was exclusively used for hyperparameter search and validation in the PSO phase and was not used in the final training and testing of the model.

The search range of the learning rate was from 1 × 10⁻⁵ to 1 × 10⁻³, the search range of the weight decay was from 1 × 10⁻⁶ to 1 × 10⁻³, and the search range of the weights α and β was from 1 to 20. PSO treated these hyperparameters as search vectors in the solution space, where each particle represents a set of candidate hyperparameter combinations, and its fitness function was defined as the mean IoU value of the validation set.

The PSO process is illustrated in Figure 5. Six particles were initialized with randomly generated positions and velocities. The inertia weight was set to 0.5, and both the cognitive and social factors were set to 1.5. In each iteration, each particle fine-tunes the model on the tuning subset (for 2 rounds) and calculates the IoU on the validation set. The individual best (pbest) and global best (gbest) positions are updated accordingly. The velocities and positions of the particles are then adjusted based on the PSO formulas. The maximum number of iterations was set to 5. If the improvement in gbest is less than 0.001 for two consecutive iterations, the process is terminated early. The optimal combination of hyperparameters is outputted at the end [43].

Subsequently, the SmaAt-UNet model was trained with the optimized hyperparameters for a maximum of 80 epochs. The training was terminated with an early stopping mechanism in place (triggered when there was no improvement in validation performance for 15 consecutive epochs). To further enhance the robustness and reproducibility of the results, each training experiment was conducted five times independently using five different random seeds (42, 123, 456, 789, and 1011). The model was evaluated on the reserved test set. All performance metrics (IoU, Dice, MAE, and mPA) were averaged over the five experiments, and 95% CIs were calculated to quantify the statistical stability of the model performance. This design, which maintains an independent test set and involves multiple repeated training sessions and hyperparameter tuning, effectively mitigates the risk of overfitting to the validation set and ensures that the reported performance metrics are statistically reliable. Prediction masks were ultimately generated on the test set.

4. Results

To systematically evaluate different deep segmentation models for Ming Dynasty temple mural pigment layer detachment identification, five models (UNet, NestedUNet, SegNet, U²-NetP, and SmaAt-UNet) were compared using the same training settings, with the final evaluation performed on an independent test set. Since the early stopping criterion was not triggered, all models were trained for the full 80 epochs. They were comprehensively assessed using four metrics: IoU, Dice, MAE, and mPA.

4.1. Comparison Results of the Segmentation Models

4.1.1. Segmentation Accuracy

Table 3 summarizes the average test results of the five image segmentation models with 95% confidence intervals (CIs) across five different random seeds. The evaluation metrics were Intersection over Union (IoU), Dice, Mean Absolute Error (MAE), and Mean Pixel Accuracy (mPA). The results show that SmaAt-UNet performs the best overall. This model achieved the highest values for IoU (0.6633) and Dice (0.7246), indicating its superior pixel overlap and boundary consistency in lesion regions. Additionally, the mPA of SmaAt-UNet reached 0.9325 and its MAE was only 0.0592, reflecting its excellent global classification accuracy and error control. In contrast, although UNet was close to or slightly better than SmaAt-UNet in some metrics (e.g., mPA = 0.9287), it remained slightly inferior overall. NestedUNet exhibited moderate performance, with an IoU of 0.4795 and an mPA of 0.8258. Although it was not as good as the SmaAt-UNet or UNet, it outperformed SegNet and U²-NetP. SegNet and U²-NetP had the weakest results, with IoU and Dice values below 0.5 (SegNet: 0.3672/0.4551; U²-NetP: 0.3527/0.4392) and an MAE close to 0.46–0.48, indicated significant deficiencies in edge delineation and robustness.

4.1.2. Model Comparison Training and Validation Curves

To visually present the convergence characteristics and validation performance changes of each model during training, this study independently trained each model five times with five different random seeds (42, 123, 456, 789, and 1011). The average values of the results from each training iteration were used as the final metrics. The study then plotted the average performance curves over 80 training epochs (IoU, Dice, MAE, and mPA) based on the validation set (Figure 6a) and the training/validation loss curves (Figure 6b).

SmaAt-UNet performed the best among all the models. Its training loss decreased rapidly to near zero, and the validation loss remained at a low level. The IoU and Dice metrics on the validation set quickly increased after training began and entered a stable performance phase after 60 epochs, eventually reaching 0.8133 and 0.8934, respectively, while the mPA increased to 0.9456 and the MAE decreased to 0.0558, indicating significant advantages in edge preservation and global consistency.

In contrast, UNet showed a similar overall trend to SmaAt-UNet but with a slower convergence speed and greater fluctuations during validation, suggesting slightly insufficient adaptability to complex textures and stability. NestedUNet gradually improved in performance after 40 epochs but exhibited noticeable oscillations in validation loss, indicating insufficient stability in handling complex details. SegNet and U²-NetP had the weakest overall performance, with slow training convergence and severe fluctuations during validation. Their IoU remained below 0.5 for an extended period, revealing significant deficiencies in precision and generalization ability for lesion region segmentation.

4.1.3. Model Comparison Segmentation Result Visualization

To ensure the representativeness and generalizability of the segmentation model’s test results, this study selected samples from the murals of three Ming Dynasty temples from different periods to construct the test set. Figure 7 illustrates the sources of the mural samples used for the segmentation experiments and their corresponding regions. Nine typical mural images were selected as the test set, where the a, b, and c slices are from Guangsheng Temple from the early Ming Dynasty; the d, e, and f slices are from Yunlin Temple from the late Ming Dynasty; and the g, h, and i slices are from Zhaohua Temple from the middle Ming Dynasty. The murals of these three temples vary in painting date, pigment characteristics, and preservation condition, which can comprehensively reflect the diverse forms of pigment layer detachment in murals from different periods and provide a representative set for cross-scene validation of the model’s segmentation performance.

To intuitively verify the segmentation performance of the different models on mural detachment regions, this study selected typical mural images covering diverse detachment morphologies from the test set, and conducted a visual comparison of the prediction results of the five models (UNet, U²-NetP, SegNet, NestedUNet, and SmaAt-UNet) (Figure 8). Each model was independently trained five times using five different random seeds (42, 123, 456, 789, and 1011), and the experimental results with evaluation metrics closest to the average were selected to generate the corresponding prediction masks. In the comparison figure, the original mural images and the binarized prediction masks output by each model are presented in sequence along the column dimension, clearly demonstrating the differences of the models’ performance in capturing fine edges and handling complex background interference.

From the visualization results, SmaAt-UNet demonstrated significant advantages: its predicted mask outline was highly consistent with the ground truth mask, and it achieved high accuracy in restoring fine detachment edges. Even when facing complex mural textures, it still clearly distinguished the boundaries of detachment regions and effectively suppress background interference.

In contrast, although UNet could capture the general outline, it was slightly vague in handling extremely fine edges, such as narrow fractures in detachment regions. U²-NetP exhibited local over-segmentation and some non-detachment regions were misclassified as targets. The predicted mask edges of SegNet and NestedUNet are rough, with obvious burrs and holes, and they lack the ability to identify tiny detachments under complex backgrounds, resulting in a large deviation between their segmentation results and the ground truth mask.

In summary, through the systematic comparison of five classical image segmentation models, this study verified the significant advantages of SmaAt-UNet in the task of identifying pigment layer detachment in Ming Dynasty temple murals. Compared with the other models, SmaAt-UNet performs best in terms of segmentation accuracy, training convergence speed, performance stability, and edge preservation capability. It can effectively address challenges such as complex mural textures and variable morphological characteristics of deterioration regions. The high IoU, Dice, and mPA values, as well as the low MAE value, achieved by SmaAt-UNet on the test set further confirm its application potential in practical mural restoration and preservation. This finding lays a solid foundation for the subsequent experiments aimed at improving model performance using the PSO algorithm.

4.2. Performance Results of the Optimized Model

Based on the comparative experiments of the five segmentation models, this study selected the best-performing SmaAt-UNet as the baseline model and used the PSO algorithm to optimize its key hyperparameters, including the weights (α and β) in the

S t a b l e I o U - B C E L o s s = α * S t a b l e I o U L o s s + β * B C E

Loss, aiming to further improve the model’s performance in identifying detachment in Ming Dynasty temple murals. The optimized model, PSO-SmaAt-UNet, was fully trained on the training and validation sets with the optimized weights α = 10 and β = 2, and its final performance was evaluated on an independent test set.

4.2.1. Performance Metrics

Table 4 summarizes the segmentation performance comparison of SmaAt-UNet on the test set before and after optimization. All metrics are the averages obtained from independent training and testing conducted under five different random seeds.

It can be seen that after PSO, the overall performance of the model significantly improved. Specifically, the IoU increased from 0.6633 to 0.7352, and the Dice Coefficient rose from 0.7246 to 0.7936, indicating that the optimized model is more accurate in delineating the boundaries of deterioration regions and achieving pixel overlap. Meanwhile, the MAE decreased from 0.0592 to 0.0455, which means that the pixel-level error between the predicted results and the ground truth annotations was significantly reduced, and the model performs better in detail restoration and error control. In addition, the mPA increased from 0.9325 to 0.9702, further confirming the advantage of the optimized model in global pixel classification accuracy. Overall, PSO effectively improved the segmentation accuracy and robustness of SmaAt-UNet.

4.2.2. Model Optimization Training and Validation Curves

To evaluate the training convergence and performance improvement of the SmaAt-UNet model after PSO, this study independently trained the model five times using five different random seeds (42, 123, 456, 789, and 1011). The average values of the results over 60 training epochs were used to plot the training/validation loss curves (Figure 9a) and the variation curves of the four performance metrics (IoU, Dice, MAE, and mPA) based on the validation set (Figure 9b). The loss curves indicate that the model’s loss quickly dropped in the initial stages of training, suggesting that the network rapidly acquired effective features. Subsequently, the loss gradually converged, with the training loss stabilizing at around 0.05. The validation loss exhibited slight fluctuations between epochs 25 and 35, reflecting differences between the training and validation sets, but remained generally stable. In terms of performance metrics on the validation set (Figure 9b), the IoU metric steadily increased to 0.8363, the Dice score rose to 0.9220, the mPA metric reached 0.9698, and the MAE metric decreased to approximately 0.0468. These trends highlight the gradual improvement in segmentation performance, particularly in pixel-level prediction accuracy and model stability. PSO effectively enhanced the model’s convergence speed and generalization ability, providing a robust technical foundation for high-precision segmentation of mural pigment layer detachment areas.

4.2.3. Model Optimization Segmentation Result Visualization

To intuitively compare the segmentation performance of the models before and after optimization, representative test set samples were selected, including detachment regions with varying sizes, shapes, and boundary sharpness. Figure 10 illustrates the segmentation results of the original SmaAt-UNet model and the PSO-optimized SmaAt-UNet model on the same test samples. Both models were independently trained five times using five different random seeds (42, 123, 456, 789, and 1011), and the experimental results with composite evaluation metrics (IoU, Dice, MAE, and mPA) closest to the overall mean were selected for visualization to ensure the representativeness and fairness of the comparison.

The visualization results reveal clear advantages of the optimized model: it achieved higher precision in identifying edge regions and separating complex backgrounds, resulting in prediction masks that align more closely with the ground truth. While the overall segmentation quality shows no substantial difference from the unoptimized model, the PSO-optimized version exhibits subtle but meaningful improvements in two critical areas: the restoration of fine detachment edges (e.g., narrow, blurred edges) and the suppression of interference from complex mural textures. These improvements directly confirm that the optimization method effectively enhances both the model’s generalization capability and its performance in fine-grained segmentation tasks.

In summary, the SmaAt-UNet model optimized by PSO demonstrates higher accuracy and stability in the task of segmenting mural detachment regions. The performance metrics show significant improvements in IoU, Dice, and mPA, along with a notable decrease in MAE. The training and validation curves indicate that the model converges rapidly and exhibits enhanced generalization ability. The visualization results further verify the model’s advantages in restoring edge details and separating complex backgrounds. Overall, PSO effectively improves the segmentation performance and robustness of SmaAt-UNet, providing a reliable foundational model for subsequent mural deterioration identification tasks.

5. Discussion

5.1. Model Performance Analysis

Among the five segmentation models (UNet, NestedUNet, SegNet, U²-NetP, and SmaAt-UNet) compared in this study, SmaAt-UNet stood out as the top performer in detecting pigment layer detachment in Ming Dynasty temple murals. The quantitative results on the test set confirm its superiority, with an IoU of 0.6633, Dice of 0.7246, mPA of 0.9325, and MAE of 0.0592—metrics that indicate that SmaAt-UNet outperformed the other four models by a significant margin.

The technical innovations of SmaAt-UNet underpin this performance advantage, with two key design improvements:

Lightweight convolutional units: Unlike the standard convolutions in UNet, this unit decomposes spatial convolution and channel convolution, drastically reducing the parameter count and computational cost. Crucially, it retains the model’s multi-scale feature extraction capability—essential for handling mural images with intricate textures (e.g., overlapping brushstrokes) and variable deterioration morphologies (e.g., dot-like vs. sheet-like detachment);
Attention module integration: During feature extraction, the module dynamically assigns higher weights to pixels corresponding to detachment regions and lower weights to background pixels. This selective attention enhances the model’s discriminative ability at edge regions, addressing the common challenge of blurry boundaries between detachment areas and mural backgrounds, and enabling precise boundary localization.

Relative to competitors, SmaAt-UNet strikes a unique balance: it avoids the excessive complexity of NestedUNet (which leads to overfitting with limited samples) and overcomes the insufficient feature expression of SegNet (due to its simplified encoder–decoder structure). This balance allows SmaAt-UNet to achieve higher learning efficiency and robustness under limited training data, directly contributing to its exceptional performance in this study.

Taking SmaAt-UNet as the baseline model, this study integrated the Particle Swarm Optimization (PSO) algorithm for the automatic optimization of key hyperparameters. The quantitative results on the test set confirm the effectiveness of this optimization: its IoU was enhanced to 0.7352, its Dice reached 0.7936, its mPA was elevated to 0.9702, and its MAE was reduced to 0.0455.

A core advantage of PSO lies in its global search capability, which addresses a critical limitation of manual tuning—i.e., the risk of falling into local optima due to subjective experience. This mechanism not only accelerates the model’s convergence under constrained computational resources but also stabilizes the validation process (evidenced by the smoother validation curves) and strengthens generalization, ensuring robust performance on unseen mural samples.

Beyond the current task, this PSO-based optimization method offers strong adaptability: it can be readily applied to other deep learning architectures (e.g., U-Net++, ResUNet) and extended to the identification of other cultural heritage deterioration types, including cracks, paint flaking, and color fading. As such, it serves as a valuable technical reference for advancing the automation of deterioration detection and hyperparameter optimization across diverse cultural heritage preservation scenarios.

5.2. Model Limitations

While the deep learning approach developed in this study delivers promising performance for mural detachment detection, several limitations warrant attention. Their root causes and potential solutions are as follows:

Limited data diversity: The study primarily utilized digital resources from specific institutions, leading to data concentration in two key dimensions—geographical coverage (e.g., focusing on murals from a few regions) and temporal scope (e.g., spanning a narrow historical period). This lack of diversity may restrict the model’s generalization to murals with distinct stylistic or deteriorative characteristics from underrepresented areas/eras;
High cost and subjectivity of manual annotation: Detachment regions were entirely manually annotated. Even with strict adherence to the China national standard, the workflow was labor-intensive. Moreover, subjective biases (e.g., differences in annotators’ judgment of blurred detachment boundaries) inevitably introduce annotation errors, which may propagate to the model training;
Challenges in complex scenarios: For murals featuring intricate decorative patterns (e.g., overlapping motifs) and rich color gradients, the models struggled with two common issues: false detection (misclassifying decorative elements as detachment) and missed detection (failing to identify small-scale or edge-blurred detachments). This reflects the models’ insufficient ability to distinguish between detachment and complex background textures.

To address these limitations, future work should focus on three actionable directions: (1) expanding data sources by collaborating with more cultural heritage institutions or integrating open-access mural datasets; (2) adopting semi-automatic or weakly supervised annotation techniques to reduce manual labor and minimize subjective errors; and (3) optimizing the network architecture—for example, integrating multi-scale feature fusion modules or advanced attention mechanisms—to improve the model’s adaptability to complex background environments.

The deep learning-based detachment identification method proposed in this study holds significant application value in cultural heritage preservation practice. The optimized PSO-SmaAt-UNet model can automatically and accurately identify and quantify mural detachment regions, greatly improving the efficiency and consistency of deterioration detection. Compared with traditional manual visual inspection or manual delineation, this method provides rapid and objective reference data for cultural heritage conservators, which can assist in deterioration assessment, restoration planning, and long-term monitoring.

5.3. Model Application and Protection Practices

Building upon the validation of model performance, this study further explored the potential value of the PSO-optimized SmaAt-U-Net segmentation model in cultural heritage conservation practice. The optimized segmentation model generates high-precision damage masks, providing not only data support for the quantitative assessment of mural deterioration but also laying the foundation for establishing a digital conservation system.

These segmentation results can be integrated into digital archives and visualization databases for cultural heritage, enabling long-term monitoring, evolutionary analysis, and multi-temporal comparisons of mural deterioration to achieve dynamic tracking of damage progression. When integrated with Historical Building Information Modeling (HBIM) systems, this approach enables precise localization and visualization of mural detachment areas in three-dimensional space. By correlating structural information with damage distribution, it enhances the efficiency and accuracy of interdisciplinary collaboration [44,45].
The model’s lightweight architecture and rapid inference speed confer it excellent portability and field application potential. The research findings can be deployed on mobile terminals or on-site inspection systems to enable real-time identification of mural deterioration and synchronous database updates, establishing an intelligent closed-loop process spanning data collection, automated recognition, and restoration decision-making [46]. This integrated application model significantly enhances the efficiency and responsiveness of mural conservation efforts, providing an intelligent supplement to traditional manual inspections. It transforms heritage monitoring from static documentation to dynamic management.

This study achieved an organic integration of deep learning algorithms into cultural heritage conservation practices. The proposed method not only validates the segmentation model’s effectiveness at the technical level, but also demonstrates its application potential in mural restoration, risk assessment, and digital management. It provides new technical pathways and methodological support for advancing the precision and scientific rigor of cultural heritage conservation.

6. Conclusions

This study addresses the pigment layer detachment deterioration of Ming Dynasty temple murals by systematically introducing deep learning segmentation technology into cultural heritage preservation. The proposed technical route includes mural image digitization, dataset construction, multi-model performance comparison, PSO, and model validation.

The research results show that SmaAt-UNet exhibited excellent performance on the test set, with an overall performance superior to that of UNet, NestedUNet, SegNet, and U²-NetP. After further optimizing its hyperparameters using the PSO algorithm, the PSO-SmaAt-UNet model displayed significant advantages in segmentation accuracy, edge preservation capability, and generalization performance on the test set.

This study constructed a systematic Ming Dynasty temple mural detachment dataset and developed a high-precision model for cultural heritage digital management and conservation assessment, providing rapid, objective, refined references for deterioration evaluation. It also demonstrates the deep integration potential of deep learning and cultural heritage conservation, offering a model for future automated, intelligent mural conservation workflows.

Author Contributions

Conceptualization, C.L. and N.L.; methodology, C.L. and B.Z.; software, Z.S.; validation, Z.S., H.P. and Y.Z.; formal analysis, Z.S., Y.Z., C.L. and A.N.; investigation, Z.S., Y.Z., H.P. and A.N.; resources, C.L. and Z.S.; data curation, C.L., Z.S. and C.W.; writing—original draft preparation, Z.S. and C.W.; writing—review and editing, C.L., Y.Z. and H.P.; visualization, Z.S. and Y.Z.; supervision, C.L. and B.Z.; project administration, C.L.; funding acquisition, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was sponsored by the Beijing Municipal Social Science Fund Project (Grant Number: 24SRC021).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, W.; Xie, Q.; Shi, W.; Lin, H.; He, J.; Ao, J. Cultural rituality and heritage revitalization values of ancestral temple architecture painting art from the perspective of relational sociology theory. Herit. Sci. 2024, 12, 340. [Google Scholar] [CrossRef]
Kosel, J.; Kavčič, M.; Legan, L.; Retko, K.; Ropret, P. Evaluating the xerophilic potential of moulds on selected egg tempera paints on glass and wooden supports using fluorescent microscopy. J. Cult. Herit. 2021, 52, 44–54. [Google Scholar] [CrossRef]
Sun, P.; Hou, M.; Lyu, S.; Wang, W.; Li, S.; Mao, J.; Li, S. Enhancement and restoration of scratched murals based on hyperspectral imaging—A case study of murals in the Baoguang Hall of Qutan Temple, Qinghai, China. Sensors 2022, 22, 9780. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Zou, Q.; Zhang, F.; Yu, H.; Chen, L.; Song, C.; Huang, X.; Wang, X.; Li, Q. Line Drawing-Guided Progressive Inpainting for Mural Damage. ACM J. Comput. Cult. Herit. 2025, 18, 1–20. [Google Scholar] [CrossRef]
Wei, X.; Fan, B.; Wang, Y.; Feng, Y.; Fu, L. Progressive enhancement and restoration for mural images under low-light and defective conditions based on multi-receptive field strategy. NPJ Herit. Sci. 2025, 13, 63. [Google Scholar] [CrossRef]
Wang, Y.; Wu, X. Current progress on murals: Distribution, conservation and utilization. Herit. Sci. 2023, 11, 61. [Google Scholar] [CrossRef]
Zhang, H.; Tang, C.A.; Guo, Q.; Wang, Y.; Xia, Y.; Tang, S.; Zhao, L. Analysis of cracking behavior of murals in Mogao Grottoes under environmental humidity change. J. Cult. Herit. 2024, 67, 183–193. [Google Scholar] [CrossRef]
Abd El-Tawab, N.; Mahran, A.; Gad, K. Conservation of the Mural Paintings of the Greek Orthodox Church Dom E of Saint George, Old Cairo-Egypt. Eur. Sci. J. 2014, 10, 324–354. [Google Scholar]
Rivas, T.; Alonso-Villar, E.M.; Pozo-Antonio, J.S. Forms and factors of deterioration of urban art murals under humid temperate climate; influence of environment and material properties. Eur. Phys. J. Plus 2022, 137, 1257. [Google Scholar] [CrossRef]
Du, P. A comprehensive automatic labeling and repair strategy for cracks and peeling conditions of literary murals in ancient buildings. J. Intell. Fuzzy Syst. 2024, 46, 3557–3568. [Google Scholar] [CrossRef]
Deng, X.; Yu, Y. Automatic calibration of crack and flaking diseases in ancient temple murals. Herit. Sci. 2022, 10, 163. [Google Scholar] [CrossRef]
Cao, J.; Li, Y.; Cui, H.; Zhang, Q. Improved region growing algorithm for the calibration of flaking deterioration in ancient temple murals. Herit. Sci. 2018, 6, 67. [Google Scholar] [CrossRef]
Chai, B.; Xiao, D.; Su, B.; Feng, W.; Yu, Z. Development and Application of a Multispectral Digital Recognition System for Mogao Caves’ Paint Colors. Dunhuang Stud. 2018, 3, 123–130. [Google Scholar]
Zhang, H.; Xu, D.; Luo, H.; Yang, B. Multi-scale mural restoration method based on edge reconstruction. J. Graph. 2021, 42, 590. [Google Scholar]
Yang, T.; Wang, S.; Pen, H.; Wang, Z. Automatic Recognition and Repair of Cracks Inmural Images Based on Improved SOM. J. Tianjin Univ. Sci. Technol. 2020, 53, 932–938. [Google Scholar]
Li, B.; Chu, X.; Lin, F.; Wu, F.; Jin, S.; Zhang, K. A highly efficient tunnel lining crack detection model based on Mini-Unet. Sci. Rep. 2024, 14, 28234. [Google Scholar] [CrossRef]
Zhang, Z.; He, Y.; Hu, D.; Jin, Q.; Zhou, M.; Liu, Z.; Chen, H.; Wang, H.; Xiang, X. Algorithm for pixel-level concrete pavement crack segmentation based on an improved U-Net model. Sci. Rep. 2025, 15, 6553. [Google Scholar] [CrossRef]
Chen, Y.; Xia, R.; Yang, K.; Zou, K. Dual degradation image inpainting method via adaptive feature fusion and U-net network. Appl. Soft Comput. 2025, 174, 113010. [Google Scholar] [CrossRef]
Zeng, Y.; Gong, Y.; Zeng, X. Controllable digital restoration of ancient paintings using convolutional neural network and nearest neighbor. Pattern Recognit. Lett. 2020, 133, 158–164. [Google Scholar] [CrossRef]
Rakhimol, V.; Maheswari, P.U. Restoration of ancient temple murals using cGAN and PConv networks. Comput. Graph. 2022, 109, 100–110. [Google Scholar] [CrossRef]
Shen, J.; Liu, N.; Sun, H.; Li, D.; Zhang, Y.; Han, L. An algorithm based on lightweight semantic features for ancient mural element object detection. NPJ Herit. Sci. 2025, 13, 70. [Google Scholar] [CrossRef]
Yuan, Q.; He, X.; Han, X.; Guo, H. Automatic recognition of craquelure and paint loss on polychrome paintings of the Palace Museum using improved U-Net. Herit. Sci. 2023, 11, 65. [Google Scholar] [CrossRef]
Wu, L.; Zhang, L.; Shi, J.; Zhang, Y.; Wan, J. Damage detection of grotto murals based on lightweight neural network. Comput. Electr. Eng. 2022, 102, 108237. [Google Scholar] [CrossRef]
Zou, Z.; Zhao, P.; Zhao, X. Virtual restoration of the colored paintings on weathered beams in the Forbidden City using multiple deep learning algorithms. Adv. Eng. Inform. 2021, 50, 101421. [Google Scholar] [CrossRef]
Zhang, J.; Bai, S.; Zeng, X.; Liu, K.; Yuan, H. Supporting historic mural image inpainting by using coordinate attention aggregated transformations with U-Net-based discriminator. NPJ Herit. Sci. 2025, 13, 1–12. [Google Scholar] [CrossRef]
Cui, J.; Tao, N.; Omer, A.M.; Zhang, C.; Zhang, Q.; Ma, Y.; Zhang, Z.; Yang, D.; Zhang, H.; Duan, Y.; et al. Attention-enhanced U-Net for automatic crack detection in ancient murals using optical pulsed thermography. J. Cult. Herit. 2024, 70, 111–119. [Google Scholar] [CrossRef]
Zhang, Y.; Xu, Y.; Li, S.; Deng, F. Study on Digital Chromatography of Fahai Temple Frescoes in Ming Dynasty Based on Visualization. In Proceedings of the 2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC), Hangzhou, China, 23–25 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 581–586. [Google Scholar]
Li, T.T. Exploration on the Origin of Architectures Murals—Analysis on Geographical Features of Song, Liao and Jin Dynasty in Shanxi. Adv. Mater. Res. 2014, 838, 2870–2874. [Google Scholar] [CrossRef]
Pen, H.; Wang, S.; Zhang, Z. Mural image shedding diseases inpainting algorithm based on structure priority. In Proceedings of the Third International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022), Wuhan, China, 4–6 November 2022; SPIE: Bellingham, WA, USA, 2023; Volume 12610, pp. 347–352. [Google Scholar]
Li, J.; Zhang, H.; Fan, Z.; He, X.; He, S.; Sun, M.; Ma, Y.; Fang, S.; Zhang, H.; Zhang, B. Investigation of the renewed diseases on murals at Mogao Grottoes. Herit. Sci. 2013, 1, 31. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, X.; Chen, W.; Liu, J.; Xu, T.; Wang, Z. Muraldiff: Diffusion for ancient murals restoration on large-scale pre-training. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 2169–2181. [Google Scholar] [CrossRef]
Wu, M.; Jia, M.; Wang, J. TMCrack-Net: A U-shaped network with a feature pyramid and transformer for mural crack segmentation. Appl. Sci. 2022, 12, 10940. [Google Scholar] [CrossRef]
Hu, X.; Naiel, M.A.; Wong, A.; Lamm, M.; Fieguth, P. RUNet: A robust UNet architecture for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16−17 June 2019. [Google Scholar]
Li, X.; Fang, X.; Yang, G.; Su, S.; Zhu, L.; Yu, Z. Transu²-net: An effective medical image segmentation framework based on transformer and u²-net. IEEE J. Transl. Eng. Health Med. 2023, 11, 441–450. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Li, K.; Li, Z.; Fang, S. Siamese NestedUNet networks for change detection of high-resolution satellite image. In Proceedings of the 2020 1st International Conference on Control, Robotics and Intelligent System, Xiamen, China, 27–29 October 2020; pp. 42–48. [Google Scholar]
Trebing, K.; Staǹczyk, T.; Mehrkanoon, S. SmaAt-UNet: Precipitation nowcasting using a small attention-UNet architecture. Pattern Recognit. Lett. 2021, 145, 178–186. [Google Scholar] [CrossRef]
De Oca, M.A.M.; Stutzle, T.; Birattari, M.; Dorigo, M. Frankenstein’s PSO: A composite particle swarm optimization algorithm. IEEE Trans. Evol. Comput. 2009, 13, 1120–1132. [Google Scholar] [CrossRef]
Jain, M.; Saihjpal, V.; Singh, N.; Singh, S.B. An overview of variants and advancements of PSO algorithm. Appl. Sci. 2022, 12, 8392. [Google Scholar] [CrossRef]
Yeghiazaryan, V.; Voiculescu, I. Family of boundary overlap metrics for the evaluation of medical image segmentation. J. Med. Imaging 2018, 5, 015006. [Google Scholar] [CrossRef]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In International Workshop on Deep Learning in Medical Image Analysis; Springer International Publishing: Cham, Germany, 2017; pp. 240–248. [Google Scholar]
Huang, F.; Tan, E.L.; Yang, P.; Huang, S.; Ou-Yang, L.; Cao, J.; Wang, T.; Lei, B. Self-weighted adaptive structure learning for ASD diagnosis via multi-template multi-center representation. Med. Image Anal. 2020, 63, 101662. [Google Scholar] [CrossRef]
Saifullah, S.; Dreżewski, R. PSO-UNet: Particle Swarm-Optimized U-Net Framework for Precise Multimodal Brain Tumor Segmentation. arXiv 2025, arXiv:2503.19152. [Google Scholar]
Xu, Z.; Yang, Y.; Fang, Q.; Chen, W.; Xu, T.; Liu, J.; Wang, Z. A comprehensive dataset for digital restoration of Dunhuang murals. Sci. Data 2024, 11, 955. [Google Scholar] [CrossRef]
Zhan, J.; Meng, Y.; Zhang, L.; Li, K.; Yan, F. Research on computer vision in intelligent damage monitoring of heritage conservation: The case of Yungang Cave Paintings. NPJ Herit. Sci. 2025, 13, 50. [Google Scholar] [CrossRef]
Mazzetto, S. Integrating emerging technologies with digital twins for heritage building conservation: An interdisciplinary approach with expert insights and bibliometric analysis. Heritage 2024, 7, 6432–6479. [Google Scholar] [CrossRef]

Figure 1. Technical flowchart of the PSO-Optimized SmaAt-UNet model for identifying pigment layer detachment in Ming Dynasty temple murals. The process is divided into three main stages. (1) Data Preparation: Collect high-res mural images from digital data; annotate, augment, and split data into 37,685 image slices and corresponding labels. (2) Model Selection and Training: Compare 5 segmentation models (with edge preservation, attention, edge detection, lightweight, and small-scale features); evaluate via IoU, Dice, MAE, and mPA. (3) Optimization and Evaluation: Optimize the selected SmaAt-UNet with PSO (learning rate, weight decay, and Stable IoU-BCE Loss) to realize automatic detachment recognition and preventive conservation.

Figure 2. This map shows the geographical distribution of the 12 research temples involved in this study, which are located across three regions: Beijing, Hebei Province, and Shanxi Province. The symbols represent different temple construction backgrounds: squares indicate imperially commissioned temples, and triangles denote local civilian-built temples. Colors correspond to different periods of the Ming Dynasty: light yellow represents the early Ming Dynasty, orange stands for the middle Ming Dynasty, and red indicates the late Ming Dynasty.

Figure 3. Image annotation of mural detachment regions. The LabelMe tool was used for pixel-level fine-grained annotation of pigment layer detachment regions.

Figure 4. Architecture and hyperparameter optimization workflow of PSO-SmaAt-UNet.

Figure 5. PSO flowchart.

Figure 6. The average training-validation curves for the five segmentation models, obtained after five independent training sessions using different random seeds (42, 123, 456, 789, and 1011), covering the trends of loss, IoU, Dice Coefficient, MAE, and mPA over 80 epochs. (a) Training and validation loss of five segmentation models, reflecting model convergence. (b) Performance metrics (IoU, Dice, MAE, and mPA) are visualized as different colored curves and reflect the segmentation accuracy and robustness of each model during training. SmaAt-UNet showed the best performance.

Figure 7. Nine typical mural images were selected as the test set. The a, b, and c slices are from Guangsheng Temple from the early Ming Dynasty; the d, e, and f slices are from Yunlin Temple from the late Ming Dynasty; and the g, h, and i slices are from Zhaohua Temple from the middle Ming Dynasty. The positions of the slices within the original works are clearly indicated.

Figure 8. Visual comparison of mural detachment segmentation results. To ensure the fairness of the model performance comparison and the representativeness of the results, each segmentation model was independently trained five times using five different random seeds (42, 123, 456, 789, and 1011). The experimental results with composite evaluation metrics (IoU, Dice, MAE, and mPA) closest to the overall mean were selected for visualization. The image slices are arranged from left to right as follows: the original mural image and the segmentation results of UNet, U²-NetP, SegNet, NestedUNet, and SmaAt-UNet on the test set, each illustrating the model’s predicted segmentation mask against the ground-truth label to intuitively reflect the differences in performance of the different models in identifying mural detachment areas.

Figure 9. The performance curves of the PSO-optimized SmaAt-UNet model that was trained independently five times using five random seeds (42, 123, 456, 789, and 1011). Curves are based on the average results over 60 training epochs. (a) Training and validation loss curves, reflecting the model’s good convergence characteristics and generalization ability. (b) Color-coded curves for the four performance metrics (IoU, Dice, MAE, and mPA), which demonstrate the high accuracy and stability of the optimized model in identifying mural detachment regions.

Figure 10. Comparison of segmentation results for mural pigment layer detachment of the original SmaAt-UNet and the PSO-optimized SmaAt-UNet. The models were independently trained five times using five different random seeds (42, 123, 456, 789, and 1011), and the experimental results with composite evaluation metrics (IoU, Dice, MAE, and mPA) closest to the overall mean were selected for visualization. The image slices are arranged from top to bottom as follows: the original image, the prediction mask of the original SmaAt-UNet, and the prediction mask of the PSO-optimized model.

Table 1. Survey of basic information of 12 selected temples.

Name	Location	Period	Type
Fahai Temple	Shijingshan District, Beijing	Mid-Ming Dynasty	Imperially Commissioned
Zhaohua Temple	Huaian, Hebei Province	Mid-Ming Dynasty	Imperially Commissioned
Pilu Temple	Shijiazhuang, Hebei Province	Mid-Ming Dynasty	Local Folk-Built
Yong’an Temple	Hunyuan, Shanxi Province	Late Ming Dynasty	Local Folk-Built
Yunlin Temple	Yanggao, Shanxi Province	Late Ming Dynasty	Imperially Commissioned
Gongzhu Temple	Fanshi, Shanxi Province	Mid-Ming Dynasty	Local Folk-Built
Duofu Temple	Taiyuan, Shanxi Province	Mid-Ming Dynasty	Local Folk-Built
Foguang Temple	Wutai, Shanxi Province	Early Ming Dynasty	Local Folk-Built
Shengmu Temple	Fenyang, Shanxi Province	Mid-Ming Dynasty	Local Folk-Built
Zishou Temple	Lingshi, Shanxi Province	Mid-Ming Dynasty	Local Folk-Built
Guangsheng Temple	Hongtong, Shanxi Province	Early Ming Dynasty	Local Folk-Built
Jiyi Temple	Xinjiang, Shanxi Province	Mid-Ming Dynasty	Local Folk-Built

Table 2. Features of five representative image segmentation models.

Model	Structure Features	Parameter Size	Key Advantages	Applicable Scenarios
UNet	Symmetric encoder–decoder structure and skip connections	Medium	Good edge information preservation and clear structure	Suitable for identifying medium-scale pigment detachment regions in murals
U²-NetP	Nested residual block (RSU) and lightweight design	Small	Strong multi-scale capability and good edge detection performance	Regions with blurred detachment boundaries and fine textures
SegNet	Max-pooling index upsampling with a compact structure	Medium	Fewer parameters and good spatial reducibility	Suitable for model lightweight deployment or fast inference environments
NestedUNet	Multi-layer nested skip connections and dense feature fusion	Large	Strong feature representation ability and good handling of complex structures	Small-scale or multi-morphological pigment detachment regions
SmaAt-UNet	Attention mechanism and depthwise separable convolution	Tiny	Fewer parameters but excellent performance, with strong edge preservation capability	Scenarios where mural detachment regions are semantically sparse but structurally critical

Table 3. Five image segmentation models’ test set performance.

Model	IoU (Mean, 95% CI)	Dice (Mean, 95% CI)	MAE (Mean, 95% CI)	mPA (Mean, 95% CI)
UNet	0.6418 ± 0.0125	0.7263 ± 0.0151	0.0608 ± 0.0032	0.9287 ± 0.0084
U²-NetP	0.3527 ± 0.0213	0.4392 ± 0.0287	0.4980 ± 0.0315	0.6315 ± 0.0246
SegNet	0.3672 ± 0.0198	0.4551 ± 0.0264	0.4796 ± 0.0293	0.6683 ± 0.0221
NestedUNet	0.4368 ± 0.0176	0.4795 ± 0.0239	0.4403 ± 0.0278	0.8258 ± 0.0195
SmaAt-UNet	0.6633 ± 0.0281	0.7246 ± 0.0253	0.0592 ± 0.0047	0.9325 ± 0.0168

Table 4. Performance comparison of SmaAt-UNet on the test set before and after optimization.

Model	IoU (Mean, 95% CI)	Dice (Mean, 95% CI)	MAE (Mean, 95% CI)	mPA (Mean, 95% CI)
SmaAt-UNet	0.6633 ± 0.0281	0.7246 ± 0.0253	0.0592 ± 0.0047	0.9325 ± 0.0168
PSO-SmaAt-UNet	0.7352 ± 0.0295	0.7936 ± 0.0238	0.0455 ± 0.0039	0.9702 ± 0.0125

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, C.; Shang, Z.; Zhang, Y.; Pan, H.; Nuermaimaiti, A.; Wang, C.; Li, N.; Zhang, B. SmaAt-UNet Optimized by Particle Swarm Optimization (PSO): A Study on the Identification of Detachment Diseases in Ming Dynasty Temple Mural Paintings in North China. Appl. Sci. 2025, 15, 12295. https://doi.org/10.3390/app152212295

AMA Style

Luo C, Shang Z, Zhang Y, Pan H, Nuermaimaiti A, Wang C, Li N, Zhang B. SmaAt-UNet Optimized by Particle Swarm Optimization (PSO): A Study on the Identification of Detachment Diseases in Ming Dynasty Temple Mural Paintings in North China. Applied Sciences. 2025; 15(22):12295. https://doi.org/10.3390/app152212295

Chicago/Turabian Style

Luo, Chuanwen, Zikun Shang, Yan Zhang, Hao Pan, Abdusalam Nuermaimaiti, Chenlong Wang, Ning Li, and Bo Zhang. 2025. "SmaAt-UNet Optimized by Particle Swarm Optimization (PSO): A Study on the Identification of Detachment Diseases in Ming Dynasty Temple Mural Paintings in North China" Applied Sciences 15, no. 22: 12295. https://doi.org/10.3390/app152212295

APA Style

Luo, C., Shang, Z., Zhang, Y., Pan, H., Nuermaimaiti, A., Wang, C., Li, N., & Zhang, B. (2025). SmaAt-UNet Optimized by Particle Swarm Optimization (PSO): A Study on the Identification of Detachment Diseases in Ming Dynasty Temple Mural Paintings in North China. Applied Sciences, 15(22), 12295. https://doi.org/10.3390/app152212295

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SmaAt-UNet Optimized by Particle Swarm Optimization (PSO): A Study on the Identification of Detachment Diseases in Ming Dynasty Temple Mural Paintings in North China

Abstract

1. Introduction

2. Materials and Methods

2.1. Research Objects and Data Collection

2.2. Image Annotation and Preprocessing

2.3. Segment Models

2.4. Hyperparameter Optimization via Particle Swarm Optimization (PSO) Algorithm

3. Experiments and Validation

3.1. Definition and Explanation of Evaluation Metrics

3.2. Loss Function

3.3. Design of Comparative Experiments

3.4. PSO Experiment Design

4. Results

4.1. Comparison Results of the Segmentation Models

4.1.1. Segmentation Accuracy

4.1.2. Model Comparison Training and Validation Curves

4.1.3. Model Comparison Segmentation Result Visualization

4.2. Performance Results of the Optimized Model

4.2.1. Performance Metrics

4.2.2. Model Optimization Training and Validation Curves

4.2.3. Model Optimization Segmentation Result Visualization

5. Discussion

5.1. Model Performance Analysis

5.2. Model Limitations

5.3. Model Application and Protection Practices

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI