Paint Loss Detection and Segmentation Based on YOLO: An Improved Model for Ancient Murals and Color Paintings

Chen, Yunsheng; Zhang, Aiwu; Shi, Jiancong; Gao, Feng; Guo, Juwen; Wang, Ruizhe

doi:10.3390/heritage8040136

Open AccessArticle

Paint Loss Detection and Segmentation Based on YOLO: An Improved Model for Ancient Murals and Color Paintings

by

Yunsheng Chen

^1,2,

Aiwu Zhang

^1,2,*,

Jiancong Shi

^1,2,

Feng Gao

^3,*,

Juwen Guo

³ and

Ruizhe Wang

^1,2

¹

Key Laboratory of 3D Information Acquisition and Application, Ministry of Education, Capital Normal University, Beijing 100048, China

²

College of Resource Environment and Tourism, Capital Normal University, Beijing 100048, China

³

China Academy of Cultural Heritage, Beijing 100029, China

^*

Authors to whom correspondence should be addressed.

Heritage 2025, 8(4), 136; https://doi.org/10.3390/heritage8040136

Submission received: 19 February 2025 / Revised: 31 March 2025 / Accepted: 8 April 2025 / Published: 11 April 2025

Download

Browse Figures

Versions Notes

Abstract

Paint loss is one of the major forms of deterioration in ancient murals and color paintings, and its detection and segmentation are critical for subsequent restoration efforts. However, existing methods still suffer from issues such as incomplete segmentation, patch noise, and missed detections during paint loss extraction, limiting the automation of paint loss detection and restoration. To tackle these challenges, this paper proposes PLDS-YOLO, an improved model based on YOLOv8s-seg, specifically designed for the detection and segmentation of paint loss in ancient murals and color paintings. First, the PA-FPN network is optimized by integrating residual connections to enhance the fusion of shallow high-resolution features with deep semantic features, thereby improving the accuracy of edge extraction in deteriorated areas. Second, a dual-backbone network combining CSPDarkNet and ShuffleNet V2 is introduced to improve multi-scale feature extraction and enhance the discrimination of deteriorated areas. Third, SPD-Conv replaces traditional pooling layers, utilizing space-to-depth transformation to improve the model’s ability to perceive deteriorated areas of varying sizes. Experimental results on a self-constructed dataset demonstrate that PLDS-YOLO achieves a segmentation accuracy of 86.2%, outperforming existing methods in segmentation completeness, multi-scale deterioration detection, and small target recognition. Moreover, the model maintains a favorable balance between computational complexity and inference speed, providing reliable technical support for intelligent paint loss monitoring and digital restoration.

Keywords:

mural; color painting; paint loss; detection and segmentation; digital preservation

1. Introduction

Murals refer to paintings created on natural or artificial wall surfaces and include forms such as temple murals, tomb murals, and cave murals [1,2]. Color paintings on ancient Chinese architecture, commonly used to decorate the surfaces of building columns, beams, walls, and other structural elements, primarily consist of pigment and a ground layer [3]. As an important form of cultural heritage, these paintings contain extensive social, artistic, religious, and historical information, making them invaluable for cultural research and heritage preservation. However, due to prolonged exposure to weathering, temperature and humidity fluctuations, pollution, microbial erosion, and human-induced damage, the pigment layers of murals and color paintings experience various forms of deterioration, including paint loss, cracking, flaking, and soot deposition [4]. Among these, paint loss is one of the most severe forms of deterioration, significantly compromising the artistic integrity of murals and color paintings, and potentially causing large-scale, irreversible damage [5]. Therefore, accurate identification and segmentation of deteriorated areas have become key challenges in the conservation and restoration of murals and color paintings.

Traditional detection of paint loss primarily relies on visual inspection, which is highly dependent on expert experience, time-consuming, labor-intensive, and inherently subjective, making it difficult to ensure consistency in detection results. To improve detection efficiency, researchers have applied various image processing techniques, including region-growing [6,7], morphological segmentation [8], and clustering-based segmentation [9,10], to the detection of paint loss. Cao et al. [11] proposed a region-growing algorithm based on threshold segmentation, utilizing features such as color, saturation, chromaticity, and brightness of the deteriorated areas. However, determining the optimal threshold requires multiple experiments, and the method is prone to creating voids or over-segmentation. Deng et al. [12] extracted initial disease masks using automatic threshold segmentation, followed by tensor voting and morphological hole filling to generate a complete mask. However, due to the similarity in color between deteriorated areas and the background, this method suffers from false negatives and false positives. Yu et al. [13] used a combination of hyperspectral imaging and simple linear iterative clustering segmentation to find areas that were getting worse. However, background noise made disease edge extraction less accurate. Although these methods achieve relatively high detection accuracy in specific contexts, they typically rely on manually extracting disease features (such as color, shape, and texture), leading to limited generalization ability and vulnerability to environmental noise.

In recent years, deep learning technologies have achieved remarkable advancements in the domain of computer vision, particularly concerning semantic segmentation and object detection and segmentation tasks. These tasks require that models not only identify objects or diseased areas within images but also execute precise pixel-level segmentation of these areas. This capability provides more detailed information for subsequent analysis and processing. Semantic segmentation methods, such as U-Net [14], have been widely applied in the detection of paint loss in murals. However, traditional semantic segmentation methods often face challenges in mural images due to complex backgrounds and blurred boundaries of deteriorated areas. Therefore, it is usually necessary to modify the original models to better meet the requirements of paint loss detection. To make the most of features at different sizes and improve the link between decoder and encoder features, Wu et al. built U-Net around ConvNet [15], a Channel Cross Fusion (CCT) module with Transformer [16], and a Bidirectional Feature Pyramid Network (BiFPN) [17]. This allowed them to extract and combine features at different sizes, which led to accurate crack segmentation in murals. To fully extract disease features and effectively integrate low-level and high-level semantic information, Zhao et al. [18] proposed an ancient mural paint layer delamination detection model based on a residual dual-channel attention U-Net, enhancing feature extraction by incorporating residual connections, channel attention, and spatial attention. To get around U-Net’s problems with finding edges and keeping details, multi-scale module construction [19] and detail feature injection mechanisms [20] were used to get the edges and texture details of murals that had been delaminated. However, these semantic segmentation methods have shown favorable results in specific study areas but still face issues such as small patch noise and holes in the damaged areas, which affect detection accuracy and stability.

Unlike semantic segmentation methods, object detection methods can precisely locate target areas and effectively alleviate the issues encountered in semantic segmentation. Wang et al. [21] suggested an automatic damage detection model based on the ResNet [22] framework. This model uses Faster R-CNN to find weathering and spalling damage in historical masonry structures so that damage can be found quickly. Mishra et al. [23] optimized the YOLO V5 model structure to achieve rapid and automated detection of discoloration, exposed bricks, cracks, and delamination, addressing the inefficiency of traditional manual inspection. Wu et al. [24] enhanced the YOLOv5 model by incorporating the Ghost Conv module [25] and the Channel Attention Module Squeeze and Excitation (SE Module) [26], and subsequently tested the improved model on the Yungang Grottoes mural crack and delamination dataset. The results showed that compared to the original model, training time and model size were reduced by 36.21% and 46.04%, respectively, while accuracy increased by 1.29%. These studies primarily focus on object detection for weathering, spalling, cracks, and other damage, but they do not achieve pixel-level segmentation of damaged areas, which limits their ability to support subsequent virtual repair of the damage.

In recent years, instance segmentation models have been developed based on object detection methods to perform pixel-level segmentation of target objects. Instance segmentation models can be further categorized into two-stage models, such as Mask R-CNN [27], and one-stage models, such as You Only Look at CoefficienTs (YOLACT) [28]. These methods combine object detection and segmentation techniques to achieve the precise localization and segmentation of target objects. Currently, fields such as medicine, agriculture, and forestry widely apply certain instance segmentation models. Zhang et al. [29] mixed F-YOLOv8n-seg with the connected domain analysis algorithm (CDA), and obtained the F-YOLOv8n-seg-CDA model. This model effectively lowers the computational cost in the convolution process and achieves a 97.2% accuracy rate for weed segmentation. The ASF-YOLO model was made by Kang et al. [30] using the YOLOv5l segmentation framework and a scale sequence feature fusion module to improve the network’s ability to extract features at multiple scales. In experiments on a cell dataset, the model achieved a segmentation accuracy of 0.887 and a target detection accuracy of 0.91. Khalili et al. [31] used the YOLOv5 framework and the Segment Anything Model (SAM) to create a lung segmentation model for chest X-ray movies. The results showed that the proposed model has significant robustness, generalization capability, and high computational efficiency. Balasubramani et al. [32] suggested a better YOLOv8n-seg model that did better than semantic models like SegNet, DeepLabv3, and EchoNet in tasks that needed to segment the left ventricle. This model was more accurate and complete, with no small patch noise or abnormal voids in the damaged areas. It was suggested by Khan et al. [33] that fruit trees should use a YOLOv8-based crown segmentation method that includes improvements like dilated convolution (DilConv) [34] and GELU activation function [35]. In various complex environments, their method achieved more complete crown segmentation compared to other instance segmentation models.

Although segmentation methods based on YOLO series models have been successfully applied in fields such as agriculture and forestry, the significant variation in shape and size of paint loss areas, as well as their distinct characteristics compared to natural objects, impose higher demands on the adaptability of existing models.

In these damage extraction tasks, YOLO and other deep learning models continue to face several challenges. First, high-quality public datasets are extremely scarce, and segmentation tasks rely on precise annotations, which are costly and heavily dependent on manual input, making it difficult to construct large-scale training datasets and impacting the optimization of model performance. Second, the large variation in the shape and scale of the damage areas often results in incomplete segmentation and missed detection of small targets. Additionally, the damage areas frequently blend with complex backgrounds, which hinders the model’s ability to accurately determine boundaries and recognize regions, thereby decreasing segmentation accuracy. Finally, image segmentation tasks are computationally intensive, and although lightweight models can improve inference efficiency, they typically suffer from a trade-off in accuracy. As such, effectively balancing segmentation accuracy with computational resource consumption has become a critical challenge in improving the practical application of paint loss detection.

To tackle the problems of incomplete boundary segmentation, missed small target detection, and insufficient overall segmentation accuracy in mural paint loss segmentation tasks, this paper proposes PLDS-YOLO, a paint loss detection and segmentation algorithm for murals based on YOLOv8s-seg. The aim is to improve segmentation accuracy, optimize boundary extraction, and reduce missed detections. The main contributions of this paper are as follows:

(1): An improvement to the PA-FPN network is proposed, incorporating residual connections to enhance segmentation accuracy at paint loss boundaries by fusing shallow high-resolution features with deep semantic features.
(2): A dual-backbone network architecture is designed, combining CSPDarkNet and ShuffleNet V2, to optimize multi-scale feature extraction capability and enhance the model’s ability to detect paint loss in complex backgrounds.
(3): The SPD-Conv module is introduced to enhance feature representation, thereby improving the detection rate of small targets and reducing missed detections.
(4): Using a self-constructed paint loss dataset, the PLDS-YOLO was able to accurately segment 86.2% of the cases, which is 2.9% better than YOLOv8s-Seg. It also strikes a good balance between inference speed and computational complexity.

The paper is organized as follows: Section 2 describes the overall architecture of the MPLD-YOLO model and the improvement strategies implemented; Section 3 covers the construction of the experimental dataset, experimental parameter settings, evaluation metrics, comparative experimental results, and ablation experiment; discussion and conclusion are presented in Section 4 and Section 5, respectively.

2. Material and Methods

2.1. Standard YOLOv8 Model

YOLOv8 is a single-stage multi-object detection network, an upgraded version of YOLOv5 [36]. Its architecture consists of four main parts: Input, Backbone, Neck, and Head. Key improvements include a new feature extraction module, a better detection head structure, and an anchor-free detection strategy to enhance detection performance and robustness. The Input section is used to input target images and perform necessary preprocessing for model training and inference. The Backbone consists of the C2f module, CBS module, and SPPF module, which are used to extract multi-scale and multi-class target features. The C2f module is different from YOLOv5’s C3 structure because it splits the feature map into two parts, processes each one separately, and then combines the features, using both low-resolution and high-resolution information fully. This makes semantic understanding and detail capture much better.

The Neck section utilizes the PAN-FPN structure [37] to enhance the deep neural network’s capability to detect multi-scale targets. In deep neural networks, features at different levels serve different roles: high-level features primarily contain semantic information, which helps in object classification, while low-level features are rich in spatial and edge information, facilitating precise target localization. To effectively fuse multi-level features, Lin et al. proposed the Feature Pyramid Network (FPN) [38], which upsamples high-level features and connects them laterally to lower-level features, facilitating top-down information propagation and enhancing the model’s ability to represent targets across different scales.

However, while FPN emphasizes semantic enhancement, the use of low-level localization information is relatively limited, particularly in scenarios involving small targets or blurred boundaries, which can lead to a decrease in detection accuracy. To address this challenge, Liu et al. introduced the Path Aggregation Network (PAN) [37]. PAN builds upon the FPN structure by introducing a bottom-up information path, enabling low-level features to be propagated upwards, complementing high-level semantic features. In summary, PAN creates a bidirectional feature fusion mechanism, enhancing both semantic representation and spatial localization capabilities, enabling the network to handle targets with large scale variations and complex structures more comprehensively and accurately.

The head structure has undergone significant modifications compared to YOLOv5, transitioning from a coupled to a decoupled head, with classification and detection tasks being handled separately. YOLOv8 adopts an anchor-free detection method that directly regresses target locations. This approach reduces the number of candidate boxes, enhancing its ability to detect small targets, separate instances, and improve both detection accuracy and boundary segmentation quality. YOLOv8 adds a Distributed Intersection over Union (DIOU) loss to the regression loss function. This loss is based on the original CIOU (Complete Intersection over Union) loss and makes regression even better.

The YOLOv8 architecture offers five progressively scaled variants—YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x—distinguished by increasing network depth and channel width. This hierarchical scaling enables a precision–complexity tradeoff, where larger models achieve enhanced detection accuracy at the expense of greater computational demands and extended training durations. Among these variants, YOLOv8s emerges as an optimal choice for edge deployment and real-time applications due to its judicious balance between computational efficiency and detection performance, maintaining high inference speeds while preserving a compact architecture. Building upon this foundation, the YOLOv8s-seg variant extends the framework’s capabilities through integration of an instance segmentation head alongside the existing detection head. This dual-head configuration enables simultaneous object localization and precise pixel-wise mask prediction, effectively unifying detection and segmentation tasks within a single efficient network. The current study adopts YOLOv8s-seg as its baseline architecture to capitalize on its dual-task capability while maintaining the computational advantages inherent in the s-variant design, thereby achieving optimized performance for comprehensive scene understanding tasks requiring both detection and segmentation outputs.

2.2. The Improved PLDS-YOLO Model

2.2.1. Overall Network Structure

This study improves the YOLOv8s-seg structure for the detection and segmentation of paint loss and proposes the PLDS-YOLO model. The model enhances segmentation performance through three key improvements: (1) improving the PA-FPN structure by adding more residual skip connections to improve the flow of feature information and the ability to combine features at different scales; (2) creating a dual-backbone network by adding a ShuffleNet V2 module on top of the original YOLOv8s-seg backbone, which improves the ability to extract features for paint loss areas; and (3) adding the SPD-Conv module to the neck network to replace some CBS modules and increase the receptive field, improve local texture features of small target areas, and lower the computational load. Figure 1 shows the overall architecture of PLDS-YOLO. Section 2.2.2 discusses the improved PA-FPN network structure, Section 2.2.3 presents the dual-backbone network, and Section 2.2.4 introduces the SPD-Conv module.

2.2.2. Improved PA-FPN Network Structure

Traditional Feature Pyramid Networks (FPNs) and Path Aggregation Networks (PANets) often face the issue of missing defect target information when performing deep feature extraction. A lot of the time, this is because when high-level features are sent and combined, they lose detail because their resolution is lower. This makes it harder for the model to recognize small targets and subtle features, which lowers the accuracy of detection and segmentation.

That is why this study suggests a better Path Aggregation Feature Pyramid Network (PA-FPN) that is based on the YOLOv8s-seg model, as seen in Figure 2. The main idea behind this strategy for improvement is a multi-scale feature pyramid network with two extra skip connections that send shallow feature information from the backbone network to the downsampling part of the neck network. This lets high-resolution images with multiple scales combine their features and gather more detailed feature information. A convolution operation with a stride of 2 is used to unify the feature scales. The specific calculation process is as follows:

F_{i n c o d e} = C o n c a t (D o w n (C o n c a t (F_{i n 1}, F_{i n 2})), C o n c a t (F_{i n 3}, F_{i n 4}))

(1)

where

C o n c a t ()

represents feature fusion concatenation,

D o w n ()

represents feature map downsampling,

F_{i n c o d e}

represents the encoded features and

F_{i n 1}, F_{i n 2}, F_{i n 3}, F_{i n 4}

represents the input feature map.

2.2.3. CSPDarkNet and ShuffleNet V2 Dual-Backbone Network Model

Although the YOLOv8s-seg model performs excellently in speed and overall accuracy, it still faces some shortcomings: the model may not fully capture subtle boundaries and features when handling complex edge shapes of diseases. This limitation becomes especially evident when the model encounters mural and color painting images with complex structures and textures. Even though certain fusion mechanisms are used to deal with multi-scale features, combining global and local information is still not enough. This is especially true when the baseline model is faced with complex backgrounds, which makes it much harder for the model to understand what is going on around it.

To address the above issues, this study proposes a dual-backbone network model that integrates the advantages of Cross Stage Partial Darknet (CSPDarkNet) and ShuffleNet v2 [39] to improve the feature extraction capability of the backbone network. CSPDarkNet serves as the backbone network in YOLO series models for feature extraction, primarily used to analyze the shape, edges, and texture information of objects in images. CSPDarkNet is widely used in versions like YOLOv5 and YOLOv8. Its structural design reduces computational complexity while maintaining detection accuracy, thus enhancing operational efficiency.

Specifically, a Shuffle_Block module is added on top of the original backbone to form a second backbone path, which processes the input image in parallel with the original backbone. In ShuffleNet V2, the Shuffle_Block splits the input feature map

X \in R^{H \times W \times C}

along the channel dimension into two parts:

X = [X_{l e f t}, X_{r i g h t}]

using a Channel Split operation, as shown in Figure 3. The left part

X_{l e f t}

is passed directly as an identity mapping, while the right part

X_{r i g h t}

undergoes a series of convolution operations:

X_{r i g h t}^{'} = C o n ν_{1 \times 1} (D W C o n ν_{3 \times 3} (C o n ν_{1 \times 1} (X_{r i g h t})))

(2)

where

{C o n v}_{1 \times 1}

represents a 1 × 1 convolution and

{D W C o n v}_{3 \times 3}

represents a depthwise separable convolution (DWConv). Then, the processed right part is concatenated with the left part along the channel dimension:

X_{c o n c a t} = C o n c a t (X_{l e f t}, X_{r i g h t}^{'})

(3)

Finally, the Channel Shuffle operation is used to enable information exchange between different channels and reduce computational complexity:

X_{s h u f f l e d} = C h a n n e S h u f f l e (X_{c o n c a t})

(4)

The incorporation of a secondary backbone, which processes the input image in conjunction with the original backbone, results in the adoption of divergent module structures by these two backbones. The primary focus of the original backbone is the extraction of deep-level features and complex semantic information, which it achieves through its utilization of the C2f module. In contrast, the secondary backbone employs the Shuffle_Block module, a strategy that circumvents the group convolution operation employed in ShuffleNet V1 by leveraging Channel Split and Channel Shuffle. This approach enables the avoidance of group convolution, which is known to be computationally intensive. This design enables information exchange between channels, reduces computational complexity, and compensates for the shortcomings of the original backbone in fine-grained feature extraction. The dual-backbone network design allows the model to focus on both global context and fine-grained local details when processing complex paint loss images. This aids in the accurate identification and segmentation of paint loss with varying shapes and sizes, significantly improving the accuracy and efficiency of paint loss detection.

2.2.4. Integration of the SPD-Conv Module into the Neck Network Module

Standard CNN designs often use stride convolution and pooling layers to lower the spatial resolution of feature maps. This makes them easier to compute and pulls out more general features. When looking for bigger targets, the stride convolution and pooling layers can get rid of unnecessary pixel information. This helps the model keep its ability to learn new features. If you use stride convolution and pooling layers to find smaller targets with little context information for the model to learn, you will lose a lot of fine-grained features [40]. The areas of paint loss exhibit complex shapes, varying sizes, and blurred boundaries. The process of downsampling reduces the spatial resolution of feature maps, resulting in the loss of some detailed information. This decreases the model’s sensitivity to small targets and fine features, thereby affecting the accuracy of paint loss region detection and segmentation. In the task of paint loss detection, this means finding a balance between simplifying the problem and preserving fine-grained features. This will help the model detect paint loss areas of varying sizes.

This study fixes this problem by adding the Space-to-Depth Convolution (SPD-Conv) module to the YOLOv8s-seg model. It takes the place of the CBS module that was used in the neck downsampling stage [38]. SPD-Conv consists of an SPD layer and a stride-free convolutional layer. The SPD layer converts spatial information into a depth dimension, preventing the loss of detailed information caused by traditional stride convolution operations. After the SPD layer, a non-stride convolutional layer is used to extract features without reducing the feature map size, further preserving the fine-grained details of the image. This improves the model’s ability to capture complex details, especially for targets with smaller paint loss areas.

The SPD-Conv operation consists of two steps. The input image’s feature map first undergoes a space-to-depth preprocessing step, followed by a standard convolution operation. SPD downsamples the intermediate feature map

X

of size

W \times W \times C_{1}

using a factor of

S

, splitting the input feature map into

S^{2}

sub-feature maps

f (x, y)

.

\begin{matrix} f_{0, 0} = X [0 : W : S, 0 : W : S], f_{1, 0} = X [1 : W : S, 0 : W : S], \dots, f_{S - 1, 0} = X [S - 1 : W : S, 0 : W : S] \\ f_{0, 1} = X [0 : W : S, 1 : W : S], f_{1, 1} = X [1 : W : S, 1 : W : S], \dots, f_{S - 1, 1} = X [S - 1 : W : S, 1 : W : S] \\ \dots \dots \\ f_{0, S - 1} = X [0 : W : S, S - 1 : W : S], f_{1, S - 1}, f_{1, S - 2}, \dots, f_{S - 1, S - 1} = X [S - 1 : W : S, S - 1 : W : S] \end{matrix}

(5)

Then, these sub-feature maps are merged along the channel dimension, forming a new intermediate feature map

X^{'}

of size

\frac{W}{S}, \frac{W}{S}, S^{2} C_{1}

. In the spatial domain, it is downsampled by a factor of

S

, while in the channel dimension, it is expanded by

S^{2}

, thereby achieving the space-to-depth transformation. When

S = 2

, SPD downsamples the input feature map into four sub-feature maps,

f (0, 0)

,

f (0, 1), f (1, 0), f (1, 1),

each with a size of

(\frac{W}{2}, \frac{W}{2}, C_{1})

. The merged intermediate feature map then has four times the original number of channels, as shown in Figure 4.

Following the SPD feature transformation, we introduce a convolutional layer with

C_{2}

filters to conduct a stride-free convolution [41] on the merged feature map

X^{'}

. This convolutional layer’s stride is set to 1 to make sure that the convolution kernel covers every pixel in the feature map. This keeps as much discriminative feature information as possible. The final feature map

X^{″}

of size

(\frac{W}{S}, \frac{W}{S}, C_{2})

is obtained.

3. Experiments and Results

3.1. Dataset

Due to the limited availability of ancient mural and color painting image resources and the absence of publicly available standard datasets, this study has collected Dunhuang mural data from the China Academy of Cultural Heritage to build a high-quality damage dataset. These data span key historical periods, including the Northern Wei, Sui, Tang, Song, and Yuan dynasties, covering over a millennium of history. The Dunhuang murals are located inside caves, preventing direct external lighting, which minimizes the impact of lighting changes on the images. The images were captured by a professional photography team, ensuring high-quality data. Additionally, ancient architectural color painting images were collected at the historical site of Peking Union Medical College Research Institute using an iPhone 15, along with the NewTech S18 rail system, tripod, and lighting system to ensure stable image capture. All images were captured at a fixed distance of 1 m using a tripod to ensure consistency, with artificial lighting used to prevent external lighting changes from affecting the images. To ensure the image quality of the dataset, we excluded Dunhuang mural images with damage regions whose boundaries were difficult to distinguish from the background, as well as ancient architectural color painting images blurred by external interference, ultimately resulting in 856 high-quality pigment layer detachment damage images. Figure 5 shows a schematic of the photographic setup for capturing ancient architectural color paintings.

To ensure the broad applicability and credibility of the research findings, we selected ancient murals and ancient architectural color paintings, two representative art forms, as the experimental datasets. Murals are generally made up of three layers, relying mainly on environmental stability, while color paintings are supported by wood, with a more complex ground layer that emphasizes waterproofing and durability. The Dunhuang murals cover several historical periods and showcase significant stylistic differences, reflecting the artistic characteristics of different cultural backgrounds. In contrast, the color paintings from the historical site of Peking Union Medical College Research Institute, created during the Republican era, have a relatively homogeneous style and background.

Although both types of artworks are influenced by factors such as temperature, humidity changes, and material aging, murals are more susceptible to salt crystallization, leading to peeling and detachment of the pigment layer. On the other hand, color paintings, protected by tung oil, experience accelerated pigment detachment due to moisture infiltration and oil film cracking. Moreover, the shrinkage and deformation of the wood can also result in the detachment of the pigment layer in color paintings. By choosing these two datasets, we are able to comprehensively assess the model’s detection ability across different artistic forms and damage conditions, thereby ensuring the broad applicability and practical significance of the research findings.

In this study, 90 images were randomly selected from the 856 images, representing 10% of the total dataset, to form the test set, and were annotated using LabelMe version 3.16.2. Next, the images were uniformly cropped to a size of 640 × 640 pixels, followed by data augmentation techniques including image flipping, rotation, color adjustment, and random noise addition. A final set of 500 images (with an equal distribution between murals and color paintings) was selected as the test set.

Regarding the division of the training and validation sets, the remaining 766 images were processed in the same way as the test set, resulting in a total of 5070 images for model training. These images were split into training and validation sets at an 80:20 ratio. During the distribution process, this study ensured an equal representation of murals and color paintings, each constituting 50% of both the training and validation sets, thereby guaranteeing balanced data distribution.

The dataset division strategy employed in this study ensures a balanced distribution of data across the training, validation, and testing phases, thereby providing reliable data support for the subsequent pigment layer detachment damage detection and segmentation tasks.

3.2. Evaluation Metrics

In this study, a variety of key metrics were employed to assess the performance of the PLDS-YOLO model in detecting paint loss and performing image segmentation tasks. We defined our classification based on three possible states: True Positive (TP), False Positive (FP), and False Negative (FN). Metrics like Precision (P), Recall (R), mAP@0.5, and mAP@0.5:0.95 were used to test the model’s performance in tasks like object detection and segmentation. These measure how accurate, complete, and well the model works overall [24,42].

P = \frac{T P}{T P + F P}

(6)

R = \frac{T P}{T P + F N}

(7)

where

T P

represents the number of disease pixels accurately identified by the model within the test set,

F P

indicates the number of background or non-disease pixels that have been incorrectly classified as disease pixels, and

F N

represents the true disease pixels that were misclassified as background or non-disease pixels by the model.

A P = \int_{0}^{1} P (R) d R

(8)

where the precision-recall (P-R) curve is represented by

P (R)

, while recall is on the x-axis and precision is on the y-axis. Average Precision (AP) is the area under the P-R curve, calculated as the area enclosed by the curve and the axes in the P-R plot.

m A P = \frac{\sum_{i = 1}^{c} A P (i)}{c}

(9)

mAP@0.5 represents the average precision computed with a fixed Intersection over Union (IoU) threshold of 0.5. It indicates the model’s performance in detecting and separating objects at a specific threshold. mAP@0.5:0.95, on the other hand, shows the average precision found when IoU thresholds are raised by 0.05 units every time from 0.5 to 0.95. This gives a more complete picture of how well the model detects things at different levels of overlap. A higher mAP@0.5 indicates that the model has higher detection and segmentation accuracy at the fixed threshold of 0.5. A higher mAP@0.5:0.95 value means that the model gives accurate and stable detection results across a range of thresholds. This shows that the algorithm is robust and can be used in different situations. In this study, there is only one category, the paint loss; thus,

c

= 1.

In evaluating the model’s efficiency and speed, this study uses floating point operations (FLOPs) to measure the computational complexity and operation requirements of the model. A higher FLOPs value indicates that the device can complete more floating-point operations per unit of time, reflecting greater computational power. To evaluate the model’s inference speed, we use frames per second (FPS) as a metric. A higher FPS value indicates faster image processing and higher inference efficiency. Typically, a detection speed greater than 30 fps is required [24]. By considering these metrics together, we can obtain a comprehensive understanding of the performance of the PLDS-YOLO model in detecting and segmenting paint loss in murals.

3.3. Experimental Setup

This study uses the Windows 10 operating system and an RTX 3090 GPU with 16 GB of memory for model training. The software environment includes CUDA 11.3, Python 3.11, and PyTorch 2.12. In this study, the proposed PLDS-YOLO model and all ablation and comparison models are trained and tested on a server GPU. In the ablation experiments and baseline models, the YOLOv8s-seg model is used as the base model for comparison and analysis. The algorithm’s stochastic gradient descent (SGD) momentum is set to 0.937, the initial learning rate is set to 0.01, weight decay is set to 0.0005, and a warm-up learning rate optimization strategy is used, with a warm-up period of 50 epochs. All experiments in this study are conducted for 500 training epochs.

3.4. Comparison of Loss and Segmentation Performance Between the Improved Model and the Baseline Model

The variation curves of the loss values for both the training and validation sets after 500 iterations are presented in Figure 6, comparing the baseline YOLOv8s-seg model with the enhanced PLDS-YOLO model. During the first 100 rounds of training, both the training and validation sets experience a rapid decrease in loss values. After reaching 300 iterations, the loss value decreases more gradually, and the curve converges, indicating that the model training process stabilizes. The PLDS-YOLO model has a lower loss compared to the baseline model.

Figure 7 presents a comparative analysis of the baseline model and PLDS-YOLO in the segmentation of paint loss. Figure 7a demonstrates that PLDS-YOLO enhances the completeness of paint loss region boundary segmentation, effectively reducing instances of incomplete segmentation. Figure 7b indicates that, in complex textured backgrounds, the improved model accurately identifies hard-to-distinguish paint loss regions, whereas the baseline model exhibits relatively weaker performance. Figure 7c,d illustrate large paint loss areas, characterized by complex boundary structures, where the baseline model fails to detect certain regions, while PLDS-YOLO achieves more comprehensive paint loss segmentation. Figure 7e,f further reveal that PLDS-YOLO detects a greater number of small-sized paint loss areas. In contrast, the baseline model demonstrates varying degrees of missed detections, indicating that the improved model exhibits higher sensitivity in detecting small paint loss regions. Overall, PLDS-YOLO outperforms the baseline model in both paint loss detection capability and segmentation accuracy. Although some instances of missed detections remain, the proposed model enables more precise extraction of paint loss areas and improves the completeness and stability of the segmentation results.

3.5. Comparison with Different Models

This study performed a series of comparison tests using different object detection and segmentation algorithms, such as Mask R-CNN, SOLOv2, and the YOLO series models (YOLOv5s-seg, YOLOv7-seg, YOLOv8-seg, and YOLOv9-seg). The goal was to evaluate the performance and benefits of PLDS-YOLO in detecting and segmenting paint loss areas. Specifically, Mask R-CNN, as a classic region proposal-based two-stage method, demonstrates excellent performance in terms of accuracy. However, its complex network architecture and high computational cost may become a bottleneck in large-scale data processing, particularly in terms of inference speed and computational efficiency. In contrast, YOLO series models and SOLOv2 adopt an end-to-end single-stage detection and segmentation framework, significantly improving inference speed while effectively optimizing accuracy. Especially in YOLOv5s-seg, YOLOv7-seg, YOLOv8s-seg, and YOLOv9-seg models, their simplified structure and faster inference speed meet the requirements for real-time detection and segmentation.

To ensure the validity of the method comparison, all models were trained on an identical dataset utilizing consistent data preprocessing techniques and standardized evaluation metrics. To verify the stability of the experimental results, each method was independently trained ten times, with the average of these results computed as the final overall score.

Table 1 shows the quantitative metrics of each model in the paint loss detection and segmentation tasks. In terms of overall segmentation performance, the PLDS-YOLO model outperforms the compared models in segmentation accuracy. Compared to Mask R-CNN, mAP@0.5 improved by 33.9%, mAP@0.5:0.95 increased by 21.6%, and FPS is approximately 2.6 times faster than Mask R-CNN. The core task of this study is the extraction of pigment layer detachment damage, with mAP@0.5 serving as the primary evaluation metric and the basis for model ranking. PLDS-YOLO achieved an 86.2% result on this metric, surpassing the second-ranked YOLOv7-seg, which scored 84.0%. Compared to YOLOv7-seg, PLDS-YOLO improved segmentation accuracy by 2.2 percentage points while reducing model complexity. The PLDS-YOLO model outperforms the baseline YOLOv8s-seg model in terms of accuracy, recall, mAP@0.5, and mAP@0.5:0.95. This indicates that PLDS-YOLO is more effective in detecting and segmenting objects.

Combining the comparison results of GFLOPs and FPS, PLDS-YOLO has a GFLOPs value of 49.5, which is much lower than Mask R-CNN (258.2), SOLOv2 (217), as well as YOLOv7-seg (141.9) and YOLOv9-seg (145.7). This indicates that the model has lower computational complexity, reducing its dependence on hardware resources and making it suitable for deployment in hardware-limited environments, such as mobile devices. The FPS of PLDS-YOLO reaches 75.6, significantly higher than that of Mask R-CNN (28.5 FPS) and SOLOv2 (35.8 FPS), and it also surpasses the more computationally complex YOLOv7-seg and YOLOv9-seg. This indicates that the proposed model can meet the real-time requirements of tasks while providing fast responses. Overall, PLDS-YOLO strikes a balance between segmentation accuracy and computational complexity, improving both inference speed and segmentation performance without significantly increasing computational cost.

Figure 8 presents the segmentation results of paint loss detected by various models. As shown in Figure 8a,b, the paint loss areas closely resemble the background. In these cases, PLDS-YOLO demonstrates a relatively high accuracy in segmenting the paint loss areas, while other models struggle to differentiate between the paint loss areas and the background, resulting in varying degrees of missed detections. As shown in Figure 8c,d, the paint loss areas exhibit larger paint loss areas and complex edges. All five comparison methods fail to fully detect the complete boundaries of the larger paint loss areas. In contrast, PLDS-YOLO not only detects the complete boundaries of the larger paint loss areas but also identifies more smaller paint loss areas, although some instances of missed detections are observed. As shown in Figure 8e, the sizes of the paint loss areas vary. Larger areas, with higher contrast against the background, are accurately detected by all models. However, smaller paint loss areas, due to their lower contrast and complex shapes, result in varying degrees of missed detection across all models. As shown in Figure 8f, the paint loss areas are small paint loss areas with complex edges. In this case, different models exhibit varying degrees of missed detection, with some models performing poorly in segmenting complex paint loss boundaries. PLDS-YOLO, however, can detect more small paint loss areas and performs better in boundary segmentation accuracy. Overall, PLDS-YOLO demonstrates superior segmentation performance in tasks involving paint loss areas with complex shapes, low contrast, and varying scales. Compared to other models, PLDS-YOLO effectively reduces the occurrence of missed detections for small paint loss areas and is capable of achieving complete segmentation of larger paint loss areas.

3.6. Ablation Experiments

A series of ablation experiments were conducted in this study to evaluate the impact of the improved PA-FPN, dual-backbone model, and SPD-Conv modules on the performance of the PLDS-YOLO model. These experiments sequentially integrated each module into the baseline model to quantify their individual contributions. This approach provides a systematic evaluation of each component’s effectiveness in detecting and segmenting paint loss, demonstrating the potential for further model enhancements.

Table 2 and Table 3 show the impact of different improved modules on the detection and segmentation tasks. The original YOLOv8s-seg model achieves an average detection accuracy (Box mAP@0.5) of 91.7% and an average segmentation accuracy (Mask mAP@0.5) of 83.3% in detecting and segmenting paint loss. By gradually incorporating the improved modules, the model’s performance has been significantly enhanced.

If you change the PA-FPN structure in the base model, the detection task got better by 0.8% for Box mAP@0.5 and 2.6% for Box mAP@0.5:0.95. On the other hand, the segmentation task got better by 0.9% for Mask mAP@0.5 and 1.2% for Mask mAP@0.5:0.95. This finding shows that the PA-FPN module effectively combines low-level detail information with high-level semantic information by adding residual skip connections. This improves the ability to detect small targets and the accuracy of segmentation.

When the dual-backbone module based on PA-FPN was added, the Box mAP@0.5:0.95 in the detection task went up even more, to 72.7% (a 2.8% improvement over the baseline), and the Mask mAP@0.5:0.95 in the segmentation task went up from 51.8% to 52.2%. By combining the best features of CSPDarkNet and ShuffleNet V2, this module showed a better ability to understand both big picture and finer details in segmentation tasks with complicated texture flaws.

Finally, after adding the SPD-Conv module, the model performance reached its highest level. In the detection task, box mAP@0.5 and Box mAP@0.5:0.95 increased to 93.3% and 74.7%, respectively; in the segmentation task, Mask mAP@0.5 and Mask mAP@0.5:0.95 increased to 86.2% and 53.1%, respectively. SPD-Conv expands the receptive field by converting spatial information into depth information, effectively capturing small targets and detail information that is easily lost in traditional downsampling under complex backgrounds.

Overall, models under different improvement strategies are all better than the original YOLOv8s-seg model. The final improved PLDS-YOLO model, compared to the baseline model, showed improvements of 1.6%, 4.8%, 2.9%, and 2.5% in Box mAP@0.5, Box mAP@0.5:0.95, Mask mAP@0.5, and Mask mAP@0.5:0.95, respectively. The results demonstrate that the proposed strategies, which incorporate multiple layers of information, effectively enhance the model’s ability to capture both global and local context. These improvements significantly boost the model’s performance in detecting and segmenting paint loss. Therefore, the suggested strategies prove to be valuable for intelligent detection in the protection of cultural heritage.

Figure 8 presents the P-R curves for paint loss detection and segmentation under different improvement strategies. As the model is optimized, the area under the P-R curve progressively increases (Figure 9a,b), indicating that the proposed strategies effectively enhance the model’s ability to detect and segment objects. The PLDS-YOLO model has the largest enclosed area, indicating that the proposed model performs the best in target detection and segmentation tasks. As illustrated in Figure 9b, in the low recall rate region (recall < 0.3), the accuracy decline trend of the improved model is comparatively slower than that of the baseline model. This finding suggests that the improved model exhibits enhanced stability in high-confidence predictions, leading to a reduction in false positives and missed detections. In the area with a high recall rate, the P-R curve of the improved model is usually higher than that of the baseline model. The PLDS-YOLO model has the largest P-R curve area, which means it is the most accurate and reliable at finding instances, especially when the backgrounds are complicated and the targets are small.

4. Discussion

Given the diversity of ancient mural and color painting styles across China, and the varying morphological characteristics of pigment layer detachment damage in different regions, this study chose Dunhuang murals and color paintings from the historical site of Peking Union Medical College Research Institute as the research subjects. While data augmentation methods have helped mitigate the limitation of the training dataset, the limited amount of collected mural data may introduce some redundancy in the augmented data, potentially affecting the model’s performance and generalization capability. Therefore, future studies will focus on further expanding mural and color painting datasets from diverse regions, increasing the dataset size and optimizing model training to improve the model’s adaptability and generalization capability.

In terms of model construction, the MPLD-YOLO model, which is an improvement based on YOLOv8, achieves a good balance between accuracy and efficiency, but there is still room for further improvement in accuracy. At present, the method proposed in this study has not been validated in the newer versions of the YOLO series. However, previous research has applied defect detection, such as building damage, using YOLOv10 [43,44] and YOLOv11 [45] versions, and has achieved certain results. In terms of method improvements, several innovative modules, including multiple aggregate trail attention mechanisms [46] and multiple-dimension attention [47], have been implemented in damage detection tasks and have effectively enhanced model performance. Future research will explore transferring the improvements proposed in this study to newer versions of the YOLO series, incorporating the latest innovative modules to optimize the model structure. This approach will improve accuracy while achieving model lightweighting, thus enhancing efficiency and performance in real-world applications.

The primary objective of this study is to develop a reliable model capable of accurately extracting pigment layer detachment damage areas, thereby providing a solid technological foundation for damage detection in murals, color paintings, and other cultural heritage artifacts of various artistic styles. However, we emphasize that the developed model should serve as an auxiliary tool for damage detection, not as a replacement for expert evaluation. Compared to expert-driven damage detection methods, automated technologies offer significant advantages in processing large-scale image data, improving detection consistency, and reducing subjective bias, making them particularly effective for tasks such as preliminary damage screening and monitoring changes. Nevertheless, manual evaluation continues to play an irreplaceable role in analyzing the semantic information of damage, understanding the historical and cultural context, and making complex repair decisions. Thus, automated detection technologies and expert evaluation will complement each other in the workflow of cultural heritage preservation.

In future image restoration tasks, alongside the powerful capabilities of the model, the expertise of specialists will also be essential to ensure the fine detail and accuracy of the restoration process. Particularly in the restoration of large-scale pigment layer detachment, the integration of the model with expert experience will enable efficient and precise cultural heritage restoration.

5. Conclusions

This paper introduces PLDS-YOLO, an improved detection and segmentation model based on YOLOv8s-seg, for extracting paint loss areas in murals and color paintings. The proposed model enhances segmentation completeness and recognition accuracy by optimizing the PA-FPN structure, integrating a dual-backbone network combining CSPDarkNet and ShuffleNet V2, and incorporating the SPD-Conv module.

Experimental results on a self-constructed dataset show that PLDS-YOLO achieves a segmentation accuracy of 86.2%, surpassing existing methods in paint loss detection, segmentation completeness, and multi-scale recognition. Notably, it significantly reduces missed detections in small targets. Additionally, the YOLO-based detection and segmentation strategy ensures more coherent and complete segmentation results, effectively mitigating common issues in traditional and semantic segmentation methods, such as patchy noise and abnormal voids, thereby enhancing model robustness.

Furthermore, PLDS-YOLO strikes a favorable balance between computational complexity and inference speed, making it highly suitable for real-world applications. Ablation experiments validate the effectiveness of each proposed improvement by incrementally integrating them into the baseline model, demonstrating their cumulative contribution to model performance. Overall, PLDS-YOLO demonstrates strong effectiveness in paint loss detection and segmentation, providing reliable technical support for intelligent monitoring and digital restoration of mural and color painting.

Author Contributions

Conceptualization, Y.C., A.Z. and F.G.; methodology, Y.C., A.Z. and F.G.; validation, Y.C., A.Z. and F.G.; formal analysis, Y.C.; software, Y.C.; data curation, Y.C., F.G., R.W. and J.G.; writing—original draft preparation, Y.C. and J.S.; writing—review and editing, Y.C., A.Z. and J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Key R&D Program of China (2023YFF0906700, 2023YFF0906704) and the National Natural Science Foundation of China (41571369).

Data Availability Statement

The data used in this study can be obtained from the corresponding author.

Acknowledgments

The authors thank the Mural Research Team from the China Cultural Heritage Research Institute for their assistance.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, J. A study on the display design of Chinese monastery murals-taking the frescoes of Yongle Palace as an example. China Natl. Exhib. 2020, 6, 151–153, (In Chinese with an English Abstract). [Google Scholar]
Wang, W.; Ma, Z.; Li, Z.; Yang, T.; Fu, Y. Consolidating of detached murals through grouting techniques. Sci. Conserv. Archaeol. 2006, 18, 52–59, (In Chinese with an English Abstract). [Google Scholar] [CrossRef]
Li, J. The composition of pigments of decorative paintings on ancient buildings of Qufu’s temple Confucius. China Cult. Herit. Sci. Res. 2014, 4, 86–89. [Google Scholar]
Yue, Y. Condition surveys of deterioration and research of wall paintings in Maijishan cave-temple. Study Nat. Cult. Herit. 2019, 4, 127–131. [Google Scholar]
Liu, X.; Hou, M.; Dong, Y.; Wang, W.; Lv, S. Extraction and evaluation of the disease of the mural paint loss. J. Spatiotemporal. Inf. 2019, 26, 22–28. [Google Scholar]
Cao, J.; Li, Y.; Cui, H.; Zhang, Q. The application of improved region growing algorithm for the automatic calibration of shedding disease on temple murals. J. Xinjiang Univ. 2018, 35, 429–436, (In Chinese with an English Abstract). [Google Scholar] [CrossRef]
De Rosa, A.; Bonacehi, A.; Cappellini, V.; Barni, M. Image segmentation and region filling for virtual restoration of artworks. In Proceedings of the 2001 International Conference on Image Processing (Cat. No. 01CH37205), Thessaloniki, Greece, 7–10 October 2001; pp. 562–565. [Google Scholar]
Meng, W.; Hui, W.; Wen, L. Research on multi-scale detection and image inpainting of Tang dynasty tomb murals. Comput. Eng. Appl. 2016, 52, 169–174, (In Chinese with an English Abstract). [Google Scholar]
Ren, X.; Deng, L. Mural inpainting based on color clustering image segmentation and the improved FMM algorithm. Comput. Eng. Sci. 2014, 36, 298–302, (In Chinese with an English Abstract). [Google Scholar]
Sun, M.; Zhang, D.; Wang, Z.; Ren, J.; Chai, B.; Sun, J. What’s Wrong with the Murals at the Mogao Grottoes: A Near-Infrared Hyperspectral Imaging Method. Sci. Rep. 2015, 5, 14371. [Google Scholar] [CrossRef]
Cao, J.; Li, Y.; Cui, H.; Zhang, Q. Improved region growing algorithm for the calibration of flaking deterioration in ancient temple murals. Herit. Sci. 2018, 6, 67. [Google Scholar] [CrossRef]
Deng, X.; Yu, Y. Automatic calibration of crack and flaking diseases in ancient temple murals. Herit. Sci. 2022, 10, 163. [Google Scholar] [CrossRef]
Yu, K.; Hou, Y.; Fu, Y.; Ni, W.; Zhang, Q.; Wang, J.; Peng, J. Automatic labeling framework for paint loss disease of ancient murals based on hyperspectral image classification and segmentation. Herit. Sci. 2024, 12, 192. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; proceedings, part III 18. pp. 234–241. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar]
Wang, H.; Cao, P.; Wang, J.; Zaiane, O.R. Uctransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Tel-Aviv, Israel, 23–27 October 2022; pp. 2441–2449. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Zhao, H.; Yu, Y.; Chen, A.; Ni, X.; Wang, X. Ancient Mural Disease Detection Based on Residual Dual Channel Attention U-Net. J. Comput.-Aided Des. Comput. Graph. 2024, 1–14. Available online: https://link.cnki.net/urlid/11.2925.TP.20240419.1159.002 (accessed on 1 April 2025). (In Chinese with an English Abstract).
Lv, S.; Wang, S.; Hou, M.; Gu, M.; Wang, W. Extraction of paint loss disease of muralss based on improved U-Net. Geomat. World 2022, 29, 69–74. [Google Scholar]
Yu, K.; Li, Y.; Yan, J.; Xie, R.; Zhang, E.; Liu, C.; Wang, J. Intelligent labeling of areas of wall painting with paint loss disease based on multi-scale detail injection U-Net. In Proceedings of the Optics for Arts, Architecture, and Archaeology VIII, Online, 8 July 2021; pp. 37–44. [Google Scholar]
Wang, N.; Zhao, X.; Zhao, P.; Zhang, Y.; Zou, Z.; Ou, J. Automatic damage detection of historic masonry buildings based on mobile deep learning. Autom. Constr. 2019, 103, 53–66. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 1 July 2016; pp. 770–778. [Google Scholar]
Mishra, M.; Barman, T.; Ramana, G.V. Artificial intelligence-based visual inspection system for structural health monitoring of cultural heritage. J. Civ. Struct. Health Monit. 2024, 14, 103–120. [Google Scholar] [CrossRef]
Wu, L.; Zhang, L.; Shi, J.; Zhang, Y.; Wan, J. Damage detection of grotto murals based on lightweight neural network. Comput. Electr. Eng. 2022, 102, 108237. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Hawaii, HI, USA, 21–26 July 2017; pp. 2961–2969. [Google Scholar]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Los Angeles, CA, USA, 16–20 June 2019; pp. 9157–9166. [Google Scholar]
Zhang, D.; Lu, R.; Guo, Z.; Yang, Z.; Wang, S.; Hu, X. Algorithm for Locating Apical Meristematic Tissue of Weeds Based on YOLO Instance Segmentation. Agronomy 2024, 14, 2121. [Google Scholar] [CrossRef]
Kang, M.; Ting, C.-M.; Ting, F.F.; Phan, R.C.-W. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
Khalili, E.; Priego-Torres, B.; Leon-Jimenez, A.; Sanchez-Morillo, D. Automatic lung segmentation in chest X-ray images using SAM with prompts from YOLO. IEEE Access 2024, 12, 122805–122819. [Google Scholar] [CrossRef]
Balasubramani, M.; Sung, C.-W.; Hsieh, M.-Y.; Huang, E.P.-C.; Shieh, J.-S.; Abbod, M.F. Automated Left Ventricle Segmentation in Echocardiography Using YOLO: A Deep Learning Approach for Enhanced Cardiac Function Assessment. Electronics 2024, 13, 2587. [Google Scholar] [CrossRef]
Khan, Z.; Liu, H.; Shen, Y.; Zeng, X. Deep learning improved YOLOv8 algorithm: Real-time precise instance segmentation of crown region orchard canopies in natural environment. Comput. Electron. Agric. 2024, 224, 109168. [Google Scholar] [CrossRef]
Uchida, K.; Tanaka, M.; Okutomi, M. Coupled convolution layer for convolutional neural network. Neural Netw. 2018, 105, 197–205. [Google Scholar] [CrossRef] [PubMed]
Deepika, S.; Arunachalam, V. Design of high performance and energy efficient convolution array for convolution neural network-based image inference engine. Eng. Appl. Artif. Intell. 2023, 126, 106953. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 April 2025).
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022; pp. 443–459. [Google Scholar]
Yu, Z.; Lei, Y.; Shen, F.; Zhou, S.; Yuan, Y. Research on identification and detection of transmission line insulator defects based on a lightweight YOLOv5 network. Remote Sens. 2023, 15, 4552. [Google Scholar] [CrossRef]
Handelman, G.S.; Kok, H.K.; Chandra, R.V.; Razavi, A.H.; Huang, S.; Brooks, M.; Lee, M.J.; Asadi, H. Peering into the black box of artificial intelligence: Evaluation metrics of machine learning methods. Am. J. Roentgenol. 2019, 212, 38–43. [Google Scholar] [CrossRef]
Liao, L.; Song, C.; Wu, S.; Fu, J. A Novel YOLOv10-Based Algorithm for Accurate Steel Surface Defect Detection. Sensors 2025, 25, 769. [Google Scholar] [CrossRef]
Raushan, R.; Singhal, V.; Jha, R.K. Damage detection in concrete structures with multi-feature backgrounds using the YOLO network family. Autom. Constr. 2025, 170, 105887. [Google Scholar] [CrossRef]
Liu, J.; Zhao, J.; Cao, Y.; Wang, Y.; Dong, C.; Guo, C. Road manhole cover defect detection via multi-scale edge enhancement and feature aggregation pyramid. Sci. Rep. 2025, 15, 10346. [Google Scholar] [CrossRef]
Wang, X.; Gao, H.; Jia, Z.; Zhao, J. A road defect detection algorithm incorporating partially transformer and multiple aggregate trail attention mechanisms. Meas. Sci. Technol. 2024, 36, 026003. [Google Scholar] [CrossRef]
Huang, H.; Ma, M.; Bai, S.; Yang, L.; Liu, Y. Automatic crack defect detection via multiscale feature aggregation and adaptive fusion. Autom. Constr. 2025, 170, 105934. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of PLDS-YOLO.

Figure 2. Improved PA-FPN network structure. Red arrows denote residual skip connections, fusing low-level features into deep layers.

Figure 3. The basic module of ShuffleNet V2.

Figure 4. Structure diagram of SPD-Conv with S = 2. ∗ represents the convolution operator with a stride of 1 on

X^{'}

.

Figure 4. Structure diagram of SPD-Conv with S = 2. ∗ represents the convolution operator with a stride of 1 on

X^{'}

.

Figure 5. Schematic of the photographic setup for capturing ancient architectural color paintings.

Figure 6. Comparison of loss values curve.

Figure 7. Comparison of results between YOLOv8s-seg and PLDS-YOLO. (c,e) show color paintings, while (a,b,d,f) display Dunhuang murals. The yellow boxes highlight the regions where YOLOv8s-seg method′s segmentation results have missed detections compared to the PLDS-YOLO method′s segmentation results.

Figure 8. Comparison results of different models. (d,e) show color paintings, while (a,b,c,f) display Dunhuang murals.

Figure 9. Effect diagram of different improvement strategies. (a) Detection P-R curve; (b) Segmentation P-R curve.

Table 1. Comparison results of different models.

Model	mAP@0.5/%	mAP@0.5:0.95/%	FLOPs/G	FPS
Mask R-CNN	52.3	31.5	258.2	28.5
SOLOv2	63.6	40.3	217	35.8
YOLOv5s-seg	79.2	44.6	25.7	108.6
YOLOv7-seg	84	50.1	141.9	56.8
YOLOv8s-seg	83.3	50.6	42.4	82.6
YOLOv9-seg	81.6	48.5	145.7	51.4
PLDS-YOLO	86.2	53.1	49.5	75.6

Note: bold numbers represent the best performance achieved by PLDS-YOLO.

Table 2. Detection results under different improvement strategies.

Improved PA-FPN	Dual-Backbone Model	SPD-Conv	Box (P)	Box (R)	Box (mAP@0.5)	Box (mAP@0.5:0.95)
-	-	-	0.909	0.824	0.917	0.699
✓	-	-	0.915	0.835	0.925	0.725
✓	✓	-	0.924	0.841	0.929	0.727
✓	✓	✓	0.927	0.853	0.933	0.747

Note: ✓ represents applicable, while - represents not applicable. Bold numbers represent perform best in each metrics.

Table 3. Segmentation results under different improvement strategies.

Improved PA-FPN	Dual-Backbone Model	SPD-Conv	Mask (P)	Mask (R)	Mask (mAP@0.5)	Mask (mAP@0.5:0.95)
-	-	-	0.868	0.748	0.833	0.506
✓	-	-	0.877	0.766	0.842	0.518
✓	✓	-	0.885	0.775	0.854	0.522
✓	✓	✓	0.892	0.784	0.862	0.531

Note: ✓ represents applicable, while - represents not applicable. Bold numbers represent perform best in each metrics.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Zhang, A.; Shi, J.; Gao, F.; Guo, J.; Wang, R. Paint Loss Detection and Segmentation Based on YOLO: An Improved Model for Ancient Murals and Color Paintings. Heritage 2025, 8, 136. https://doi.org/10.3390/heritage8040136

AMA Style

Chen Y, Zhang A, Shi J, Gao F, Guo J, Wang R. Paint Loss Detection and Segmentation Based on YOLO: An Improved Model for Ancient Murals and Color Paintings. Heritage. 2025; 8(4):136. https://doi.org/10.3390/heritage8040136

Chicago/Turabian Style

Chen, Yunsheng, Aiwu Zhang, Jiancong Shi, Feng Gao, Juwen Guo, and Ruizhe Wang. 2025. "Paint Loss Detection and Segmentation Based on YOLO: An Improved Model for Ancient Murals and Color Paintings" Heritage 8, no. 4: 136. https://doi.org/10.3390/heritage8040136

APA Style

Chen, Y., Zhang, A., Shi, J., Gao, F., Guo, J., & Wang, R. (2025). Paint Loss Detection and Segmentation Based on YOLO: An Improved Model for Ancient Murals and Color Paintings. Heritage, 8(4), 136. https://doi.org/10.3390/heritage8040136

Article Menu

Paint Loss Detection and Segmentation Based on YOLO: An Improved Model for Ancient Murals and Color Paintings

Abstract

1. Introduction

2. Material and Methods

2.1. Standard YOLOv8 Model

2.2. The Improved PLDS-YOLO Model

2.2.1. Overall Network Structure

2.2.2. Improved PA-FPN Network Structure

2.2.3. CSPDarkNet and ShuffleNet V2 Dual-Backbone Network Model

2.2.4. Integration of the SPD-Conv Module into the Neck Network Module

3. Experiments and Results

3.1. Dataset

3.2. Evaluation Metrics

3.3. Experimental Setup

3.4. Comparison of Loss and Segmentation Performance Between the Improved Model and the Baseline Model

3.5. Comparison with Different Models

3.6. Ablation Experiments

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI