AI-Driven Recognition and Sustainable Preservation of Ancient Murals: The DKR-YOLO Framework

Guo, Zixuan; Kumar, Sameer; Wang, Houbin; Li, Jingyi

doi:10.3390/heritage8100402

Open AccessArticle

AI-Driven Recognition and Sustainable Preservation of Ancient Murals: The DKR-YOLO Framework

¹

Asia-Europe Institute, Universiti Malaya, Kuala Lumpur 50603, Malaysia

²

School of Resources and Environmental Engineering, Ludong University, Yantai 264025, China

³

School of Finance and Tourism, Chongqing Vocational Institute of Engineering, Chongqing 402260, China

^*

Author to whom correspondence should be addressed.

Heritage 2025, 8(10), 402; https://doi.org/10.3390/heritage8100402

Submission received: 3 August 2025 / Revised: 5 September 2025 / Accepted: 17 September 2025 / Published: 25 September 2025

(This article belongs to the Special Issue AI and the Future of Cultural Heritage)

Download

Browse Figures

Versions Notes

Abstract

This paper introduces DKR-YOLO, an advanced deep learning framework designed to empower the digital preservation and sustainable management of ancient mural heritage. Building upon YOLOv8, DKR-YOLO integrates innovative components—including the DySnake Conv layer for refined feature extraction and an Adaptive Convolutional Kernel Warehouse to optimize representation—addressing challenges posed by intricate details, diverse artistic styles, and mural degradation. The network’s architecture further incorporates a Residual Feature Augmentation (RFA)-enhanced FPN (RE-FPN), prioritizing the most critical visual features and enhancing interpretability. Extensive experiments on mural datasets demonstrate that DKR-YOLO achieves a 43.6% reduction in FLOPs, a 3.7% increase in accuracy, and a 5.1% improvement in mAP compared to baseline models. This performance, combined with an emphasis on robustness and interpretability, supports more inclusive and accessible applications of AI for cultural institutions, thereby fostering broader participation and equity in digital heritage preservation.

Keywords:

cultural heritage preservation; ancient mural recognition; deep learning; YOLOv8; robustness; interpretability; inclusive AI; sustainable digitalization; DySnake Conv; adaptive convolutional kernels

1. Introduction

Across the globe, the conservation and study of wall paintings have evolved within an established international framework that treats such works as part of humanity’s shared heritage [1]. The UNESCO World Heritage Convention codified principles for safeguarding cultural properties of “outstanding universal value,” stimulating cross-border coordination and standards for documentation, condition assessment, and intervention. Building on that policy foundation, major conservation programs have promoted interdisciplinary methods that combine materials science, art history, and preventive conservation [2]. In parallel, research infrastructures have lowered barriers to advanced instrumentation, shared datasets, and expert services, accelerating comparative studies and reproducible workflows across institutions and countries [3].

Multi-sensor capture pipelines now enable in situ, non-destructive imaging and multi-layer modeling, supplying rich inputs for downstream analysis [4]. At the same time, computer vision has matured from proof-of-concepts to operational tools in galleries, libraries, archives, and museums—supporting classification, retrieval, and monitoring at scale [5]. Recent surveys and policy briefs highlight how deep learning advances improve access to visual archives, automate cataloging, and assist decision-making in conservation, while also underscoring domain-specific challenges such as style variance, aging-related degradation, and data imbalance [6]. Against this international landscape, Chinese mural heritage—especially the Dunhuang corpus—offers a uniquely demanding testbed: millennia-spanning styles, vast iconographic vocabularies, and extensive digitization efforts converge to create both unprecedented opportunities and acute information-retrieval bottlenecks [7]. It is within this context that we turn to Dunhuang, where improving recognition, retrieval, and organization of mural elements is pivotal for scholarship, conservation planning, and public engagement [8].

The Dunhuang grottoes, first excavated in the second year of the Jian Yuan period during the Former Qin dynasty, encapsulate over a millennium of historical and cultural integration among various ethnic groups, ultimately forming a distinct Chinese Buddhist art system [9]. The Dunhuang murals are indispensable for comprehensively understanding the history of Chinese art and play a crucial role in fostering modern artistic innovation [10]. With the rapid development of information technology and the accelerated digitization of cultural heritage, restoration techniques and digital collection methods for Dunhuang murals have significantly advanced [11]. Libraries, archives, and museums now house extensive collections of Dunhuang mural images. However, the abstract visual forms and obscure semantic content of these images often present challenges for users attempting to construct precise search queries, resulting in difficulties in retrieving relevant images and low resource utilization [12]. These challenges impede researchers’ ability to study Dunhuang murals effectively and dampen public enthusiasm for exploring Dunhuang culture. The integration of computer image recognition technology into the search domain for Dunhuang murals presents a promising solution to enhance resource acquisition efficiency [13]. By improving the restoration and protection processes, as well as optimizing the digital collection and storage systems, this technology can contribute significantly to the safeguarding and transmission of Dunhuang’s cultural heritage. Specifically, it supports the preservation of these murals by improving resource utilization and advancing information dissemination, ultimately ensuring the longevity and accessibility of this invaluable cultural legacy.

1.1. Research on Visual Classification and Detection

In recent years, research on object detection and scene recognition has mainly advanced from three aspects: multi-scale feature modeling, lightweight design, and deployment feasibility. A representative direction is the introduction of adaptive or weighting mechanisms into feature fusion structures to enhance multi-scale interaction. For example, ASFF has been widely adopted in top conferences, proving its effectiveness in alleviating the mismatch between semantic and detail information on complex backgrounds with uneven scale distribution [14]. The second type of work emphasizes the combination of dynamic convolution and attention to improve channel and spatial selectivity without significantly increasing the amount of computation [15]. The representative trend is to embed deformable/dynamic kernels and lightweight attention into top-down/bottom-up pyramids [16]. The third class of work focuses on lightweight detectors with end-to-end landing and provides reproducible empirical baselines in parameter count and latency [17].

Suppose that a 640 × 640 mural picture contains a large-scale semantic target at the center, with a diameter of about 300 pixels. Around it are 4–8 pixel dot patterns and 2–4 pixel-wide fine-grained small targets. In addition, pigment aging further reduces the contrast between these details and the background color [18]. In the multi-scale features of YOLOv8, P5/stride32 captured the strong semantic pattern of the main image, and P3/stride8 retained the high-frequency edge [19]. When a fixed-weight PAN/FPN is applied for top-down upsampling and element-wise addition with P3, the strong semantic response from P5 often overwhelms the weak edge activation of P3. Moreover, bi-linear or nearest-neighbor upsampling introduces phase misalignment and high-frequency aliasing. Together, these effects further weaken the small-target signal. At the same time, the details of the mural show dense and repetitive texture distribution, and the indistinct aging noise in P3 will be transmitted to the upper layer without distinction, resulting in the point patterns being treated as background or merged into coarse block responses [20]. The localization of the main image and large targets such as a halo is stable, but the spot patterns and fine cracks are significantly missed or the confidence decreases after fusion [21]. The main commonalities in performance are that the AP of small targets is much lower than that of large targets, the confusion between similar patterns increases, and the attention of heat map leaks from details to large areas of background color [22]. This scene example illustrates that the detection of mural images will amplify the conflict between details and semantics in the fixed cross-stage fusion, thus making feature fusion a performance bottleneck for detection.

Cross-domain evidence also supports the above direction: Dahri et al. used customized CNN to achieve high accuracy in indoor scene recognition under occlusion and a similar texture environment, suggesting that stronger multi-scale discrimination ability is needed in tasks with “similar textures and large scale changes” [23]. Sun et al.’s MFE-YOLOv8 replaces C2f’s multi-feature module and EMA attention and Focal-Modulation noise suppression and implements real-time detection with DCSlideLoss and ByteTrack [24]. It reflects the effective path of “fusion + attention + training loss” collaborative optimization. Ullah et al. significantly improved the perception of fine-grained texture through multi-scale feature collaboration and attention guidance in wheat disease recognition, which met the requirements of enhancing detail features such as cracks and wear of murals [25]. Iqbal et al. provide a reference for the deployment architecture from the perspective of multi-sensor fusion and “end-cloud” collaboration in IoT [26].

The YOLO series has presented a pioneering and influential approach in object detection, fundamentally transforming how visual information is perceived and processed [27]. YOLOv8 pushing the boundaries of object detection and classification [28]. In recent years, a number of top research works have focused on feature fusion and structure evolution based on YOLO.

Yolo-world integrates visual language pre-training into YOLO and proposes reparameterizable RepVL-PAN and region-based text contrast loss [29]. It achieves 35.4 AP/52 FPS (V100) on LVIS, demonstrating that efficiency and accuracy can still be balanced under open-vocabulary detection. This makes it a valuable reference for cross-modal feature aggregation paradigms. RTMO seamlessly integrates the coordinate classification into the YOLO architecture in the form of double one-dimensional heatmaps, and achieves 74.8 AP/141 FPS (V100) on COCO val2017, indicating that the speed–accuracy trade-off can also be broken through the feature representation and aggregation path that is more suitable for the task in the one-stage framework [30]. In the earlier Assisted Excitation, starting from the training strategy, the localization prior was explicitly injected into the detection head at the early stage and gradually annealed, so that the YOLO achieved +3.8/+2.2 mAP on MS-COCO, respectively, suggesting that the improved fusion effect could also be amplified by the guided activation in the learning process [31]. Infrared Adversarial Car Stickers generate infrared adversarial Car stickers by 3D modeling, achieve 91.49% physical attack success rate on real vehicles, and have strong transferability to a variety of detectors, including YOLOv3 [32]. We are reminded that simultaneous evaluation of robustness is necessary when discussing fusion structures in weak-contrast, multi-scale scenes such as murals. Compared with fixed-weight PAN/FPN, these studies adopt strategies such as learnable weighting, multi-directional fusion, task-adaptive representation, and training time guidance mechanisms. Together, these approaches improve multi-scale feature alignment and information transfer without significantly increasing overhead. This provides a direct basis for the paper to supplement related studies and to clarify the necessity of improving the fusion by taking the multi-scale characteristics of murals as an example.

1.2. Mural Detection Application Research

Murals, as one of the most fundamental forms of painting, have become an integral part of human cultural history [33]. The study of traditional artworks has increasingly benefited from the application of digital technologies. Machine learning techniques have been applied in the categorization of traditional paintings. For example, input vectors have been used to classify landscape paintings, while attention-based long short-term memory (LSTM) networks have been adopted for the classification of Chinese paintings [34]. Currently, most automated mural categorization techniques rely on computer vision. Common methods include contour-based similarity measurements, multi-instance grouping, and semantic retrieval models, which establish connections between the content of ancient murals and their underlying meanings. Traditional methods for classifying wall paintings have shown some success; however, they are limited to extracting only the fundamental attributes of the images [35]. The subjectivity and diversity inherent in wall paintings pose significant challenges in adequately capturing high-level features such as texture, color properties, and other intricate details [36].

Existing methods often fail to capture the hierarchical structure of ancient murals, which is crucial for understanding the spatial relationships between different elements [37]. This limitation leads to poor localization and classification performance, particularly for small or partially occluded elements [38]. Furthermore, the absence of a robust feature pyramid network (FPN) in many current techniques hampers the extraction of multi-scale features, resulting in the loss of critical information across different levels of the image [39]. Another major challenge is the inability of current algorithms to effectively address the unique issues posed by ancient murals, such as low-resolution images, uneven lighting, and the presence of artifacts. These factors significantly degrade the performance of object detection and classification models, leading to increased rates of false positives and false negatives. Moreover, the reliance on handcrafted features in traditional methods limits their adaptability and generalizability to diverse types of murals. Without dynamic feature extraction mechanisms, these models struggle to effectively learn and represent the complex patterns and textures inherent in ancient murals [40].

Dunhuang murals represent a key focus in the field of digital humanities, a discipline that merges computer science with the humanities and social sciences, facilitating the integration of digital technology into cultural heritage preservation and dissemination [41]. This interdisciplinary approach has reshaped traditional research paradigms and significantly advanced the study of cultural history. Wang et al. developed a semantic framework for Dunhuang murals, creating a domain-specific vocabulary to bridge the semantic gap in image retrieval [42]. Zeng et al. employed the bag-of-visual-words method to extract features from mural images and used support vector machines (SVMs) for classification, exploring the thematic distribution and dynastic evolution of the murals [43]. Ren et al. focused on mural restoration using generative adversarial networks (GANs), automating the process by learning relationships between degraded and restored mural textures [44]. Another study proposed a restoration algorithm based on sparse coding of line drawings. Fei et al. further enhanced the curvature diffusion algorithm with adaptive strategies for improved restoration [45]. Mu et al. designed the “Restore VR” system, enabling users to experience the restoration of Dunhuang murals through virtual reality (VR)-based digital tours of the caves [46]. Recent research in medical image analysis has also contributed valuable methodologies applicable to mural recognition. Similarly, Nazir et al. employed an embedded clustering sliced U-Net with a fusing strategy for intervertebral disk segmentation and classification, achieving high precision in medical image segmentation [47]. These models highlight the effectiveness of adaptive convolutional techniques in handling complex image structures.

General object detectors like YOLOv8 have achieved excellent performance on natural images. However, they still encounter multiple bottlenecks when applied to the specialized domain of ancient murals. Ancient murals typically feature slender brushstrokes, continuous curves, and high-frequency patterns. Standard convolution is insufficiently sensitive to anisotropy and nonlinear boundaries, which can lead to missed detections and boundary adhesion [48]. Different periods and techniques result in significant drift in color, brushstrokes, and texture statistics, causing the fixed convolution kernels and static feature spaces of general models to overfit to mainstream styles, leading to misclassification in rare or heterogeneous styles [49]. Ancient murals often have dense elements and large scale variations, and conventional feature pyramids are insufficient in depicting very small targets and long-range dependencies, resulting in limited recall [50]. Local and collection scenarios are limited by low-power/low-memory devices, and existing high-precision models struggle to meet the requirements of real-time processing and affordability in terms of parameters, FLOPs, and memory usage. The existing object detection research for a better compromise between parameters and mAP by reconstructing the neck and introducing attention. On the basis of YOLOv8, Brg-YOLO uses BiFPN + Ghost + RSE neck and RSE backbone to suppress aliasing features and reports mAP@0.5 = 84.6% and mAP@0.5:0.95 = 47.8% in mural detection [51]. YOLOv10 modified “YOLO Mural” combined with Efficient RepGFPN and SimAM, achieved 289 FPS and mAP@0.5 = 63.7% with only 2.65 M parameters, and claimed that the parameter scale was lower than that of general models such as YOLOv8/7 [52]. The PLDS-YOLO for “dropped/broken” segmentation uses PA-FPN+ residual, CSPDarkNet + ShuffleNetV2 double backbone and SPD-Conv enhanced multi-scale representation, and achieves 86.2% segmentation accuracy on the self-built dataset [53]. The fixed PAN/FPN of native YOLOv8 suffers from insufficient information transfer and poor scale alignment on mural images characterized by weak contrast, small targets, and fine textures. This limitation leads to mAP disadvantages and parameter redundancy, even under the same or smaller computational budget. It can be seen that the traditional YOLOv8 has higher computational overhead and low efficiency of cross-stage feature fusion in mural detection feature fusion. Collectively, these problems make it difficult for general detectors to balance within-class recall, inter-class separability, and reasoning efficiency. This limitation is particularly evident in the identification, element-level retrieval, and knowledge organization of mural elements.

1.3. This Paper’s Research

Based on the above pain points, the DKR-YOLOv8 proposed in this paper aims to achieve a deployable balance among accuracy, robustness, and efficiency and is specifically designed for the mural scene. YOLOv8 is the latest iteration in the YOLO series, representing an end-to-end, compact neural network built on deep learning principles [54]. The DKR-YOLOv8, a refinement of YOLOv8 proposed in this study, enhances the feature extraction network of YOLOv8. This study introduces a mural image target recognition algorithm, DKR-YOLOv8, designed to address the unique challenges posed by ancient murals. A specialized dataset for Dunhuang grotto murals was compiled, encompassing 30 distinct classes of common mural art features. Rigorous evaluation procedures ensured the inclusion of only high-quality images, which were subsequently annotated manually to identify key elements within the mural artworks. Building upon the original YOLOv8 architecture, we incorporated DySnake Conv, a module known for its sensitivity to elongated topologies. This modification significantly improved detection accuracy, particularly for mural features with irregular shapes and elongated structures. Our method integrates three key components: an active target detection module, a three-channel spatial attention mechanism, and the dynamic convolution technique from Kernel Warehouse. These innovations reduce model parameters, mitigate overfitting, and alleviate computational and memory demands. As a result, the model achieves more accurate and reliable detection of mural art elements. The FPN structure has been substituted with the RE-FPN, a residual feature fusion pyramid structure that is more effective and lightweight. Furthermore, a layer of SOD is added to improve the model’s ability to detect objects of various sizes, especially those that are small.

From the perspectives of both academic value and application value, this research is necessary. DKR-YOLOv8 provides a verifiable structured modeling model for non-rigid, curve-dominated, heterogeneous style, and deteriorated artistic images. It can establish a reusable systematic design for the detection and retrieval of cultural heritage images such as murals. In the multi-modal resource platform, robust and efficient element-level detection is the fundamental capability for content retrieval, semantic cataloging, reference matching for restoration, and inspection and monitoring. It directly determines the efficiency of resource invocation and public accessibility. Therefore, DKR-YOLOv8 not only addresses the key shortcomings of general detectors in mural scenarios but also offers a feasible technical path for sustainable digital protection under low-computing power conditions. At the same time, it supports the development of intelligent public services.

2. Materials and Procedures

2.1. The YOLOv8 Model’s Architecture

YOLOv8 is an iteration of Ultralytics’ publicly available single-stage object detection algorithm, offering variations in five editions: n, s, m, l, and x, each featuring an increased number of model parameters [55]. Similarly to YOLOv5, YOLOv8 removes the feature extraction, feature fusion, and prediction modules. In addition, the cross-stage partial network (C3) from YOLOv5 is replaced by a more gradient-rich C2f module, which enhances information acquisition capabilities. The volume structure in the sampling phase is removed from the PAN-FPN, which increases computational speed. Moreover, the non-slip detector in YOLOv8 outperforms traditional frame-based methods in terms of accuracy, enabling faster object detection and identification. The loss function in YOLOv8 consists of classification losses and regression losses, where classification losses are computed using binary cross-entropy (BCE) loss, while regression losses are calculated using the distributed focal loss and union loss. Both categories of losses are weighted by predefined coefficients, and the total loss is computed through a weighted combination, as shown in Figure 1.

Despite these advancements, there remains a pressing need for a comprehensive algorithm capable of effectively integrating these techniques to address the unique challenges posed by ancient mural images. The Kernel Warehouse, a repository of customizable kernel functions, presents a promising pathway for the development of such an algorithm. The proposed approach aims to enhance accuracy and robustness in recognizing and classifying elements within ancient mural images by combining the strengths of Kernel Warehouse, YOLOv8, feature pyramid networks, and Snake Convolution.

2.2. Dynamic Snake Convolution

DySnake Conv, proposed in 2023, has predominantly been applied in the segmentation and recognition of images [56]. Its advantages over conventional convolutional kernels stem from its enhanced feature extraction capabilities and its flexibility in handling non-rigid object shapes. In the construction of image-driven knowledge graph, researchers embed the DySnake convolution into the ResNet backbone to improve the detection ability of curved patterns and fine boundaries, and verify the effectiveness on the self-built dataset through comparison and ablation experiments, which indicates that the curve adaptive receptive field is helpful for the structural extraction of complex patterns [57]. In the industrial vision scene of complex texture and thin shape of power insulator defects, ORAL-YOLOv8 combines DySnakeConv with C2f. With the cooperation of lightweight detection head and optimized fusion neck, the parameters and calculation amount are decreased by about 27% and 38%, and the mAP is increased by 1.7% [58]. It shows that this kind of convolution can enhance the representation of thin and narrow boundaries and complex textures while maintaining real-time performance.

Building on this foundation, we have customized the DySnake Conv layer specifically for the complex task of identifying and predicting the intricate lines and patterns characteristic of ancient mural images. The flexible nature of the DySnake Conv kernel enables it to effectively learn the complex geometric features of the targets. By utilizing a restricted set of offsets as learning parameters, the model can dynamically identify the most optimal locations for feature extraction. This ensures that, even in the presence of substantial deformations, the detection area remains stable. The topology, as depicted in Figure 2, has been fine-tuned to capture the subtle lines and artistic details inherent in mural images. This refinement leads to a more accurate and comprehensive analysis of ancient murals, allowing the deep learning network to extract local information with enhanced precision. Blue arrows indicate the data flow between stages/branches. Blue squares mark the reference element (the kernel center E) whose sampling location is tracked before and after the learned offsets.

In the 2D image plane, we use continuous coordinates

(α, β)

to denote horizontal and vertical positions, with integer values corresponding to pixel centers; the

t

-th sampling location on a convolution kernel or sampling path is denoted

K_{t} = (α_{t}, β_{t})

. Discrete step indices along the horizontal and vertical directions are written as

i

and

j

, with step size

c \in ℤ_{> 0}

. In DySnake Conv, the deformation offsets at each step are

Δ α_{t}, Δ β_{t} \in ℝ

, which are learnable parameters accumulated in temporal order during training. When a sampling position does not lie on an integer grid point, we use bilinear interpolation: let the value of the input feature map at an integer grid point (

α^{'}, β^{'}

) be

X [α^{'}, β^{'}]

; then the interpolated value at the continuous position

K = (α, β)

is

X (α, β) = \sum_{(α^{'}, β^{'}) \in N (α, β)} B ((α^{'}, β^{'}), (α, β)) X [α^{'}, β^{'}]

. where

N (α, β)

is the set of integer neighbors used for interpolation-four neighbors in the standard case-optionally constrained in implementation to a

(2 r + 1) \times (2 r + 1)

window centered at

K

. The interpolation kernel

B

factorizes into two one-dimensional kernels,

B = b (α, α^{'}) b (β, β^{'})

, with

b (u, u^{'}) = \max (0, 1 - |u - u^{'}|)

. To obtain sufficient tangential context while controlling normal bandwidth, we set the 2D deformable coverage radius to

r = 4

, corresponding to a

9 \times 9

local window. When describing a conventional rigid convolution kernel, we assume a

3 \times 3

grid whose relative offset set is

[(- 1, - 1), \dots, (+ 1, + 1)]

.

The expression for a conventional 3 × 3 and 2D convolution kernel

K

is as follows:

K = \{(α - 1, β - 1), (α - 1, β), \dots, (α + 1, β + 1)\}

(1)

Motivated by deformable convolution, the deformation offset

Δ

is introduced to enhance the flexibility of the convolution kernel, enabling it to focus on the intricate geometric features of the target. However, if the model is allowed to learn the deformation offset freely, the receptive field may drift away from the target, particularly when dealing with thin and elongated structures. To address this issue and maintain continuity in the attention mechanism, we employ an iterative approach to determine the subsequent position of each target to be processed. This iterative process ensures that the perception field remains focused on the target, preventing excessive expansion due to large deformation shifts.

K_{i + c} = (α_{i + c}, β_{i + c}) = (α_{i} + c, β_{i} + \sum Δ β_{i + c})

(2)

When advancing

c

steps horizontally,

α

is updated in integer steps to

α_{i} + c

, ensuring a stable, controllable “forward speed” along the

x

-axis;

β

is refined at subpixel level by accumulating

Δ β_{t}

, allowing the path to bend gradually along the true curve. The summation range

t = i + 1

to

i + c

indicates that this is not a single jump but the stepwise accumulation of

c

small normal offsets, which preserves continuity and makes it easy to impose magnitude constraints at each substep.

The deformation rule of DSConv along the x-axis is as follows:

K_{i \pm c} = \{\begin{matrix} (α_{i + c}, β_{i + c}) = (α_{i} + c, β_{i} + \sum_{i}^{i + c} Δ β) \\ (α_{i - c}, β_{i - c}) = (α_{i} - c, β_{i} - \sum_{i - c}^{i} Δ β) \end{matrix}

(3)

The deformation rule along the y-axis is as follows:

K_{i \pm c} = \{\begin{matrix} (α_{j + c}, β_{j + c}) = (α_{j} + c, β_{j} + \sum_{j}^{j + c} Δ α, β_{j} + c) \\ (α_{j - c}, β_{j - c}) = (α_{j} - c, β_{j} - \sum_{j - c}^{j} Δ α, β_{j} - c) \end{matrix}

(4)

When advancing along the

y

-axis,

β

is updated in integer steps while

α

is adjusted only by accumulated fine offsets; this ensures that, whether the target’s principal direction is horizontal, vertical, or locally oblique, the kernel aligns using a strategy of tangential integer stepping with normal fine-tuning. If the local curve is closer to a vertical trajectory,

\sum Δ α_{t}

carries the main geometric alignment burden; as the curve gradually changes direction, the relative magnitudes of

Δ α

and

Δ β

transition smoothly between adjacent steps, avoiding a jagged (staircase) path.

When using bilinear interpolation, the offset

Δ

is usually expressed as a fraction. The process looks like this:

K = \sum_{\dot{K}} B (K^{'}, K) \cdot K^{'}

(5)

The bilinear interpolation kernel is represented by

B

, and all of the listed integral spatial coordinates are denoted by

K^{'}

. Two one-dimensional kernels comprise the bilinear interpolation kernel:

B (K, K^{'}) = b (K_{α}, K_{α}^{'}) \cdot b (K_{β}, K_{β}^{'})

(6)

DSConv is a deformation-based two-dimensional transformation process that spans a 9 × 9 area, offering enhanced perception of key features. This is especially advantageous for mural detection, which involves long-distance, small, and elongated targets. By covering a 9 × 9 range during the deformation process, DSConv expands the model’s receptive field, improving the detection of crucial features and laying the foundation for accurate recognition. The introduction of dynamic serpentine convolution allows adaptive adjustments to the shape of the convolution kernel, facilitating the precise capture of local features and structural information in images. This approach humanizes the processing of complex images, resulting in better feature extraction capabilities and enhanced robustness.

Through this method, the next position of each target to be processed is selected sequentially, ensuring continuity of focus. Following this, conventional convolution is applied. Dynamic serpentine convolution ensures continuity in the convolution kernel’s modifications by accumulating offset values. This design allows for flexible selection of the receptive field without excessive dispersion, maintaining more accurate and stable focus on elongated targets. Additionally, bilinear interpolation guarantees the smoothness and precision of the convolution process, further improving the effectiveness of convolution operations. In the DSConv module, the combination of conventional convolution and dynamic serpentine convolution retains the stability and efficiency of traditional convolution while introducing the flexibility and adaptability of dynamic serpentine convolution.

The linear patterns in the murals are mostly thin and long, with continuous curvature and smooth local direction but significant global deformation. They are also overlaid with degradation factors such as cracking, powder loss, stains, and non-uniform lighting caused by their long-standing age. DySnake Conv generates “snake-like” sampling paths using restricted and iterative displacement parameters. Essentially, it transforms the convolution kernel from a rigid grid to a sampling band that adapts to the curve, with the continuous accumulation of displacements ensuring that the kernel elements have approximately the same tangential direction and limited normal oscillation at adjacent positions, thereby stretching the receptive field into a narrow strip-shaped area along the pattern’s direction. This significantly reduces the receptive field drift and energy leakage that occur in conventional deformable convolutions on thin and long targets, avoiding the absorption of damaged cracks, random particles, or brushstroke noise as target structures. The offset of each step of DySnake Conv is a function of the previous position and the current local geometry, ensuring that the sampling points still move along the true structural main direction under long-distance deformation. The amplitude limit and continuity penalty inhibit the obsession with strong textures in the background, fundamentally alleviating the risk of thin structures being “dragged away by the background”. Combined with a 9 × 9 two-dimensional deformation coverage area, the network can obtain sufficient tangential context while ensuring a narrow normal band, using it to fill the small gaps caused by powder loss and fractures, achieving a robust connection for “missing strokes and broken lines”; this is particularly crucial for long-distance, intermittent lines commonly found in murals. In the hierarchical organization of the network, DySnake Conv and conventional convolutions are assembled in a multi-layer serial/parallel manner, forming a “texture-geometry” dual-channel complementarity: conventional convolutions retain texture clues such as paint particles, brushstroke texture, and color gradients, while DySnake Conv focuses on extracting the curve skeleton and local direction.

2.3. Kernel Warehouse Dynamic Convolution

Dynamic rolling examines a mixed rolling nucleus made of n static rolling kernel linear mixture, weighting through their sample-related attention, demonstrating excellent performance compared to ordi-nary rolling [59]. However, previous designs have problems with parameter efficiency: they increase the number of rolling parameters by n times. This problem, together with the complexity of optimization, led to the lack of research advancement in the field of dynamic volume, which did not allow the use of bigger n values (e.g.,

n > 100

, rather than conventional

n < 10

) to push the performance bounds. Kernel Warehouse is broad version of dynamic envelope that can a beneficial balance be-tween parameter efficiency and representation. Its main idea was to reinterpret the fundamental terms “assembled convoluted nucleus” and “convolved nucleus” in relation to dynamic convolutions, with an emphasis on greatly expanding the number of nuclei while decreasing their dimensions. Kernel Warehouse increases the reliance of volume parameters within the same layer as well as between the continuous layers through advanced kernel partition and warehouse sharing.

In particular, on any ConvNet volume layer, Kernel Warehouse splits the static nucleus into m noncompatible nuclei with the same dimensions. It then computes the linear mixture of each nucleus based on a predefined “warehouse” that contains n nuclei (e.g., n = 108) that is also shared by several ad-jacent volume layers. Finally, it assembles each m mixture sequentially, offering a higher degree of freedom while adhering to the necessary parameter budget. A new attention function was also developed by the authors to aid in the process of learning to reconcile attention to the cluster nucleus. ConvBERT replaces part of the global attention heads with the Kernel Warehouse Dynamic Convolution based on the spanning, which reduces the training cost and parameter scale while outperforming the same level models on GLUE, proving that the mechanism of adaptive kernel according to the input can effectively capture local dependencies [60]. MAGIC reconstructs Kernel Warehouse Dynamic Convolution by multi-dimensional aggregation and kernel recomputation, which systematically outperforms conventional convolution and existing dynamic convolution on multiple benchmarks, demonstrating its stronger adaptability to deformation and multi-scale structures [61].

Dynamic convolution is a method that uses input dependent attentions to learn a linear combination of n static kernels. This approach has been shown to outperform standard convolution in terms of performance. However, it results in an increase in the convolutional parameters by a factor of n, making it inefficient in terms of parameters. This lack of research advancement hinders researchers from exploring settings where n > 100, which is far bigger than the common setting of n < 10. Such exploration is crucial for advancing the performance boundary of dynamic convolution while maintaining parameter efficiency. In this paper, we introduce Kernel Warehouse as a solution to address this gap. Kernel Warehouse is a more comprehensive version of dynamic convolution that redefines the fundamental concepts of “kernels”, “assembling kernels”, and “attention function”. It does so by leveraging the convolutional parameter dependencies within the same layer and across adjacent layers of a ConvNet. We validate the efficacy of Kernel Warehouse on ImageNet and MS-COCO datasets by employing diverse ConvNet topologies. Kernel Warehouse can be applied to Vision Transformers, leading to a decrease in the size of the model’s core while also enhancing the model’s precision. For example, Kernel Warehouse (n = 4) delivers 5.61%|3.90%|4.38% absolute top-1 accuracy gain on the ResNet18|MobileNetV2|DeiT-Tiny backbone, and Kernel Warehouse (n = 1/4) achieves 2.29% gain on the ResNet18 backbone while reducing the model size by 65.10%.

Kernel Warehouse provides a use friendly interface that allows users to easily search, filter, and compare different kernels depending on their individual requirements and limits. This covers factors such as kernel size, stride, padding, and dilation, enabling users to rapidly identify the most suited kernel for their application. The schematic diagram describing the Kernel Warehouse is illustrated in Figure 3.

Another key feature of Kernel Warehouse is its efficiency in managing large-scale image datasets [62]. It incorporates an advanced data management system that ensures effective storage and retrieval of kernel functions, significantly reducing the computational cost associated with feature extraction [63]. This capability allows our system to efficiently process vast volumes of ancient mural images without compromising the accuracy of recognition and classification. Kernel Warehouse seamlessly integrates with other components of our method, such as the Feature Pyramid Network and Snake Convolution. This synergistic interaction between the components enhances the overall performance of our system, enabling improved accuracy and efficiency in recognizing and classifying elements of ancient mural images.

The design logic of the Adaptive Kernel Warehouse (KW) can form a strong correspondence with the actual difficulties in mural target detection, thereby explaining its “more suitable” mechanism: Mural images often present intricate and continuous linear patterns, extremely large-scale decorative units, severe degradation noise (powder loss, cracks, stains, reflections, and non-uniform lighting), as well as domain-specific differences in color, material, and texture. This requires feature extraction to balance spatial anisotropy, cross-scale robustness, and parameter controllability. KW essentially provides an “assembly base of low-dimensional parameters and high-dimensional shapes”: without nonlinear expansion of the parameter quantity, the model can adaptively select and assemble the most suitable stimulated template for each spatial position and scale level from a large number of diverse and morphologically variable base kernel families (including sub-kernels with different directions, frequency bands, hole rates, and aspect ratios). For fine and long mural lines and scroll patterns, the direction selectivity and controllable hole rate of KW make the receptive field more sensitive in tangential integration and normal suppression, reducing the drift and energy leakage of the common dynamic convolution that are prone to occur. For weakly contrasting, broken, and faded areas, the shared kernel warehouse across layers enables adjacent layers to reuse the same base kernel combination in similar structures, forming a stable “tracking” along the curve, thereby maintaining consistent responses in the noise background and achieving cross-scale bridging of broken brushstrokes. Unlike the traditional “n-fold parameters” dynamic convolution, KW significantly improves parameter efficiency through kernel decomposition and shared warehouse, allowing n to be expanded significantly within the budget, improving the coverage and prior diversity of the model from a statistical learning perspective, and reducing the risk of overfitting due to limited data size. If combined with structural tensors or direction consistency regularization, the custom attention function can suppress incorrect focusing on high-frequency noise and mottled textures, enhancing the preference for true patterns with smooth curvature and continuous direction.

In all experiments, we adopt a predefined Kernel Warehouse (KW) as the dynamic convolution backend. The default size is

N = 64

, comprising a mix of

3 \times 3 / 5 \times 5

kernels and separable

1 \times k / k \times 1

kernels (

k = 3, 5

), with dilation rates

[1, 2, 3]

; depthwise separable design and low-rank decomposition are used to control parameters and FLOPs. A single warehouse is shared across three adjacent convolutional layers, and each layer partitions channels into

m = 4

groups for group-wise selection. The KW attention branch uses a lightweight context encoder followed by softmax/Top-k normalization to linearly select and combine several sub-kernels from a shared kernel warehouse for each channel group. The forward pass then proceeds via either “mix-then-convolve” or “convolve-then-weight.” In this way, it preserves the content adaptivity of dynamic convolution while, through sharing and factorization, significantly improving parameter efficiency and deployability.

2.4. Lightweight Residual Feature Pyramid Network

The YOLOv8-based recognition and classification method for ancient mural image elements relies significantly on the Feature Pyramid Network (FPN) architecture. FPN addresses the limitations of traditional object detection models, which often struggle to accurately detect objects of varying scales within an image [64]. The FPN architecture utilizes a top-down approach with lateral connections, facilitating the extraction of multi-scale features from the input image. This structure is crucial in enhancing the YOLOv8 model’s effectiveness in analyzing ancient mural paintings. FPN enables the model to capture both high-level semantic information and fine-grained details in mural images by constructing a pyramid of feature maps at varying resolutions. This multi-scale feature representation is essential for the accurate detection and classification of complex elements within the intricate patterns of ancient murals.

The FPN architecture consists of multiple stages, each responsible for extracting features at different scales. The initial stages focus on capturing low-level features, such as edges and textures, while later stages capture higher-level features, including object shapes and patterns. The lateral connections between these stages facilitate the fusion of information across different scales, resulting in a comprehensive feature representation well-suited for the recognition and classification of mural image elements. Additionally, the FPN architecture is highly customizable and can be seamlessly integrated with other components of the YOLOv8 model, such as the Snake Convolution module. This integration ensures efficient and effective feature extraction from mural images, significantly enhancing the overall performance of the recognition and classification algorithm. By leveraging the advantages of the FPN architecture, the YOLOv8-based method can reliably identify and classify a wide range of elements within ancient mural images, thereby contributing to a deeper understanding of these invaluable cultural artifacts.

L^{u t} = \sum_{u = 1}^{U} \sum_{p} {∥ C_{u (p)}^{*} - C_{u (p)}^{u t} ∥}_{2}^{2}

(7)

L^{b t} = \sum_{b = 1}^{B} \sum_{p} {∥ C_{b (p)}^{*} - C_{b (p)}^{b t} ∥}_{2}^{2}

(8)

Shallow features in deep neural networks contain rich spatial information and fine-grained details but are often accompanied by noise. As the network deepens, the semantic content in the features increases, while the spatial and minor detail information gradually diminishes, and the noise decreases. This study focuses on enhancing the efficiency of the neck architecture in the YOLOv8 model by leveraging the Feature Pyramid Network (FPN) to facilitate the integration of features. The goal is to enable the flow and interaction between features at different depths. Furthermore, we introduce a novel approach to address the core issue of spatial information loss caused by channel transformations in higher-level features within the FPN framework. This is achieved by incorporating the Residual Feature Augmentation (RFA) unit into the model design. The RFA module utilizes residual branches to inject contextual information from various spatial positions, thereby improving the feature representation of higher-level features. Additionally, an ECA (Efficient Channel Attention) module, a lightweight attention mechanism, is integrated into each branch of the FPN to further minimize the loss of spatial information. A lightweight residual feature fusion pyramid structure, called RE-FPN, is developed by applying a 3 × 3 Depthwise Convolution (DW Conv) operation to each feature map, as illustrated in Figure 4.

The primary goal of this approach is to enhance feature interaction, reduce spatial information loss, and preserve the lightweight nature of the YOLOv8 algorithm’s FPN. By encouraging the movement and communication of features at different network levels, the RE-FPN structure strengthens high-level feature representation through the inclusion of contextual information via residual branches, tackling the issue of spatial information loss in higher-level feature channels. Furthermore, the integration of a simple attention mechanism in each FPN branch reduces spatial information loss. The application of 3 × 3 depthwise convolution on each feature map creates the RE-FPN structure, combining the advantages of feature interaction and spatial information retention while maintaining the network’s lightweight characteristics. The RE-FPN thus enhances feature interaction and representation with minimal spatial information loss, improving the overall performance of the YOLOv8 algorithm in object detection tasks.

To address the issue of ineffective feature fusion across stages in the algorithm, this paper introduces the FPN structure at the neck position to enhance feature circulation and information interaction at various stages. The outputs from layers 2, 4, 6, and 7 of the MobileNet V2 network are selected as the input features for the FPN structure, denoted as C2, C3, C4, C5, with input feature map sizes set to 38 × 38 × 32, 19 × 19 × 96, 10 × 10 × 320, and 10 × 10 × 1280, respectively. To further enhance the model’s lightweight characteristics, the number of channels in each stage of the FPN structure is reduced from 256 to 160 using 1 × 1 convolution, minimizing the model parameters and computational load. Additionally, the 3 × 3 standard convolution in the FPN structure is replaced by a 3 × 3 depthwise separable convolution, further reducing the model’s volume. At the feature level, the FPN structure propagates strong semantic features from higher to lower levels, improving object detection performance through feature fusion. However, in the process of feature fusion, low-level features benefit from the strong semantic information of high-level features, while the features at the highest pyramid level lose information due to channel reduction. To address this issue, adaptive pooling is utilized to extract varying contextual information, reducing the loss of information in the highest-level features within the feature pyramid. By introducing the residual feature enhancement (RFA) module, the residual branch injects contextual information from different spatial positions, thus enhancing the feature expression of high-level features in a more concise manner, with reduced computational requirements.

Ancient murals, on the one hand, present patterns and brushstrokes with “extremely small scale, slender and continuous lines, and low contrast”, while on the other hand, they are accompanied by “high-resolution imaging, non-uniform lighting, and the mottled noise and cracks caused by material aging”. This requires the detection network to not only retain the tiny spatial details in the cross-scale semantic aggregation, but also support higher input resolution and denser sliding window reasoning under the condition of controllable computational budget. RE-FPN is based on the top-down and lateral fusion of FPN, and replaces the standard convolution with 1 × 1 channel compression and 3 × 3 depth separable convolution, achieving significant reduction in parameters and FLOPs, making it possible to increase the input resolution without increasing memory usage and latency. This is crucial for the distinguishability of sub-pixel-level lines and tiny patterns in the murals. Regarding the problem of information loss in the high-level feature space after channel transformation in FPN, RE-FPN introduces the Residual Feature Augmentation (RFA) unit to inject multi-scale context through the residual branch: adaptive pooling aggregates the structural background and composition semantics from different receptive fields, and then re-injects it into the compressed high-level features, restoring the spatial cues while maintaining strong semantics, thereby improving the continuity and traceability of linear targets with low contrast, fading and fractures, and reducing the detail overwriting and gap omission caused by the “over-smoothing” of high-level semantics. The lightweight design of RE-FPN releases computing power and memory budget, allowing it to work in synergy with previous geometric alignment and assembly kernel mechanisms such as DySnake Conv and Kernel Warehouse: the former performs along-line refinement sampling within the anisotropic receptive field, and the latter provides content-adaptive direction/holes/aspect ratio base kernels on multiple-scale branches. RE-FPN ensures that these geometries and template priors are propagated and fused in a low-loss manner across layers and scales, ultimately achieving a balance in detection recall, positioning accuracy and inference efficiency under the real conditions of high-resolution, long-distance context and diverse material noise in the murals.

3. Experiment and Results

3.1. Hardware and Software Configuration

The experimental platform for this study, as shown in Table 1, At the same time, the model uses unified super parameters in the training phase, where the optimizer is Random Gradient Decrease (SGD), the initial learning rate is 0.01, the weight decrease is 0.0005, the number of training rounds is 200 epoch.

Table 1. Experimental platform.

Item	Setting
Optimizer	SGD (Nesterov = True), momentum 0.937, weight decay 5 × 10⁻⁴
Initial LR	0.01 (SGD), cosine decay to 1% of LR; warm-up 3 epochs (linear)
Epochs	200
Batch size	32 images/GPU (RTX 3090 Ti, 24 GB)
Input size	640 × 640 (train & val), letterbox resize, keep aspect ratio
Normalization	Pixel range [0, 1]; no per-channel mean/std shifting
Augmentation (train)	Horizontal flip 0.5; scale [0.5, 1.5]; translate 0.1; shear 0.0; degrees 0; HSV(h = 0.015, s = 0.7, v = 0.4); no mosaic/mixup for the baseline row (Model1)
Model precision	FP16 mixed precision (AMP)
Losses	Box: CIoU + DFL (Ultralytics defaults); Cls/Obj: BCE (label smoothing 0.0)
Anchor setting	Anchor-free (YOLOv8 head)
NMS	Class-agnostic NMS, IoU 0.60; confidence threshold 0.25
Evaluation metric	Primary: mAP@0.5 (IoU = 0.5)—values reported in Table 2, Table 3 and Table 4
Dataset split	Stratified 50/50 train/val (per-class proportions preserved)
Classes	20 (per Table 2)
Hardware/Env	Windows 11 (64-bit); Python 3.8; PyTorch 1.10; i9-14900K; RTX 3090 Ti

Table 2. Mural Recognition types.

Category	Name	Introduce	Sets
human landscap	Fans	These fans not only have practical uses but also carry rich cultural meanings, embodying the artistic achievements of ancient craftsmen.	85
	Honeysuckle	In the edge of Dunhuang grotts, such as caisings, flat tiles, wall layers, arches, niches, and canopies, honeysuckle patterns are used as edge decorations.	40
	Flame	Flames in Dunhuang murals often appear as decorative patterns such as back light and halo, symbolizing light, holiness, and power. Around religious figures like Buddhas and Bodhisattvas, the use of flame patterns enhances their holiness and grandeur.	35
	Bird	Birds are common natural elements in Dunhuang murals. They adding vivid life and natural beauty to the murals.	28
	Pipa	As an important ancient plucked string instrument, the pipa frequently appears in Dunhuang murals, especially in musical and dance scenes. These pipa images not only showcase the form of ancient musical instruments but also reflect the music culture and lifestyle of the time.	62
	Konghou	The konghou is also an ancient plucked string instrument and is a significant part of musical and dance scenes in Dunhuang murals.	34
	tree	Trees in Dunhuang murals often serve as backgrounds or decorative elements, such as pine and cypress trees. They not only add natural beauty to the mural but also symbolize longevity, resilience, and other virtuous qualities.	38
productive labor	Pavilion	Pavilions are common architectural images in Dunhuang murals. These architectural images not only display the artistic style and technical level of ancient architecture but also reflect the cultural life and esthetic pursuits of the time.	76
	Horses	Horses in Dunhuang murals often appear as transportation or symbolic objects, such as warhorses and horse-drawn carriages. These horse images are vigorous and powerful, reflecting the military strength and lifestyle of ancient society.	72
	Vehicle	Vehicles, including horse-drawn carriages and ox-drawn carriages, are also common transportation images in Dunhuang murals. These vehicles not only showcase the transportation conditions and technical level of ancient society but also reflect people’s lifestyles and cultural habits.	49
	Boat	While boats are not as common as land transportation in Dunhuang murals, they do appear in scenes reflecting water-based life. These boat reflecting the water transportation conditions and water culture characteristics of ancient society.	22
	Cattle	Cattle in Dunhuang murals often appear as farming or transportation images, such as working cows and ox-drawn carriages. These cattle images are simple and honest, closely connected to the farming life of ancient society.	32
religious activities	Deer	Deer in Dunhuang murals often symbolize goodness and beauty. In some story paintings or decorative patterns, deer images add a sense of vivacity and harmony to the mural.	52
	Clouds	Clouds in Dunhuang murals often serve as background elements. They may be light and graceful or thick and steady, creating different atmospheres and emotional tones in the mural. The use of clouds also symbolizes good wishes such as good fortune and fulfillment.	72
	Alage wells	Algae Wells are important architectural decorations. Located at the center of the ceiling, they are adorned with exquisite patterns and colors. They not only serve a decorative purpose but also symbolize the suppression of evil spirits and the protection of the building.	126
	Baldachin	Canopies or halos in Dunhuang murals may appear as head lights or back lights, covering religious figures such as Buddhas and Bodhisattvas, symbolizing holiness and nobility.	43
	Lotus	The lotus is a common floral pattern in Dunhuang murals, symbolizing purity, elegance, and good fortune. Below or around religious figures such as Buddhas and Bodhisattvas.	24
	Niche Lintel	Niche lintels are the decorative parts above the niches in Dunhuang murals, often painted with exquisite patterns and colors. These niche lintel images not only serve a decorative purpose but also reflect the artistic achievements and esthetic pursuits of ancient craftsmen.	10
	Pagoda	Pagodas are important religious architectural images in Dunhuang murals. These pagoda images not only showcase the artistic style and technical level of ancient architecture but also reflect the spread and influence of Buddhist culture.	66
	Monk Staff	The monastic staff is a commonly used implement by Buddhist monks and may appear as an accessory to monk figures in Dunhuang murals. As an important symbol of Buddhist culture undoubtedly adds a strong religious atmosphere to the mural.	29

Table 3. Presents a comparison of network models in terms of recognition accuracy and performance.

Simulations	P/%	R/%	mAP@0.5	F1/%	FPS
YOLOv3-tiny	79.2	79.6	81.4	78.8	557
YOLOv4-tiny	81.4	74.8	82.6	78.1	229
YOLOv5n	80.1	75.2	82.3	77.2	326
YOLOv7-tiny	81.3	73.8	81.2	76.9	354
YOLOv8	78.3	75.4	80.6	77.2	526
DKR-YOLOv8	82.0	80.9	85.7	80.5	592

Table 4. Experiments on ablation of network models.

Models	Based Models	DSC	KW	RE-FPN	P/%	R/%	mAP@0.5	F1/%	FPS	FLOPs (G)
Model1	YOLOv8				78.3	75.4	80.6	77.2	526	28.41
Model2	YOLOv8			✓	81.6	73.9	80.8	78.8	868	27.74
Model3	YOLOv8		✓		77.3	74.8	81.3	78.7	640	26.93
Model4	YOLOv8	✓			84.0	83.2	82.1	84.5	474	14.17
Model5	YOLOv8		✓	✓	80.9	74.8	82.0	75.9	669	27.84
Model6	YOLOv8	✓	✓		80.6	79.8	86.2	76.5	539	13.43
Model7	YOLOv8	✓		✓	80.6	81.4	78.9	84.3	524	15.08
Model8	YOLOv8	✓	✓	✓	82.0	80.9	85.7	80.5	592	16.01

The experimental setup for the Kernel Warehouse, Feature Pyramid Network (FPN), and Snake Convolution YOLO (DKR-YOLO) detection and classification method for ancient mural image elements based on YOLOv8 was carefully designed to ensure accurate and reliable results. Image preprocessing was conducted to remove noise, artifacts, and distortions that could hinder algorithm performance.

The Kernel Warehouse module was constructed using a combination of predefined and custom-designed kernel functions, optimized to capture the unique characteristics of ancient mural elements. The Feature Pyramid Network utilized a series of dilated convolutional layers, enabling the extraction of multiscale features. The Snake Convolution module was integrated into the YOLOv8 architecture to enhance the detection and classification of irregularly shaped objects, such as the intricate patterns and themes found in historical murals.

Data augmentation was applied based on the distribution of images in the dataset. This technique helps the network learn a broader range of features, improving its performance on unseen data and enhancing generalization while reducing the risk of overfitting. Data augmentation also acts as a regularization method, promoting more stable convergence during training, particularly for complex model structures or uneven data distributions. To enhance the diversity and robustness of the dataset, 10 different random combinations were generated to augment the image set, with bounding boxes adjusted accordingly. Each image is augmented once by randomly selecting one of the ten types of content. Figure 5 shows examples of the augmented images from each process.

Various image processing operations are applied using specific parameters and methods to enhance image quality. Histogram Equalization (HE) improves image contrast by equalizing the brightness histogram. Adaptive Histogram Equalization (AHE) further enhances local contrast, while Contrast-Limited Adaptive Histogram Equalization (CLAHE) improves contrast while limiting it to prevent noise amplification. Erosion and dilation operations utilize a disk-shaped structuring element with a radius of 5 to eliminate small noise or fill minor gaps. Similarly, opening and closing operations employ disk-shaped structuring elements to remove small objects and close small holes. Mean filtering is used for smoothing, applying a 5 × 5 filter to reduce noise. Median filtering, using a 5 × 5 median filter, effectively removes impulse noise. Gaussian smoothing, using a 5 × 5 Gaussian filter, reduces high-frequency noise. Gradient filtering is employed for edge detection and detail enhancement. Shift filtering operations translate the image by one pixel to the left to simulate image translation. Sharpening enhances image edges and fine details. The Gaussian pyramid method performs down-sampling, while the Laplacian pyramid achieves down-sampling by subtracting the up-sampled version of the down-sampled image. The Fourier Transform is applied to convert the image to the frequency domain, using a logarithmic function and normalized spectrum. Fourier low-pass and high-pass filtering, using Gaussian filters, are employed to retain the low-frequency and high-frequency components of the image, respectively.

3.2. Recognition Results

The Dunhuang murals, as representative examples of ancient Chinese grotto art, originated during the Sixteen Kingdoms period and have a history spanning over 1500 years. These murals exhibit diverse styles across different historical periods, making them one of the world’s most significant artistic treasures. The patterns and designs in Dunhuang murals are highly varied, commonly seen in architectural elements such as herringbone patterns, flat beams, and caissons, as well as in decorative features of Buddhist artifacts like niche lintels, canopies, lotus thrones, and halos. Additionally, strip borders are frequently used to divide and embellish architectural and mural spaces. The designs encompass a broad array of motifs, including flowers, plants, clouds, birds, beasts, flames, geometric shapes, and gold lamps. These elements are rich in artistic atmosphere and imbued with profound symbolic meanings. They not only reflect refined esthetic tastes but also create unique spatial perceptions, contributing significant artistic, esthetic, and scholarly value.

This study utilizes imagery from The Complete collection of Chinese Dunhuang murals [10]. These murals feature a variety of images and styles. The DOI of the dataset for the images is 10.57967/hf/4516 (URL is https://huggingface.co/datasets/jinmuxige/dunhuang (accessed on 16 February 2025)). The dataset is divided evenly, with 50% allocated for the validation set and 50% for the training set. A selection of these motifs was processed for the experiment, and the results are presented in Table 2. To ensure the dataset’s representativeness and reproducibility, we draw imagery from the Mogao Caves at Dunhuang (Gansu, China)—a corpus spanning the Sixteen Kingdoms through the Yuan period—and use the public “The Complete Collection of Chinese Dunhuang Murals”. The corpus covers major stylistic phases and iconographic genres (architectural ornaments, Buddhist ritual implements, flora–fauna motifs, and scenes of productive labor), with 20 object categories and 995 annotated instances in total (long-tailed distribution: 10–126 instances per class as summarized in Table 2; e.g., Niche Lintel: 10, Boat: 22, Alage wells: 126). We adopt a stratified 50/50 train–validation split that preserves per-class proportions and maintains the balance across the three thematic groups reported in Table 2. All labels underwent double review with consensus resolution under guidance from a mural-conservation specialist to harmonize class boundaries and ensure period-agnostic consistency. This protocol makes explicit our provenance (site, period coverage), scale (20 classes; 995 instances), and labeling standards (definitions of complex detail and degradation), thereby supporting claims about generalization across Dunhuang’s stylistic diversity.

The corpus of this study has a significant long-tail distribution. If the common 70/30 or 80/20 division is adopted, the number of positive cases of the minimum category in the validation set will be reduced to about 10, respectively, which makes it difficult to form a stable and interpretable class-wise PR curve and AP evaluation, and it is also impossible to reliably estimate the detection threshold and uncertainty calibration. The 50/50 allocation at least ensures that the minority class still has multiple positive examples in the validation set, so as to test the model more realistically and reduce the evaluation variance and the contingency of the conclusion caused by too few validation samples.

The enhanced DKR-YOLO model was employed to identify 20 classes of symbols in photographs of the Mo Kao Grotto murals at Dunhuang. Figure 6 shows some examples of the content of the mural recognition. As depicted in Figure 7, the upgraded DKR-YOLO model effectively recognizes murals featuring multiple targets. It can also detect murals that have been partially hidden or modified using Gaussian blur.

From the recognition results shown in Figure 6 and Figure 7, it can be seen that the improved DKR-YOLO model not only can achieve multi-object detection in complex scenarios, but also maintains high robustness even when the images are distorted or blurred. When dealing with the digitalization of large-scale mural heritage, automated recognition can significantly reduce the workload of manual annotation and classification, and improve the efficiency of data management and academic research. The model can still effectively detect images that are partially obscured or damaged, providing feasible technical support for the monitoring of cultural relics’ deterioration and the restoration of incomplete images. Through systematic recognition of elements such as architectural patterns, Buddhist ornaments, animals and plants, and geometric symbols in the murals, the model helps reveal the style evolution and esthetic orientation of different historical periods, thereby providing quantifiable data support for archeological, art history, and conservation studies.

3.3. Test Results on Mural Dataset

To effectively highlight the accuracy and performance of the improved DKR-YOLOv8 model, Table 3 presents a comparative experiment of recognition accuracy and performance among the DKR-YOLOv8 model and other models.

Table 3 shows that DKR-YOLOv8 achieves the best results simultaneously in five metrics: precision, recall, mAP, F1, and inference speed (FPS). It achieves a “win-win-win” situation of accuracy, recall, and speed. Compared with the baseline YOLOv8 under the same data conditions, the precision of DKR-YOLOv8 has increased from 78.3% to 82.0%, the recall has increased from 75.4% to 80.9%, mAP has increased from 80.6% to 85.7%, F1 has increased from 77.2% to 80.5%, and the frame rate has increased from 526 to 592. Compared with the best “lightweight baseline” of each metric, DKR-YOLOv8 still leads: the precision is 0.6 percentage points higher than YOLOv4-tiny, the recall is 1.3 percentage points higher than YOLOv3-tiny, mAP is 3.1 percentage points higher than YOLOv4-tiny, F1 is 1.7 percentage points higher than YOLOv3-tiny, and the speed is 35 frames faster than YOLOv3-tiny. From the perspective of detection significance, the higher recall and F1 indicate fewer missed detections of “small and long, weak contrast, and prone to break” targets and more stable thresholds; the significant improvement in mAP reflects the overall enhancement of sorting and positioning quality under different confidence thresholds; the increase in precision effectively suppresses false alarms caused by non-structured noise. More importantly, while maintaining or even improving FPS, DKR-YOLOv8 achieves comprehensive accuracy gains, indicating that the information backfilling of RE-FPN and lightweight design, the anisotropic refined sampling of DySnake, and the content adaptive kernel assembly of Kernel Warehouse achieve complementarity: high-level semantics are robustly transmitted without excessive smoothing, low-level details are preserved without excessive noiseization, and ultimately DKR-YOLOv8 is located at the Pareto frontier of various methods under the same computing budget, balancing high-throughput scanning and high-quality recognition, and better meeting the dual requirements of real-time performance and accuracy for mural scenarios.

3.4. Ablation Experiment

To evaluate the efficiency of the upgraded DKR-YOLOv8 model, a series of comparative ablation studies were conducted. Eight ablation experiments were performed to consistently assess model performance, utilizing the same dataset, training configuration, and methodology in each case. The validation analyses included precision, recall, integrated evaluation metrics, and mean Average Precision (mAP).

The first row of Table 4 presents the training results for the original YOLOv8 model, without any augmentation techniques. The incorporation of Dynamic Snake Convolution (DSC), Kernel Warehouse dynamic convolution, or Lightweight Residual Feature Pyramid Network (RE-FPN) into the baseline YOLOv8 model led to improvements in detection rates, accuracy, and computational efficiency across the updated models. Notably, DSC—not Kernel Warehouse—resulted in an approximately 50% reduction in FLOPs. Furthermore, mAP improved by 0.7% with Kernel Warehouse and by 1.5% with DSC, while the RE-FPN method reduced recall by 1.5% rather than increasing it. Following these independent algorithm enhancements, the resulting model is designated as DKR-YOLOv8. This model demonstrates significant improvements, with a 3.7% increase in precision, a 5.5% increase in recall, a 3.3% increase in the F1 score, and a 5.1% increase in mAP compared to the original YOLOv8. It also achieves 16.01 G FLOPs (a 43.6% reduction versus baseline) and a frame rate of 592 FPS. In summary, the enhanced DKR-YOLOv8 model attains an 80.9% recall rate and an 85.7% mAP. The increase in FPS, along with reductions in model parameters and FLOPS, substantially boosts detection speed, enabling deployment on both mobile and stationary observation platforms.

The FLOPs were reduced by 43.6% and the inference speed reached 592 FPS, which means that under the same budget and power supply conditions, the ultra-high pixel mural mosaic can be processed at a higher input resolution and with a denser sliding window strategy on the edge side. This enables the inspection frequency to be raised from “project-based” to “daily”, and supports real-time quality feedback and re-composition correction on the scaffolding site. At the same time, the lower computing power threshold allows small museums and local cultural heritage institutions with limited resources to run offline on CPU or entry-level GPU devices, avoiding reliance on the external network and cloud, meeting the security and privacy regulations for cultural relic images, and reducing long-term operation costs. On the other hand, the combined effect of an 5.1% increase in mAP, a 3.7% increase in accuracy rate, a 5.5% increase in recall rate, and a 3.3% increase in F1 score is reflected in the repair and digital cataloging chain: fewer missed detections make the spatial distribution map of subtle signs more complete, the regional quantification more stable, and it helps to formulate the minimum intervention materials and processes; fewer false detections reduce the burden of manual review, improve the consistency of batch processing, make the mapping from image to semantic label to catalog information more traceable and verifiable, and can be seamlessly connected with standardized record forms such as CIDOC. In addition, the retained spatial clues and saliency responses of RE-FPN and ECA can serve as visual audit evidence, helping non-technical personnel understand the source of model determination, enhancing cross-team collaboration and decision-making transparency.

3.5. Grad-CAM Module Analysis

To visually present the fine-grained texture enhancement of each component on the mural, we used the Grad-CAM series method to visualize the key layers of the detection network and overlaid heat maps on typical Dunhuang samples. The colors in the figure increase from blue to green to yellow to red, representing the increasing contribution. Figure 8 compares the original YOLOv8 with the DKR-YOLOv8 in this paper; Figure 8 is a module-based visualization, showing the changes in activation distribution after introducing DySnake Conv, Kernel Warehouse, and RE-FPN. All samples maintain the same input size and color mapping to ensure comparability. The color changes from blue to red indicating that the level of attention increases from low to high. It can be observed that the model shows significantly higher responses at the target subject and its discriminative structure, while the responses are weaker in the background area, which is basically consistent with the position of the predicted box, suggesting that the model mainly makes decisions based on the semantic-related areas. The heatmap is obtained by computing the gradient of the target category scores on the last layer of convolutional features and performing weighted summation, then upsampling to the input resolution. It is used to explain the basis of the model’s discrimination and the rationality of the positioning.

The heat maps of the original YOLOv8 tend to show a trend of “large blocky highlights”, favoring high-contrast large structures such as light backings, eaves contours, etc.; in complex candelabra, interlocking patterns, and hair-like fine lines, the responses are discrete and the boundaries are blunt, easily being “overwhelmed” by the strong responses from the high-level semantic downsampling. In contrast, the heat maps of DKR-YOLOv8 in Figure 9 are significantly “thinner and more precise”: the highlighted areas are distributed in a band-like continuous manner along slender curves such as scrolls, clothing patterns, eave angles, etc., with the normal direction bandwidth narrowing, and the background mottling and weathering noise are suppressed; in low-contrast fine details such as horses, figures’ hands, etc., activation no longer spreads into large areas, but is concentrated closely along the main axis of the structure, demonstrating a stable capture of small targets and fine boundaries. This change from “surface highlights” to “line highlights/lighting” is consistent with our quantitative results on the recall of small targets and the overall improvement of mAP.

The heatmap of DySnake Conv forms thin and continuous highlights along the tangential direction of the curve, which can smoothly pass through minor defects and broken strokes. The normal energy is concentrated, avoiding being pulled by wall cracks or random noise. This indicates that the sampled bands for curve alignment effectively reduces the drift and energy leakage in the receiving domain. Kernel Warehouse exhibits obvious directional selectivity: within the same image block, different orientations and hole rates of the kernel are sparsely activated, forming a segmented response consistent with the main direction of the pattern, enhancing cross-scale consistency while reducing background random textures, demonstrating the adaptive template assembly capability. RE-FPN (with residual feature enhancement) retains the edges and point decorations of P3 after fusion and provides surrounding structural semantics for P4/P5 through multi-scale context. The high-level semantics no longer “smooth out” the boundaries, and the phase mismatch and high-frequency aliasing at the upsampling are alleviated, making cross-scale alignment more robust. In summary, this set of visualizations not only reveals the division and complementarity of each module: DySnake Conv is responsible for line alignment and continuous tracking, Kernel Warehouse is responsible for selecting appropriate form templates according to direction and scale, and RE-FPN maintains details and stable fusion under a lightweight budget; it also forms mutual verification with our quantitative improvements in FLOPs, latency, and pixel throughput, supporting the conclusion that DKR-YOLOv8 achieves an “accurate, stable, and efficient” deployable balance in the mural scene.

3.6. Model Method Comparison Experiment

To strengthen the claim that DKR-YOLOv8 is not only superior to the YOLOv8 baseline but also competitive within cultural-heritage–oriented settings, we design a supplementary evaluation encompassing lightweight detectors and pixel-level models commonly adopted in conservation pipelines. On the detection track, we add NanoDet-Plus (1.5×), PP-PicoDet-LCNet (1.5×), EfficientDet-D0, and MobileNetV3-SSD as parameter-efficient baselines matched to the same input size (640 × 640) and training recipe (SGD, 200 epochs, cosine LR, batch 32). This establishes a like-for-like comparison at similar FLOPs/latency. On the segmentation track—often required for condition mapping—we introduce U-Net (ResNet-34 encoder), DeepLabV3+ (MobileNetV3), and SegFormer-B0 for semantic masks, plus Mask R-CNN (R50-FPN) for instance masks. The results of the comparative experiments are shown in Table 5 below.

From the comparison in the table, it can be seen that DKR-YOLOv8 achieves P/R/F1/mAP@0.5 of 82.0/80.9/80.5/85.7 in the detection task, with all four core accuracy indicators being the highest among all detection models. Meanwhile, in terms of computational cost, it is 43.6% lower than the baseline YOLOv8 (28.41 G), demonstrating the superiority of the “high accuracy—medium computing power” trade-off. Compared with NanoDet-Plus (3.9 G, mAP 78.7), PP-PicoDet-LCNet (2.7 G, 77.9), and MobileNetV3-SSD (2.2 G, 75.1), DKR-YOLOv8 outperforms them by 7.0–10.6 points in mAP, and significantly leads in recall and F1. This indicates that the geometric alignment and multi-scale enhancement introduced by DySnake, RE-FPN, and Kernel Warehouse for capturing slender and low-contrast small targets have exchanged substantial recognition benefits over “extremely low computing power” solutions. Compared with EfficientDet-D0 (8.5 G, mAP 79.3), DKR-YOLOv8 achieves +6.4 points in mAP and higher P/R/F1 with only an increase of about 7.5 G in computing power, closer to the accuracy-stability balance required for practical applications. The segmentation baseline (U-Net/DeepLabV3+/SegFormer-B0) in the table provides P, R, and mAP@0.5 based on the indirect measurement of mask-external bounding boxes, and F1 cannot be directly comparable. Even so, its highest mAP is 78.3 (SegFormer-B0), still lower than DKR-YOLOv8’s 85.7, indicating that in the workflow centered on object detection, this method has stronger usability under the same input and training protocol. Instance segmentation Mask R-CNN (44.7 G) achieves only mAP 80.1 and F1 78.6 under significantly higher computational overhead, and is also surpassed by DKR-YOLOv8. This further demonstrates that in the Pareto frontier position of absolute accuracy and computational efficiency, while maintaining a medium-low computing power budget, this method provides the highest detection accuracy and better recall/threshold stability, thus being more suitable for resource-constrained but high-reliability scenarios such as digital protection and cataloging of murals.

With the same input (640 × 640, batch = 1), DKR-YOLOv8 reduces FLOPs from 28.41 G to 16.01 G (−43.6%) and increases FPS from 526 to 592 (+12.55%), bringing end-to-end latency down from 1.90 ms to 1.69 ms (−11.15%). In terms of pixel throughput, the baseline processes about 2.154 × 10⁸ px/s (526 × 640 × 640) versus 2.425 × 10⁸ px/s for DKR-YOLOv8, a net gain of 12.55%. This means that, under the same time budget, the total number of sliding windows can increase by 12.55%. In common edge settings with fixed power, per-image energy is roughly proportional to latency, so the energy per image is reduced by about 11.15%; under a fixed-FPS target, DKR-YOLOv8 offers 12.55% compute headroom that can be traded for lower frequency/voltage to improve battery life or thermals. In short, the model’s lightweight design not only cuts FLOPs by 43.6% but also translates on edge devices into −11.15% latency, +12.55% pixel throughput, +33% feasible input resolution (at equal throughput), and −11.15% energy per image—clear, verifiable deployment gains.

3.7. Comparison Experiment Before and After Improvement

After 200 epoch, we retrieved the position loss values for the DKR-YOLOv8 and YOLOv8 models in order to do a more detailed analysis of the model’s performance before to and after the upgrade. By examining these position loss values, we can gain insights into the efficiency and accuracy improvements brought about by the upgraded DKR-YOLOv8 model. The data illustrate how the enhancements in the DKR-YOLOv8 model contribute to more precise object localization, ultimately leading to better overall detection performance.

As shown in Figure 10, the improved DKR-YOLOv8 model exhibits faster convergence and a significantly higher convergence rate compared to the original YOLOv8 model. This improvement can be attributed to the integration of the Kernel Warehouse and Dynamic Snake Convolution techniques. Recall curves illustrating a steady increase in recall as the number of training iterations increases. Notably, after the 200th epoch, a significant difference is observed between the revised DKR-YOLOv8 model and the original YOLOv8 model.

In terms of clustering performance, the DKR-YOLOv8 model outperforms YOLOv8, demonstrating superior clustering capability. The clustering density is significantly higher, with 82.6% of the variables clustered within the range of −0.4 to 0.3. Figure 11 further illustrates a more robust clustering function, which enhances the model’s ability to effectively group variables. This tighter clustering contributes to more accurate object detection, as the model is better able to distinguish between different objects within an image.

The 2D histogram of data extraction frequency for DKR-YOLO v8 exhibits a more cohesive and uniform distribution. The frequency counts predominantly fall within the range of 20–60, suggesting a balanced and consistent extraction process. This uniformity indicates that the model performs data extraction more evenly across different frames, reducing the likelihood of bias or overfitting to specific data segments. DKR-YOLO v8 shows a closer relationship across various angular directions, which is depicted by the near-circular distribution in the 2D histogram. This close-to-circular pattern signifies a more isotropic clustering of data points, meaning that the model’s performance and accuracy are less dependent on the orientation of the objects within the frames. Such isotropy is advantageous as it ensures the model’s robustness and consistency in detecting objects regardless of their orientation.

Confusion matrix is a two-dimensional matrix where rows represent the predicted classes by the model and columns represent the actual classes. Each row’s data denotes the instances predicted as a particular class, with the total number in each row representing the number of instances predicted to belong to that class. Similarly, each column’s data represents the actual classes, with the total number in each column indicating the number of instances belonging to that class.

To diagnose class-wise recognition behavior and isolate the effect of our design choices, Figure 12 and Figure 13 juxtaposes confusion matrices for YOLOv8 and DKR-YOLOv8 in normalized forms. Complementing this analysis, Figure 14 visualizes the co-occurrence structure of misdetections at the image level, The different colors in the figure represent different clustering categories. The colors themselves have no specific meanings; they are merely used to distinguish between different categories.

Compared to YOLOv8, the clustering degree of the confusion matrix of DKR-YOLOv8 is significantly higher, the accuracy of DKR-YOLOv8 is about 94%, while YOLOv8 is only 85%. The improved model effectively reduces the probability of missed detections. Through the analysis of the confidence map, it is evident that the attention mechanism significantly enhances the model’s detection accuracy, thereby reducing uncertainty and false detection rates. The analysis results of the confidence map show that the introduction of the attention mechanism significantly improves the model’s ability to recognize different mural categories, reducing instances of false detections and missed detections, and thereby increasing overall detection accuracy. These improvements demonstrate that the attention mechanism plays a crucial role in optimizing the performance of the DKR-YOLOv8 model, making it exhibit higher reliability and accuracy in mural detection tasks within complex backgrounds.

In image recognition, object categories are often prone to confusion, either between different categories or within the same category. Below is an analysis of the factors contributing to such misclassification for each category:

Category 1: Konghou, Cattle, Clouds, Bird, Flame, Algae Wells. Natural elements such as clouds, flames, and birds exhibit highly variable shapes, making it challenging for image recognition systems to capture clear boundaries and forms. This variability can lead to confusion with objects that share similar morphological features, such as flowing water or smoke. Decorative or artistic forms, including algae wells and the konghou, often feature intricate textures or curves that resemble certain natural forms (e.g., plants or ripples), increasing the likelihood of misclassification. Similarly, the textures of cattle skin or bird feathers can be mistaken for those of other furred or feathered animals due to insufficient differentiation in surface details.

Category 2: Deer, Horses, Pavilion, Pipa. Deer and horses share morphological similarities, particularly when viewed from specific angles or in low-quality images, often resulting in misidentification. Architectural structures like pavilions may be confused with similar structures (e.g., temples or covered bridges), especially if the algorithm lacks sensitivity to finer details. Likewise, the pipa may be misclassified as other string instruments, such as the guqin or yueqin, when their distinctive features are not adequately captured during recognition.

Category 3: Baldachin, Pagoda, Boat. Baldachins and pagodas frequently include multi-tiered structures or ornamented spires, which may cause them to be misclassified as other religious buildings (e.g., stupas or temple roofs). Their repetitive decorative elements often bear high visual similarity to other royal or religious architectural designs, complicating the recognition of their specific functions or symbolic meanings. Similarly, the diverse forms of boats can lead to confusion with other objects of comparable shapes, such as bridge components or lower sections of buildings.

Category 4: Monk Staff, Honeysuckle, Niche Lintel, Fans, Tree. Trees are often difficult to distinguish from other plants or natural objects, such as shrubs or vines, particularly in images with blurred details. The monk staff may resemble plant forms like bamboo, especially when the material is not prominently displayed. Decorative and functional overlap further contributes to confusion; for example, fans may be misclassified as other similarly shaped objects, such as folding fans or decorative screens, when shared patterns or structures are present. Likewise, niche lintels can be mistaken for other architectural elements, such as door lintels or window frames, due to their similar shapes and designs.

In image recognition, the primary reasons for object confusion include morphological similarity, surface texture complexity, the interplay of functional and decorative elements, and the visual similarity between natural and artificial objects. These factors can lead to misrecognition or classification errors if the algorithm fails to adequately capture the key features of the objects.

As shown in Figure 15, the loss functions and accuracy metrics during the model training process exhibit their respective convergence trends and oscillatory behaviors. The loss functions measure the discrepancy between the model’s predictions and the actual labels. All three loss functions (Box Loss, Objectness Loss, and Classification Loss) demonstrate a clear convergence trend, eventually stabilizing. This indicates that the model progressively learns the features during training and effectively reduces the prediction error.

The Box Loss function is used to measure the difference between the predicted bounding boxes and the ground truth bounding boxes. The Objectness Loss function assesses the model’s accuracy in predicting the presence of an object. The Classification Loss function evaluates the model’s accuracy in predicting the class of the object. The model’s accuracy exhibits noticeable oscillations during the training process. This oscillatory behavior could be due to the model continuously adjusting its parameters to achieve better performance on a complex dataset. However, as the training progresses, the accuracy gradually stabilizes, and around 160 epochs, it stabilizes at approximately 0.8. This indicates that the model has achieved high accuracy in the classification task. All three loss functions converging to a stable state indicates that the model has achieved the expected performance in various tasks. Despite the oscillations observed in the accuracy during training, it eventually stabilizes and remains at a high value. The introduction of the attention mechanism has significantly enhanced the model’s detection accuracy, reduced uncertainty and false detection rates, and effectively improved the overall performance of the model.

A 43.6% reduction in FLOPs directly lowers inference latency and power draw, enabling on-site deployment on modest edge hardware typically available to small museums and local conservation studios (e.g., compact GPUs or CPU-only workstations without datacenter connectivity). This, in turn, supports higher input resolutions, denser tiling of gigapixel murals, and real-time feedback during scaffold-side capture and rapid condition surveys, all within the same energy and budget envelope. Lower computational cost also reduces archival processing time and cloud dependence, facilitating privacy-preserving, offline workflows where sensitive cultural assets cannot leave the institution. In routine monitoring, the same device can run more frequent passes per wall section, improving temporal coverage without increasing staffing or electricity costs—concretely advancing “sustainable management” by making periodic, standardized assessments feasible for resource-limited custodians.

The 3.7% accuracy increase and 5.1% mAP improvement translate to fewer missed detections of small, elongated, and low-contrast elements and fewer false positives on noise sources. In digitization and cataloging, higher mAP improves iconographic and motif-level recognition consistency across heterogeneous mural styles and aging states, reducing the manual verification burden for curators and leading to cleaner, more complete metadata at scale. Because RE-FPN and ECA preserve spatial cues while emphasizing semantically critical channels, the resulting attention and saliency overlays are more interpretable for non-technical experts: conservators can audit why a region is flagged, trace decisions to visible pictorial evidence, and integrate those overlays into condition reports and CIDOC-compliant records.

As shown in Figure 16, the core function of the software is the automatic recognition of the elements in the mural. Through high-precision image recognition technology, the software can accurately capture and analyze the key information in the mural. To provide strong technical support for scholars in the field of mural research. Scholars can use the software to quickly obtain key data in murals, so as to more deeply study the history, culture, artistic style and other aspects of murals. The software has a user-friendly interface design, including functional modules such as starting the experience, importing pictures, analyzing data and viewing history. This makes it easy for users to get started and quickly complete the recognition and analysis tasks of mural elements. The development and application of the software involves many disciplines such as computer science, image processing, culture and art. Its successful application will promote the cross integration and deep cooperation between these disciplines, and promote the common development of related fields. The application of the software is helpful to improve the level of mural protection and research. By accurately identifying the elements and features in the murals, the preservation status of the murals can be more accurately evaluated, which provides strong support for the development of scientific conservation programs and research plans.

4. Discussion

Computer-vision techniques centered on mural content classification are shifting from expensive, fragile, high-barrier laboratory capabilities into deployable, reusable, and governable digital infrastructure that directly reshapes workflows in museums, archives, and smaller cultural-heritage organizations with limited budgets and staff. While remaining compatible with the YOLOv8 industrial ecosystem, DKR-YOLO enhances the representation of intricate brushwork and edges through DySnake Conv and an Adaptive Convolutional Kernel Warehouse; it further raises discriminability and interpretability by combining an RFA-enhanced FPN (RE-FPN) with Efficient Channel Attention (ECA), achieving a better compute–accuracy trade-off. These gains are not merely incremental numbers; they tangibly expand the option space and affordability envelope for cultural institutions. In digital archiving, more robust fine-detail capture and resilience to stylistic variation enable the automatic segmentation and labeling of semantic units in murals. Compared with workflows reliant on manual description and keyword retrieval, the model’s ability to recognize motif variants, regional iconographic differences, and latent elements under fading or occlusion facilitates cross-museum and cross-regional ontology and vocabulary alignment, markedly reducing downstream costs for catalog consolidation and deduplication. The residual augmentation in RE-FPN and the channel prioritization of ECA stabilize suggested labels and confidence intervals, trimming curator verification time—freeing scarce staff in smaller organizations to shift effort from repetitive data entry toward semantic review and knowledge organization, which improves descriptive consistency and auditability.

Content-level classification benefits conservation decision-making at two ends: on one end, early detection and temporal monitoring of micro-pathologies; on the other, risk assessment and evidentiary preservation for restoration projects. DySnake Conv’s sensitivity to tortuous boundaries and hairline cracks helps stably segment morphological features across multi-temporal imagery; the Adaptive Convolutional Kernel Warehouse maintains feature generality under changes in materials and illumination, reducing the maintenance burden of having to retrain whenever equipment or lighting conditions change. Together with model-intrinsic attention heatmaps and class-activation (CAM) visualizations, conservation teams can incorporate evidence explaining why the model assigned a given classification directly into the record, forming an auditable interpretive chain. At the project-management level, the 43.6% reduction in FLOPs enables inference on portable terminals or on-site edge devices, decreasing reliance on climate-controlled server rooms and cloud bandwidth and lowering energy use and carbon emissions.

On access and public engagement, stable content classification supplies the semantic substrate for tiered presentation, customizable guides, and educational applications. When the system reliably distinguishes subject types, narrative segments, and pictorial roles—and links them to knowledge-graph nodes for time–place, materials, and technique—institutions can more easily build multilingual, layered interfaces: high-precision motif comparison and cross-collection retrieval for researchers; thematic, story-driven tours for general visitors; and tailored narration and tactile/audio alternatives for children or visitors with visual impairments. Because DKR-YOLO reduces inference cost, small institutions can embed model services in gallery kiosks, mobile mini-apps, or even offline devices for temporary exhibits, enriching interaction without materially increasing operating expenses. Mural content classification does not operate in isolation; it is mutually reinforcing with capture modalities (multispectral, macro, structured light), 3D reconstruction, textual comparison, and information extraction. By delivering higher mAP at lower inference load, DKR-YOLO lowers the barrier from pixels to semantics, making a pragmatic path viable for small and mid-sized institutions: first secure usable structured semantics, then pursue cross-modal linkage and knowledge computation. This path improves today’s archiving quality and conservation efficiency and lays a scalable foundation for cross-institutional integration and open science.

5. Conclusions

We have proposed an antique mural picture model based on the YOLOv8 architecture. Our proposed technique has been shown to be successful in correctly and efficiently detecting mural pictures in a variety of environmental conditions, as proven by extensive testing and meticulous analysis.

5.1. Research Results

The DKR-YOLOv8 proposed in this paper is designed for the identification and classification of elements in ancient murals. It provides systematic improvements around three core dimensions: generalization, efficiency, and interpretability. Through Kernel Warehouse dynamic convolution and Dynamic Snake Convolution, it enhances the representation of fine-grained boundaries and decorative details. By optimizing the top-down feature fusion with the lightweight R-FPN (combined with channel attention), it achieves a dual improvement in computing power and accuracy (FLOPs reduced by 43.6%, accuracy increased by 3.7%, and mAP increased by 5.1%) while maintaining the compatibility of the YOLOv8 ecosystem. The accompanying diverse mural dataset covers themes, styles, degradation forms, and complex backgrounds, significantly enhancing the model’s robustness in real-world scenarios and verifying its deployability and timeliness in different environmental monitoring contexts and mobile/edge devices.

These technological and data-related contributions directly serve the practice of cultural heritage protection. The automated and granular identification and location of elements provide replicability in terms of high-quality digital archiving and structured metadata production, reducing cataloging costs and improving the efficiency of cross-institutional comparisons and academic searches. For early detection of diseases and process monitoring, the model’s sensitivity to microscopic forms such as cracks, powdering, and chalking, as well as its cross-device/light robustness, provide quantifiable evidence for on-site inspection and long-term time-series assessment, facilitating the establishment of priority intervention levels by zones and grades. The lightweight inference capability and interpretable visualization (heat maps, activation regions) enhance the usability, auditability, and trustworthiness in resource-constrained institutions, reducing reliance on high-energy-consuming, in terms of computing power, and external cloud transmission, while also aligning with the requirements of sustainable digitalization and data sovereignty. DKR-YOLOv8 transforms the cutting-edge detection framework into a practical toolchain for heritage protection, providing an efficient, reliable, and scalable technical path for the long-term preservation of murals, scientific research, and public services.

5.2. Research Prospects

Despite the notable gains in efficiency and accuracy achieved by DKR-YOLOv8 for mural element recognition and classification, several boundary conditions remain. First, the representativeness and consistency of data and annotations are not yet sufficient to guarantee generalization across sites, materials, and historical periods. Although our dataset covers canonical samples such as Dunhuang and multiple forms of degradation, substantial differences among collecting institutions—e.g., spectral responses, lighting setups, camera and lens characteristics, resolution, and compression policies—can introduce domain shifts. At the same time, restoration marks, overpainted inscriptions, secondary coloring, and dust shadows often share local statistical features with true ornamental boundaries, leading to confusion between “crack/line drawing” and “powdering/texture.” Severe scarcity of long-tail categories exacerbates class imbalance, whose dominant effect on learning has not been fully mitigated. Second, the method primarily targets object detection: while Kernel Warehouse dynamic convolution and Dynamic Snake Convolution improve the representation of tortuous edges and fine ornamentation, box-level supervision is insufficient for conservation tasks that require pixel-level interpretation. Current explainability relies largely on channel attention and class-activation maps—post hoc visualizations that do not map directly to causal explanations or evidentiary chains in conservation terminology. Third, the evaluation scheme still centers on general detection metrics, lacking processual and longitudinal quantification of practical benefits to conservation workflows. Finally, deployment and governance need strengthening: although lightweighting reduces compute barriers, inference latency and energy consumption may still become bottlenecks for ultra-high-resolution panoramic mosaics, low-illumination/high-reflectance scenes, and offline gallery devices. Issues of data sovereignty, tiered access to sensitive images, cross-institutional ontology alignment, and version provenance likewise call for more robust institutional and tooling support.

Looking ahead, we propose aligning and jointly learning visible light with UVF/IRR, multispectral/hyperspectral imaging, RTI, and shallow 3D, leveraging coupled spectral–material–geometric constraints to enhance interpretation of mural types and overlayer structures; geometry-aware FPNs or transformer architectures can improve robustness to distortions on curved surfaces and oblique mosaics. Using weak supervision grounded in hierarchical ontologies, together with metric learning and prototypical networks, can better support long-tail, few-shot categories. Active learning and uncertainty-driven sampling should concentrate limited annotation budgets on regions of maximal model confusion and disagreement, while cross-institutional continual learning and domain adaptation can mitigate performance drift induced by differences in equipment, lighting, and craft techniques. In task formulation, we advocate moving from “detection” to conservation-oriented semantic modeling—integrating instance/panoptic segmentation, crack-skeleton extraction, iconographic relation inference, and temporal change detection into an end-to-end multi-task framework that outputs structured evidence directly actionable for conservation decisions. For explainability and uncertainty management, we recommend proactive, auditable designs—concept bottlenecks, prototype/part-level attribution, and causal interventions to produce explanatory units aligned with conservation terminology—as well as confidence calibration and risk stratification to make the boundaries of “when not to trust the model” explicit within the workflow. In parallel, we also recommend establishing end-to-end lineage records and model cards—from data capture and annotation through training and release—to support cross-institutional review and reproducibility.

Author Contributions

Conceptualization, S.K., and J.L.; methodology, Z.G., and H.W.; resources, Z.G., and S.K.; data curation, Z.G., H.W., and J.L.; writing—original draft preparation, H.W., and Z.G.; writing—review and editing, S.K.; visualization, Z.G., S.K., and H.W.; supervision, S.K.; project administration, S.K.; funding acquisition, Z.G., and S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Innovation Training Program of the National Innovation and Entrepreneur-ship Project Fund of China (202410451009).

Data Availability Statement

Data are contained within the article.

Acknowledgments

We would like to express our sincere gratitude to the Asia-Europe Institute of Universiti Malaya and Ruilu Weijun (Yantai) Information Technology Co., Ltd. for the valuable support and assistance provided for this research. All individuals included in this section have consented to the acknowledgement.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RFA	Residual Feature Augmentation
LSTM	Long short-term memory
FPN	Feature pyramid network
SVMs	Support vector machines
GANs	Generative adversarial networks
BCE	Binary cross-entropy
KW	Kernel Warehouse
DWConv	Depthwise Convolution
SGD	Random Gradient Decrease
HE	Histogram Equalization
DSC	Dynamic Snake Convolution

References

Brumann, C.; Gfeller, A.É. Cultural landscapes and the UNESCO World Heritage List: Perpetuating European dominance. Int. J. Herit. Stud. 2022, 28, 147–162. [Google Scholar] [CrossRef]
Mazzetto, S. Integrating emerging technologies with digital twins for heritage building conservation: An interdisciplinary approach with expert insights and bibliometric analysis. Heritage 2024, 7, 6432–6479. [Google Scholar] [CrossRef]
Poger, D.; Yen, L.; Braet, F. Big data in contemporary electron microscopy: Challenges and opportunities in data transfer, compute and management. Histochem. Cell Biol. 2023, 160, 169–192. [Google Scholar] [CrossRef] [PubMed]
Vassilev, H.; Laska, M.; Blankenbach, J. Uncertainty-aware point cloud segmentation for infrastructure projects using Bayesian deep learning. Autom. Constr. 2024, 164, 105419. [Google Scholar] [CrossRef]
Agrawal, K.; Aggarwal, M.; Tanwar, S.; Sharma, G.; Bokoro, P.N.; Sharma, R. An extensive blockchain based applications survey: Tools, frameworks, opportunities, challenges and solutions. IEEE Access 2022, 10, 116858–116906. [Google Scholar] [CrossRef]
Jaillant, L.; Mitchell, O.; Ewoh-Opu, E.; Hidalgo Urbaneja, M. How can we improve the diversity of archival collections with AI? Opportunities, risks, and solutions. AI Soc. 2025, 40, 4447–4459. [Google Scholar] [CrossRef]
Yu, T.; Lin, C.; Zhang, S.; Wang, C.; Ding, X.; An, H.; Liu, X.; Qu, T.; Wan, L.; You, S. Artificial intelligence for Dunhuang cultural heritage protection: The project and the dataset. Int. J. Comput. Vis. 2022, 130, 2646–2673. [Google Scholar] [CrossRef]
Gao, Y.; Zhang, Q.; Wang, X.; Huang, Y.; Meng, F.; Tao, W. Multidimensional knowledge discovery of cultural relics resources in the Tang tomb mural category. Electron. Libr. 2024, 42, 1–22. [Google Scholar] [CrossRef]
Zhang, X. The Dunhuang Caves: Showcasing the Artistic Development and Social Interactions of Chinese Buddhism between the 4th and the 14th Centuries. J. Educ. Humanit. Soc. Sci. 2023, 21, 266–279. [Google Scholar] [CrossRef]
Chen, S.; Vermol, V.V.; Ahmad, H. Exploring the Evolution and Challenges of Digital Media in the Cultural Value of Dunhuang Frescoes. J. Ecohumanism 2024, 3, 1369–1376. [Google Scholar] [CrossRef]
Zeng, Z.; Qiu, S.; Zhang, P.; Tang, X.; Li, S.; Liu, X.; Hu, B. Virtual restoration of ancient tomb murals based on hyperspectral imaging. Herit. Sci. 2024, 12, 410. [Google Scholar] [CrossRef]
Tekli, J. An overview of cluster-based image search result organization: Background, techniques, and ongoing challenges. Knowl. Inf. Syst. 2022, 64, 589–642. [Google Scholar] [CrossRef]
Zeng, Z.; Sun, S.; Li, T.; Yin, J.; Shen, Y. Mobile visual search model for Dunhuang murals in the smart library. Libr. Hi Tech 2022, 40, 1796–1818. [Google Scholar] [CrossRef]
Gupta, N.; Jalal, A.S. Traditional to transfer learning progression on scene text detection and recognition: A survey. Artif. Intell. Rev. 2022, 55, 3457–3502. [Google Scholar] [CrossRef]
Shin, J.; Miah, A.S.M.; Konnai, S.; Hoshitaka, S.; Kim, P. Electromyography-Based Gesture Recognition with Explainable AI (XAI): Hierarchical Feature Extraction for Enhanced Spatial-Temporal Dynamics. IEEE Access 2025, 13, 88930–88951. [Google Scholar] [CrossRef]
Zhang, P.; Liu, J.; Zhang, J.; Liu, Y.; Shi, J. HAF-YOLO: Dynamic Feature Aggregation Network for Object Detection in Remote-Sensing Images. Remote Sens. 2025, 17, 2708. [Google Scholar] [CrossRef]
Neves, F.S.; Claro, R.M.; Pinto, A.M. End-to-end detection of a landing platform for offshore uavs based on a multimodal early fusion approach. Sensors 2023, 23, 2434. [Google Scholar] [CrossRef]
Zhao, Z.; Zhang, P.; Liu, X.; Lei, X.; Luo, Y. Analysis of photoaging characteristics of Chinese traditional pigments and dyes in different environments based on color difference principle. Color Res. Appl. 2021, 46, 1276–1287. [Google Scholar] [CrossRef]
Li, K.; Zhang, Y.; Li, K.; Li, Y.; Fu, Y. Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 641–656. [Google Scholar] [CrossRef] [PubMed]
Gao, Y.; Li, W.; Liang, S.; Hao, A.; Tan, X. Saliency-Aware Foveated Path Tracing for Virtual Reality Rendering. IEEE Trans. Vis. Comput. Graph. 2025, 31, 2725–2735. [Google Scholar] [CrossRef]
Feng, C.-Q.; Li, B.-L.; Liu, Y.-F.; Zhang, F.; Yue, Y.; Fan, J.-S. Crack assessment using multi-sensor fusion simultaneous localization and mapping (SLAM) and image super-resolution for bridge inspection. Autom. Constr. 2023, 155, 105047. [Google Scholar] [CrossRef]
Ma, W.; Guan, Z.; Wang, X.; Yang, C.; Cao, J. YOLO-FL: A target detection algorithm for reflective clothing wearing inspection. Displays 2023, 80, 102561. [Google Scholar] [CrossRef]
Dahri, F.H.; Abro, G.E.M.; Dahri, N.A.; Laghari, A.A.; Ali, Z.A. Advancing Robotic Automation with Custom Sequential Deep CNN-Based Indoor Scene Recognition. ICCK Trans. Intell. Syst. 2024, 2, 14–26. [Google Scholar] [CrossRef]
Sun, H.; Wang, Y.; Du, J.; Wang, R. MFE-YOLO: A Multi-feature Fusion Algorithm for Airport Bird Detection. ICCK Trans. Intell. Syst. 2025, 2, 85–94. [Google Scholar]
Iqbal, M.; Yousaf, J.; Khan, A.; Muhammad, T. IoT-Enabled Food Freshness Detection Using Multi-Sensor Data Fusion and Mobile Sensing Interface. ICCK Trans. Sens. Commun. Control 2025, 2, 122–131. [Google Scholar] [CrossRef]
Ullah, N.; Ahmad, B.; Khan, A.; Khan, I.; Khan, I.M.; Khan, S. Attention-Guided Wheat Disease Recognition Network through Multi-Scale Feature Optimization. ICCK Trans. Sens. Commun. Control 2025, 2, 11–24. [Google Scholar] [CrossRef]
Ali, M.L.; Zhang, Z. The YOLO framework: A comprehensive review of evolution, applications, and benchmarks in object detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
Zhang, X.; Wang, H.; Dong, H. A Survey of Deep Learning-Driven 3D Object Detection: Sensor Modalities, Technical Architectures, and Applications. Sensors 2025, 25, 3668. [Google Scholar] [CrossRef]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 16901–16911. [Google Scholar]
Lu, P.; Jiang, T.; Li, Y.; Li, X.; Chen, K.; Yang, W. Rtmo: Towards high-performance one-stage real-time multi-person pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 1491–1500. [Google Scholar]
Derakhshani, M.M.; Masoudnia, S.; Shaker, A.H.; Mersa, O.; Sadeghi, M.A.; Rastegari, M.; Araabi, B.N. Assisted excitation of activations: A learning technique to improve object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9201–9210. [Google Scholar]
Lengyel, H.; Remeli, V.; Szalay, Z. A collection of easily deployable adversarial traffic sign stickers. at-Automatisierungstechnik 2021, 69, 511–523. [Google Scholar] [CrossRef]
Cetinic, E.; She, J. Understanding and creating art with AI: Review and outlook. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2022, 18, 1–22. [Google Scholar] [CrossRef]
Mei, S.; Li, X.; Liu, X.; Cai, H.; Du, Q. Hyperspectral image classification using attention-based bidirectional long short-term memory network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–12. [Google Scholar] [CrossRef]
Liu, S.; Yang, J.; Agaian, S.S.; Yuan, C. Novel features for art movement classification of portrait paintings. Image Vis. Comput. 2021, 108, 104121. [Google Scholar] [CrossRef]
Martinez Pandiani, D.S.; Lazzari, N.; Erp, M.v.; Presutti, V. Hypericons for interpretability: Decoding abstract concepts in visual data. Int. J. Digit. Humanit. 2023, 5, 451–490. [Google Scholar] [CrossRef]
Wang, Y.; Wu, X. Current progress on murals: Distribution, conservation and utilization. Herit. Sci. 2023, 11, 61. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Zhang, X.; Huang, Z.; Cheng, X.; Feng, J.; Jiao, L. Bidirectional multiple object tracking based on trajectory criteria in satellite videos. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Koo, B.; Choi, H.-S.; Kang, M. Simple feature pyramid network for weakly supervised object localization using multi-scale information. Multidimens. Syst. Signal Process. 2021, 32, 1185–1197. [Google Scholar] [CrossRef]
Hu, J.; Yu, Y.; Zhou, Q. GuidePaint: Lossless image-guided diffusion model for ancient mural image restoration. npj Herit. Sci. 2025, 13, 118. [Google Scholar] [CrossRef]
Mantzou, P.; Bitsikas, X.; Floros, A. Enriching cultural heritage through the integration of art and digital technologies. Soc. Sci. 2023, 12, 594. [Google Scholar] [CrossRef]
Saunders, D. Domain adaptation and multi-domain adaptation for neural machine translation: A survey. J. Artif. Intell. Res. 2022, 75, 351–424. [Google Scholar] [CrossRef]
Zeng, Z.; Sun, S.; Li, T.; Yin, J.; Shen, Y.; Huang, Q. Exploring the topic evolution of Dunhuang murals through image classification. J. Inf. Sci. 2024, 50, 35–52. [Google Scholar] [CrossRef]
Ren, H.; Sun, K.; Zhao, F.; Zhu, X. Dunhuang murals image restoration method based on generative adversarial network. Herit. Sci. 2024, 12, 39. [Google Scholar] [CrossRef]
Fei, B.; Lyu, Z.; Pan, L.; Zhang, J.; Yang, W.; Luo, T.; Zhang, B.; Dai, B. Generative diffusion prior for unified image restoration and enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9935–9946. [Google Scholar]
Mu, R.; Nie, Y.; Cao, K.; You, R.; Wei, Y.; Tong, X. Pilgrimage to Pureland: Art, Perception and the Wutai Mural VR Reconstruction. Int. J. Hum.–Comput. Interact. 2024, 40, 2002–2018. [Google Scholar] [CrossRef]
Nazir, A.; Cheema, M.N.; Sheng, B.; Li, P.; Li, H.; Xue, G.; Qin, J.; Kim, J.; Feng, D.D. Ecsu-net: An embedded clustering sliced u-net coupled with fusing strategy for efficient intervertebral disc segmentation and classification. IEEE Trans. Image Process. 2021, 31, 880–893. [Google Scholar] [CrossRef]
Smagulova, D.; Samaitis, V.; Jasiuniene, E. Convolutional Neural Network for Interface Defect Detection in Adhesively Bonded Dissimilar Structures. Appl. Sci. 2024, 14, 10351. [Google Scholar] [CrossRef]
Zhang, C. The digital interactive design of mirror painting under transformer based intelligent rendering methods. Sci. Rep. 2025, 15, 25518. [Google Scholar] [CrossRef]
Zhou, G.; Zhi, H.; Gao, E.; Lu, Y.; Chen, J.; Bai, Y.; Zhou, X. DeepU-Net: A Parallel Dual-Branch Model for Deeply Fusing Multi-Scale Features for Road Extraction From High-Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 9448–9463. [Google Scholar] [CrossRef]
Wang, P.; Fan, X.; Yang, Q.; Tian, S.; Yu, L. Object detection of mural images based on improved YOLOv8. Multimed. Syst. 2025, 31, 93. [Google Scholar] [CrossRef]
Chen, L.; Wu, L.; Wan, J. Damage detection and digital reconstruction method for grotto murals based on YOLOv10. npj Herit. Sci. 2025, 13, 91. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, A.; Shi, J.; Gao, F.; Guo, J.; Wang, R. Paint Loss Detection and Segmentation Based on YOLO: An Improved Model for Ancient Murals and Color Paintings. Heritage 2025, 8, 136. [Google Scholar] [CrossRef]
Sapkota, R.; Flores-Calero, M.; Qureshi, R.; Badgujar, C.; Nepal, U.; Poulose, A.; Zeno, P.; Vaddevolu, U.B.P.; Khan, S.; Shoman, M. YOLO advances to its genesis: A decadal and comprehensive review of the You Only Look Once (YOLO) series. Artif. Intell. Rev. 2025, 58, 274. [Google Scholar] [CrossRef]
Hussain, M. Yolov1 to v8: Unveiling each variant–a comprehensive review of yolo. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
Zheng, J.; Fu, Y.; Zhao, R.; Lu, J.; Liu, S. Dead Fish Detection Model Based on DD-IYOLOv8. Fishes 2024, 9, 356. [Google Scholar] [CrossRef]
Wu, X.; Yuan, Q.; Qu, P.; Su, M. Image-driven batik product knowledge graph construction. npj Herit. Sci. 2025, 13, 20. [Google Scholar] [CrossRef]
Ru, H.; Zhang, W.; Wang, G.; Ding, L. OMAL-YOLOv8: Real-time detection algorithm for insulator defects based on optimized feature fusion. J. Real-Time Image Process. 2025, 22, 52. [Google Scholar] [CrossRef]
Dunkerley, D. Leaf water shedding: Moving away from assessments based on static contact angles, and a new device for observing dynamic droplet roll-off behaviour. Methods Ecol. Evol. 2023, 14, 3047–3054. [Google Scholar] [CrossRef]
Lee, U.; Park, Y.; Kim, Y.; Choi, S.; Kim, H. Monacobert: Monotonic attention based convbert for knowledge tracing. In Proceedings of the International Conference on Intelligent Tutoring Systems, Thessaloniki, Greece, 10–13 June 2024; pp. 107–123. [Google Scholar]
Li, S.; Tu, Y.; Xiang, Q.; Li, Z. MAGIC: Rethinking dynamic convolution design for medical image segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 9106–9115. [Google Scholar]
Zhao, M.; Agarwal, N.; Basant, A.; Gedik, B.; Pan, S.; Ozdal, M.; Komuravelli, R.; Pan, J.; Bao, T.; Lu, H. Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product. In Proceedings of the 49th Annual International Symposium on Computer Architecture, New York, NY, USA, 18–22 June 2022; pp. 1042–1057. [Google Scholar]
Ahsan, F.; Dana, N.H.; Sarker, S.K.; Li, L.; Muyeen, S.; Ali, M.F.; Tasneem, Z.; Hasan, M.M.; Abhi, S.H.; Islam, M.R. Data-driven next-generation smart grid towards sustainable energy evolution: Techniques and technology review. Prot. Control Mod. Power Syst. 2023, 8, 1–42. [Google Scholar] [CrossRef]
Bhalla, S.; Kumar, A.; Kushwaha, R. Feature-adaptive FPN with multiscale context integration for underwater object detection. Earth Sci. Inform. 2024, 17, 5923–5939. [Google Scholar] [CrossRef]

Figure 1. YOLOv8 model stench map.

Figure 2. Structure of the DSConv.

Figure 3. Schematic of the Kernel Warehouse dynamic convolution approach.

Figure 4. The pyramid structure for the lightweight residual feature fusion.

Figure 5. The processing results of the 10 enhancement methods for the sample photo.

Figure 6. Display of the target objects for the content recognition of some of the murals.

Figure 7. Results of using this model to identify some of the murals are presented.

Figure 8. Original YOLOv8 and the heat map of DKR-YOLOv8 in this paper.

Figure 9. Visualization of heat maps for different algorithm modules.

Figure 10. Curve of position loss rates and recall rates.

Figure 11. Academic Expansion and Analysis of DKR-YOLOv8 Compared to YOLOv8. (a) YOLOv8 Overlapping Line Plot of Width and Height Distribution. (b) DKR-YOLOv8 Over-lapping Line Plot of Width and Height Distribution. (c) YOLOv8 2D Histogram of Width and Height. (d) DKR-YOLOv8 2D Histogram of Width and Height.

Figure 12. Confusion matrix ratio information for YOLOv8.

Figure 13. Confusion matrix ratio information for DKR-YOLOv8.

Figure 14. Map of misdetected co-occurrences in image recognition.

Figure 15. Result on dataset. The loss function, precision, recall, and map evaluation metrics are included in the result picture.

Figure 16. Mural recognition software page.

Table 5. Comparative experimental results of different popular models.

Model	Track	FLOPs (G)	P (%)	R (%)	F1 (%)	mAP@0.5 (%)
DKR-YOLOv8	Detection	16.01	82.0	80.9	80.5	85.7
YOLOv8 (baseline)	Detection	28.41	78.3	75.4	77.2	80.6
NanoDet-Plus (1.5×)	Detection	3.9	79.0	73.8	76.3	78.7
PP-PicoDet-LCNet (1.5×)	Detection	2.7	78.5	72.6	75.4	77.9
EfficientDet-D0	Detection	8.5	80.0	74.5	77.1	79.3
MobileNetV3-SSD	Detection	2.2	76.4	69.7	72.9	75.1
U-Net (R34 encoder)	Segmentation	16.8	73.8	80.2	None	76.2
DeepLabV3+ (MV3)	Segmentation	5.4	74.9	79.0	None	76.5
SegFormer-B0	Segmentation	8.9	77.3	80.5	None	78.3
Mask R-CNN (R50-FPN)	Instance Seg.	44.7	81.0	76.4	78.6	80.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, Z.; Kumar, S.; Wang, H.; Li, J. AI-Driven Recognition and Sustainable Preservation of Ancient Murals: The DKR-YOLO Framework. Heritage 2025, 8, 402. https://doi.org/10.3390/heritage8100402

AMA Style

Guo Z, Kumar S, Wang H, Li J. AI-Driven Recognition and Sustainable Preservation of Ancient Murals: The DKR-YOLO Framework. Heritage. 2025; 8(10):402. https://doi.org/10.3390/heritage8100402

Chicago/Turabian Style

Guo, Zixuan, Sameer Kumar, Houbin Wang, and Jingyi Li. 2025. "AI-Driven Recognition and Sustainable Preservation of Ancient Murals: The DKR-YOLO Framework" Heritage 8, no. 10: 402. https://doi.org/10.3390/heritage8100402

APA Style

Guo, Z., Kumar, S., Wang, H., & Li, J. (2025). AI-Driven Recognition and Sustainable Preservation of Ancient Murals: The DKR-YOLO Framework. Heritage, 8(10), 402. https://doi.org/10.3390/heritage8100402

Article Menu

AI-Driven Recognition and Sustainable Preservation of Ancient Murals: The DKR-YOLO Framework

Abstract

1. Introduction

1.1. Research on Visual Classification and Detection

1.2. Mural Detection Application Research

1.3. This Paper’s Research

2. Materials and Procedures

2.1. The YOLOv8 Model’s Architecture

2.2. Dynamic Snake Convolution

2.3. Kernel Warehouse Dynamic Convolution

2.4. Lightweight Residual Feature Pyramid Network

3. Experiment and Results

3.1. Hardware and Software Configuration

3.2. Recognition Results

3.3. Test Results on Mural Dataset

3.4. Ablation Experiment

3.5. Grad-CAM Module Analysis

3.6. Model Method Comparison Experiment

3.7. Comparison Experiment Before and After Improvement

4. Discussion

5. Conclusions

5.1. Research Results

5.2. Research Prospects

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI