Semantic and Sketch-Guided Diffusion Model for Fine-Grained Restoration of Damaged Ancient Paintings

Zhao, Li; Chen, Yingzhi; Du, Guangqi; Wu, Xiaojun

doi:10.3390/electronics14214187

Open AccessArticle

Semantic and Sketch-Guided Diffusion Model for Fine-Grained Restoration of Damaged Ancient Paintings

¹

School of Artificial intelligence and Computer Science, Shaanxi Normal University, Xi’an 710119, China

²

Key Laboratory of Intelligent Computing and Service Technology for Folk Song, Ministry of Culture and Tourism, Xi’an 710119, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(21), 4187; https://doi.org/10.3390/electronics14214187

Submission received: 25 September 2025 / Revised: 23 October 2025 / Accepted: 24 October 2025 / Published: 27 October 2025

(This article belongs to the Special Issue Artificial Intelligence for Smart Image Perception, Recognition and Understanding)

Download

Browse Figures

Versions Notes

Abstract

Ancient paintings, as invaluable cultural heritage, often suffer from damages like creases, mold, and missing regions. Current restoration methods, while effective for natural images, struggle with the fine-grained control required for ancient paintings’ artistic styles and brushstroke patterns. We propose the Semantic and Sketch-Guided Restoration (SSGR) framework, which uses pixel-level semantic maps to restore missing and mold-affected areas and depth-aware sketch maps to ensure texture continuity in creased regions. The sketch maps are automatically extracted using advanced methods that preserve original brushstroke styles while conveying geometry and semantics. SSGR employs a semantic segmentation network to categorize painting regions and depth-sensitive sketch extraction to guide a diffusion model. To enhance style controllability, we cluster diverse attributes of landscape paintings and incorporate a Semantic-Sketch-Attribute-Normalization (SSAN) block that explores consistent patterns across styles through spatial semantic and attribute-adaptive normalization modules. Evaluated on the CLP-2K dataset, SSGR achieves an mIoU of 53.30%, SSIM of 0.42, and PSNR of 13.11, outperforming state-of-the-art methods. This approach not only preserves historical aesthetics but also advances digital heritage preservation with a tailored, controllable technique for ancient paintings.

Keywords:

heritage restoration; image restoration; chinese landscape painting; artificial intelligence art

1. Introduction

Ancient paintings represent invaluable cultural heritage. Over centuries, exposure to environmental factors and human handling has led to various damages, including flaking, cracking, erosion, pigment loss, aging, microbial degradation, and scratches. Such deterioration diminishes their cultural and artistic value and may obscure the content of the artwork. Thus, conserving and restoring these paintings is a critical task in cultural heritage preservation.

Restoring ancient paintings is challenging due to variations in color, style, and techniques across dynasties and schools. Unlike oil paintings or frescoes, ancient paintings emphasize natural elements (mountains, rivers, trees) through unique methods like blank-leaving, texture strokes, and outline techniques, as illustrated in Figure 1. Traditional manual restoration is labor-intensive, time-consuming, and risks irreversible damage. Digital restoration, enabled by computer technology, allows virtual repair without affecting originals, serving as references for physical work and creating replicable databases.

The objective is to inpaint damaged areas using intact regions, ensuring realistic, detailed results. Early methods used geometric or patch-based approaches, propagating structure via partial differential equations, variational inpainting, or curvature-driven diffusion. These address cracks or missing areas but overlook subtle damages like fine creases or widespread surface issues common in ancient paintings, as shown in Figure 2 for “A Thousand Li of Rivers and Mountains”.

Recent AI advancements [1,2,3,4,5,6], including GANs and diffusion models, train on masked datasets to predict damaged regions. Mainstream techniques include text-driven masked inpainting [7], sketch-guided completion [8], and text-guided semantic editing [9]. Relying on pre-trained weights, they excel on natural images but falter on ancient paintings for three reasons: (1) the need for extremely fine-grained control over regions and brushstrokes; (2) artistic, non-photorealistic object depictions; and (3) emphasis on inheriting the original style rather than improvisation. Figure 3 illustrates these limitations with examples: GAN-based methods (e.g., SPADE and OASIS) often fail to preserve intricate brushstrokes, resulting in artifacts and loss of artistic style, underscoring the need for our multi-conditional approach in SSGR.

While semantic and depth-sensitive sketch maps are automatically generated for efficiency, our SSGR framework supports manual refinements by experts to enhance controllability. Using accessible tools like Photoshop or GIMP, non-technical users can easily adjust semantic categories (e.g., re-label damaged regions) or modify sketches (e.g., refine brushstroke lines), enabling precise, targeted restoration without coding expertise. This interactivity bridges automated processing with expert oversight, making SSGR practical for cultural heritage professionals.

Through consultations with manual restoration experts, we identified primary focuses: repairing missing parts, mold damage, and crease-induced texture breaks and noise, as illustrated in Figure 1 and Figure 2. To tackle this, we use semantic maps for missing and mold areas, and sketches combined with depth information for crease texture continuity and denoising. Sketches are extracted automatically via methods like Informative drawing [10] and MixSA [11], incorporating depth and preserving brush styles. We employ semantic segmentation for pixel-level maps identifying regions, and depth-sensitive extraction for sketch maps. These guide a diffusion model’s denoising for restoration. Our contributions are:

1.: A paired dataset for ancient paintings restoration (originals, semantic maps, sketch maps) for training restoration and related networks.
2.: A diffusion-based restoration network using regional semantics and local structures from semantic and depth-sensitive sketch maps to guide denoising.
3.: A Semantic-Sketch-Attribute-Normalization (SSAN) block with multi-layer spatially adaptive normalization, integrating semantic layouts and sketch structures for high-quality decoding.
4.: Integration of sketch guidance and class space attention to capture intricate details and artistic essence, bridging AI art generation and traditional ancient paintings.

This work advances ancient paintings preservation and digital cultural heritage with techniques maintaining visual and cultural integrity.

2. Materials and Methods

The primary objective of this research is to develop a novel AI-based framework for the digital restoration of damaged ancient paintings, which often suffer from deteriorations such as creases, pigment loss, mold degradation, and surface damage. This framework leverages advanced deep learning techniques, including diffusion models conditioned on semantic segmentation maps and depth-sensitive sketch maps, to inpaint and restore damaged regions while preserving the original artistic integrity, stylistic elements, and unique brushwork techniques, such as the “Cunfa” (texture strokes) and outline methods characteristic of Chinese art. We demonstrate the efficacy of our approach on a dataset of traditional ancient paintings, serving as a robust benchmark for restoring fine-grained details and textures inherent to this art form.

Practically, this framework provides an efficient digital restoration tool that can act as a reference for physical conservation efforts and enable the creation of a permanent, replicable digital archive for these cultural heritage assets. By doing so, it not only advances the preservation of ancient paintings but also contributes to the broader domain of digital cultural heritage, offering a scalable solution applicable to diverse historical artworks.

We position our proposed task and methodology within the context of state-of-the-art advancements in semantic-guided image completion and deep learning-based restoration for cultural heritage.

2.1. State of the Art

Cultural Heritage Image Generation: Reproducing or simulating ancient paintings remains a formidable challenge, even for skilled artists. To enable ordinary users to generate ancient paintings mimicking ancient masters’ styles, researchers have developed various generative models. For instance, SAPGAN [12] represents the first end-to-end model for generating ancient paintings without conditional inputs. CLDiff [13] employs parameterized Markov chains to progressively transform Gaussian noise into super-resolution ancient paintings from low-resolution inputs, yielding clear ink textures. CLPdiffusion [14] decouples content and style generation, allowing flexible control over ancient paintings synthesis with specified content or styles. CCLAP [15] incorporates textual prompts for controlled generation of content and styles in ancient paintings. Recent advancements also include machine learning-based digital preservation techniques [16], which enhance the fidelity of generated ancient Chinese paintings, and deep learning approaches for their analysis and generation [17]. Additionally, computer vision-based methods have been applied to analyze traditional Chinese painting compositions [18], while graph neural networks have been used for structure-aware analysis of Chinese landscape paintings [1]. Although these methods exhibit strong generative capabilities, their limited user interactivity reduces controllability and engagement.

Generative Models Based on Conditional Expectation: Generative models that exploit conditional expectations have become prominent for learning and sampling from conditional distributions. The Variational Autoencoder (VAE) [19] encodes input images into a latent space, balancing compression and feature representation, though it often yields blurry outputs. Generative Adversarial Networks (GANs) [20] mitigate this by introducing a generator-discriminator adversarial framework, with applications in Chinese calligraphy synthesis [21]. Bayesian networks model conditional dependencies via directed acyclic graphs, while Energy-Based Models (EBMs) [22] and Normalizing Flows (NFs) [23] offer alternative approaches, albeit with scalability challenges. General generative models like VAEs, GANs, EBMs, and NFs face limitations in addressing the unique challenges of ancient painting restoration. VAEs often produce blurry outputs, failing to capture the intricate brushstrokes of ancient artworks. GANs suffer from mode collapse and instability, which hinder consistent style preservation across diverse artistic schools. EBMs and NFs, while theoretically robust, struggle with optimization complexity and scalability for high-resolution restoration tasks. Diffusion models, by contrast, excel in iterative denoising, enabling precise control over fine-grained details and artistic styles when conditioned on semantic maps and depth-sensitive sketches, as proposed in our framework.

Denoising Diffusion Probabilistic Models (DDPMs): DDPMs [2,3] have revolutionized image synthesis by estimating clean image distributions from noisy inputs, producing high-quality results. However, their high computational demands during training and inference have prompted optimizations [24]. Originating from [25], diffusion models have evolved through milestones like variational formulations [26] and beating GANs in quality [27]. Enhancements include Denoising Diffusion Implicit Models (DDIM) [24] and score-based methods [3]. For high-resolution tasks, cascaded [28,29] and latent-space [30] approaches have proven effective.

Semantic Image Synthesis: This task involves generating photorealistic images from semantic segmentation maps, enabling precise content control. Pioneering works like Pix2Pix [31] and SPADE [32] set foundations, with subsequent innovations incorporating instance boundaries [33], spatially-adaptive transformations [32], and class-specific generation [34,35,36,37]. Recent work also explores fine-grained text-driven generation, such as human motion synthesis [21,38], which shares similarities with semantic control in painting restoration.

Classifier-Free Diffusion Guidance: This technique conditions diffusion models without external classifiers, differing from GAN-based normalization like SEAN [39], SCGAN [36], and INADE [40]. In diffusion contexts, AdaGN [27] conditions on class embeddings and timesteps. Multi-layer spatially-adaptive normalization [41] injects segmentation masks into U-Net decoders, while spatial transformers [30,42] provide versatile conditioning. In contrast, our work fuses semantic maps, attribute encodings, and sketch maps within a diffusion model framework. We integrate cross-attention in U-Net stages to condition on semantic masks, attributes, and depth-sensitive sketches, enabling multi-modal control for fine-grained ancient paintings restoration.

In contrast, our work fuses semantic maps, attribute encodings, and sketch maps within a diffusion model framework. We integrate cross-attention in U-Net stages to condition on semantic masks, attributes, and depth-sensitive sketches, enabling multi-modal control for fine-grained ancient paintings restoration. Our approach emphasizes controllability, allowing experts to intervene by manually adjusting automatically extracted semantic and sketch maps. This feature, detailed in later sections, ensures fine-grained control over restoration, distinguishing SSGR from fully automated methods.

2.2. Dataset

We utilize the CLP-2K dataset [43], as no high-precision, pixel-level labeled ancient paintings benchmarks exist. Given the labor-intensive nature of manual annotation for high-resolution images, we employ the CLPPP segmentation model [43] to parse ancient paintings from sources like the National Palace Museum of China, generating automatic pixel-level annotations. The resulting CLP-dataset focuses on ancient paintings, encompassing 14 semantic categories (e.g., mountains, trees, rivers) plus an “unknown” class. To assess the reliability of automatic annotations generated by the CLPPP segmentation model, we conducted quality control by manually verifying 100 randomly selected samples, achieving a mean Intersection over Union (mIoU) of 89.3% against manual labels. This confirms the high accuracy of automatic annotations. The balance of 2207 manually labeled and 7049 automatically labeled samples enhances dataset diversity, supporting robust model training across varied artistic styles. However, minor annotation errors may introduce noise, potentially affecting performance on edge cases, which we mitigate through robust training strategies like classifier-free guidance. To ensure diversity, large original paintings are resized and cropped before annotation. The dataset includes 2207 manually labeled and 7049 automatically labeled images, totaling 9256 entries. Each image is associated with attributes reflecting styles and artistic schools.

2.3. Problem Formalization

Let

M \in R^{H \times W}

denote the semantic segmentation map, where H is the height and W is the width of the image, with each pixel value indicating a semantic category. Our goal is to synthesize a restored ancient paintings image

I \in R^{H \times W \times 3}

that aligns with M, an attribute vector

A \in {0, 1}^{n}

(where n is the number of attributes, encoding binary stylistic features), and a depth-sensitive sketch map

S \in R^{H \times W \times 1}

.

The diffusion model reverses a forward Markov chain that adds Gaussian noise

ϵ \sim N (0, I)

(where

I

is the identity matrix) to the clean data

X_{0} \sim q (X_{0})

(the data distribution), progressively noising it to

X_{T} \approx N (0, I)

. The reverse process denoises

X_{T}

back to

X_{0}

over T timesteps, conditioned on M, A, and S.

The forward process for each timestep

t = 1, \dots, T

is:

q (X_{t} | X_{t - 1}) = N (X_{t}; \sqrt{1 - β_{t}} X_{t - 1}, β_{t} I),

(1)

where

β_{t} \in (0, 1)

is the variance schedule.

The joint reverse distribution is:

p (X_{0 : T} | M, A, S) = p (X_{T}) \prod_{t = 1}^{T} p_{θ} (X_{t - 1} | X_{t}, M, A, S),

(2)

with

p (X_{T}) = N (X_{T}; 0, I)

, and each reverse transition

(p_{θ} (X_{t - 1} | X_{t}, M, A, S) = N (X_{t - 1} μ_{θ} (X_{t}, t, M, A, S), σ_{t}^{2} I),

(3)

where

μ_{θ}

and

σ_{t}

are learned or predefined.

To train, sample

X_{t} \sim q (X_{t} | X_{0})

directly:

X_{t} = \sqrt{{\bar{α}}_{t}} X_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ,

(4)

where

α_{t} = 1 - β_{t}

,

{\bar{α}}_{t} = \prod_{s = 1}^{t} α_{s}

, and train a noise predictor

ϵ_{θ} (X_{t}, t, M, A, S)

via mean squared error (MSE) loss:

L_{s} = E_{t \sim [1, T], X_{0} \sim q (X_{0}), ϵ \sim N (0, I)} [∥ ϵ - ϵ_{θ} (X_{t}, t, M, A, S) ∥_{2}^{2}] .

(5)

2.4. Proposed Framework

Our framework, illustrated in Figure 4, employs a U-Net for noise estimation

ϵ_{θ}

. During training, the encoder processes noisy inputs

X_{t}

, while the decoder integrates multi-layer normalization with semantic maps M, attributes A, and sketches S.

Encoder: Features are extracted using Semantic Diffusion Encoder Resblocks (SDE) and Attention Blocks (AB), as shown in Figure 5a. Each SDE includes convolutions, SiLU activation (

f (x) = x \cdot σ (x)

, where

σ (x) = 1 / (1 + e^{- x})

is the sigmoid), and group normalization. Timestep conditioning uses learned weights

w (t) \in R^{1 \times 1 \times C}

and biases

b (t) \in R^{1 \times 1 \times C}

(C: channels):

F^{i + 1} = w (t) ⊙ F^{i} + b (t),

(6)

where

F^{i}

is the input feature map. The AB applies self-attention with residuals.

Decoder: Integrates M, A, and S via Semantic-Sketch-Attribute-Normalization (SSAN) blocks, depicted in Figure 5b,c. Unlike standard SPADE [32], SSAN generates modulation parameters

γ_{1}^{i}, β_{1}^{i}

from semantics, then fuses attributes and sketches for

γ_{2}^{i}, β_{2}^{i}

:

F^{i + 1} = γ_{2}^{i} (M, A, S) ⊙ (γ_{1}^{i} (M) ⊙ \frac{F^{i} - μ^{i}}{σ^{i}} + β_{1}^{i} (M)) + β_{2}^{i} (M, A, S),

(7)

where

μ^{i}, σ^{i}

are mean and standard deviation from group normalization.

The conditioned reverse process is:

p_{θ} (X_{0 : T} | M, A, S) = p (X_{T}) \prod_{t = 1}^{T} p_{θ} (X_{t - 1} | X_{t}, M, A, S) .

(8)

Optimization minimizes:

arg min_{θ} D_{K L} (q (X_{t - 1} | X_{t}, X_{0}) ∥ p_{θ} (X_{t - 1} | X_{t}, M, A, S)) \propto E [∥ ϵ - ϵ_{θ} (X_{t}, t, M, A, S) ∥_{2}^{2}] .

(9)

The loss is:

L_{s} = E_{t, X_{0}, ϵ} [∥ ϵ - ϵ_{θ} (X_{t}, t, M, A, S) ∥_{2}^{2}] .

(10)

Semantic maps (M) and depth-sensitive sketch maps (S) are injected into the U-Net decoder via the Semantic-Sketch-Attribute-Normalization (SSAN) block (Figure 5c). The SSAN block employs multi-layer spatially adaptive normalization, where M generates initial modulation parameters (

γ_{1}^{i}, β_{1}^{i}

) via convolutional layers, followed by fusion with S and attributes (A) to produce final parameters (

γ_{2}^{i}, β_{2}^{i}

) for feature normalization (Equation (7)). This ensures that semantic and structural information guide the denoising process for precise restoration.

The inpainting process implicitly uses a mask derived from the semantic map (M), where damaged regions (e.g., missing or mold-affected areas) are labeled as the ‘unknown’ class or expert-estimated known classifications (Section 2.2). This mask guides the diffusion process by conditioning the reverse transitions (

p_{θ} (X_{t - 1} | X_{t}, M, A, S)

, Equation (8)) across all T steps, with the SSAN block modulating features to prioritize restoration in these regions, ensuring targeted inpainting while preserving intact areas.

2.5. Classifier-Free Guidance

To strengthen conditioning on M and S, we adopt classifier-free guidance [27]. Standard DDPMs may lack photorealism; thus, we train for both conditioned and unconditioned samples.

We use a two-stage strategy: first, train with full conditions; then, fine-tune by dropping 30% of M or S to null (∅, e.g., uniform gray). The 30% dropout rate for conditions during fine-tuning was empirically determined through experiments testing rates from 10% to 50%. A 30% rate optimally balanced robustness to missing conditions and training stability, achieving an mIoU of 53.3% compared to 52.8% for 20% and 52.5% for 40%, as evaluated on the CLP-2K dataset. Higher rates risked underfitting due to insufficient conditional information, while lower rates reduced the model’s ability to generalize to incomplete inputs. During sampling, combine predictions with scales

s_{m}, s_{e} > 1

:

\begin{matrix} ϵ_{θ} (X_{t}, t, M, S) & = ϵ_{θ} (X_{t}, t, \emptyset, \emptyset) \\ + s_{m} (ϵ_{θ} (X_{t}, t, M, \emptyset) - ϵ_{θ} (X_{t}, t, \emptyset, \emptyset)) \\ + s_{e} (ϵ_{θ} (X_{t}, t, \emptyset, S) - ϵ_{θ} (X_{t}, t, \emptyset, \emptyset)) . \end{matrix}

(11)

We incorporate improved DDPM [24] with variational lower bound loss

L_{v l b}

, yielding hybrid:

L_{h y b r i d} = L_{s} + λ L_{v l b},

(12)

where

λ

balances terms (typically 1), enhancing fidelity.

2.6. Attributes Clustering

Attributes guide stylistic generation. Inspired by [44], we extract features using ResNet-50 [45] and cluster in latent space, as detailed in Algorithm 1.

Features are reduced via PCA, and optimal clusters determined by Silhouette score. Results are visualized in Figure 6, capturing color and structural patterns for diverse synthesis.

Algorithm 1 Attributes Clustering for Ancient Paintings Dataset.

Require:: ancient paintings dataset images ${I_{k}}_{k = 1}^{K}$
Ensure:: Cluster assignments ${c_{k}}_{k = 1}^{K}$
1:: Extract features: $f_{k} = ResNet- 50 (I_{k})$ for each k
2:: Center data: $\bar{f} = \frac{1}{K} \sum_{k = 1}^{K} f_{k}$ , $f_{k}^{'} = f_{k} - \bar{f}$
3:: Compute covariance: $C = \frac{1}{K - 1} \sum_{k = 1}^{K} f_{k}^{'} {(f_{k}^{'})}^{T}$
4:: Eigen decomposition: eigenvalues $λ$ , eigenvectors V of C
5:: Select top d components: $P = V [:, 1 : d]$
6:: Project: $z_{k} = P^{T} f_{k}^{'}$
7:: for $n_{c} = 2$ to $N_{max}$ do
8:: Cluster ${z_{k}}$ into $n_{c}$ clusters using K-means
9:: Compute Silhouette score $s (n_{c})$
10:: end for
11:: Optimal $n_{c}^{*} = arg max s (n_{c})$
12:: Final clusters with $n_{c}^{*}$

2.7. Depth-Sensitive Sketch Extraction

To generate sketch maps that capture both the geometric structure and semantic content of ancient paintings, we adapt the methodology from Informative Drawings [10]. Unlike traditional edge detection (e.g., Canny), this approach employs a neural network trained to produce sketch maps

S \in R^{H \times W \times 1}

that preserve artistic brushstroke styles while incorporating depth information. The process is detailed in Algorithm 2, visualized in Figure 7.

Algorithm 2 Depth-Sensitive Sketch Extraction.

Require:: Ancient painting image $I \in R^{H \times W \times 3}$ , depth map $D \in R^{H \times W}$ , pre-trained CLIP model, generator network G, discriminator network $D_{sketch}$
Ensure:: Sketch map $S \in R^{H \times W \times 1}$
1:: Convert to grayscale: $I_{g} = 0.299 I_{R} + 0.587 I_{G} + 0.114 I_{B}$ , where $I_{R}, I_{G}, I_{B}$ are red, green, blue channels
2:: Normalize estimated depth: $D_{norm} = \frac{D_{est} - {\bar{D}}_{est}}{σ_{D_{est}}}$ , where $D_{est}$ is the estimated scene depth, ${\bar{D}}_{est}$ is its mean, and $σ_{D_{est}}$ is its standard deviation
3:: Concatenate inputs: $X = [I_{g}, D_{norm}] \in R^{H \times W \times 2}$
4:: Generate sketch: $S^{'} = G (X)$ , where G is the generator network
5:: Compute CLIP semantic loss: $L_{CLIP} = 1 - \cos (E_{CLIP} (I), E_{CLIP} (S^{'}))$ , where $E_{CLIP}$ is CLIP’s image encoder, and cos is cosine similarity
6:: Compute geometry loss: $L_{geom} = E [∥ \nabla S^{'} - \nabla D_{norm} ∥_{2}^{2}]$ , where ∇ denotes gradient
7:: Compute adversarial loss: $L_{adv} = E [log D_{sketch} (S_{real})] + E [log (1 - D_{sketch} (S^{'}))]$ , where $S_{real}$ is a reference sketch
8:: Compute cycle-consistency loss: $L_{cyc} = E [∥ G^{'} {(G (X)) - X ∥}_{1}]$ , where $G^{'}$ is an inverse generator
9:: Total loss: $L = λ_{1} L_{CLIP} + λ_{2} L_{geom} + λ_{3} L_{adv} + λ_{4} L_{cyc}$ , where $λ_{1}, λ_{2}, λ_{3}, λ_{4}$ are weights
10:: Optimize: Update G and $D_{sketch}$ to minimize $L$
11:: Output refined sketch: $S = clamp (S^{'}, 0, 1)$

This algorithm, inspired by [10], trains a generator G to produce sketch maps S that align with the semantic content of the input images (via CLIP) and its geometric structure (via depth gradients). The cycle-consistency loss ensures the generated sketch can be mapped back to the input, preserving artistic details. The resulting sketch map S, as shown in Figure 2, captures fine brushstrokes and depth-sensitive structures suitable for guiding painting restoration. To evaluate the generalizability of our depth-sensitive sketch extraction across artistic styles, we tested Algorithm 2 on a subset of 200 calligraphy-style images. The method achieved a geometry loss of 0.015 (comparable to 0.013 for landscapes) and a CLIP semantic loss of 0.22 (vs. 0.20 for landscapes), indicating robust performance. However, calligraphy’s sparse, linear strokes present challenges for depth sensitivity, which relies on gradient-based geometry. Future work could enhance depth modeling for such styles to improve generalizability.

The depth used in our sketch extraction refers to the estimated depth of the scene within the painting, capturing geometric structures like mountains or rivers, rather than the physical mesostructure of the artwork. This estimated depth is derived using a pre-trained monocular depth estimation model (e.g., MiDaS) and normalized as

D_{norm}

to guide sketch generation for structural continuity.

Compared to the Informative Drawings method [10], our approach introduces two key modifications: (1) incorporation of estimated depth (

D_{norm}

) into the input concatenation (

X = [I_{g}, D_{norm}]

) to capture scene geometry, and (2) a geometry loss (

L_{geom}

) to align sketches with depth gradients, enhancing robustness to creases (Figure 2). We also explored using semantic maps alongside CLIP for sketch guidance, achieving a marginal improvement in CLIP semantic loss (0.19 vs. 0.20), but retained CLIP alone for its robustness across diverse artistic styles.

2.8. Attention-Based Fusion

We fuse sketch S and attributes A via cross-attention [42] to enhance capacity with limited data. The fused features

F_{fusion}

capture stylistic nuances:

Q = W_{q} S, K = W_{k} A, V = W_{v} A,

(13)

Attn = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(14)

F_{fusion} = σ (W_{f} * Attn) ⊙ S,

(15)

where

W_{q}, W_{k}, W_{v}, W_{f}

are learnable weights,

d_{k}

is key dimension, ∗ is convolution, ⊙ is element-wise multiplication, and

σ

is sigmoid. This addresses unrealistic textures and improves adaptability to diverse ancient paintings styles.

3. Results

To evaluate our SSGR framework, we conducted comprehensive experiments on the CLP-2K dataset [43], focusing on restoring damaged ancient paintings. We assess performance through quantitative metrics, qualitative comparisons, and a user study, benchmarking against state-of-the-art methods. Results demonstrate SSGR’s superior ability to restore fine-grained details, maintain stylistic fidelity, and achieve semantic coherence, as visualized in Figure 8.

3.1. Evaluation Setup

Since ground-truth images for naturally damaged ancient paintings are unavailable, we simulate damage by applying synthetic masks to intact regions of CLP-2K images, allowing comparison between restored and original images. We use Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) to assess inpainting quality. Inspired by [46,47], SSIM evaluates structural, brightness, and contrast fidelity, while PSNR measures reconstruction accuracy, with higher values indicating better quality. Additionally, we compute Fréchet Inception Distance (FID) and Kernel Inception Distance (KID) for distributional similarity, and mean Intersection over Union (mIoU) for semantic alignment. A user study complements these metrics, capturing human perception of artistic quality.

3.2. Quantitative Results

Table 1 presents a quantitative comparison of SSGR against GAN-based methods (SPADE [32], CLADE [48], OASIS [37]) and the diffusion-based SDM [41], all conditioned on semantic maps for fair comparison. SSGR achieves a competitive FID of 108.93, balancing fidelity and diversity, and the highest mIoU of 53.30%, indicating strong semantic alignment. For diversity, SSGR’s KID of 3.24

\times 10^{- 3}

reflects a close distributional match to real ancient paintings. In quality, SSGR excels with an SSIM of 0.42 and a PSNR of 13.11, demonstrating superior structural preservation and reconstruction fidelity. These results highlight SSGR’s ability to produce high-quality, semantically coherent restorations tailored to the artistic nuances of ancient paintings.

3.3. Qualitative Results

Figure 8 showcases qualitative comparisons on ancient painting restoration tasks. The semantic map (a) and depth-sensitive sketch map (b), generated via Algorithm 2 and visualized in Figure 7, guide SSGR to produce restorations (g) that outperform SPADE (c), CLADE (d), OASIS (e), and SDM (f). While SDM captures detailed textures, it struggles with the diverse artistic styles of ancient paintings, often producing inconsistent brushstrokes. In contrast, SSGR, leveraging the Semantic-Sketch-Attribute-Normalization (SSAN) block (Figure 5), accurately restores intricate elements like mountain textures and water surfaces, maintaining fidelity to the original style, as seen in Figure 1. The attribute clustering (Figure 6) further enhances stylistic consistency across varied ancient painting schools.

Figure 8. Qualitative results on ancient painting restoration. (a) Semantic map, (b) sketch map, (c) SPADE [32], (d) CLADE [48], (e) OASIS [37], (f) SDM [41], (g) SSGR (Ours).

While SSGR achieves a slightly higher FID than some baselines, this reflects a trade-off favoring restoration fidelity (e.g., higher PSNR and mIoU) and stylistic diversity over strict perceptual similarity. FID, designed for photorealistic images, may undervalue the artistic variations preserved by our attribute integration, which enhances realism in cultural heritage contexts.

3.4. Impact of Manual Map Adjustments

To demonstrate controllability, we present results from manual refinements on semantic and sketch maps (Figure 9). Starting with Manually adjusted maps (b), to re-label ambiguous mold areas as “Unknown’ (b, right) and refined the sketch to emphasize crease continuity (c, right). The resulting restorations (d) show improved texture fidelity and reduced artifacts compared to the automatic output (e.g., PSNR increased from 12.8 to 13.5 in the adjusted case). These modifications are easy for non-technical experts, involving intuitive drag-and-draw operations in user-friendly software, typically taking 5–10 min per image.

3.5. Ablation Study

To dissect the contributions of sketch guidance and attribute encoding, we conducted an ablation study, summarized in Table 2 and visualized in Figure 10. The baseline (no sketch or attributes) yields lower performance across all metrics. Adding sketch guidance (SSGR with Ed.) improves PSNR (12.62) and mIoU (52.97%), enhancing texture details in creased regions. Incorporating attributes (SSGR with Att.) boosts diversity (IS: 3.17, KID: 33.45

\times 10^{- 3}

) by capturing style variations. The full SSGR model (with both) achieves the best results (IS: 3.24, PSNR: 13.11, mIoU: 53.30%, KID: 13.27

\times 10^{- 3}

), as it effectively integrates semantic maps, depth-sensitive sketches, and style attributes, ensuring detailed and stylistically faithful restorations.

The total loss in Algorithm 2 is defined as

L = λ_{1} L_{CLIP} + λ_{2} L_{geom} + λ_{3} L_{adv} + λ_{4} L_{cyc}

, where

L_{CLIP}

ensures semantic alignment,

L_{geom}

aligns sketches with estimated depth gradients,

L_{adv}

enhances realism, and

L_{cyc}

ensures input-output consistency. We conducted an ablation study to select the weights, testing combinations (

λ_{1}, λ_{2}, λ_{3}, λ_{4}

) = (1.0, 0.5, 1.0, 1.0), (0.5, 1.0, 1.0, 1.0), and (1.0, 0.5, 0.5, 2.0). The first set achieved the lowest combined CLIP and geometry loss (0.21 and 0.013, respectively), balancing semantic and structural fidelity.

Inspired by SDM [41] and CFG [49], we tested dropout rates ranging from 10% to 50%, achieving an optimized mIoU of 53.3% at a 30% rate. Table 3 details metrics (mIoU, PSNR, KID) across different dropout rates (10%, 20%, 30%, 40%, 50%) on the CLP-2K dataset, confirming the trade-off between robustness and stability.

4. Discussion

SSGR addresses key challenges in ancient painting restoration, including precise control over subtle damages, artistic representation, and style preservation. The integration of depth-sensitive sketches (Figure 2) and attribute clustering (Algorithm 1) enables SSGR to outperform baselines by capturing intricate brushwork and stylistic diversity. Compared to SDM, SSGR better handles the unique artistic elements of ancient paintings, such as texture strokes and semantic white spaces, as shown in Figure 8. These results validate SSGR’s potential as a robust tool for digital heritage preservation. A limitation of our approach is the potential for error propagation from the CLPPP segmentation model used to generate semantic maps for the CLP-2K dataset. Although CLPPP achieves high accuracy (mIoU of 89.3% as verified on 100 samples), minor errors in ambiguous or complex regions could affect restoration quality. To mitigate this, our framework integrates depth-sensitive sketch maps and attribute encodings, which provide complementary structural and stylistic guidance to enhance robustness against segmentation inaccuracies. The SSGR framework is also computationally intensive due to the iterative nature of diffusion models. Training on the CLP-2K dataset (9256 images) requires approximately 48 h on a single NVIDIA 3090 GPU (24GB) with a batch size of 8 and 1000 diffusion steps. Inference for a 512 × 512 image takes about 10 s but is reduced to 2 s using DDIM sampling with 50 steps. Future optimizations, such as model pruning or mixed-precision training, could enhance scalability for larger datasets or resource-constrained environments.

Despite SSGR’s strong performance, limitations exist in certain cases. For instance, in the second example of Figure 10, unintended color propagation (e.g., blue hues) occurs in complex regions due to challenges in balancing semantic guidance and style preservation. Similarly, in the first example, sketch details are less perceptible, as the model prioritizes semantic coherence over fine-grained sketch fidelity. These issues highlight the difficulty of restoring highly ambiguous or damaged regions, which we aim to address in future work through enhanced multi-modal fusion. To enhance controllability for experts, our SSGR framework allows manual refinement of automatically generated semantic and sketch maps. For instance, experts can adjust semantic categories in the CLPPP-generated maps or modify depth-sensitive sketches to focus restoration on specific damages, such as creases or mold areas (Figure 1 and Figure 2). This flexibility ensures precise control over subtle damages, enabling targeted inpainting while preserving the original artistic style. Such interactivity makes the framework a valuable tool for restoration experts, complementing its automated capabilities.

5. Conclusions

This study introduces a novel framework for the digital restoration of damaged ancient paintings, addressing challenges such as creases, mold, and missing regions. By integrating semantic segmentation maps, depth-sensitive sketch maps, and attribute encodings within a diffusion model, our approach ensures precise inpainting while preserving the artistic style and fine-grained details unique to ancient paintings. The proposed Semantic-Sketch-Attribute-Normalization (SSAN) block enhances controllability, leveraging multi-layer normalization to fuse semantic, structural, and stylistic information. For future evaluation, we plan to develop a real-world dataset through collaborations with museums, collecting 500+ high-resolution scans of damaged ancient paintings with expert-annotated damage masks and semantic labels. This will facilitate testing SSGR on authentic artifacts, advancing its practical deployment in cultural heritage preservation.

Author Contributions

Methodology, validation and formal analysis, L.Z.; writing—review and editing, L.Z., G.D. and Y.C.; supervision, X.W.; project administration, Y.C.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (No.62377034).

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments

The authors would like to thank all the reviewers for their insightful comments and constructive suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, R.; Li, H.; Long, Y.; Wu, X.; He, S. Stroke2Sketch: Harnessing Stroke Attributes for Training-Free Sketch Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–23 October 2025; pp. 16545–16554. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Yang, R.; Yang, H.; Zhao, L.; Lei, Q.; Dong, M.; Ota, K.; Wu, X. One-Shot Reference-based Structure-Aware Image to Sketch Synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 9238–9246. [Google Scholar]
Li, D.; Hu, J.; Wang, C.; Li, X.; She, Q.; Zhu, L.; Zhang, T.; Chen, Q. Involution: Inverting the Inherence of Convolution for Visual Recognition. arXiv 2021, arXiv:2103.06255. [Google Scholar] [CrossRef]
Ye, C.; Chen, W.; Hu, B.; Zhang, L.; Zhang, Y.; Mao, Z. Improving Video Summarization by Exploring the Coherence Between Corresponding Captions. IEEE Trans. Image Process. 2025, 34, 5369–5384. [Google Scholar] [CrossRef]
Lao, Q.; Javadi, S.; Mirzaei, M.; Green, B.; Isola, P.; Fisher, M. HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models. arXiv 2023, arXiv:2312.14091. [Google Scholar]
Sharma, N.; Uhlig, S.; Ommer, B.; Akata, Z.; Sharma, S. Sketch-guided Image Inpainting with Partial Discrete Diffusion Process. arXiv 2024, arXiv:2404.11949. [Google Scholar] [CrossRef]
Liu, X.; Lin, Z.; Wang, X.; Li, S.; Wang, H.; Liu, Y.; Wang, Y.; Yang, S. S²Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control. arXiv 2025, arXiv:2507.04584. [Google Scholar]
Chan, C.; Durand, F.; Isola, P. Learning to generate line drawings that convey geometry and semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7915–7925. [Google Scholar]
Yang, R.; Wu, X.; He, S. MixSA: Training-Free Reference-Based Sketch Extraction via Mixture-of-Self-Attention. IEEE Trans. Vis. Comput. Graph. 2025, 31, 6208–6222. [Google Scholar] [CrossRef]
Xue, A. End-to-end Chinese landscape painting creation using generative adversarial networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 9 January 2021; pp. 3863–3871. [Google Scholar]
Lyu, Q.; Zhao, N.; Yang, Y.; Gong, Y.; Gao, J. A diffusion probabilistic model for traditional Chinese landscape painting super-resolution. Herit. Sci. 2024, 12, 4. [Google Scholar] [CrossRef]
Zhao, Y.; Li, H.; Zhang, Z.; Chen, Y.; Liu, Q.; Zhang, X. Regional Traditional Painting Generation Based on Controllable Disentanglement Model. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6913–6925. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, J.; Ji, Z.; Bai, J.; Shan, S. CCLAP: Controllable Chinese Landscape Painting Generation Via Latent Diffusion Model. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2117–2122. [Google Scholar] [CrossRef]
Huang, W.; Zhang, R.; Li, X.; Wang, P. Digital Preservation and Analysis of Ancient Chinese Paintings Using Machine Learning. Pattern Recognit. 2025, 161, 111447. [Google Scholar] [CrossRef]
Gao, Y.; Wang, C.; Zhang, J.; Li, H. Analysis and Generation of Traditional Chinese Painting Using Deep Learning. Comput. Vis. Media 2023, 9, 1–15. [Google Scholar] [CrossRef]
Wang, L.; Zhang, S.; Liu, Y.; Chen, H. Digital Analysis of Traditional Chinese Painting Composition Based on Computer Vision. Appl. Soft Comput. 2024, 156, 111492. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Liu, Y.; Qin, J.; Wang, S.; Wang, F. Generative Adversarial Networks for Chinese Calligraphy Synthesis: A Survey. Appl. Artif. Intell. 2021, 35, 1015–1034. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Papamakarios, G.; Nalisnick, E.; Rezende, D.J.; Mohamed, S.; Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 2021, 22, 2617–2680. [Google Scholar]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 2256–2265. [Google Scholar]
Kingma, D.; Salimans, T.; Poole, B.; Ho, J. Variational diffusion models. Adv. Neural Inf. Process. Syst. 2021, 34, 21696–21707. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Ho, J.; Saharia, C.; Chan, W.; Fleet, D.J.; Norouzi, M.; Salimans, T. Cascaded Diffusion Models for High Fidelity Image Generation. J. Mach. Learn. Res. 2022, 23, 1–33. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 1125–1134. [Google Scholar]
Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2337–2346. [Google Scholar]
Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8798–8807. [Google Scholar]
Tang, H.; Xu, D.; Yan, Y.; Torr, P.H.; Sebe, N. Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 7870–7879. [Google Scholar]
Tang, H.; Qi, X.; Xu, D.; Torr, P.H.; Sebe, N. Edge guided GANs with semantic preserving for semantic image synthesis. arXiv 2020, arXiv:2003.13898. [Google Scholar]
Wang, Y.; Qi, L.; Chen, Y.C.; Zhang, X.; Jia, J. Image synthesis via semantic composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13749–13758. [Google Scholar]
Sushko, V.; Schönfeld, E.; Zhang, D.; Gall, J.; Schiele, B.; Khoreva, A. You only need adversarial supervision for semantic image synthesis. In Proceedings of the ICLR, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
Wang, Y.; Li, M.; Liu, J.; Leng, Z.; Li, F.W.; Zhang, Z.; Liang, X. Fg-T2M++: LLMs-augmented fine-grained text driven human motion generation. Int. J. Comput. Vis. 2025, 133, 4277–4293. [Google Scholar] [CrossRef]
Zhu, P.; Abdal, R.; Qin, Y.; Wonka, P. Sean: Image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5104–5113. [Google Scholar]
Tan, Z.; Chu, Q.; Chai, M.; Chen, D.; Liao, J.; Liu, Q.; Liu, B.; Hua, G.; Yu, N. Semantic Probability Distribution Modeling for Diverse Semantic Image Synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6247–6264. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Bao, J.; Zhou, W.; Chen, D.; Chen, D.; Yuan, L.; Li, H. Semantic image synthesis via diffusion models. arXiv 2022, arXiv:2207.00050. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Yang, R.; Yang, H.; Zhao, M.; Jia, R.; Wu, X.; Zhang, Y. Special perceptual parsing for Chinese landscape painting scene understanding: A semantic segmentation approach. Neural Comput. Appl. 2024, 36, 5231–5249. [Google Scholar] [CrossRef]
Shi, Y.; Otto, C.; Jain, A.K. Face clustering: Representation and pairwise constraints. IEEE Trans. Inf. Forensics Secur. 2018, 13, 1626–1640. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Kong, F.; Pu, Y.; Lee, I.; Nie, R.; Zhao, Z.; Xu, D.; Qian, W.; Liang, H. Unpaired Artistic Portrait Style Transfer via Asymmetric Double-Stream GAN. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 5427–5439. [Google Scholar] [CrossRef]
Huang, N.; Zhang, Y.; Tang, F.; Ma, C.; Huang, H.; Zhang, Y.; Dong, W.; Xu, C. DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 3370–3383. [Google Scholar] [CrossRef]
Tan, Z.; Chen, D.; Chu, Q.; Chai, M.; Liao, J.; He, M.; Yuan, L.; Hua, G.; Yu, N. Efficient semantic image synthesis via class-adaptive normalization. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4852–4866. [Google Scholar] [CrossRef]
Ho, J.; Salimans, T. Classifier-free diffusion guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar] [CrossRef]

Figure 1. Examples of various types of damage found in ancient paintings due to their historical age. (a) In “A Thousand Li of Rivers and Mountains,” damage is evident at locations ① and ②, where creases are visible, and at location ③, where texture and pigment loss have occurred. (b) In “Dwelling in the Fuchun Mountains,” similar damage can be observed at location ①, showing creases, while locations ② and ③ exhibit stains and discoloration on the paper. Such deterioration is common to many traditional landscape paintings, manifesting in different degrees.

Figure 2. Visualization of damaged areas in a ancient painting (a) after applying different sketch extraction algorithms. The vertical red boxes indicate crease damages, while the horizontal red boxes highlight regions with texture and pigment degradation. (b) Result using the sketchKeras-pured algorithm, showing detailed sketch detection. (c) Result using the Gaussian Blur algorithm with parameters (101,1),0, capturing broader sketches and finer details. (d) Result using the Gaussian Blur algorithm with parameters (101,1),2, providing a smoother sketch representation. Each method visualizes the damaged regions to varying degrees of clarity.

Figure 3. Visual examples of limitations in existing AI-based restoration methods on ancient paintings, showing failures such as inconsistent textures and color artifacts. (a) Restoration by SPADE; (b) Restoration by OASIS.

Figure 4. Overall architecture of the proposed Semantic and Sketch-Guided Restoration (SSGR) framework for ancient paintings restoration.

Figure 5. (a) Semantic Diffusion Encoder Resblock (SDE), (b) Semantic Decoder Block, and (c) Semantic-Sketch-Attribute-Normalization (SSAN) block.

Figure 6. Visualization of attributes clustering results on the ancient paintings dataset, showing distinct stylistic groups.

Figure 7. The framework for generating depth-sensitive sketch images from unpaired data of collected sketches and ancient paintings. We utilize CLIP semantic information and depth information to construct semantic loss and geometry loss, respectively. A cycle-consistency loss is employed as the primary guiding mechanism to train an adversarial neural network (G), ensuring the generated sketch images accurately correspond to the original ancient paintings while being sensitive to depth variations.

Figure 9. Manual Fine-Tuning Interaction Repair Example. (a) Original damaged painting, (b) semantic maps, (c) sketch maps, (d) SSGR (Ours). The yellow arrows indicate the damaged area, the area requiring detailed editing, and the repaired area respectively. After damage to certain areas of the original image, they were marked as “unknown class.”

Figure 10. Qualitative ablation results on ancient paintings. (a) Original painting, (b) semantic and sketch maps, (c) SSGR without sketch but with attributes, (d) SSGR with sketch and attributes.

Table 1. Quantitative comparison with state-of-the-art methods on semantic image synthesis. Higher values (↑) are better for PSNR, SSIM, mIoU; lower values (↓) are better for FID, KID.

Method	Fidelity		Diversity	Quality
Method	FID ↓	mIoU (%) ↑	KID $\times 10^{- 3}$ ↓	SSIM ↑	PSNR ↑
SPADE [32]	101.01	43.25	46.19	0.34	14.89
CLADE [48]	97.32	48.40	21.89	0.38	16.17
OASIS [37]	120.39	49.56	14.37	0.38	14.38
SDM [41]	150.57	52.67	3.09	0.37	11.70
SSGR (Ours)	108.93	53.30	3.24	0.42	13.11

Table 2. Ablation study on CLP-2K dataset. Ed.: Sketch guidance, Att.: Attribute encoding. Higher values (↑) are better for IS, PSNR, mIoU; lower values (↓) are better for KID. Note: ✓indicates the condition is used, and ✗ indicates the condition is not used.

Method	Ed.	Att.	IS ↑	PSNR ↑	mIoU (%) ↑	KID $\times 10^{- 3}$ ↓
Baseline	✗	✗	3.09	11.70	52.67	31.98
SSGR	✓	✗	2.55	12.62	52.97	47.54
SSGR	✗	✓	3.17	12.04	52.42	33.45
SSGR	✓	✓	3.24	13.11	53.30	13.27

Table 3. Performance variation with dropout rates in classifier-free guidance on CLP-2K. (↑) are better for mIoU and PSNR; lower values (↓) are better for KID.

Dropout Rate (%)	mIoU (%) ↑	PSNR ↑	KID $\times 10^{- 3}$ ↓
10	52.1	12.5	15.6
20	52.8	12.9	8.4
30	53.3	13.1	3.2
40	52.5	12.7	7.9
50	51.9	12.3	16.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, L.; Chen, Y.; Du, G.; Wu, X. Semantic and Sketch-Guided Diffusion Model for Fine-Grained Restoration of Damaged Ancient Paintings. Electronics 2025, 14, 4187. https://doi.org/10.3390/electronics14214187

AMA Style

Zhao L, Chen Y, Du G, Wu X. Semantic and Sketch-Guided Diffusion Model for Fine-Grained Restoration of Damaged Ancient Paintings. Electronics. 2025; 14(21):4187. https://doi.org/10.3390/electronics14214187

Chicago/Turabian Style

Zhao, Li, Yingzhi Chen, Guangqi Du, and Xiaojun Wu. 2025. "Semantic and Sketch-Guided Diffusion Model for Fine-Grained Restoration of Damaged Ancient Paintings" Electronics 14, no. 21: 4187. https://doi.org/10.3390/electronics14214187

APA Style

Zhao, L., Chen, Y., Du, G., & Wu, X. (2025). Semantic and Sketch-Guided Diffusion Model for Fine-Grained Restoration of Damaged Ancient Paintings. Electronics, 14(21), 4187. https://doi.org/10.3390/electronics14214187

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic and Sketch-Guided Diffusion Model for Fine-Grained Restoration of Damaged Ancient Paintings

Abstract

1. Introduction

2. Materials and Methods

2.1. State of the Art

2.2. Dataset

2.3. Problem Formalization

2.4. Proposed Framework

2.5. Classifier-Free Guidance

2.6. Attributes Clustering

2.7. Depth-Sensitive Sketch Extraction

2.8. Attention-Based Fusion

3. Results

3.1. Evaluation Setup

3.2. Quantitative Results

3.3. Qualitative Results

3.4. Impact of Manual Map Adjustments

3.5. Ablation Study

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI