Skeleton-Guided Diffusion for Font Generation

Zhao, Li; Dong, Shan; Liu, Jiayi; Zhang, Xijin; Gao, Xiaojiao; Wu, Xiaojun

doi:10.3390/electronics14193932

Open AccessArticle

Skeleton-Guided Diffusion for Font Generation

by

Li Zhao

^1,2

,

Shan Dong

^1,2

,

Jiayi Liu

^1,2,

Xijin Zhang

¹,

Xiaojiao Gao

^1,2 and

Xiaojun Wu

^1,2,*

¹

School of Artificial Intelligence and Computer Science, Shaanxi Normal University, Xi’an 710119, China

²

Key Laboratory of Intelligent Computing and Service Technology for Folk Song, Ministry of Culture and Tourism, Xi’an 710119, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(19), 3932; https://doi.org/10.3390/electronics14193932

Submission received: 10 September 2025 / Revised: 24 September 2025 / Accepted: 28 September 2025 / Published: 3 October 2025

(This article belongs to the Special Issue Artificial Intelligence for Smart Image Perception, Recognition and Understanding)

Download

Browse Figures

Versions Notes

Abstract

Generating non-standard fonts, such as running script (e.g., XingShu), poses significant challenges due to their high stroke continuity, structural flexibility, and stylistic diversity, which traditional component-based prior knowledge methods struggle to model effectively. While diffusion models excel at capturing continuous feature spaces and stroke variations through iterative denoising, they face critical limitations: (1) style leakage, where large stylistic differences lead to inconsistent outputs due to noise interference; (2) structural distortion, caused by the absence of explicit structural guidance, resulting in broken strokes or deformed glyphs; and (3) style confusion, where similar font styles are inadequately distinguished, producing ambiguous results. To address these issues, we propose a novel skeleton-guided diffusion model with three key innovations: (1) a skeleton-constrained style rendering module that enforces semantic alignment and balanced energy constraints to amplify critical skeletal features, mitigating style leakage and ensuring stylistic consistency; (2) a cross-scale skeleton preservation module that integrates multi-scale glyph skeleton information through cross-dimensional interactions, effectively modeling macro-level layouts and micro-level stroke details to prevent structural distortions; (3) a contrastive style refinement module that leverages skeleton decomposition and recombination strategies, coupled with contrastive learning on positive and negative samples, to establish robust style representations and disambiguate similar styles. Extensive experiments on diverse font datasets demonstrate that our approach significantly improves the generation quality, achieving superior style fidelity, structural integrity, and style differentiation compared to state-of-the-art diffusion-based font generation methods.

Keywords:

skeleton guide; font generation; style confusion; structural distortion; contrastive learning; style fidelity

1. Introduction

As one of the oldest writing systems, Chinese characters encompass a vast repertoire and intricate structures, making font design a labor-intensive endeavor that demands significant expertise [1]. According to standards like GB2312, a complete Chinese typeface requires at least 6763 characters, with complexities arising from stroke variations and compositional rules. Recent breakthroughs in deep learning have revolutionized Chinese Font Generation (CFG), transforming it into an image-to-image translation task, where models learn mappings between source and target styles while preserving the content [2,3]. This has enabled applications in font library creation, image restoration, editing, and optical character recognition enhancement [4,5,6,7].

The evolution of CFG has progressed through stages: early computer graphics methods focused on contour descriptions and radical combinations, followed by machine learning approaches using stroke fusion and style learning [8]. Since 2017, deep learning has dominated, with initial convolutional networks like Rewrite [9] and Pix2Pix-based Zi2Zi [10] addressing printed fonts, though struggling with artifacts in complex glyphs. Subsequent advancements incorporated GANs for style transfer [11], VAEs for latent space modeling, RNNs for sequential stroke generation, Transformers for attention-based feature extraction [12,13], and most recently, diffusion models for high-fidelity synthesis [14,15,16,17].

Particularly for calligraphy styles like regular script (KaiShu), clerical script (LiShu), and running script (XingShu), diffusion models have emerged as a promising paradigm due to their ability to map noise to target distributions via forward diffusion and reverse denoising processes [18,19,20]. Models such as DDPM [21] and their variants [22] excel in capturing continuous stroke variations and stylistic nuances, as evidenced in recent CFG works. For instance, FontDiffuser [17] aggregates multi-scale content and employs contrastive learning for one-shot generation, while QT-Font [15] leverages quadtree-based diffusion for efficient synthesis. FontStudio [16] introduces shape-adaptive diffusion for coherent effects, and other efforts integrate font transfer processes into multi-stage diffusion [14]. These models achieve stable training, robust detail recovery, and minimal mode collapse, outperforming GANs in metrics like FID and SSIM for raster fonts [1].

However, generating non-standard fonts like XingShu remains challenging due to the high stroke continuity, structural flexibility, and stylistic diversity [1]. Component-based methods, relying on glyph decomposition and stroke assembly [23,24], perform well for structured fonts like KaiShu but fail to generalize to fluid continuous strokes in XingShu, where rigid priors cannot capture dynamic deformations. Diffusion models, while adept at implicit learning, suffer from the following: (1) style leakage, where noise interference disrupts consistency amid large stylistic variances [17]; (2) structural distortion, lacking explicit guidance and leading to broken strokes or deformed glyphs, especially in compound characters [1]; and (3) style confusion, inadequately distinguishing similar styles like semi-cursive variants, resulting in ambiguous outputs [15]. Beyond standard calligraphy styles such as XingShu, KaiShu, and LiShu, our approach is designed to generalize to a wide variety of non-standard fonts, including artistic and decorative typefaces with high structural and stylistic diversity, as validated in additional experiments (Section 4.3).

As illustrated in Figure 1, the essence of Chinese calligraphy can be regarded as stylistic variations imposed upon a stable character skeleton, where the skeleton preserves semantic readability, while the styles determine visual expressiveness. Our raster-based skeleton guidance, derived from pixel-level features using morphological thinning, contrasts with vector-based approaches like SVG or Bezier curve decompositions [23]. While vector methods offer scalability and editability, they often fail to capture the fluid continuous stroke variations of non-standard fonts like XingShu due to rigid parametric constraints. By integrating raster skeletons into the diffusion process, our approach achieves superior stylistic fidelity and structural coherence, complementing vector methods for calligraphy generation.

Recent attempts to mitigate these, such as style embeddings [25] or multi-scale fusion [26], rely on implicit features or coarse priors, insufficient for intricate skeletal and stylistic properties of non-standard fonts. The survey by Wang et al. [1] highlights the need for innovative networks integrating multimodal inputs and cross-style generation to address these gaps. Unlike FontDiffuser’s implicit multi-scale aggregation [17] and DP-Font’s physical priors [27], our approach leverages hierarchical skeleton clustering, semantic alignment with energy constraints, and decomposition-based contrastive learning, offering superior control over stroke continuity and style disambiguation for non-standard fonts.

To overcome these limitations, we propose a novel skeleton-guided diffusion model for robust non-standard font generation. Our approach incorporates explicit glyph skeleton priors—representing structural hierarchies from strokes to components [1]—into the diffusion process, bridging structural rigidity and stylistic flexibility. It features three innovations: (1) a skeleton-constrained style rendering module enforcing semantic alignment and energy constraints to mitigate style leakage; (2) a cross-scale skeleton preservation module integrating multi-scale skeletons for macro-layouts and micro-stroke details, preventing distortions; and (3) a contrastive style refinement module using decomposition, recombination, and contrastive learning to disambiguate styles.

This method advances beyond existing diffusion-based CFG by providing fine-grained control, aligning with future directions in multimodal integration and innovative designs [1,28]. Our contributions are as follows: first, a skeleton-guided framework addressing style leakage, distortion, and confusion for superior non-standard font quality; second, modules leveraging multi-scale skeletons and contrastive learning, enhancing computer vision and generative modeling; third, extensive validation on calligraphy datasets, outperforming SOTA in style fidelity, integrity, and differentiation, with implications for text-to-image and artistic tasks.

2. Related Work

Recent advancements in Chinese calligraphy font generation have leveraged deep learning to address the complexities of stroke continuity, structural flexibility, and stylistic diversity, particularly in non-standard styles like running script (XingShu). These methods can be categorized into paired-data-based, unpaired-data-based, and universal-feature-based approaches. Paired-data-based methods rely on aligned source–target pairs for direct mapping, unpaired-data-based methods enable flexible training without explicit pairings, and universal-feature-based methods incorporate generalizable features such as skeletons or physical constraints to enhance robustness. Our approach aligns with universal-feature-based methods, emphasizing the clustering of skeletal attributes in radicals (biaspang bupshou) to capture writing features of a specific style, enabling high-fidelity generation of given text content while mitigating the style leakage, structural distortion, and style confusion in diffusion models.

2.1. Paired-Data-Based Methods

Paired-data-based methods, such as those inspired by Pix2Pix [3], require aligned source and target font images for training, facilitating direct style transfer through supervised learning. For instance, the involution-based model [29] enhances Pix2Pix by replacing convolutions with involution operators and adding self-attention modules, improving the stroke accuracy and continuity in calligraphic styles, including running script. Similarly, FontRNN [30] treats characters as sequences of writing trajectories, using RNNs with monotonic attention to generate large-scale fonts from paired samples, addressing stroke continuity via point categories but requiring extensive paired data for cursive variations. SCFont [31] employs deep stacked networks for structure-guided generation, transferring trajectories from reference to target styles in a paired manner to ensure structural integrity. However, these methods often struggle with non-standard fonts like XingShu due to their reliance on fixed pairings, limiting flexibility in capturing dynamic stroke deformations and stylistic diversity without large diverse paired datasets.

2.2. Unpaired-Data-Based Methods

Unpaired-data-based methods, typically built on CycleGAN [32], allow training on unaligned datasets, making them suitable for scenarios where paired calligraphy samples are scarce. Building on this framework, DG-Font [33] introduces deformable convolutions to dynamically model spatial transformations between fonts, significantly reducing structural distortions without requiring paired data. The end-to-end model using dense blocks and capsule networks extends CycleGAN with perceptual loss to minimize broken or deformed strokes, demonstrating improved similarity in running script styles like Wang Xizhi’s. These approaches enhance generalization to unseen styles by learning cycle-consistent mappings, but they can suffer from mode collapse or inconsistent outputs in highly variable calligraphy, where stroke continuity and stylistic nuances are not explicitly enforced, leading to artifacts in fluid non-standard fonts.

2.3. Universal-Feature-Based Methods

Universal-feature-based methods integrate generalizable priors, such as structural or physical features, to guide generation beyond data-specific mappings, offering robustness for diverse calligraphy styles. For example, DP-Font [27] employs a diffusion model with physical information neural networks (PINNs) and stroke order encoding, incorporating universal physical constraints like ink diffusion equations to enhance realism and plausibility in personalized calligraphy, effectively handling stroke continuity and diversity in running script through latent attribute conditioning. Unlike component-based decompositions that rely on rigid stroke assemblies [23,24], our skeleton clustering approach hierarchically encodes strokes, components, and radicals, enabling the flexible modeling of fluid styles like XingShu. In contrast to DP-Font’s physical priors (e.g., ink diffusion simulations), our structural skeleton priors reduce the computational complexity while achieving superior style fidelity and structural integrity. Component-based approaches, as surveyed in [1], decompose glyphs into radicals and strokes for recombination, providing a universal framework for structural preservation. Our skeleton-guided diffusion model advances this paradigm by explicitly clustering skeletal attributes of radicals within a specific style, viewing calligraphy generation as the aggregation of writing features (e.g., stroke shapes and layouts) to produce consistent high-fidelity glyphs for any text content. This addresses limitations in prior diffusion-based methods, such as style leakage from noise interference, structural distortions without guidance, and confusion among similar styles, achieving superior fidelity, integrity, and differentiation in non-standard fonts like XingShu, as validated experimentally.

3. Method

3.1. Preliminaries

Our skeleton-guided diffusion model builds upon latent diffusion models (LDMs) [34], which encode images into a compact latent space using a Variational Autoencoder (VAE) comprising an encoder

E

and decoder

D

. The denoising process occurs in this latent space via a U-Net-based network

ϵ_{θ}

. To address the challenges of non-standard font generation, such as style leakage, structural distortion, and style confusion in fonts like running script (XingShu), we integrate explicit glyph skeleton priors into the diffusion framework. These skeletons represent hierarchical structures from strokes to radicals, enabling the model to balance structural rigidity with stylistic flexibility.

The forward diffusion process gradually adds Gaussian noise to the latent representation

z_{0} = E (x_{0})

of the target glyph image

x_{0}

over T timesteps:

q (z_{t} | z_{t - 1}) = N (z_{t}; \sqrt{1 - β_{t}} z_{t - 1}, β_{t} I),

(1)

where

β_{t}

is a variance schedule, and the cumulative noise addition is

z_{t} = \sqrt{{\bar{α}}_{t}} z_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, ϵ \sim N (0, I),

(2)

where

α_{t} = 1 - β_{t}

, and

{\bar{α}}_{t} = \prod_{s = 1}^{t} α_{s}

.

The reverse process approximates the denoising using

ϵ_{θ} (z_{t}, t, S, R)

, conditioned on skeleton priors S and reference style R:

p_{θ} (z_{t - 1} | z_{t}) = N (z_{t - 1}; μ_{θ} (z_{t}, t, S, R), σ_{t}^{2} I),

(3)

where

μ_{θ} (z_{t}, t, S, R) = \frac{1}{\sqrt{α_{t}}} (z_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (z_{t}, t, S, R))

, and

σ_{t}^{2} = β_{t}

.

As depicted in Figure 2, the overall architecture incorporates content and reference inputs, featuring the Skeleton-Constrained Style Rendering (SCSR), Cross-Scale Skeleton Preservation (CSSP), and Contrastive Style Refinement (CSR) modules within a U-Net backbone with residual blocks and up/downsampling.

3.2. Skeleton-Constrained Style Rendering (SCSR)

To mitigate the style leakage caused by noise interference in diffusion models, the SCSR module enforces semantic alignment between the generated glyph and its skeleton prior while applying balanced energy constraints to amplify critical skeletal features. This ensures stylistic consistency across large variances in non-standard fonts.

As illustrated in Figure 3, the input latent

z_{t}

is reshaped and projected before entering the U-Net. The skeleton prior S is encoded into features

F_{s} (S)

using a skeleton encoder. The self-attention within the module is

Attention (Q, K, V) = softmax (\frac{Q K^{T} + M}{\sqrt{d}}) V,

(4)

where

Q, K,

and V are derived from the intermediate U-Net features, d is the dimension, and M is a mask incorporating skeleton constraints for semantic focus.

Cross attention integrates skeletal features:

Q = W_{q} \cdot ϕ (z_{t}), K = W_{k} \cdot F_{s} (S), V = W_{v} \cdot F_{s} (S),

(5)

with

W_{q}, W_{k},

and

W_{v}

as projection matrices and

ϕ (\cdot)

extracting features from U-Net layers.

Energy constraints balance the feature amplification:

L_{render} = {∥ϕ (z_{t}) - ϕ (F_{s} (S))∥}_{2}^{2} + λ_{e} \sum_{i} exp (- \frac{E_{i}}{τ_{e}}),

(6)

where

E_{i}

is the energy of skeletal feature i,

τ_{e}

is a temperature parameter, and

λ_{e}

weights the constraint. This loss minimizes the discrepancies while preventing over-amplification of non-critical features, ensuring consistent style rendering.

3.3. Cross-Scale Skeleton Preservation (CSSP)

Structural distortions in diffusion-based generation arise from lacking explicit multi-scale guidance, leading to broken strokes or deformed glyphs. The CSSP module addresses this by integrating hierarchical skeleton information—macro-scale

S^{M}

(radical layouts), meso-scale

S^{m}

(component arrangements), and micro-scale

S^{μ}

(stroke details)—through cross-dimensional interactions.

These scales are encoded into features

F_{t}^{M}, F_{t}^{m}, F_{t}^{μ}

and fused via

F_{t} = G (F_{t}^{M}, F_{t}^{m}, F_{t}^{μ}) = W_{g} \cdot Concat ({Attn}^{M} (F_{t}^{M}), {Attn}^{m} (F_{t}^{m}), {Attn}^{μ} (F_{t}^{μ})),

(7)

where

W_{g}

is a fusion weight, and

{Attn}^{s}

denotes the scale-specific attention:

{Attn}^{s} (Q^{s}, K^{s}, V^{s}) = softmax (\frac{Q^{s} {(K^{s})}^{T}}{\sqrt{d_{s}}} + B^{s}) V^{s},

(8)

with

Q^{s}, K^{s},

and

V^{s}

from the respective scale features,

d_{s}

as the scale dimension, and

B^{s}

as a bias incorporating cross-scale dependencies.

The preservation loss enforces integrity:

L_{preserve} = \sum_{s \in {M, m, μ}} {∥ψ^{s} (z_{t}) - F_{s} (S^{s})∥}_{1} + γ_{p} KL (p^{s} (z_{t}) | | q^{s} (S)),

(9)

where

ψ^{s} (\cdot)

extracts scale-specific features from the latent,

F_{s} (S^{s})

is the encoded skeleton at scale s, KL is the Kullback–Leibler divergence between distributions

p^{s}

and

q^{s}

, and

γ_{p}

balances the terms. This multi-scale fusion prevents distortions by modeling both global layouts and local details effectively.

3.4. Contrastive Style Refinement (CSR)

Style confusion occurs when similar font styles are not adequately distinguished, leading to ambiguous outputs. The CSR module tackles this via skeleton decomposition into pairs

(S, R^{+})

for positive (same style) and

(S, R^{-})

for negative (different style) samples, followed by recombination and contrastive learning to build robust style representations.

Decomposition separates skeleton S and style components R:

(S, R) = D_{d e c} (x), x^{'} = R_{r e c} (S, R^{'}),

(10)

where

D_{d e c}

decomposes the input glyph x, and

R_{r e c}

recombines with potentially swapped styles

R^{'}

.

Embeddings

h_{s}, h^{+}, h^{-}

are obtained via a style projector:

h = Proj (Concat (AvgPool (f), MaxPool (f))),

(11)

where f represents multi-layer features from a backbone network.

The contrastive loss is

L_{ctr} = - \sum log \frac{exp (sim (h_{s}, h^{+}) / τ)}{exp (sim (h_{s}, h^{+}) / τ) + \sum exp (sim (h_{s}, h^{-}) / τ)},

(12)

with

sim (a, b) = \frac{a \cdot b}{∥ a ∥ ∥ b ∥}

as the cosine similarity and

τ

as a temperature (set to 0.07). Positive/negative samples are augmented with random transformations to enhance the robustness, disambiguating similar styles like semi-cursive variants.

3.5. Attention Mechanisms in the Diffusion Process

To align strokes with skeletons during denoising, we employ spatial transformers and energy-augmented cross attention:

Attn (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d}} + E) V,

(13)

where Q is from the current latent, K and V are from the reference skeletons, and E is an energy map derived from skeletal features to prioritize important regions. This preserves the global layout and local fidelity, with energy transformations:

E = σ (W_{e} \cdot F_{s} (S)),

(14)

where

σ

is a sigmoid, and

W_{e}

represents the learned weights.

3.6. Training Objective

Training optimizes a composite loss that integrates the mean squared error (MSE) diffusion loss with the rendering, preservation, and contrastive losses to ensure robust font generation:

L_{total} = L_{MSE} + α L_{render} + β L_{preserve} + γ L_{ctr},

(15)

where

L_{MSE} = E_{z_{0}, ϵ, t} [∥ ϵ - ϵ_{θ} (z_{t}, t, S, R) ∥_{2}^{2}]

is the MSE diffusion loss, measuring the discrepancy between predicted and actual noise in the latent space, conditioned on skeleton priors S and reference style R. The hyperparameters

α = 0.1

,

β = 0.2

, and

γ = 0.05

balance the contributions of structural alignment, preservation, and style refinement, respectively, ensuring high-fidelity generation with preserved structural integrity and stylistic accuracy.

4. Experiments

In this section, we describe the extensive experiments conducted to evaluate the effectiveness of our proposed SGD-Font. We first describe the experimental setup, datasets, and evaluation metrics. We then present qualitative and quantitative comparisons with state-of-the-art (SOTA) methods on multiple calligraphy styles, followed by in-depth analyses of robustness, ablation studies, and case-specific evaluations. Our results consistently demonstrate that the incorporation of skeleton guidance and contrastive refinement leads to significant improvements in style fidelity, structural preservation, and style disambiguation.

4.1. Experimental Setup

We implement our skeleton-guided diffusion model using PyTorch 1.13 and conduct all experiments on an NVIDIA GeForce RTX 3090 GPU with 24 GB memory. The training corpus comprises ten categories, including standard printed fonts (SimHei, Sim-Sun, KaiTi, FangSong, Microsoft YaHei) from GB2312 and historical stele inscriptions (Yan Zhenqing’s, Liu Gongquan’s, Ouyang Xun’s regular scripts, Wei Bei, Han Dynasty clerical script) from public datasets like the Chinese Font Dataset (CFD) and GitHub calligraphy repositories. Each training font includes approximately 7000 characters, encompassing common radicals and glyph variants to promote generalization across diverse structural complexities. To ensure efficient training while maintaining high-quality results, the batch size is set to 8, with a maximum of 440,000 iterations. We employ the Adam optimizer with an initial learning rate of 1 × 10⁻⁴ and a dropout ratio of 0.1 to prevent overfitting. To prevent memorization, we applied data augmentation (rotations, Gaussian noise), dropout (0.1), and an 80/20 train/validation split, with stable validation loss indicating no overfitting. Unless otherwise specified, the number of diffusion sampling steps is fixed to 50, striking a balance between generation quality and computational efficiency. All models are trained until convergence, and we present both qualitative visualizations and quantitative metrics in the subsequent sections to demonstrate the superior performance of our approach.

4.2. Quantitative Evaluation Metrics

To objectively assess the visual quality of the generated calligraphy relative to the target fonts, we adopt standard metrics commonly used in prior research. Given that even subtle differences in generated font images can be highly perceptible to humans, our evaluation encompasses pixel-level, structural, and perceptual features for a comprehensive reflection of generation quality.

Pixel-level evaluation compares generated and target images at the pixel level to quantify basic similarity. Structural metrics account for luminance, contrast, and overall organization, while perceptual metrics align more closely with human visual judgment. The key indicators include the following:

(1): L1 Loss (Mean Absolute Error, MAE): The L1 Loss measures the average absolute difference between corresponding pixels in the generated and target images, providing a straightforward assessment of pixel-wise accuracy. It is particularly sensitive to outliers and is defined as

$L 1 (x, y) = \frac{1}{H W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} | x_{i j} - y_{i j} |,$

(16)

where H and W are the height and width of the images, and $x_{i j}$ and $y_{i j}$ are the pixel values at position $(i, j)$ .
(2): Mean Squared Error (MSE): The MSE is one of the most commonly used metrics for quantifying image differences, emphasizing larger errors through squaring. Its formula is

$M S E (x, y) = \frac{1}{H W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {(x_{i j} - y_{i j})}^{2} .$

(17)
(3): Structural Similarity Index Measure (SSIM): The SSIM evaluates similarity in terms of luminance, contrast, and structure, making it widely applicable in tasks like super-resolution. It is calculated as

$S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + c_{1}) (2 σ_{x y} + c_{2})}{(μ_{x}^{2} + μ_{y}^{2} + c_{1}) (σ_{x}^{2} + σ_{y}^{2} + c_{2})},$

(18)

where $μ_{x}$ and $μ_{y}$ are the mean intensities, $σ_{x}^{2}$ and $σ_{y}^{2}$ are the variances, $σ_{x y}$ is the covariance, and $c_{1}$ and $c_{2}$ are stabilization constants.
(4): Learned Perceptual Image Patch Similarity (LPIPS): LPIPS leverages deep network features to better align with human perception, computing weighted differences across feature layers:

$L P I P S (x, y) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} {∥ w_{l} ⊙ (f_{l}^{x} (h, w) - f_{l}^{y} (h, w)) ∥}_{2}^{2},$

(19)

where l denotes the feature layers, $H_{l}$ and $W_{l}$ are the dimensions, $f_{l}^{x}$ and $f_{l}^{y}$ are the features from images x and y, and $w_{l}$ are the learned channel weights.

These metrics collectively provide a robust evaluation framework, with lower values for L1, MSE, and LPIPS, and higher values for SSIM indicating superior performance.

4.3. Standard-to-Calligraphy Conversion

A primary application of SGD-Font is converting standard printed fonts to diverse calligraphy styles, where our model’s innovations shine in handling non-standard fonts with high stroke continuity and stylistic variability. We evaluate conversions to three representative styles: running script (XingShu), regular script (KaiShu), and clerical script (LiShu). For a fair comparison, all baseline models (e.g., Zi2Zi, DG-Font) were re-implemented and re-trained under identical conditions (including datasets, hardware, and batch size). Through qualitative and quantitative analyses, we demonstrate how SGD-Font outperforms state-of-the-art (SOTA) methods by mitigating style leakage, structural distortion, and style confusion, thanks to its skeleton-constrained style rendering, cross-scale skeleton preservation, and contrastive style refinement modules.

4.3.1. Running Script

Running script (XingShu) is particularly challenging due to its fluid connected strokes and dynamic deformations, where traditional methods often fail to maintain continuity without introducing artifacts. As illustrated in Figure 4, which compares SGD-Font against SOTA models like Zi2Zi [10], DG-Font [33], LF-Font [35,36], MX-Font [37], and FontDiffuser [17], our method excels in preserving coherent stroke connectivity and capturing diverse stylistic nuances, resulting in more authentic and visually pleasing glyphs. For instance, competitors like FontDiffuser [17] exhibit style leakage, leading to inconsistent stroke thicknesses, while Zi2zi [10] and others suffer from broken strokes in complex characters. In contrast, SGD-Font’s skeleton-guided approach ensures structural integrity amid stylistic flexibility, highlighting our innovation in integrating explicit skeletal priors to amplify critical features and enforce semantic alignment.

Further demonstrations in Figure 8 showcase conversions from standard printed fonts to XingShu, where the generated characters faithfully replicate the cursive flowing style while preserving readability. Quantitatively, Table 1 reports metrics on a diverse test set. Although FontDiffuser [17] achieves the lowest L1 (0.1156), which is sensitive to absolute pixel differences, SGD-Font surpasses all baselines in MSE (0.0708, a 24.2% improvement over FontDiffuser [17]), SSIM (0.7903, 16.1% higher), and LPIPS (0.1901, 18.0% lower), underscoring superior perceptual and structural fidelity. These results emphasize our model’s advantages in balanced energy constraints and multi-scale interactions, enabling robust generation for non-standard fonts.

4.3.2. Regular Script

Regular script (KaiShu) demands precise upright structures with subtle brush variations, testing a model’s ability to avoid distortions in balanced layouts. Figure 5 presents qualitative comparisons, revealing that SGD-Font generates characters with exceptional alignment to specified styles, outperforming SOTA methods that often produce uneven stroke weights or deformed radicals. Our cross-scale skeleton preservation module effectively models macro-level layouts and micro-level details, preventing the structural issues seen in baselines like MX-Font [37] and LF-Font [35].

Figure 9 further illustrates successful conversions, with generated glyphs retaining KaiShu’s characteristic balance and readability. Table 2 quantifies this superiority: SGD-Font achieves the lowest MSE (0.0602, 28.6% better than FontDiffuser [17]) and LPIPS (0.1906, 8.3% lower) and the highest SSIM (0.7946, 11.2% higher), despite a slightly higher L1 than FontDiffuser [17]. This performance highlights our innovation in contrastive learning for style disambiguation, ensuring consistent and high-fidelity outputs that advance beyond implicit feature reliance in prior diffusion models.

4.3.3. Clerical Script

Clerical script (LiShu) features flat wide strokes and significant deviations from printed fonts, amplifying the risks of style confusion and distortion. In Figure 6, comparisons show SGD-Font achieving precise stroke structures and superior style uniformity, unlike SOTA methods that yield irregular or semantically collapsed glyphs. Our skeleton decomposition and recombination strategies enable robust adaptation, producing characters with excellent regularity.

Additional results in Figure 10 confirm the semantic correctness amid large style shifts. Table 3 supports this with SGD-Font leading in MSE (0.0663, 29.3% improvement over FontDiffuser [17]), SSIM (0.7977, 18.9% higher), and LPIPS (0.1882, 17.5% lower), affirming our model’s excellence in enforcing explicit guidance and energy constraints for non-standard font generation.

4.3.4. Generalization to Diverse Font Styles

To further validate the generalization capability of SGD-Font beyond traditional calligraphy styles, we evaluated its performance on a diverse set of 15 highly stylized fonts, including artistic and decorative typefaces. As illustrated in Figure 7, the generated characters exhibit consistent structural integrity and style fidelity across a wide spectrum of font designs, ranging from fluid cursive to rigid geometric forms. These results demonstrate that our skeleton-guided diffusion framework effectively adapts to non-standard fonts with varying stroke continuity, layout complexity, and aesthetic expression. The model’s ability to maintain readability while capturing distinctive stylistic nuances underscores the robustness of the proposed SCSR, CSSP, and CSR modules in mitigating style leakage, structural distortion, and style confusion even in highly heterogeneous font domains.

4.4. Analysis of Style Robustness

A critical challenge in font-to-calligraphy conversion is bridging the stylistic gap between rigid printed fonts and fluid handwritten scripts. As evidenced in Figure 8, Figure 9 and Figure 10, SGD-Font robustly disentangles structure from style: skeletal backbones are preserved for readability, while attributes like curvature, texture, and layout adapt seamlessly to targets. This is particularly evident in LiShu conversions, where drastic morphological changes do not induce collapse, unlike in the baselines. Our innovations—skeleton-constrained rendering for consistency, cross-scale preservation for integrity, and contrastive refinement for differentiation—confer unparalleled robustness, enabling high-fidelity generation across diverse styles and underscoring SGD-Font’s advancements in generative modeling for calligraphy.

Figure 8. Running script generation results. Conversion from standard printed font to Xingshu using SGD-Font. The generated characters faithfully capture the cursive and dynamic style while preserving structural readability.

Figure 9. Regular script generation results. Conversion from standard printed font to Kaishu using SGD-Font. The generated characters retain the upright and balanced structures characteristic of regular script.

Figure 10. Clerical script generation results. Conversion from standard printed font to Lishu using SGD-Font. The generated results demonstrate robustness to large style shifts while maintaining semantic correctness.

4.5. Ablation Studies

To validate the contributions of our key modules, we conducted ablation studies on the three styles, focusing on skeleton-guided style rendering (SCSR), cross-scale skeleton preservation (CSSP), and contrastive style refinement (CSR). Starting from a baseline diffusion model without these components, we incrementally added modules and evaluated the impacts.

4.5.1. Skeleton-Guided Style Rendering (SCSR)

Removing SCSR leads to structural inconsistencies, as shown in Figure 11, where generated characters exhibit broken strokes and deformations, especially in XingShu and LiShu. With SCSR enabled, as in Figure 8 and Figure 10, coherence is restored through semantic alignment and energy constraints. Table 4, Table 5 and Table 6 quantify this: adding SCSR to the baseline reduces the MSE by up to 18.1% (e.g., 0.0843 to 0.0802 in Table 4) and boosts the SSIM by 5.1%, highlighting its role in mitigating style leakage.

4.5.2. Contrastive Refinement Module

Next, we investigate the effect of the contrastive refinement module, which addresses style confusion. Without this module, the model sometimes struggles to distinguish between similar calligraphy styles, as shown in Figure 12 and Figure 13. The generated results without contrastive learning exhibit ambiguous style transitions, particularly between regular and clerical scripts. Incorporating contrastive refinement improves the style differentiation, as evidenced by the sharper boundaries between similar styles in Figure 9 and Figure 10.

4.5.3. Cross-Scale Skeleton Preservation (CSSP)

Ablating CSSP results in the loss of multi-scale details, causing macro-layout misalignments and micro-stroke distortions. Integrating CSSP yields significant gains, such as a 12.3% MSE reduction (0.0802 to 0.0739 in Table 4) and a 3.2% SSIM increase, emphasizing its innovation in cross-dimensional interactions for preventing distortions.

4.5.4. Contrastive Style Refinement (CSR)

Without CSR, style confusion arises, as depicted in Figure 12 and Figure 13, with ambiguous transitions between similar styles like KaiShu and LiShu. Adding CSR sharpens distinctions, as seen in Figure 9 and Figure 10. Full integration (SCSR + CSSP + CSR) achieves optimal metrics, e.g., 15.9% LPIPS improvement over baseline in Table 5, validating CSR’s contrastive learning for robust representations.

Overall, as summarized in Table 4, Table 5 and Table 6, each module progressively enhances the performance, with the complete SGD-Font outperforming the baseline by averages of 22.4% in MSE and 7.4% in SSIM. These ablations affirm our innovations’ synergistic effects in delivering superior style fidelity, structural integrity, and differentiation.

As shown in Figure 4, Figure 5 and Figure 6, SGD-Font consistently outperforms competing methods, achieving the highest preference scores for its preservation of artistic nuances and structural fidelity in calligraphy generation.

4.6. Efficiency Analysis

After establishing the qualitative advantages of SGD-Font, we now analyze its computational efficiency for a comprehensive evaluation. The analysis is conducted under a consistent configuration: an NVIDIA RTX 3090 GPU, a batch size of 8, and 50 sampling steps. As shown in Table 7, under these settings, our method demonstrates highly comparable resource consumption levels in terms of inference speed, GPU utilization, and GPU memory usage compared to FontDiffuser [17], with slight advantages observed in certain metrics. This indicates that the improvement in generative quality stems from the efficiency of our algorithmic design rather than increased computational overhead, achieving a superior balance between quality and efficiency.

4.7. User Study

To validate the practical utility and visual appeal of our method from an end-user perspective, we performed a comprehensive user study. Twenty participants (10 font designers and 10 non-experts) rated 100 generated samples on a 5-point scale regarding style fidelity, structural integrity, and overall aesthetics. Our method, SGD-Font, consistently outperformed the FontDiffuser baseline across all criteria. The complete findings are summarized in Table 8.

4.8. Limitations and Failure Cases

Failures include the appearance of redundant strokes, the occurrence of broken and closed strokes, missing strokes (for example, due to incorrect skeleton extraction), and stylistic ambigurities of similar clerical script variants. Figure 14 illustrates these situations.

4.9. Discussion

The experiments demonstrate the effectiveness of SGD-Font in generating high-quality calligraphy styles while maintaining structural integrity. The ablation studies show that both the skeleton-guided rendering and contrastive refinement modules are crucial for achieving robust style generation. Furthermore, the user study confirms that our method produces visually appealing results that align with user preferences across various calligraphy styles. The skeleton-guided approach could extend to scripts like Arabic or Devanagari by adapting skeleton extraction (e.g., medial axis transforms for cursive or conjunct structures), though retraining on script-specific datasets is required, marking a future research direction.

5. Conclusions

Our skeleton-guided diffusion model represents a significant leap forward in generating non-standard fonts, such as the intricate and fluid XingShu, by effectively tackling the persistent challenges of style leakage, structural distortion, and style confusion inherent in diffusion-based approaches. Through the innovative integration of three core modules—Skeleton-Constrained Style Rendering (SCSR), Cross-Scale Skeleton Preservation (CSSP), and Contrastive Style Refinement (CSR)—our method achieves unparalleled style fidelity, structural integrity, and style differentiation. The SCSR module enforces semantic alignment and balanced energy constraints, ensuring consistent stylistic rendering even under large style variances. The CSSP module leverages multi-scale skeletal priors to preserve both macro-level layouts and micro-level stroke details, preventing distortions in complex glyphs. Meanwhile, the CSR module employs contrastive learning to disambiguate similar styles, enhancing the robustness of style representations. By incorporating advanced attention mechanisms, including spatial transformers and energy-augmented cross-attention, our approach seamlessly balances stylistic expressiveness with structural coherence throughout the denoising process. Extensive experiments on diverse calligraphy datasets validate our model’s superiority over state-of-the-art methods. Owing to its practical efficiency and strong generalization, SGD-Font is suitable for integration into design software as a plug-in to facilitate custom font generation and cultural content creation. This work thereby bridges academic research and practical applications, demonstrating potential to redefine font generation for digital typography, cultural preservation, and artistic design.

Author Contributions

Methodology, validation, and formal analysis, L.Z.; writing—review and editing, L.Z. and S.D.; Model experimentation, data curation, and visualization, S.D.; data analysis and interpretation, experiments, J.L.; method implementation, data collection and data analysis, X.Z.; literature search, study design and writing, X.G.; supervision, X.W.; project administration, X.W.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62377034).

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments

The authors would like to thank all the reviewers for their insightful comments and constructive suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, L.; Liu, Y.; Sharum, M.Y.; Yaakob, R.; Kasmiran, K.A.; Wang, C. Deep learning for Chinese font generation: A survey. Expert Syst. Appl. 2025, 276, 127105. [Google Scholar] [CrossRef]
Cheng, R.R.; Zhao, X.l.; Zhou, H.j. Chinese font style transfer research based on font features and multi-scale patch generative adversarial network. J. Yunnan Univ. Nat. Sci. Ed. 2023, 45, 1228–1237. [Google Scholar] [CrossRef]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar] [CrossRef]
Assael, Y.; Sommerschield, T.; Shillingford, B.; Bordbar, M.; Pavlopoulos, J.; Chatzipanagiotou, M.; Androutsopoulos, I.; Prag, J.; de Freitas, N. Restoring and attributing ancient texts using deep neural networks. Nature 2022, 603, 280–283. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Du, C.; Jiang, Z.; Du, Q.; Ye, C. Towards automated Chinese ancient character restoration: A diffusion-based method with a new dataset. Proc. AAAI Conf. Artif. Intell. 2024, 38, 3073–3081. [Google Scholar] [CrossRef]
Chen, X.; Jin, L.; Zhu, Y.; Luo, C.; Wang, T. Text recognition in the wild: A survey. arXiv 2020, arXiv:2005.03492. [Google Scholar] [CrossRef]
Tian, Y.C.; Xu, S.H.; Sylla, C. A novel three-staged generative model for skeletonizing Chinese characters with versatile styles. J. Comput. Sci. Technol. 2023, 38, 1250–1271. [Google Scholar] [CrossRef]
Liu, G.; Zhong, Y.; Chen, Y.; Cao, Y.; Zhao, Y. Stroke-Based Few-Shot Chinese Character Style Transfer. In Proceedings of the International Conference on Intelligent Computing, Tianjin, China, 5–8 August 2024. [Google Scholar]
Tian, Y. Rewrite: Neural Style Transfer for Chinese Characters. 2016. Available online: https://github.com/kaonashi-tyc/Rewrite (accessed on 1 August 2023).
Tian, Y. Zi2zi: Master Chinese Calligraphy with Conditional Adversarial Networks. 2017. Available online: https://kaonashi-tyc.github.io/2017/04/06/zi2zi.html (accessed on 1 August 2023).
Lyu, P.; Bai, X.; Yao, C.; Zhu, Z.; Huang, T.; Liu, W. Auto-Encoder guided GAN for Chinese calligraphy synthesis. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–12 November 2017; pp. 1095–1100. [Google Scholar] [CrossRef]
Liu, Y.; Lian, Z. FontTransformer: Few-shot high-resolution Chinese glyph image synthesis via stacked transformers. Pattern Recognit. 2023, 141, 109593. [Google Scholar] [CrossRef]
Wen, C.; Pan, Y.; Chang, J.; Zhang, Y.; Chen, S.; Wang, Y.; Han, M.; Tian, Q. Handwritten Chinese Font Generation with Collaborative Stroke Refinement. In Proceedings of the Workshop on Applications of Computer Vision, Virtual, 5–9 January 2021. [Google Scholar]
Fu, B.; Yu, F.; Liu, A.; Wang, Z.; Wen, J.; He, J.; Qiao, Y. Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Liu, Y.; Lian, Z. QT-Font: High-efficiency Font Synthesis via quadtree-based diffusion models. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Papers ’24, Denver, CO, USA, 27 July–1 August 2024; pp. 1–11. [Google Scholar] [CrossRef]
Mu, X.; Chen, L.; Chen, B.; Gu, S.; Bao, J.; Chen, D.; Li, J.; Yuan, Y. FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation. In Proceedings of the European Conference on Computer Vision, Dublin, Ireland, 22–23 October 2025. [Google Scholar]
Yang, Z.; Peng, D.; Kong, Y.; Zhang, Y.; Yao, C.; Jin, L. FontDiffuser: One-shot font generation via denoising diffusion with multi-scale content aggregation and style contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024. [Google Scholar]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Guo, X.; Zhao, L.; Wu, X.; Yang, H.; Man, Y. A Chinese Font Generation Method based on Dynamic Convolution Improved Generative Adversarial Network. In Proceedings of the 2024 International Conference on Culture-Oriented Science & Technology (CoST), Beijing, China, 25–28 August 2024; pp. 12–17. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Yang, R.; Yang, H.; Zhao, L.; Lei, Q.; Dong, M.; Ota, K.; Wu, X. One-Shot Reference-based Structure-Aware Image to Sketch Synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 9238–9246. [Google Scholar]
Chen, J.; Huang, Y.; Lv, T.; Cui, L.; Chen, Q.; Wei, F. TextDiffuser: Diffusion Models as Text Painters. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Zeng, J.; Chen, Q.; Liu, Y.; Wang, M.; Yao, Y. StrokeGAN: Reducing mode collapse in Chinese font generation via stroke encoding. Proc. AAAI Conf. Artif. Intell. 2021, 35, 3270–3277. [Google Scholar] [CrossRef]
Wang, Y.; Xiong, K.; Yuan, Y.; Zeng, J. EdgeFont: Enhancing style and content representations in few-shot font generation with multi-scale edge self-supervision. Expert Syst. Appl. 2025, 262, 125547. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, R.; Wu, Y.; Li, Y.; Ling, Y.; Wang, B.; Sun, L.; Li, Y. Few-shot font style transfer with multiple style encoders. Sci. China Inf. Sci. 2022, 65, 160109. [Google Scholar] [CrossRef]
Zhang, L.; Zhu, Y.; Benarab, A.; Ma, Y.; Dong, Y.; Sun, J. DP-font: Chinese calligraphy font generation using diffusion model and physical information neural network. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024, IJCAI ’24, Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar] [CrossRef]
Ye, C.; Chen, W.; Hu, B.; Zhang, L.; Zhang, Y.; Mao, Z. Improving Video Summarization by Exploring the Coherence Between Corresponding Captions. IEEE Trans. Image Process. 2025, 34, 5369–5384. [Google Scholar] [CrossRef] [PubMed]
Li, D.; Hu, J.; Wang, C.; Li, X.; She, Q.; Zhu, L.; Zhang, T.; Chen, Q. Involution: Inverting the Inherence of Convolution for Visual Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Tang, S.; Xia, Z.; Lian, Z.; Tang, Y.; Xiao, J. FontRNN: Generating Large-scale Chinese Fonts via Recurrent Neural Network. Comput. Graph. Forum. 2019, 38, 567–577. [Google Scholar] [CrossRef]
Jiang, Y.; Lian, Z.; Tang, Y.; Xiao, J. Scfont: Structure-guided chinese font generation via deep stacked networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 4015–4022. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar] [CrossRef]
Xie, Y.; Chen, X.; Sun, L.; Lu, Y. DG-Font: Deformable Generative Networks for Unsupervised Font Generation. arXiv 2021, arXiv:2104.03064. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Park, S.; Chun, S.; Cha, J.; Lee, B.; Shim, H. Few-shot Font Generation with Localized Style Representations and Factorization. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021. [Google Scholar]
Park, S.; Chun, S.; Cha, J.; Lee, B.; Shim, H. Few-shot Font Generation with Weakly Supervised Localized Representations. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 46, 1479–1495. [Google Scholar] [CrossRef] [PubMed]
Park, S.; Chun, S.; Cha, J.; Lee, B.; Shim, H. Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]

Figure 1. Illustration of the skeleton–style decomposition in Chinese calligraphy. The character is represented by a structural skeleton (colored polylines) overlaid on the glyph, highlighting that the essence of calligraphy lies in the stylistic variations built upon a stable structural backbone.

Figure 2. The overall architecture of our skeleton-constrained diffusion model for style rendering. The model incorporates both content and reference inputs, and consists of key components such as the SCSR module, ResBlocks, and up/downsampling operations.

Figure 3. The overall architecture of the Skeleton-Constrained Style Rendering (SCSR) module.

Figure 4. Comparison of our method with state-of-the-art font generation models on running script. Our approach better preserves the coherent stroke connectivity distinctive to Running Script and captures diverse stylistic stroke characteristics, while generating more authentic and visually pleasing glyphs.

Figure 5. Comparison of our method with state-of-the-art font generation models on regular script. Our method enables the generation of characters in a specified calligraphy style, while aligning the text content with the style characteristics.

Figure 6. Comparison of our method with state-of-the-art font generation models on clerical script. Our approach achieves precise stroke structures and significantly improves style uniformity, producing characters with excellent regularity and stylistic consistency.

Figure 7. Generation results of SGD-Font on 15 diverse font styles. The model demonstrates robust performance across artistic and non-standard fonts, preserving structural coherence and stylistic authenticity.

Figure 11. Ablation study on skeleton-guided style rendering. Removal of skeleton guidance leads to structural inconsistency in the generated characters. With skeleton guidance, the structure is preserved while adapting to the target style.

Figure 12. Ablation study on contrastive refinement. Without contrastive learning, similar calligraphy styles such as Kaishu and Lishu are poorly differentiated. With contrastive refinement, the model produces distinct style outputs with clearer boundaries.

Figure 13. Ablation study on contrastive refinement. Without contrastive learning, similar calligraphy styles such as Kaishu and Lishu are poorly differentiated. With contrastive refinement, the model produces distinct style outputs with clearer boundaries.

Figure 14. Analysis of typical failure cases of SGD-Font. Although our model performs well in most cases, it still fails in some challenging scenarios, such as incorrect skeleton extraction and insufficient separation of similar style contrasts. These cases point out the direction for future improvements.

Table 1. Quantitative comparison with state-of-the-art methods on running script.

Method	L1 ↓	MSE ↓	SSIM ↑	LPIPS ↓
Zi2zi [10]	0.2419	0.1028	0.6761	0.2403
DG-Font [33]	0.2301	0.0969	0.7732	0.2177
LF-Font [35]	0.2338	0.0981	0.7658	0.2181
MX-Font [37]	0.2289	0.0965	0.7754	0.2066
FontDiffuser [17]	0.1156	0.0934	0.6802	0.2318
SGD-Font (Ours)	0.1763	0.0708	0.7903	0.1901

Table 2. Quantitative comparison with state-of-the-art methods on regular script.

Method	L1 ↓	MSE ↓	SSIM ↑	LPIPS ↓
Zi2zi [10]	0.2149	0.0890	0.7776	0.2108
DG-Font [33]	0.2041	0.0829	0.7809	0.1914
LF-Font [35]	0.2055	0.0832	0.7799	0.1887
MX-Font [37]	0.1999	0.0810	0.7845	0.1981
FontDiffuser [17]	0.1051	0.0843	0.7142	0.2079
SGD-Font (Ours)	0.1654	0.0602	0.7946	0.1906

Table 3. Quantitative comparison with state-of-the-art methods on clerical script.

Method	L1 ↓	MSE ↓	SSIM ↑	LPIPS ↓
Zi2zi [10]	0.2098	0.0864	0.7830	0.2095
DG-Font [33]	0.1957	0.0791	0.7883	0.1891
LF-Font [35]	0.2008	0.0812	0.7831	0.1975
MX-Font [37]	0.1947	0.0785	0.7890	0.1853
FontDiffuser [17]	0.1169	0.0938	0.6711	0.2281
SGD-Font (Ours)	0.1688	0.0663	0.7977	0.1882

Table 4. Ablation study on running script.

Method	L1 ↓	MSE ↓	SSIM ↑	LPIPS ↓
Baseline	0.2349	0.0843	0.7389	0.2346
Baseline + SCSR	0.2113	0.0802	0.7766	0.2177
Baseline + CSSP	0.1564	0.0739	0.8016	0.1926
Baseline + SCSR + CSSP	0.1763	0.0708	0.7903	0.1901

Table 5. Ablation study on regular script.

Method	L1 ↓	MSE ↓	SSIM ↑	LPIPS ↓
Baseline	0.2127	0.0739	0.7437	0.2123
Baseline + SCSR	0.1970	0.0606	0.7513	0.2070
Baseline CSSP	0.1775	0.0602	0.8022	0.1976
Baseline + SCSR + CSSP	0.1654	0.0602	0.7946	0.1906

Table 6. Ablation study on clerical script.

Method	L1 ↓	MSE ↓	SSIM ↑	LPIPS ↓
Baseline	0.2099	0.0754	0.7387	0.2145
Baseline + SCSR	0.1968	0.0714	0.7501	0.1985
Baseline + CSSP	0.1743	0.0705	0.7885	0.1843
Baseline + SCSR + CSSP	0.1688	0.0663	0.7977	0.1882

Table 7. Comparison of training and inference efficiency.

Indicator	Ours	FontDiffuser [17]
Generation per characters, (50 steps)	1.5 s	1.6 s
Total GB2312 set time, (6763 characters)	2.8 h	3.0 h
Training time (440k iterations)	74 h	75 h
Peak GPU utilization	45%	43%
Peak GPU memory usage rate	8.11%	8.14%

Table 8. Subjective evaluation results from the user study (5-point scale).

Indicator	Ours	FontDiffuser [17]
fidelity	4.3	4.0
integrity	4.2	3.9
aesthetics	4.1	3.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, L.; Dong, S.; Liu, J.; Zhang, X.; Gao, X.; Wu, X. Skeleton-Guided Diffusion for Font Generation. Electronics 2025, 14, 3932. https://doi.org/10.3390/electronics14193932

AMA Style

Zhao L, Dong S, Liu J, Zhang X, Gao X, Wu X. Skeleton-Guided Diffusion for Font Generation. Electronics. 2025; 14(19):3932. https://doi.org/10.3390/electronics14193932

Chicago/Turabian Style

Zhao, Li, Shan Dong, Jiayi Liu, Xijin Zhang, Xiaojiao Gao, and Xiaojun Wu. 2025. "Skeleton-Guided Diffusion for Font Generation" Electronics 14, no. 19: 3932. https://doi.org/10.3390/electronics14193932

APA Style

Zhao, L., Dong, S., Liu, J., Zhang, X., Gao, X., & Wu, X. (2025). Skeleton-Guided Diffusion for Font Generation. Electronics, 14(19), 3932. https://doi.org/10.3390/electronics14193932

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Skeleton-Guided Diffusion for Font Generation

Abstract

1. Introduction

2. Related Work

2.1. Paired-Data-Based Methods

2.2. Unpaired-Data-Based Methods

2.3. Universal-Feature-Based Methods

3. Method

3.1. Preliminaries

3.2. Skeleton-Constrained Style Rendering (SCSR)

3.3. Cross-Scale Skeleton Preservation (CSSP)

3.4. Contrastive Style Refinement (CSR)

3.5. Attention Mechanisms in the Diffusion Process

3.6. Training Objective

4. Experiments

4.1. Experimental Setup

4.2. Quantitative Evaluation Metrics

4.3. Standard-to-Calligraphy Conversion

4.3.1. Running Script

4.3.2. Regular Script

4.3.3. Clerical Script

4.3.4. Generalization to Diverse Font Styles

4.4. Analysis of Style Robustness

4.5. Ablation Studies

4.5.1. Skeleton-Guided Style Rendering (SCSR)

4.5.2. Contrastive Refinement Module

4.5.3. Cross-Scale Skeleton Preservation (CSSP)

4.5.4. Contrastive Style Refinement (CSR)

4.6. Efficiency Analysis

4.7. User Study

4.8. Limitations and Failure Cases

4.9. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI