1. Introduction
As one of the oldest writing systems, Chinese characters encompass a vast repertoire and intricate structures, making font design a labor-intensive endeavor that demands significant expertise [
1]. According to standards like GB2312, a complete Chinese typeface requires at least 6763 characters, with complexities arising from stroke variations and compositional rules. Recent breakthroughs in deep learning have revolutionized Chinese Font Generation (CFG), transforming it into an image-to-image translation task, where models learn mappings between source and target styles while preserving the content [
2,
3]. This has enabled applications in font library creation, image restoration, editing, and optical character recognition enhancement [
4,
5,
6,
7].
The evolution of CFG has progressed through stages: early computer graphics methods focused on contour descriptions and radical combinations, followed by machine learning approaches using stroke fusion and style learning [
8]. Since 2017, deep learning has dominated, with initial convolutional networks like Rewrite [
9] and Pix2Pix-based Zi2Zi [
10] addressing printed fonts, though struggling with artifacts in complex glyphs. Subsequent advancements incorporated GANs for style transfer [
11], VAEs for latent space modeling, RNNs for sequential stroke generation, Transformers for attention-based feature extraction [
12,
13], and most recently, diffusion models for high-fidelity synthesis [
14,
15,
16,
17].
Particularly for calligraphy styles like regular script (KaiShu), clerical script (LiShu), and running script (XingShu), diffusion models have emerged as a promising paradigm due to their ability to map noise to target distributions via forward diffusion and reverse denoising processes [
18,
19,
20]. Models such as DDPM [
21] and their variants [
22] excel in capturing continuous stroke variations and stylistic nuances, as evidenced in recent CFG works. For instance, FontDiffuser [
17] aggregates multi-scale content and employs contrastive learning for one-shot generation, while QT-Font [
15] leverages quadtree-based diffusion for efficient synthesis. FontStudio [
16] introduces shape-adaptive diffusion for coherent effects, and other efforts integrate font transfer processes into multi-stage diffusion [
14]. These models achieve stable training, robust detail recovery, and minimal mode collapse, outperforming GANs in metrics like FID and SSIM for raster fonts [
1].
However, generating non-standard fonts like XingShu remains challenging due to the high stroke continuity, structural flexibility, and stylistic diversity [
1]. Component-based methods, relying on glyph decomposition and stroke assembly [
23,
24], perform well for structured fonts like KaiShu but fail to generalize to fluid continuous strokes in XingShu, where rigid priors cannot capture dynamic deformations. Diffusion models, while adept at implicit learning, suffer from the following: (1) style leakage, where noise interference disrupts consistency amid large stylistic variances [
17]; (2) structural distortion, lacking explicit guidance and leading to broken strokes or deformed glyphs, especially in compound characters [
1]; and (3) style confusion, inadequately distinguishing similar styles like semi-cursive variants, resulting in ambiguous outputs [
15]. Beyond standard calligraphy styles such as XingShu, KaiShu, and LiShu, our approach is designed to generalize to a wide variety of non-standard fonts, including artistic and decorative typefaces with high structural and stylistic diversity, as validated in additional experiments (
Section 4.3).
As illustrated in
Figure 1, the essence of Chinese calligraphy can be regarded as stylistic variations imposed upon a stable character skeleton, where the skeleton preserves semantic readability, while the styles determine visual expressiveness. Our raster-based skeleton guidance, derived from pixel-level features using morphological thinning, contrasts with vector-based approaches like SVG or Bezier curve decompositions [
23]. While vector methods offer scalability and editability, they often fail to capture the fluid continuous stroke variations of non-standard fonts like XingShu due to rigid parametric constraints. By integrating raster skeletons into the diffusion process, our approach achieves superior stylistic fidelity and structural coherence, complementing vector methods for calligraphy generation.
Recent attempts to mitigate these, such as style embeddings [
25] or multi-scale fusion [
26], rely on implicit features or coarse priors, insufficient for intricate skeletal and stylistic properties of non-standard fonts. The survey by Wang et al. [
1] highlights the need for innovative networks integrating multimodal inputs and cross-style generation to address these gaps. Unlike FontDiffuser’s implicit multi-scale aggregation [
17] and DP-Font’s physical priors [
27], our approach leverages hierarchical skeleton clustering, semantic alignment with energy constraints, and decomposition-based contrastive learning, offering superior control over stroke continuity and style disambiguation for non-standard fonts.
To overcome these limitations, we propose a novel skeleton-guided diffusion model for robust non-standard font generation. Our approach incorporates explicit glyph skeleton priors—representing structural hierarchies from strokes to components [
1]—into the diffusion process, bridging structural rigidity and stylistic flexibility. It features three innovations: (1) a
skeleton-constrained style rendering module enforcing semantic alignment and energy constraints to mitigate style leakage; (2) a
cross-scale skeleton preservation module integrating multi-scale skeletons for macro-layouts and micro-stroke details, preventing distortions; and (3) a
contrastive style refinement module using decomposition, recombination, and contrastive learning to disambiguate styles.
This method advances beyond existing diffusion-based CFG by providing fine-grained control, aligning with future directions in multimodal integration and innovative designs [
1,
28]. Our contributions are as follows: first, a skeleton-guided framework addressing style leakage, distortion, and confusion for superior non-standard font quality; second, modules leveraging multi-scale skeletons and contrastive learning, enhancing computer vision and generative modeling; third, extensive validation on calligraphy datasets, outperforming SOTA in style fidelity, integrity, and differentiation, with implications for text-to-image and artistic tasks.
4. Experiments
In this section, we describe the extensive experiments conducted to evaluate the effectiveness of our proposed SGD-Font. We first describe the experimental setup, datasets, and evaluation metrics. We then present qualitative and quantitative comparisons with state-of-the-art (SOTA) methods on multiple calligraphy styles, followed by in-depth analyses of robustness, ablation studies, and case-specific evaluations. Our results consistently demonstrate that the incorporation of skeleton guidance and contrastive refinement leads to significant improvements in style fidelity, structural preservation, and style disambiguation.
4.1. Experimental Setup
We implement our skeleton-guided diffusion model using PyTorch 1.13 and conduct all experiments on an NVIDIA GeForce RTX 3090 GPU with 24 GB memory. The training corpus comprises ten categories, including standard printed fonts (SimHei, Sim-Sun, KaiTi, FangSong, Microsoft YaHei) from GB2312 and historical stele inscriptions (Yan Zhenqing’s, Liu Gongquan’s, Ouyang Xun’s regular scripts, Wei Bei, Han Dynasty clerical script) from public datasets like the Chinese Font Dataset (CFD) and GitHub calligraphy repositories. Each training font includes approximately 7000 characters, encompassing common radicals and glyph variants to promote generalization across diverse structural complexities. To ensure efficient training while maintaining high-quality results, the batch size is set to 8, with a maximum of 440,000 iterations. We employ the Adam optimizer with an initial learning rate of 1 × 10−4 and a dropout ratio of 0.1 to prevent overfitting. To prevent memorization, we applied data augmentation (rotations, Gaussian noise), dropout (0.1), and an 80/20 train/validation split, with stable validation loss indicating no overfitting. Unless otherwise specified, the number of diffusion sampling steps is fixed to 50, striking a balance between generation quality and computational efficiency. All models are trained until convergence, and we present both qualitative visualizations and quantitative metrics in the subsequent sections to demonstrate the superior performance of our approach.
4.2. Quantitative Evaluation Metrics
To objectively assess the visual quality of the generated calligraphy relative to the target fonts, we adopt standard metrics commonly used in prior research. Given that even subtle differences in generated font images can be highly perceptible to humans, our evaluation encompasses pixel-level, structural, and perceptual features for a comprehensive reflection of generation quality.
Pixel-level evaluation compares generated and target images at the pixel level to quantify basic similarity. Structural metrics account for luminance, contrast, and overall organization, while perceptual metrics align more closely with human visual judgment. The key indicators include the following:
- (1)
L1 Loss (Mean Absolute Error, MAE): The L1 Loss measures the average absolute difference between corresponding pixels in the generated and target images, providing a straightforward assessment of pixel-wise accuracy. It is particularly sensitive to outliers and is defined as
where
H and
W are the height and width of the images, and
and
are the pixel values at position
.
- (2)
Mean Squared Error (MSE): The MSE is one of the most commonly used metrics for quantifying image differences, emphasizing larger errors through squaring. Its formula is
- (3)
Structural Similarity Index Measure (SSIM): The SSIM evaluates similarity in terms of luminance, contrast, and structure, making it widely applicable in tasks like super-resolution. It is calculated as
where
and
are the mean intensities,
and
are the variances,
is the covariance, and
and
are stabilization constants.
- (4)
Learned Perceptual Image Patch Similarity (LPIPS): LPIPS leverages deep network features to better align with human perception, computing weighted differences across feature layers:
where
l denotes the feature layers,
and
are the dimensions,
and
are the features from images
x and
y, and
are the learned channel weights.
These metrics collectively provide a robust evaluation framework, with lower values for L1, MSE, and LPIPS, and higher values for SSIM indicating superior performance.
4.3. Standard-to-Calligraphy Conversion
A primary application of SGD-Font is converting standard printed fonts to diverse calligraphy styles, where our model’s innovations shine in handling non-standard fonts with high stroke continuity and stylistic variability. We evaluate conversions to three representative styles: running script (XingShu), regular script (KaiShu), and clerical script (LiShu). For a fair comparison, all baseline models (e.g., Zi2Zi, DG-Font) were re-implemented and re-trained under identical conditions (including datasets, hardware, and batch size). Through qualitative and quantitative analyses, we demonstrate how SGD-Font outperforms state-of-the-art (SOTA) methods by mitigating style leakage, structural distortion, and style confusion, thanks to its skeleton-constrained style rendering, cross-scale skeleton preservation, and contrastive style refinement modules.
4.3.1. Running Script
Running script (XingShu) is particularly challenging due to its fluid connected strokes and dynamic deformations, where traditional methods often fail to maintain continuity without introducing artifacts. As illustrated in
Figure 4, which compares SGD-Font against SOTA models like Zi2Zi [
10], DG-Font [
33], LF-Font [
35,
36], MX-Font [
37], and FontDiffuser [
17], our method excels in preserving coherent stroke connectivity and capturing diverse stylistic nuances, resulting in more authentic and visually pleasing glyphs. For instance, competitors like FontDiffuser [
17] exhibit style leakage, leading to inconsistent stroke thicknesses, while Zi2zi [
10] and others suffer from broken strokes in complex characters. In contrast, SGD-Font’s skeleton-guided approach ensures structural integrity amid stylistic flexibility, highlighting our innovation in integrating explicit skeletal priors to amplify critical features and enforce semantic alignment.
Further demonstrations in Figure 8 showcase conversions from standard printed fonts to XingShu, where the generated characters faithfully replicate the cursive flowing style while preserving readability. Quantitatively,
Table 1 reports metrics on a diverse test set. Although FontDiffuser [
17] achieves the lowest L1 (0.1156), which is sensitive to absolute pixel differences, SGD-Font surpasses all baselines in MSE (0.0708, a 24.2% improvement over FontDiffuser [
17]), SSIM (0.7903, 16.1% higher), and LPIPS (0.1901, 18.0% lower), underscoring superior perceptual and structural fidelity. These results emphasize our model’s advantages in balanced energy constraints and multi-scale interactions, enabling robust generation for non-standard fonts.
4.3.2. Regular Script
Regular script (KaiShu) demands precise upright structures with subtle brush variations, testing a model’s ability to avoid distortions in balanced layouts.
Figure 5 presents qualitative comparisons, revealing that SGD-Font generates characters with exceptional alignment to specified styles, outperforming SOTA methods that often produce uneven stroke weights or deformed radicals. Our cross-scale skeleton preservation module effectively models macro-level layouts and micro-level details, preventing the structural issues seen in baselines like MX-Font [
37] and LF-Font [
35].
Figure 9 further illustrates successful conversions, with generated glyphs retaining KaiShu’s characteristic balance and readability.
Table 2 quantifies this superiority: SGD-Font achieves the lowest MSE (0.0602, 28.6% better than FontDiffuser [
17]) and LPIPS (0.1906, 8.3% lower) and the highest SSIM (0.7946, 11.2% higher), despite a slightly higher L1 than FontDiffuser [
17]. This performance highlights our innovation in contrastive learning for style disambiguation, ensuring consistent and high-fidelity outputs that advance beyond implicit feature reliance in prior diffusion models.
4.3.3. Clerical Script
Clerical script (LiShu) features flat wide strokes and significant deviations from printed fonts, amplifying the risks of style confusion and distortion. In
Figure 6, comparisons show SGD-Font achieving precise stroke structures and superior style uniformity, unlike SOTA methods that yield irregular or semantically collapsed glyphs. Our skeleton decomposition and recombination strategies enable robust adaptation, producing characters with excellent regularity.
Additional results in Figure 10 confirm the semantic correctness amid large style shifts.
Table 3 supports this with SGD-Font leading in MSE (0.0663, 29.3% improvement over FontDiffuser [
17]), SSIM (0.7977, 18.9% higher), and LPIPS (0.1882, 17.5% lower), affirming our model’s excellence in enforcing explicit guidance and energy constraints for non-standard font generation.
4.3.4. Generalization to Diverse Font Styles
To further validate the generalization capability of SGD-Font beyond traditional calligraphy styles, we evaluated its performance on a diverse set of 15 highly stylized fonts, including artistic and decorative typefaces. As illustrated in
Figure 7, the generated characters exhibit consistent structural integrity and style fidelity across a wide spectrum of font designs, ranging from fluid cursive to rigid geometric forms. These results demonstrate that our skeleton-guided diffusion framework effectively adapts to non-standard fonts with varying stroke continuity, layout complexity, and aesthetic expression. The model’s ability to maintain readability while capturing distinctive stylistic nuances underscores the robustness of the proposed SCSR, CSSP, and CSR modules in mitigating style leakage, structural distortion, and style confusion even in highly heterogeneous font domains.
4.4. Analysis of Style Robustness
A critical challenge in font-to-calligraphy conversion is bridging the stylistic gap between rigid printed fonts and fluid handwritten scripts. As evidenced in
Figure 8,
Figure 9 and
Figure 10, SGD-Font robustly disentangles structure from style: skeletal backbones are preserved for readability, while attributes like curvature, texture, and layout adapt seamlessly to targets. This is particularly evident in LiShu conversions, where drastic morphological changes do not induce collapse, unlike in the baselines. Our innovations—skeleton-constrained rendering for consistency, cross-scale preservation for integrity, and contrastive refinement for differentiation—confer unparalleled robustness, enabling high-fidelity generation across diverse styles and underscoring SGD-Font’s advancements in generative modeling for calligraphy.
Figure 8.
Running script generation results. Conversion from standard printed font to Xingshu using SGD-Font. The generated characters faithfully capture the cursive and dynamic style while preserving structural readability.
Figure 8.
Running script generation results. Conversion from standard printed font to Xingshu using SGD-Font. The generated characters faithfully capture the cursive and dynamic style while preserving structural readability.
Figure 9.
Regular script generation results. Conversion from standard printed font to Kaishu using SGD-Font. The generated characters retain the upright and balanced structures characteristic of regular script.
Figure 9.
Regular script generation results. Conversion from standard printed font to Kaishu using SGD-Font. The generated characters retain the upright and balanced structures characteristic of regular script.
Figure 10.
Clerical script generation results. Conversion from standard printed font to Lishu using SGD-Font. The generated results demonstrate robustness to large style shifts while maintaining semantic correctness.
Figure 10.
Clerical script generation results. Conversion from standard printed font to Lishu using SGD-Font. The generated results demonstrate robustness to large style shifts while maintaining semantic correctness.
4.5. Ablation Studies
To validate the contributions of our key modules, we conducted ablation studies on the three styles, focusing on skeleton-guided style rendering (SCSR), cross-scale skeleton preservation (CSSP), and contrastive style refinement (CSR). Starting from a baseline diffusion model without these components, we incrementally added modules and evaluated the impacts.
4.5.1. Skeleton-Guided Style Rendering (SCSR)
Removing SCSR leads to structural inconsistencies, as shown in
Figure 11, where generated characters exhibit broken strokes and deformations, especially in XingShu and LiShu. With SCSR enabled, as in
Figure 8 and
Figure 10, coherence is restored through semantic alignment and energy constraints.
Table 4,
Table 5 and
Table 6 quantify this: adding SCSR to the baseline reduces the MSE by up to 18.1% (e.g., 0.0843 to 0.0802 in
Table 4) and boosts the SSIM by 5.1%, highlighting its role in mitigating style leakage.
4.5.2. Contrastive Refinement Module
Next, we investigate the effect of the contrastive refinement module, which addresses style confusion. Without this module, the model sometimes struggles to distinguish between similar calligraphy styles, as shown in
Figure 12 and
Figure 13. The generated results without contrastive learning exhibit ambiguous style transitions, particularly between regular and clerical scripts. Incorporating contrastive refinement improves the style differentiation, as evidenced by the sharper boundaries between similar styles in
Figure 9 and
Figure 10.
4.5.3. Cross-Scale Skeleton Preservation (CSSP)
Ablating CSSP results in the loss of multi-scale details, causing macro-layout misalignments and micro-stroke distortions. Integrating CSSP yields significant gains, such as a 12.3% MSE reduction (0.0802 to 0.0739 in
Table 4) and a 3.2% SSIM increase, emphasizing its innovation in cross-dimensional interactions for preventing distortions.
4.5.4. Contrastive Style Refinement (CSR)
Without CSR, style confusion arises, as depicted in
Figure 12 and
Figure 13, with ambiguous transitions between similar styles like KaiShu and LiShu. Adding CSR sharpens distinctions, as seen in
Figure 9 and
Figure 10. Full integration (SCSR + CSSP + CSR) achieves optimal metrics, e.g., 15.9% LPIPS improvement over baseline in
Table 5, validating CSR’s contrastive learning for robust representations.
Overall, as summarized in
Table 4,
Table 5 and
Table 6, each module progressively enhances the performance, with the complete SGD-Font outperforming the baseline by averages of 22.4% in MSE and 7.4% in SSIM. These ablations affirm our innovations’ synergistic effects in delivering superior style fidelity, structural integrity, and differentiation.
As shown in
Figure 4,
Figure 5 and
Figure 6, SGD-Font consistently outperforms competing methods, achieving the highest preference scores for its preservation of artistic nuances and structural fidelity in calligraphy generation.
4.6. Efficiency Analysis
After establishing the qualitative advantages of SGD-Font, we now analyze its computational efficiency for a comprehensive evaluation. The analysis is conducted under a consistent configuration: an NVIDIA RTX 3090 GPU, a batch size of 8, and 50 sampling steps. As shown in
Table 7, under these settings, our method demonstrates highly comparable resource consumption levels in terms of inference speed, GPU utilization, and GPU memory usage compared to FontDiffuser [
17], with slight advantages observed in certain metrics. This indicates that the improvement in generative quality stems from the efficiency of our algorithmic design rather than increased computational overhead, achieving a superior balance between quality and efficiency.
4.7. User Study
To validate the practical utility and visual appeal of our method from an end-user perspective, we performed a comprehensive user study. Twenty participants (10 font designers and 10 non-experts) rated 100 generated samples on a 5-point scale regarding style fidelity, structural integrity, and overall aesthetics. Our method, SGD-Font, consistently outperformed the FontDiffuser baseline across all criteria. The complete findings are summarized in
Table 8.
4.8. Limitations and Failure Cases
Failures include the appearance of redundant strokes, the occurrence of broken and closed strokes, missing strokes (for example, due to incorrect skeleton extraction), and stylistic ambigurities of similar clerical script variants.
Figure 14 illustrates these situations.
4.9. Discussion
The experiments demonstrate the effectiveness of SGD-Font in generating high-quality calligraphy styles while maintaining structural integrity. The ablation studies show that both the skeleton-guided rendering and contrastive refinement modules are crucial for achieving robust style generation. Furthermore, the user study confirms that our method produces visually appealing results that align with user preferences across various calligraphy styles. The skeleton-guided approach could extend to scripts like Arabic or Devanagari by adapting skeleton extraction (e.g., medial axis transforms for cursive or conjunct structures), though retraining on script-specific datasets is required, marking a future research direction.
5. Conclusions
Our skeleton-guided diffusion model represents a significant leap forward in generating non-standard fonts, such as the intricate and fluid XingShu, by effectively tackling the persistent challenges of style leakage, structural distortion, and style confusion inherent in diffusion-based approaches. Through the innovative integration of three core modules—Skeleton-Constrained Style Rendering (SCSR), Cross-Scale Skeleton Preservation (CSSP), and Contrastive Style Refinement (CSR)—our method achieves unparalleled style fidelity, structural integrity, and style differentiation. The SCSR module enforces semantic alignment and balanced energy constraints, ensuring consistent stylistic rendering even under large style variances. The CSSP module leverages multi-scale skeletal priors to preserve both macro-level layouts and micro-level stroke details, preventing distortions in complex glyphs. Meanwhile, the CSR module employs contrastive learning to disambiguate similar styles, enhancing the robustness of style representations. By incorporating advanced attention mechanisms, including spatial transformers and energy-augmented cross-attention, our approach seamlessly balances stylistic expressiveness with structural coherence throughout the denoising process. Extensive experiments on diverse calligraphy datasets validate our model’s superiority over state-of-the-art methods. Owing to its practical efficiency and strong generalization, SGD-Font is suitable for integration into design software as a plug-in to facilitate custom font generation and cultural content creation. This work thereby bridges academic research and practical applications, demonstrating potential to redefine font generation for digital typography, cultural preservation, and artistic design.