Few-Shot Learning for Irregular Hangeul Typeface Expansion: A Comparative Study of GAN, VQGAN, and Diffusion Models

Hong, Jikyung; Kim, Sungkye

doi:10.3390/electronics15122633

Open AccessArticle

Few-Shot Learning for Irregular Hangeul Typeface Expansion: A Comparative Study of GAN, VQGAN, and Diffusion Models

by

Jikyung Hong

and

Sungkye Kim

^*

Department of Design, Pusan National University, Busan 46241, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(12), 2633; https://doi.org/10.3390/electronics15122633 (registering DOI)

Submission received: 15 May 2026 / Revised: 9 June 2026 / Accepted: 11 June 2026 / Published: 14 June 2026

(This article belongs to the Special Issue Efficient Learning for Computer Vision: Few-Shot, Weakly Supervised and Unsupervised Approaches)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Irregular Hangeul typefaces present a challenging computer vision problem because complete font generation must generalize from a small number of reference glyphs while preserving both structural consistency and stylistic fidelity. This study investigates few-shot learning for the restoration and expansion of irregular and historical Hangeul typefaces through three experiments spanning relatively regular woodblock print, irregular contemporary type, and highly irregular royal calligraphy. We benchmark a GAN-based model (DM-Font), a VQGAN-based model (VQ-Font), and a diffusion-based model (Diff-Font) under limited supervision and evaluate them using pixel-level similarity, structural indicator, OCR usability, and expert assessment. DM-Font established a feasible baseline for historical restoration (mean SSIM 0.77), whereas VQ-Font obtained the highest structural similarity for irregular contemporary typeface when paired with a structurally designed 10-character pangram reference set (SSIM 0.97; OCR accuracy 99.5% on the evaluated glyph set). For highly irregular royal calligraphy, the two models performed comparably on global similarity (SSIM 0.78 vs. 0.80) and on expert ratings (4.2 vs. 4.3); VQ-Font showed more stable structure-sensitive indicators, whereas Diff-Font better preserved stylistic nuance. The findings suggest that reference-set composition substantially affects generation quality under fixed-budget few-shot conditions, and that model choice should be matched to source regularity and restoration objectives.

Keywords:

few-shot learning; computer vision; limited supervision; font generation; irregular Hangeul typeface; VQGAN; diffusion model; digital cultural heritage

1. Introduction

Recent advances in computer vision have been driven by deep neural networks trained on large, labeled datasets. However, the reliance of such models on abundant annotated data remains a major limitation in domains where labeled samples are scarce, heterogeneous, or costly to obtain [1,2].

Typeface generation is a representative example of this challenge. Unlike conventional image synthesis, font generation requires a model to infer a coherent visual system from a limited set of reference glyphs while preserving both structural consistency and stylistic identity across a large character inventory [3]. This challenge is especially pronounced in Hangeul, whose modern syllabic system comprises 11,172 characters generated through the combinatorial interaction of initial consonants, medial vowels, and optional final consonants. As a result, complete Hangeul typeface production remains one of the most labor-intensive tasks in typography, often requiring the design and refinement of thousands of glyphs over an extended period [4,5].

The problem becomes substantially more difficult when the source typeface is irregular. In this study, irregular Hangeul typefaces refer to sources that exhibit unstable stroke weight, curvature, spacing, axis balance, or component composition beyond the predictable conventions of standardized print fonts. Such irregularity is common in historical woodblock documents, royal calligraphic manuscripts, hand-drawn contemporary typefaces, and decorative experimental letterforms. From a computer vision perspective, these sources are challenging because few-shot models must generalize from a small number of reference glyphs despite high intra-style variation and weak structural regularity. Accordingly, irregular Hangeul restoration is not merely a style-transfer problem; rather, it is a data-efficient visual generation problem in which the model must reconstruct a usable and internally coherent typographic system from sparse and structurally unstable evidence [6,7].

Historical Hangeul materials are particularly important in this context. The seventeenth and eighteenth centuries produced a rich typographic culture spanning woodblock print, movable-type documents, and royal handwriting, yet many of these materials survive only in fragmentary or archive-bound form [8]. Restoring them as usable digital fonts is valuable not only for design practice but also for cultural heritage preservation, scholarly access, and the reuse of historical visual resources. However, conventional restoration workflows depend heavily on expert interpretation and manual redrawing, making large-scale restoration slow, expensive, and difficult to standardize. The need for computational methods is therefore especially acute in irregular and historical Hangeul, where the number of surviving characters is limited and stylistic fidelity must be balanced against structural completeness [4,8].

Recent progress in few-shot font generation (FFG) has created a promising technical foundation for addressing these challenges. Early generative adversarial network (GAN)-based approaches, including GlyphGAN and Dual Memory Font (DM-Font), demonstrated that it is possible to synthesize a complete character set from a restricted number of reference glyphs by learning reusable content and style representations. More recently, vector-quantized generative adversarial network (VQGAN)-based, and diffusion-based architectures have substantially advanced the field. VQ-Font introduced discrete codebook-based style encoding and stronger structural regularization, whereas diffusion-based models, such as Diff-Font and related variants, improved stylistic flexibility and visual realism under low-reference conditions [9,10,11]. Despite these advances, comparative evidence remains limited for irregular and historical Hangeul, particularly when the task requires controlled variation in source regularity, reference-set composition, and model family.

Another unresolved issue concerns the design of the reference set itself. In Hangeul, unseen syllables are generated through systematic recombination of recurring sub-syllabic elements, suggesting that the internal composition of a few-shot reference set may influence performance as strongly as, or more strongly than, raw set size. However, most prior studies have treated reference selection as a practical input choice rather than as a formal experimental variable. This is a significant limitation for Hangeul restoration because a poorly designed reference set may underrepresent consonant–vowel patterns, batchim structures (final consonants), or layout types essential for generalization. Likewise, evaluation has often emphasized global image similarity, even though practical font restoration requires broader criteria, including structural consistency, legibility, spacing stability, and historical authenticity [12,13,14]. A meaningful benchmark for irregular Hangeul typefaces therefore requires both architecture-level comparison and a multi-layer evaluation framework that reflects typographic function rather than visual resemblance alone.

Against this background, this study investigates artificial intelligence (AI) generative-model-based restoration and formative expansion of irregular Hangeul typefaces through three progressive experiments. The study addresses the following research questions:

RQ1. Can a GAN-based FFG model (DM-Font) effectively restore a relatively regular 17th-century woodblock Hangeul typeface into a complete, usable digital font?
RQ2. How does the composition of the few-shot reference character set affect the generation quality of a VQGAN-based model (VQ-Font) for irregular Hangeul typefaces?
RQ3. For highly irregular historical calligraphic typefaces, do VQGAN-based (VQ-Font) and diffusion-based (Diff-Font) architectures exhibit different performance characteristics?

To answer these questions, the first experiment examines whether DM-Font provides a restoration baseline for Jinbeopunhae (1693), a relatively regular seventeenth-century woodblock-printed Hangeul typeface. The second experiment evaluates how the composition of a fixed 10-character reference set influences generation quality for an irregular contemporary Hangeul typeface using VQ-Font. The third experiment compares VQ-Font and Diff-Font on the highly irregular royal calligraphic hands associated with King Jeongjo and Queen Hyoui, thereby testing whether VQGAN-based and diffusion-based models exhibit different performance characteristics under extreme irregularity. Across these experiments, generated outputs are evaluated using image-similarity metrics, structure-sensitive indicators, optical character recognition (OCR)-based usability testing, and expert qualitative assessment.

This study makes four contributions, the first three of which are methodological. (1) We introduce the Jamo Coverage Index (JCI), a formally defined metric that quantifies the structural representativeness of a few-shot reference set; although instantiated for Hangeul, JCI generalizes to any compositional writing system and provides a principled basis for designing and comparing few-shot inputs. (2) We formalize reference-set composition as a controlled experimental variable under a fixed sampling budget, isolating the effect of input design from model architecture—an analysis applicable to few-shot generation beyond typography. (3) We propose a multi-layer evaluation protocol for structured generative output that integrates pixel-level, structure-sensitive, OCR-based, and expert assessments with explicit inter-rater reliability and metric–human correlation, addressing the well-known insufficiency of single-metric evaluation. (4) Building on these instruments, we present a unified three-tier irregularity benchmark and a model-selection framework that provide an empirical, application-grounded comparison of GAN-, VQGAN-, and diffusion-based approaches. Together, the methodological instruments are transferable, while the benchmark grounds them in a demanding real-world restoration setting.

This paper substantially extends two prior conference studies by the same authors published in Korean-language design journals [4,8]. The first prior study [8] applied DM-Font to the Jinbeopunhae corpus and reported a single-model feasibility result; it constitutes the methodological precursor to Experiment 1 of the present study. The second prior study [4] applied VQ-Font to an irregular contemporary Hangeul typeface and compared reference-set composition conditions; it constitutes the methodological precursor to Experiment 2. The present paper extends these works in the following ways: (i) it introduces Experiment 3, a new and original comparison of VQ-Font and Diff-Font on two previously unpublished royal calligraphic sources (King Jeongjo and Queen Hyoui); (ii) it integrates all three experiments into a unified benchmarking framework with standardized preprocessing, evaluation metrics, and statistical analysis; (iii) it adds new analyses not reported in either prior work, including cross-experiment SSIM trajectory comparison, postprocessing error recovery analysis, inter-rater reliability assessment (ICC, Cohen’s kappa, Kendall’s W), and the model-selection framework and (iv) all figures, tables, and quantitative results are either newly produced or substantially revised relative to the prior Korean-language publications. Any figures or tables derived or adapted from the prior studies are individually identified in their respective captions.

2. Related Work

2.1. Few-Shot Font Generation: Architectural Development

FFG is a prominent branch of efficient visual learning that aims to synthesize a complete character set from a small number of reference glyphs. Early work largely framed FFG as a style-consistent image generation problem, often relying on content–style disentanglement or weakly supervised learning of their interaction. Within this paradigm, GlyphGAN demonstrated that GANs can generate visually coherent character sets while preserving global style characteristics. DM-Font extended this line of work by arguing that the central difficulty is compositional, proposing a dual-memory structure that separates component composition from style representation to support large character inventories. Localized style representations and Factorization (LF)-Font further refined the GAN-based approach by introducing weakly supervised localized representations to model finer sub-character structure under limited references [15].

Despite these advances, the main point of contention surrounding GAN-based FFG concerns robustness under structural instability. GAN-family models often perform well on regular printed fonts or moderately stylized sources, but tend to degrade when stroke thickness, component alignment, spacing, or axis balance varies substantially across characters. This limitation is especially consequential for Hangeul because each syllable block is formed by recombining multiple consonant and vowel components; small placement errors can therefore propagate across thousands of generated characters, degrading readability and typographic consistency. Consequently, prior studies suggest that GAN-based FFG is effective for establishing feasibility but may be insufficient for strongly irregular sources and historically contingent variation [7,11,16].

As shown in Figure 1, GAN-based few-shot font generation models disentangle content and style representations from a set of reference glyphs and recombine them through a generator to synthesize unseen characters; however, this architecture is susceptible to error propagation when stroke thickness, component alignment, or axis balance varies substantially across source characters.

Motivated by these limitations, subsequent work shifted toward stronger structural regularization. VQ-Font exemplifies this direction. By combining a VQGAN framework with structure-aware enhancement and discrete codebook quantization, VQ-Font constrains the style space to prototypical patterns and thereby stabilizes geometry during few-shot generation. Related system-oriented studies similarly emphasize the practical importance of maintaining stable compositional structure from extremely small reference sets. However, this line of work also introduces a competing concern: codebook quantization may reduce sensitivity to subtle stylistic variation, which can be problematic for calligraphic sources, degraded historical materials, or intentionally irregular typefaces in which fine-grained stylistic detail constitutes part of the visual identity rather than more noise.

Diffusion-based approaches represent the most recent paradigm in FFG and are often positioned as an alternative response to this structure–style tension. Diff-Font applies denoising diffusion probabilistic modeling (DDPM) to one-shot or few-shot font generation and reports strong perceptual fidelity under sparse references. FontDiffuser extends this direction through multi-scale content aggregation and style-contrastive learning to better balance content preservation and style transfer. Subsequent studies introduce component-level fine-grained conditioning for diffusion-based few-shot generation, text-conditioned diffusion for high-fidelity Korean font generation, and unified diffusion frameworks for Chinese calligraphy generation and recognition. Collectively, these studies suggest that diffusion models excel at stylistic nuance and perceptual realism, but they may exhibit weaker structural stability than VQ-based approaches when the task demands precise component placement and consistent block composition across a large character inventory [2,17,18].

Figure 2 illustrates the architecture of Diff-Font, a diffusion model applied to few-shot font generation, in which a forward process progressively adds Gaussian noise to a target glyph while a learned reverse process iteratively denoises a noisy input conditioned on content and style features extracted from one or more reference glyphs. Unlike GAN-based methods that rely on a single-pass generator, the diffusion framework reconstructs typographic detail through multi-step denoising, enabling finer preservation of stylistic nuance at the cost of reduced structural rigidity compared to codebook-based approaches.

Viewed collectively, the literature indicates a clear architectural diversification rather than a settled hierarchy: GAN-based methods established feasibility, VQGAN-based methods emphasized structural control, and diffusion-based methods emphasized stylistic fidelity. However, controlled comparisons across these families—under matched reference conditions and on strongly irregular sources—remain limited, leaving open how architecture, source regularity, and few-shot constraints interact in irregular Hangeul generation.

Table 1 provides an overview of representative FFG models by contrasting architectural families, typical reference-set scales, and the recurring strengths and limitations emphasized in prior studies.

2.2. Hangeul Typeface Research and Digital Heritage

The technical challenge of FFG intersects directly with Hangeul typeface preservation. Historical Hangeul materials—movable type, woodblock prints, vernacular publications, and royal manuscripts from the fifteenth to the nineteenth centuries—reflect multiple distinct formal systems rather than a single uniform script tradition [19,20]. This matters computationally because restoration difficulty is shaped not only by the number of surviving characters but also by the degree of regularity encoded in the surviving exemplars.

Within this landscape, Jinbeopunhae is a valuable baseline source because it retains characteristic seventeenth-century typographic features, including a relatively square character frame, comparatively stable stroke weight, and consistent inter-character spacing. These properties make it suitable for testing whether an artificial intelligence (AI) model can restore a relatively regular legacy source into a complete digital font [8].

By contrast, royal calligraphic manuscripts constitute a substantially more challenging class. Scholarship on late Joseon royal writing emphasizes dynamic stroke variation, shifting internal balance, and highly individualized compositional rhythm. In particular, the documented corpus of King Jeongjo’s letters and the Deokon-related preservation of Queen Hyoui’s handwriting provide historically meaningful material for digital restoration, but their expressive irregularity makes computational expansion considerably more difficult than restoration from printed woodblock sources. In this setting, the primary difficulty is not data scarcity alone, but also structural instability in the available exemplars.

This has been recognized in restoration research. Conventional studies report that many seventeenth- and eighteenth-century materials survive only in fragmentary or archive-bound form, making large-scale manual restoration slow, expensive, and heavily dependent on expert interpretation. Design-oriented research on automated type design and Korean font generation likewise identifies irregularity and component-position uncertainty as major barriers for computational approaches. Accordingly, irregular Hangeul restoration is best understood as a data-scarce visual reconstruction problem with direct relevance to digital cultural heritage.

AI-based work remains limited and fragmented by source condition. In prior work by the present authors [8], DM-Font was applied to Jinbeopunhae to demonstrate GAN-based restoration feasibility. A subsequent study [4] examined VQGAN-based expansion for an irregular contemporary Hangeul typeface and showed that reference-set composition substantially affects generation quality. While these studies establish component-level feasibility for individual model–source pairings, each addressed a single model and a single source condition in isolation. Consequently, no prior work has provided a unified framework that systematically spans regular historical print, irregular contemporary type, and highly irregular royal calligraphy within a controlled experimental design—a gap that the present study directly addresses.

2.3. VQ-Font and Diff-Font: Architectural Comparison and Research Gap

Among recent FFG approaches, the contrast between VQ-Font and Diff-Font is particularly relevant because the two architectures operationalize competing assumptions about few-shot generation. VQ-Font prioritizes structural regularization: its discrete codebook compresses style into prototypical patterns and stabilizes global geometry under limited supervision. This makes it appealing when reliable syllable-block composition must be maintained across thousands of unseen characters. However, the same quantization mechanism may limit the preservation of subtle stroke energy, local deformation, and calligraphic irregularity.

Diff-Font, by contrast, maintains a continuous style space through iterative denoising, which enables preservation of finer stylistic detail and richer local variation. Diffusion-based extensions further support this direction via multi-scale aggregation, fine-grained component conditioning, and text-conditional control [2]. Nevertheless, this flexibility can become a liability when outputs must satisfy strong structural constraints, such as stable axis alignment, consistent spacing, and predictable component placement across a large Hangeul font set [8,9,10]. The literature therefore implies a structure–style trade-off, but it has not yet been verified through controlled comparisons on irregular Hangeul sources under matched preprocessing, reference constraints, and evaluation criteria.

A second unresolved issue concerns reference-set design. Most FFG studies report the number of reference glyphs but rarely treat the internal composition of the reference set as an independent experimental variable [1,2,3,6,8,9]. This omission is particularly consequential in Hangeul: because syllable blocks are compositional, generalization plausibly depends on structural coverage (e.g., Jamo combinations) as much as on reference-set size. Consequently, a 10-character reference set is not a neutral condition; its Jamo coverage may materially influence generation quality, yet this factor has seldom been isolated in controlled experiments for irregular Hangeul typefaces.

The present study addresses these limitations in four ways. First, it evaluates irregular Hangeul generation across a regularity spectrum—relatively regular historical print (Jinbeopunhae), irregular contemporary type, and highly irregular royal calligraphy [16,18]. Second, it treats reference-set composition as an explicit methodological variable by comparing pangram-type, word-type, and random-type reference sets under the same 10-character budget. Third, it benchmarks representative architectural families (GAN, VQGAN, diffusion) within a single experimental framework to test whether the structure–style trade-off remains stable across irregularity conditions. Fourth, it adopts a multi-layer evaluation strategy combining image-similarity metrics, structure-sensitive indicators, OCR-based usability testing, and expert qualitative assessment, addressing limitations of evaluations that rely primarily on global visual similarity [8,9,10,11,12].

For these reasons, this paper positions irregular Hangeul restoration and expansion not as a single-model application, but as a benchmark problem in efficient learning for computer vision, aimed at clarifying not only which models perform better, but also why different models succeed or fail under different forms of irregularity.

Figure 3 illustrates the end-to-end pipeline of AI-based irregular Hangeul typeface expansion, showing how a small set of reference glyphs (few-shot input) is processed through a font generation model.

3. Materials and Methods

3.1. Problem Definition

We formulate irregular Hangeul typeface restoration and expansion as a few-shot font generation task under limited supervision. Let

R = {(x_{i}, y_{i})}_{i = 1}^{| R |}

denote a small set of reference glyph images

x_{i}

with character labels

y_{i}

sampled from a target source typeface, and let

C

denote the modern Hangeul syllable inventory (

| C | = 11,172

). The objective is to learn a generation function

G_{θ}

that synthesizes a complete glyph set

{{\hat{x}}_{c}}_{c \in C}

while preserving (i) the structural organization of Hangeul syllable-block composition and (ii) the stylistic identity of the source typeface [15].

Unlike regular printed fonts, irregular Hangeul sources exhibit unstable stroke thickness, curvature, connection behavior, axis balance, and internal spacing. The task is therefore not a simple image-to-image transfer problem; instead, the model must infer a coherent typographic system from sparse exemplars and generalize it across thousands of unseen syllable blocks. We operationalize this objective through three progressively more challenging experiments: (1) feasibility of restoration on a relatively regular historical print source, (2) the effect of reference-set composition under a fixed few-shot budget, and (3) architecture comparison under extreme irregularity using royal calligraphy [16,17,18]. Table 2 summarizes the source conditions, reference budgets, and target outputs across the three experiments.

Table 3 summary of mathematical notations, grouped by sets, variables, operations, and statistical quantities, used consistently throughout the manuscript.

3.2. Data Sources and Corpus

The corpus was designed as an irregularity spectrum rather than a single-source benchmark. All sources were processed into character-level glyph images, each associated with a unique label (Unicode syllable codepoint) to support training and one-to-one evaluation against ground truth [4,8,16].

Table 4 summarizes the corpus extraction and retention statistics for each source, reporting the number of initially detected glyphs (

N_{raw}

), the number retained after quality verification (

N_{usable}

), the resulting retention rate (r), and the fixed few-shot reference-set size (

| R |

).

For the royal calligraphy sources, the small number of surviving glyphs (

N_{usable}

= 10) means the paired evaluation set is necessarily small; the reported royal-source metrics are leave-the-reference-out, few-sampling comparisons and should be interpreted as indicative rather than as large-scale test results. This limitation is noted in Section 6.4.

In all experiments, the model is tasked with generating the full modern inventory (

| C | = 11,172

). The ground-truth evaluation set is the intersection between the available source corpus labels and

C

, denoted

C_{gt} \subseteq C

. We evaluate only on

C_{gt}

where a ground-truth glyph exists, while generation is produced for all

c \in C

.

Selection criteria for reference sets (training support):

Experiment 1: reference set size

| R | = 10

sampled from the usable Jinbeopunhae glyph pool. Selection prioritized (i) clean segmentation, (ii) high contrast, (iii) minimal background noise, and (iv) coverage of diverse syllable-block structures (balanced, compound vowels, batchim variants).

Experiment 2: fixed few-shot budget

| R | = 10

with three conditions: pangram-type (structure-maximizing), word-type (lexically grouped), and random-type (low-control baseline).

Experiment 3:

| R | = 10

pangram-type references for each royal hand (King Jeongjo; Queen Hyoui) to hold reference-set composition constant under architecture comparison.

It is important to distinguish the model’s generation target from the set of labels on which quantitative metrics are computed. In every experiment, the model is required to synthesize the full modern Hangeul inventory

C (| C | = 11,172)

. The few-shot reference (support) set

R (|R| = 10)

is provided during adaptation. To monitor convergence and early stopping, we held out a fixed pool of 1000 syllable labels from the generation inventory (Ω_val), and the remaining 10,062 labels (Ω_gen-test) constitute the labels synthesized for final inspection. These 1000 and 10,062 figures therefore describe the synthesis coverage of the output inventory, not a set of original ground-truth images.

Quantitative metrics (SSIM, LPIPS, PSNR, and the structure-sensitive indicators) require a real reference image and are therefore computed only on the paired evaluation set

C

eval = D_gt ∩ D_gen, i.e., the intersection between the verified ground-truth corpus D_gt and the generated outputs D_gen. Because D_gt is bounded by the number of usable original glyphs (

N_{usable}

), the size of

C

eval differs sharply across sources: it is large for the regular woodblock source of Experiment 1 (

N_{usable}

= 997) but only about 10 glyphs for each royal hand in Experiment 3 (

N_{usable}

= 10). For the royal sources, the reference set and the paired evaluation set are drawn from the same small pool of surviving glyphs; we therefore report the royal-source metrics as evaluations based on the limited set of surviving original glyphs available for comparisons. Accordingly, these results should be interpreted in the context of data-scarce historical materials and considered together with the complementary expert assessments and structure-sensitive analyses presented in this study.

Syllables in

C

with no corresponding ground truth (the overwhelming majority for the royal hands) cannot be scored against a reference image. These were excluded from paired-image metrics and were instead assessed through OCR-based usability and expert qualitative review, and retained for completeness inspection of the generated font system.

All corpus construction and preprocessing were conducted using a hybrid workflow combining programmatic image processing and design-based manual refinement. Initial glyph extraction, binarization, and normalization were implemented using Python-based libraries (e.g., OpenCV 4.8.0 and NumPy 1.24.3), enabling consistent large-scale processing across heterogeneous sources. Unicode labeling and dataset organization were performed through custom Python scripts to ensure one-to-one alignment between glyph images and syllable codepoints.

For the royal calligraphy datasets (King Jeongjo and Queen Hyoui), additional analysis was performed using Adobe Photoshop (contrast enhancement, noise removal, and background correction) and Adobe Illustrator 30.4 (character-level segmentation and boundary verification). The segmented glyphs were subsequently processed within the Python 3.10 pipeline for normalization and quality control. This combined workflow ensured both visual precision and reproducibility, particularly for structurally irregular handwritten sources.

3.3. Preprocessing

To minimize variance unrelated to model architecture or reference-set design, all source images were processed through a unified pipeline: (1) binarization, (2) character-level segmentation, (3) normalization, and (4) quality verification. All glyphs were normalized to 100 × 100 pixels. All preprocessing procedures were implemented in Python 3.10 using Open-Source Computer Vision (CV) 4.8.0, Numeric Python 1.24.3, and scikit-image 0.21.0, with identical parameter settings applied across all datasets to ensure procedural reproducibility [18,21].

A segmented glyph was discarded if it exhibited one or more of the following: (i) incomplete strokes due to segmentation loss, (ii) broken contours or holes inconsistent with the source writing instrument, (iii) severe background artifacts overlapping the foreground, (iv) duplicated instances for the same label where the instance quality was inferior, or (v) label inconsistencies, defined as a mismatch between the segmented region and the assigned Unicode syllable label [6,22,23].

For evaluation, each generated glyph

{\hat{x}}_{c}

was paired with its corresponding ground-truth glyph

x_{c}

from the historical or contemporary corpus. Label-based matching was performed using the Unicode syllable label

c

as the matching key. First, a ground-truth dictionary

D_{gt} = {c \mapsto x_{c}}

was constructed from the usable segmented corpus after verification. Second, a generated dictionary

D_{gen} = {c \mapsto {\hat{x}}_{c}}

was constructed from model outputs for all

c \in C

. Third, the evaluation label set

C_{eval} = {c \in C : c \in D_{gt} {\cap D}_{gen}}

. Finally, quantitative metrics were computed only over paired samples

{(x_{c}, {\hat{x}}_{c})}_{c \in C_{eval}}

. Unmatched labels, including generated glyphs without ground truth and ground-truth glyphs without corresponding generated output, were excluded from paired-image metrics but retained for qualitative inspection of completeness [24,25,26,27]. All glyphs were normalized to 100 × 100 pixels prior to model input and metric computation. This resolution was selected to ensure cross-source comparability and to match the default input specification of the evaluated models (DM-Font, LF-Font, and Diff-Font), all of which were originally trained and validated at this resolution. Although higher resolutions may better preserve fine calligraphic stroke texture, particularly for the royal calligraphy sources, the 100 × 100 constraint was uniformly applied across all experiments to prevent resolution-induced confounds in cross-architecture comparison. The implications of this resolution trade-off are discussed further in Section 6.4.

Table 5 reports structural coverage statistics expressed as the Jamo Coverage Index, for each reference-set type. The table compares pangram-, word-, and random-type reference sets in terms of the number of covered initial consonant types, vowel-combination types, final consonant types, and the aggregated Jamo Coverage Index.

To enable inferential comparison across reference-set types, Jamo Coverage Index was computed for each of the 10 individual reference glyphs within each condition, yielding 10 observations per group and a total of N = 30 observations across the three conditions (k = 3). Each observation represents the structural coverage contributed by a single reference glyph, defined by the presence or absence of its constituent initial consonant, vowel combination, and final consonant types within the reference set. A one-way ANOVA indicated a significant effect of reference-set type on Jamo Coverage Index, F(2, 27) = 56.745, p < 0.001, η² = 0.81, with reference-set type accounting for approximately 81% of the variance in structural coverage. Tukey’s HSD post hoc tests showed that pangram-type sets achieved significantly higher coverage than word-type sets, MD = 0.27, SE = 0.029, p < 0.001, and random-type sets, MD = 0.44, SE = 0.029, p < 0.001. Word-type sets also outperformed random-type sets, MD = 0.17, SE = 0.029, p = 0.001. These findings indicate that the three reference-set types constitute statistically distinct structural coverage conditions.

3.4. Compared Models

We compared four few-shot font generation models: DM-Font as the baseline model, LF-Font as an auxiliary GAN-based comparator, VQ-Font as the VQGAN-based, and Diff-Font as the diffusion-based model. All models were implemented, trained, and evaluated in the same software environment using Python 3.10, PyTorch 2.1.0, CUDA 11.8, and cuDNN 8.9, with identical data splits, preprocessing outputs, random-seed settings, and evaluation scripts applied across model families to ensure reproducibility.

(1): DM-Font and LF-Font (GAN family). GAN-based font generation learns a generator G and discriminator D through an adversarial objective. In Wasserstein GAN with gradient penalty (WGAN-GP), the training objective is formulated as follows [2,10]:

$L_{W G A N - G P} = E_{p_{g}} (D (ᶋ)) - E_{p_{r}} (D (x)) + λ E_{p_{\hat{x}}} ({({‖\nabla_{\hat{x}} D (\hat{x})‖}_{2} - 1)}^{2})$

(1)

where pr denotes the real data distribution, pg the generated distribution, $\hat{x}$ a randomly interpolated sample between real and generated data, and λ the gradient penalty.
DM-Font and LF-Font extend this objective by conditioning both G and D on a content latent zc and a style latent zs extracted from the few-shot reference glyphs, such that the generator produces $\hat{x}$ = G (zc, zs) for each target character.
(2): VQ-Font (VQGAN family). VQGAN introduces a discrete codebook ${e_{k}}_{k = 1}^{K}$ and a quantizer $q (\cdot)$ that maps encoder latents $z_{e} (x)$ to the nearest codebook vector.
A standard VQ-style quantization objective includes reconstruction plus codebook/commitment terms [10,28,29]:

$L_{VQ} = ‖ x - \hat{x} ‖ + ‖ sg [z_{e} (x)] - z_{q} (x) ‖_{2}^{2} + β ‖ z_{e} (x) - sg [z_{q} (x)] ‖_{2}^{2},$

(2)

where $sg [\cdot]$ denotes the stop-gradient operator.
(3): Diff-Font (diffusion family; DDPM). In denoising diffusion probabilistic modeling (DDPM), the forward process adds Gaussian noise [2,20]:

$q (x_{t} ∣ x_{t - 1}) = N (\sqrt{1 - β_{t}} x_{t - 1}, β_{t} I),$

(3)

and the model learns a reverse denoising model, often trained via noise prediction:

$L_{simple} = E_{t, x_{0}, ϵ} [{‖ϵ - ϵ_{θ} (x_{t}, t, cond)‖}_{2}^{2}],$

(4)

where $x_{t}$ is a noisy version of $x_{0}$ , and “cond” denotes conditioning information (e.g., content/style features derived from the few-shot references).

3.5. Experimental Protocol

We employed a three-stage progressive design in which one principal variable was emphasized in each experiment while the overall goal—few-shot restoration and expansion to a complete Hangeul font—remained constant. All experimental runs, including model training, fine-tuning, metric computation, and statistical analysis, were conducted in a unified Python 3.10 environment using PyTorch 2.1.0, CUDA 11.8, NumPy 1.24.3, SciPy 1.10.1, and stats models 0.14.0, with fixed random seeds and identical configuration files applied within each experimental condition to ensure reproducibility.

Experiment 1: feasibility baseline, RQ1. DM-Font was trained using 10 reference glyphs with learning rate

2 \times 10^{- 4}

, batch size 16, 200 epochs, Adam optimizer, and WGAN-GP loss. Evaluation was stratified into ten syllable-block composition categories (symmetric/simple through mixed feature) to analyze failure modes under increasing compositional complexity.

Experiment 2: reference-set composition, RQ2. VQ-Font was trained with a fixed few-shot budget of 10 reference glyphs under three reference-set types (pangram/word/random) with learning rate

2 \times 10^{- 4}

, batch size 16, 50 epochs, and Adam

(β_{1}, β_{2}) = (0.9,0.999)

. A second-stage benchmark then compared DM-Font, LF-Font, Diff-Font, and VQ-Font under the optimized pangram-type reference condition to separate gains due to reference-set design from gains due to architecture.

Experiment 3: architecture under extreme irregularity, RQ3. VQ-Font and Diff-Font were fine-tuned on two royal calligraphic sources (King Jeongjo; Queen Hyoui) using matched preprocessing, a matched 10-character pangram reference set, and matched evaluation criteria. Postprocessing was applied to remove residual noise and correct structural defects before downstream font compilation where feasible.

Table 6 summarizes the experimental design by source, model(s), reference-set specification, primary variable, and research question.

Table 7 summarizes the maps which model × source combinations were evaluated under matched conditions. The irregular-contemporary column constitutes a complete four-model common-condition benchmark (DM-Font, LF-Font, VQ-Font, Diff-Font under an identical pangram-type 10-glyph reference set, preprocessing, splits, seeds, and metrics), and the two royal columns are complete for the two architectures that remained trainable at extreme irregularity (VQ-Font, Diff-Font). Cells marked “—” were not evaluated and are reported as such; no results are imputed for unevaluated combinations.

The three experiments follow a progressive design, but architecture is isolated under common conditions within two of them: the second stage of Experiment 2 compares DM-Font, LF-Font, VQ-Font, and Diff-Font on the same irregular-contemporary source under an identical optimized pangram-type reference condition and protocol, and Experiment 3 compares VQ-Font and Diff-Font on the royal sources under matched preprocessing, reference set, and evaluation criteria. The design is progressive rather than fully factorial because the three sources differ greatly in usable-glyph availability (N usable ≈ 997 versus ≈10) and the compared models have different native reference budgets (Table 3); forcing every model onto every source under one tiny budget would introduce model–condition confounds rather than remove them. Table 4 summarizes which model × source cells are evaluated under matched conditions. A full model × source factorial under budget-normalized conditions is identified in Section 6.4 as a direction for future work.

3.6. Evaluation Metrics

The evaluation framework was designed as a multi-layered assessment system to capture pixel-level similarity, typography-specific structural stability, practical usability, and expert judgment. All evaluation metrics were computed using a unified Python 3.10 environment, with scikit-image 0.21.0 used for SSIM and PSNR, PyTorch 2.1.0 with the LPIPS 0.1.4 package used for perceptual-distance estimation, OpenCV 4.8.0 and NumPy 1.24.3 used for geometry-based typographic measurements and heatmap analysis, and OCR-based usability testing was conducted using Tesseract OCR 5.3.0 (pytesseract 0.3.10) configured with the Hangeul pack. Each generated glyph image was rendered at 100 × 100 pixels, binarized using the same threshold applied during preprocessing (Otsu’s method, OpenCV 4.8.0), and submitted to the OCR engine in single-character recognition mode—configured with page segmentation mode 10 (—psm 10), which treats the image as a single character, and OCR engine mode 3 (—oem 3), the default LSTM-based recognition engine—to obtain a predicted Unicode syllable label. OCR usability was selected as a downstream proxy for functional legibility because it operationalizes the question of whether a generated glyph is interpretable as the intended character under a standardized recognition system, complementing perceptual metrics that assess visual resemblance without measuring character identity. We acknowledge that OCR accuracy is a necessary but not sufficient condition for human legibility; however, high OCR accuracy (>95%) under a modern recognition engine provides an objective lower bound on deployability and is consistent with evaluation practices in prior font generation research [9,11,22]. To verify robustness, OCR accuracy was additionally confirmed using a secondary Naver Clova OCR pass on a 500-glyph random subsample, and rank-order agreement between the two engines was confirmed (Spearman’s ρ > 0.95). Identical metric scripts, image-resolution settings, and threshold parameters were applied across all experimental conditions to ensure reproducibility.

OCR usability operationalizes a single, narrow question: whether a generated glyph is automatically recognizable as the intended Unicode character. It is a downstream proxy for functional, machine-level legibility only. It does not, and is not intended to, measure typographic quality, faithfulness of historical reconstruction, or stylistic fidelity to the calligraphic source. We acknowledge that OCR accuracy is a necessary but not sufficient condition for human legibility, and—critically—that it is neither necessary nor sufficient for cultural or visual authenticity, which we assess through separate expert and structural measures.

Global image fidelity was assessed using the Structural Similarity Index Measure (SSIM) and the Peak Signal-to-Noise Ratio (PSNR). For a generated glyph image

\hat{x}

and its ground-truth image

x

, SSIM is defined as Equation (5) [10,11]:

S S I M (x, \hat{x}) = \frac{(2 μ_{x} μ_{\hat{x}} + C_{1}) (2 σ_{x \hat{x}} + C_{2})}{(μ_{x}^{2} + μ_{\hat{x}}^{2} + C_{1}) (σ_{x}^{2} + σ_{\hat{x}}^{2} + C_{2})}

(5)

where

μ_{x}

,

μ_{\hat{x}}

are the mean luminances,

σ_{x}^{2}

,

σ_{\hat{x}}^{2}

the variances,

σ_{x \hat{x}}

the covariance, and

C_{1}

,

C_{2}

small stabilizing constants. Higher SSIM indicates greater luminance, contrast, and structural correspondence. PSNR was additionally reported, in decibels, as the standard logarithmic transform of the Mean Squared Error between paired images; higher pixel-level reconstruction fidelity. Because pixel-level agreement does not always reflect perceptual quality, the Learned Perceptual Image Patch Similarity (LPIPS) was also computed as a deep-feature perceptual distance using its standard implementation; lower LPIPS indicates better perceptual fidelity.

Because the object of evaluation is a structurally constrained typeface system rather than generic imagery, we further introduced structure-sensitive typographic indicators. In Experiment 2, the internal geometric stability of the generated syllable blocks was quantified by three (Equations (6)–(8)) measures. Center-axis deviation is the mean absolute displacement between the vertical centerlines of the generated and ground-truth glyphs:

D_{a x i s} = \frac{1}{N} \sum_{i = 1}^{N} ∣ c ({\hat{x}}_{i}) - c (x_{i}) ∣

(6)

where

c (\cdot)

is the estimated vertical centerline of a glyph and

N

is the number of evaluated glyphs; lower values indicate less axis drift. Stroke-thickness stability is the coefficient of variation in measured stroke widths:

C V_{s t r o k e} = \frac{σ_{w}}{μ_{w}}

(7)

where

μ_{w}

and

σ_{w}

are the mean and standard deviation of stroke widths; lower values indicate more uniform stroke thickness. Character-spacing deviation is the mean difference between generated and ground-truth internal spacings:

D_{s p a c i n g} = \frac{1}{N} \sum_{i = 1}^{N} ∣ s ({\hat{x}}_{i}) - s (x_{i}) ∣

(8)

where

s (\cdot)

is the measured internal spacing of a glyph; lower values indicate spacing more consistent with the source typeface. Reference-set quality was quantified (Equation (9)) by the number of initial-consonant, vowel-combination, and final-consonant types covered, aggregated into the Jamo Coverage Index:

J C I (R) = \frac{1}{3} (\frac{∣ I_{R} ∣}{∣ I ∣}+ \frac{∣ V_{R} ∣}{∣ V ∣}+ \frac{∣ F_{R} ∣}{∣ F ∣})

(9)

where

R

is the reference set;

I_{R}

,

V_{R}

,

F_{R}

are the initial-consonant, vowel-combination, and final-consonant types it covers; and

I

,

V

,

F

are the corresponding target inventories. Higher JCI indicates a more structurally representative few-shot reference set.

To localize recurrent spatial errors, heatmap structural deviation aggregates per-pixel absolute error over the evaluation set:

H (p) = \frac{1}{N} \sum_{i = 1}^{N} ∣ {\hat{x}}_{i} (p) - x_{i} (p) ∣

(10)

where

H (p)

is the mean error intensity at pixel location

p

; lower values indicate fewer spatially concentrated structural errors.

In Experiment 3, OCR-based usability assessed whether generated glyphs were operationally legible as deployable font assets. OCR usability is the recognition accuracy over the generated set:

{A c c}_{o c r} = \frac{1}{N} \sum_{i = 1}^{N} 1 [{\hat{y}}_{i} = y_{i}]

(11)

where

{\hat{y}}_{i}

is the OCR-predicted label,

y_{i}

the ground-truth label, and

1 [\cdot]

the indicator function; higher accuracy indicates greater practical legibility. Finally, expert qualitative judgment was summarized as the mean expert rating across criteria:

\overset{̅}{R} = \frac{1}{E K} \sum_{e = 1}^{E} \sum_{k = 1}^{K} r_{e, k}

(12)

where E is the number of evaluators, K is the number of criteria, and

r_{e k}

the score from evaluator e on criterion k; higher values indicate stronger expert judgment of typographic quality, stylistic fidelity, readability, and historical appropriateness.

Table 8 summarizes all evaluation metrics used to assess generated Hangeul glyphs across image fidelity, perceptual similarity, typographic structure, OCR-based usability, and expert judgment. It also reports each metric’s definition or function, preferred direction of interpretation, and the experiment(s) in which it was applied. All outputs were reviewed and verified by the authors, who take full responsibility for the accuracy of reported values.

3.7. Expert Qualitative Evaluation

Expert qualitative evaluation was conducted by a five-member panel (three professional typeface designers with more than ten years of experience and two traditional calligraphers; mean professional experience of 12 years), all independent of the research team and blind to model identity. Glyph samples were drawn by stratified random sampling across the ten syllable-block composition categories defined in Section 3.5. In Experiment 1, each expert evaluated 40 glyphs (four per category × 10 categories) for DM-Font. In Experiment 3, each expert evaluated 60 glyphs per model (six per category × 10 categories), i.e., 120 glyphs in total for the VQ-Font vs. Diff-Font comparison. Each glyph was rated on five criteria (style consistency, similarity to the original, readability, typographic quality, historical atmosphere reproduction) using a 5-point Likert scale (1 = very low quality; 5 = very high quality), and open-ended feedback was collected. Reported category and model means are averages over the five experts and the sampled glyphs in each cell. Table 9 summarizes the categories and definitions of the expert evaluation items, as well as relevant prior research.

Table 10 reports the expert-evaluation sampling budget per experiment: the number of glyphs drawn per category, the resulting glyphs rated per expert per model, and the panel size (five experts throughout).

To assess whether the quantitative metrics reflect human perception, we correlated the per-category expert ratings with the corresponding quantitative scores across the ten syllable-block categories of Experiment 1 (n = 10 paired observations). Expert rating was strongly and significantly correlated with SSIM (Pearson r = 0.96, p < 0.001; Spearman ρ = 0.98), and moderately to strongly correlated with PSNR (Pearson r = 0.82, p = 0.004; Spearman ρ = 0.87). SSIM and PSNR were themselves correlated (r = 0.84). The strong rank agreement (ρ = 0.98) indicates that the ordering of categories by expert preference closely matches their ordering by structural similarity: categories the experts rated lowest (e.g., Category 8, expert 3.1) were also those with the lowest SSIM (0.55), while the highest-rated category (Category 3, expert 4.7) had the highest SSIM (0.95).

These results support the convergent validity of SSIM as a proxy for perceived structural quality in this benchmark, while the somewhat weaker PSNR association is consistent with PSNR being more sensitive to pixel-level noise than to perceived typographic quality. The correlation does not, however, eliminate the need for the structure-sensitive indicators and expert review: as noted in Section 5, global SSIM can miss spacing and stroke-stability failures that experts and the structural metrics detect, which is why the multi-layer framework is retained. Table 11 reports correlation between expert ratings and quantitative metrics across the ten syllable-block categories of Experiment 1 (n = 10). The strong SSIM–expert agreement (r = 0.96, ρ = 0.98) indicates that SSIM tracks perceived structural quality in this benchmark.

4. Results

4.1. Main Benchmark Results

A three-experiment benchmark was conducted across (i) relatively regular historical woodblock print (Jinbeopunhae), (ii) an irregular contemporary typeface, and (iii) highly irregular royal calligraphy (King Jeongjo and Queen Hyoui).

Experiment 1 (historical print; DM-Font). Table 8 reports a mean SSIM of 0.77, mean PSNR of 36.6 dB, and mean expert rating of 4.04/5.0 for DM-Font using 10 reference glyphs. Category-wise SSIM values ranged from 0.55 (Category 8) to 0.95 (Category 3). Category 1, Category 3, and Category 9 recorded SSIM values of 0.92, 0.95, and 0.89, respectively, with expert ratings ≥ 4.4. Figure 4 shows a per-category reconstruction-error map (DM-Font, Experiment 1). It shows the structural fidelity for the ten Hangeul syllable-block categories (C1–C10) on regular historical print. Each cell shows the raw metric value from Table 11 (SSIM; PSNR in dB; mean expert rating on a 5-point scale); cell shading encodes the within-metric normalized deviation from the best-performing category (light = best, dark = worst). Compound-batchim glyphs (C8: SSIM 0.55, PSNR 26 dB, expert 3.10) and the horizontal/compound–vowel clusters (C5, C6) are the dominant failure modes, whereas balanced-composition glyphs (C1, C3, C9) are reconstructed most faithfully.

Table 12 summarizes the evaluation results by category and analyzes them by comparing quantitative and qualitative indicators.

Characters with balanced composition (Categories 1, 3, 9) achieved SSIM 0.89–0.95, while complex batchim clusters (Categories 4, 6, 8) showed SSIM 0.55–0.65. These results establish a positive baseline but motivate the transition to more advanced architectures.

Figure 5 illustrates the category-wise generation outputs of DM-Font trained on 10 reference glyphs extracted from Jinbeopunhae, a 17th-century Hangeul woodblock-printed typeface.

Experiment 2 (irregular contemporary; multi-model comparison under pangram-type references). Table 13 reports that VQ-Font achieved SSIM 0.97 and LPIPS 0.41, and Diff-Font achieved SSIM 0.95 and LPIPS 0.57. The GAN-based baselines reported lower similarity and higher LPIPS (DM-Font: SSIM 0.74, LPIPS 0.94; LF-Font: SSIM 0.66, LPIPS 0.89).

Figure 6 presents the generation outputs of VQ-Font on an irregular contemporary Hangeul typeface under the pangram-type 10-character reference condition, illustrating how output quality evolves as training progresses across epochs. VQ-Font achieved the highest quantitative performance in the multi-model comparison.

Experiment 3 (royal calligraphy; VQ-Font vs. Diff-Font). In the average across the two royal sources, Diff-Font recorded SSIM 0.80 and VQ-Font recorded SSIM 0.78. VQ-Font recorded lower structural similarity (0.78 vs. 0.80) and LPIPS (0.41 vs. 1.28), and also recorded lower character-spacing deviation (2.5 vs. 3.8 px), stroke-thickness variation (0.14 vs. 0.18), and center-axis deviation (3.2 vs. 4.1 px). Figure 7 presents side-by-side generation outputs of VQ-Font and Diff-Font on the highly irregular royal calligraphic hands of King Jeongjo and Queen Hyoui. Across experiments, the SSIM value increased from 0.77 (Experiment 1, DM-Font) to 0.97 (Experiment 2, VQ-Font), and decreased to 0.78 (Experiment 3, VQ-Font on royal calligraphy).

4.2. Ablation on Reference Design

Under a fixed 10-character reference budget, Table 10 reports that the pangram-type reference set achieved SSIM 0.97, PSNR 44 dB, and LPIPS 0.41. The word-type reference set achieved SSIM 0.92, PSNR 38 dB, and LPIPS 0.95. The random-type reference set achieved SSIM 0.76, PSNR 31 dB, and LPIPS 1.28. Table 9 also reports center-axis deviation of 0.89 px (pangram-type), 1.97 px (word-type), and 2.14 px (random-type), as well as stroke-thickness CV of 0.28, 0.86, and 1.35, respectively.

The Jamo Coverage Indices were 0.89 (pangram-type), 0.62 (word-type), and 0.45 (random-type). One-way ANOVA reported significant differences among reference-set types for structural coverage (F = 56.745, p < 0.05) and for downstream performance (F = 112.850, p < 0.05). OCR accuracy under the pangram-type condition was 99.5%, and heatmap structural deviation was 2.1% (random-type: 7.3%). Table 14 shows the differences in quantitative metrics across the training reference sets.

Figure 8 is the cross-experiment comparison of structural similarity (SSIM) across the three benchmark settings. Mean SSIM was 0.77 in Experiment 1 (DM-Font on the historical woodblock print Jinbeopunhae), increased to 0.97 in Experiment 2 (VQ-Font on an irregular contemporary typeface under pangram-type references), and decreased to 0.78 in Experiment 3 (VQ-Font on highly irregular royal calligraphy from King Jeongjo and Queen Hyoui). The non-monotonic trajectory indicates that SSIM is strongly affected by source regularity, with royal calligraphy remaining the most challenging condition, ANOVA: F = 112.850, p < 0.05. Pangram-type achieved OCR accuracy of 99.5% and heatmap structural deviation of 2.1% (vs. 7.3% for random-type).

4.3. Qualitative Comparison

Table 15 summarizes the mean expert ratings for VQ-Font and Diff-Font across the five evaluation criteria. VQ-Font received higher scores for style consistency (4.3 vs. 4.1) and readability (4.2 vs. 4.0). Diff-Font received higher scores for style reproducibility (4.4 vs. 4.1), character completeness (4.7 vs. 4.6), and historical authenticity (4.3 vs. 3.8). Overall mean scores were 4.2 (VQ-Font) and 4.3 (Diff-Font).

Overall, both models received high ratings, indicating that each model produced generally legible and visually plausible Hangeul glyphs. However, the two models exhibited distinct performance tendencies. VQ-Font received higher ratings for style consistency and typographic quality, whereas Diff-Font received higher ratings for similarity to the original, readability, and historical atmosphere reproduction.

A comparison of the mean scores indicates that VQ-Font showed stronger performance in stylistic consistency. VQ-Font received a mean score of 4.3, compared with 4.1 for Diff-Font. The experts noted that VQ-Font tended to maintain uniform stroke thickness, consistent curvature, and stable compositional proportions across the generated character set. This suggests that the discrete representation mechanism of VQ-Font may contribute to typographic regularization and inter-glyph cohesion. By contrast, Diff-Font showed slightly lower consistency, largely because fine strokes were not always reproduced with the same degree of uniformity across characters.

The qualitative feedback supported this interpretation. Participant 1 stated that “the balance and thickness of the strokes are well aligned, giving the font a strong sense of unity.” Similarly, Participant 2 noted that “the stroke structure is stable, and the overall quality is high.” These comments suggest that VQ-Font was perceived as particularly effective in producing coherent glyph structures and maintaining stable typographic form across the generated font set.

In terms of similarity to the original, Diff-Font received a higher mean score of 4.4, compared with 4.1 for VQ-Font. This result indicates that Diff-Font more effectively reproduced the stylistic characteristics of the reference glyphs, particularly fine curvilinear flow, decorative details, and textural variations in the original writing. The experts emphasized that Diff-Font was better able to preserve fine stylistic features that are central to historically irregular calligraphic sources. Participant 3 commented that “the feel of the strokes and even the fine details of the old characters are well preserved.” Participant 4 further observed that “Diff-Font conveys the naturalness and variation found in human handwriting.” These responses indicate that Diff-Font was more sensitive to source-specific stylistic subtleties.

Readability was rated highly for both models, with Diff-Font receiving a mean score of 4.7 and VQ-Font receiving 4.6. This result indicates that both models generated glyphs that were generally easy to identify and operationally legible. Diff-Font’s slightly higher readability score was attributed to its ability to preserve complete stroke forms without severe omissions or distortions in most evaluated glyphs. However, some experts noted that VQ-Font occasionally produced minor spacing irregularities in complex syllable blocks, which could slightly reduce legibility in specific cases.

For typographic quality, VQ-Font received a higher mean score of 4.2, whereas Diff-Font received 4.0. VQ-Font was evaluated as producing more refined and compositionally stable glyphs, with natural stroke connections and balanced internal spacing. In contrast, Diff-Font occasionally showed local structural instability, including uneven spacing, awkward stroke connections, or slight positional skew in complex characters. Participant 5 commented that “in some characters produced by Diff-Font, the spacing between strokes is uneven, which weakens the balance.” This observation is consistent with the quantitative results, in which Diff-Font showed occasional deviations in character positioning and stroke-level stability.

The largest qualitative difference between the two models was observed in historical atmosphere reproduction. Diff-Font received a mean score of 4.3, substantially higher than VQ-Font’s score of 3.8. Experts noted that Diff-Font more effectively captured the expressive qualities of historical brush writing, including stroke flow, texture, pressure variation, and irregular material traces. These characteristics are particularly important for royal calligraphic and historically degraded sources, where stylistic irregularity is not merely noise but a defining component of the source identity. Participant 3 described Diff-Font outputs as having “a rich atmosphere in each individual character,” while Participant 4 noted that the model better preserved “the flow and texture of historical brush writing.”

By contrast, VQ-Font was viewed as more structurally stable but less expressive in terms of historical materiality. Participant 2 stated that “VQ-Font reproduces the characters accurately and cleanly, but the rustic quality of old typefaces and the texture of brush strokes are less apparent.” Participant 5 similarly noted that “the detailed stylistic expression appears somewhat standardized, which slightly reduces the uniqueness of the original.” These comments indicate that VQ-Font’s strength in structural regularization may also reduce sensitivity to fine-grained stylistic variation.

Overall, the expert evaluation indicates that VQ-Font and Diff-Font exhibit complementary strengths. VQ-Font demonstrated stronger typographic cohesion, structural stability, and design-level refinement, making it advantageous when consistency and font-system regularity are prioritized. Diff-Font, by contrast, showed stronger stylistic fidelity, perceptual richness, and historical atmosphere reproduction, making it more suitable for sources in which irregular brush texture, stroke individuality, and historical expressiveness are central to the target style. These findings are consistent with the quantitative results and further suggest that model architecture influences not only reconstruction accuracy but also the balance between structural regularity and expressive fidelity in few-shot Hangeul font generation.

To ensure the statistical reliability of the expert-based qualitative evaluation, inter-rater reliability and internal consistency were additionally examined. Since all experts evaluated the same set of generated samples using an ordinal five-point Likert scale, inter-rater reliability was assessed using a two-way random-effects intraclass correlation coefficient with absolute agreement, ICC (2, k), together with the single-measure ICC (2,1). Internal consistency across the five qualitative criteria was measured using Cronbach’s alpha. In addition, pairwise quadratic-weighted Cohen’s kappa was computed to account for the ordinal nature of the Likert-scale ratings, and the average weighted kappa was reported across all expert pairs. Table 16 reports the reliability analysis of experts’ evaluations.

The reliability analysis indicated good agreement among the expert raters. The average-measure ICC was 0.87, 95% CI [0.78, 0.94], p < 0.001, while the single-measure ICC was 0.58, 95% CI [0.42, 0.74], p < 0.001. Cronbach’s alpha across the five evaluation criteria was 0.88, indicating high internal consistency of the qualitative evaluation instrument. The mean pairwise quadratic-weighted Cohen’s kappa was 0.74, 95% CI [0.65, 0.82], suggesting substantial ordinal agreement among experts. These results support the statistical reliability of the expert ratings reported in Table 15. Figure 9 presents a radar chart and bar comparison of expert qualitative ratings for VQ-Font and Diff-Font across five evaluation criteria—structural consistency, style reproducibility, character completeness, readability, and historical authenticity—assessed on a five-point Likert scale by a panel of typography experts.

4.4. Failure Cases

Table 12 reports error incidence for VQ-Font vs. Diff-Font: stroke errors (3.2% vs. 6.5%), component errors (4.1% vs. 5.8%), connection errors (3.8% vs. 5.5%), and spacing errors (3.9% vs. 4.3%), with total raw error rates of 15.0% and 22.0%, respectively. Postprocessing improvement rates were 75.0% vs. 81.5% (stroke), 75.6% vs. 81.0% (component), 68.4% vs. 76.4% (connection), and 74.4% vs. 78.6% (spacing). After postprocessing, residual error rates were approximately 4.0% (VQ-Font) and 4.5% (Diff-Font).

Figure 10 illustrates the geometric deformation and feature-map representation.

Table 17 shows the types of errors and the error reduction rates for each model. After postprocessing, residual error rates converged: VQ-Font ≈ 4.0%, Diff-Font ≈ 4.5%.

5. Discussion

This study investigated few-shot learning for irregular Hangeul typeface restoration and expansion across three source conditions: relatively regular historical woodblock print, irregular contemporary type, and highly irregular royal calligraphy. By comparing GAN-, VQGAN-, and diffusion-based models under controlled reference-set and evaluation conditions, we aimed to clarify how model architecture, source regularity, and reference-set composition jointly influence generation quality [23,30,31]. The results demonstrate that few-shot Hangeul typeface generation is not determined by model architecture alone. Rather, performance depends on the interaction between the structural complexity of the source typeface, the compositional representativeness of the reference glyphs, and the intended restoration goals, such as structural stability, stylistic fidelity, readability, or historical authenticity [24,32,33].

In Experiment 1, DM-Font established a feasible baseline for restoring a relatively regular seventeenth-century Hangeul woodblock source. The mean SSIM of 0.77, mean PSNR of 36.6 dB, and expert rating of 4.04/5.0 indicate that a GAN-based few-shot font generation model can restore a usable glyph set when the source typeface has comparatively stable stroke weight, spacing, and syllable-block composition. This finding supports the premise that compositional font generation models can be effective when the source domain provides sufficient structural regularity [34,35]. However, the category-wise results also show that average performance values may conceal substantial variation among syllable-block types. Categories with balanced or simple compositions achieved high SSIM values, whereas categories containing compound batchim, asymmetric layouts, or complex vowel structures showed lower similarity and weaker expert ratings [36,37]. This pattern indicates that the difficulty of Hangeul restoration is not merely a matter of global visual appearance but is strongly affected by internal syllable-block structure [28,38]. Therefore, even when a model produces acceptable overall similarity, complex structural subclasses may still require targeted correction, additional training support, or more carefully selected reference glyphs [1,39].

Experiment 2 further demonstrated that advanced architectures can substantially improve performance under irregular contemporary conditions. Under the pangram-type reference condition, VQ-Font obtained the highest similarity scores among the four models (SSIM 0.97, LPIPS 0.41), with Diff-Font close behind (SSIM 0.95, LPIPS 0.57); both clearly exceeded the GAN-based DM-Font and LF-Font. In contrast, the GAN-based models, DM-Font and LF-Font, showed lower SSIM and higher LPIPS values. These results suggest that VQGAN- and diffusion-based approaches are more robust than earlier GAN-based approaches when the target typeface contains irregular stroke behavior, non-standard spacing, and unstable local structure.

Figure 11 illustrates the structural similarity tracks of perceived quality (DM-Font, Experiment 1), the per-category mean expert quality rating versus SSIM for the ten syllable-block categories (C1–C10); marker area is proportional to PSNR and marker color denotes the SSIM fidelity tier (teal: SSIM ≥ 0.85; gold: 0.70 ≤ SSIM < 0.85; rust: SSIM < 0.70). The dashed line is the least-squares fit; the Pearson correlation between SSIM and expert rating across the ten categories is r = 0.96, indicating that structural similarity is a strong proxy for human-perceived quality. OCR accuracy is reported only at the condition level (pangram-type, 99.5%) and is therefore not plotted per category.

VQ-Font’s discrete codebook representation and structure-aware enhancement appear to provide strong regularization for syllable-block geometry [7,16,30]. This is important for Hangeul because the generation task requires not only visually plausible individual glyphs but also consistency across a large combinatorial inventory of 11,172 syllables. The results therefore indicate that VQ-Font is especially effective when the primary objective is to produce a coherent and operationally stable font system from a small number of reference characters.

The reference-set ablation in Experiment 2 is one of the central findings of this study. Under an identical 10-character budget, the pangram-type reference set produced measurably better outcomes than the word-type and random-type reference sets across every reported indicator (e.g., SSIM 0.97 vs. 0.92 vs. 0.76; LPIPS 0.41 vs. 0.95 vs. 1.28), and the difference was statistically significant by one-way ANOVA (F(2,27) = 56.75, p < 0.001). The pangram-type condition achieved higher SSIM and PSNR, lower LPIPS, lower center-axis deviation, lower stroke-thickness variation, and lower character-spacing deviation. It also achieved higher OCR usability and lower heatmap structural deviation. These results indicate that reference-set composition functions as an experimentally meaningful variable in few-shot Hangeul typeface generation. In other words, the number of reference glyphs alone is insufficient to define the few-shot condition [23,39,40]. Because Hangeul syllables are formed through combinations of initial consonants, medial vowels, and final consonants, the structural coverage of the reference set directly affects the model’s ability to generalize to unseen syllables. The high Jamo Coverage Index of the pangram-type set explains its superior performance: it provided broader compositional evidence within the same reference budget. This finding contributes to efficient-learning research by showing that data efficiency can be improved not only through architectural innovation, but also through principled support-set design [41,42,43].

OCR accuracy must be interpreted with care in the context of historical calligraphic reconstruction. OCR engines are optimized to recognize the identity of a character, not its calligraphic manner. A glyph that has been normalized toward modern, regular stroke forms may be recognized more reliably than a faithful rendering that preserves the irregular brushwork, dry-brush texture, and ductus of the royal source—yet the normalized glyph is, by construction, less historically authentic. Our expert evaluation (Table 10) bears this out: VQ-Font attains higher typographic quality and comparable readability while scoring lower on historical atmosphere than Diff-Font. We therefore treat OCR strictly as a measure of operational legibility and deployability, and we base all claims about visual quality, stylistic fidelity, and cultural/historical authenticity on the expert panel (three type designers and two calligraphers) and on structure-aware measures (SSIM, LPIPS, and the Joint Component Inspection heatmap), never on OCR [10,22,42].

Two mechanisms could, in principle, explain the pangram-set advantage: broader combinatorial coverage of the reference set (a set-level property), or greater structural representativeness of the individual reference glyphs (a glyph-level property). The decomposition in Table 3 favors the former. The pangram-type set covers all eight initial-consonant types and all four vowel-combination types, whereas the random-type set covers only four and two, respectively; the aggregate Jamo Coverage Index rises monotonically with this component coverage (0.89 vs. 0.62 vs. 0.45). Crucially, the budget is held constant at |

R

| = 10 across conditions, so the difference is not the number of reference glyphs but which structural components those glyphs collectively expose. Because every one of the 11,172 modern syllables is generated by composing initial, medial, and final Jamo, a reference set that instantiates more of these primitives provides the model with direct evidence for a larger fraction of the compositions it must synthesize, reducing the need to extrapolate unseen component shapes [12,13,16].

Three observations argue against per-glyph representativeness (H2) as the primary cause. First, all reference glyphs in every condition passed the same quality-verification criteria (Section 3.3), so the conditions do not differ systematically in individual glyph cleanliness. Second, the per-glyph ANOVA treats each reference character as one observation of structural coverage and still yields a large, significant between-conditions effect (F(2,27) = 56.75, p < 0.001, η² = 0.81), indicating that the conditions differ in the coverage that glyphs contribute to the set, not merely in isolated glyph traits. Third, the category-wise results of Experiment 1 show that generation quality drops specifically for compound-batchim, compound-vowel, and asymmetric syllable blocks—i.e., for compositions whose constituent Jamo are least likely to be represented by a low-coverage reference set—which is the failure pattern predicted by the coverage account rather than by a uniform glyph-quality account.

Mechanistically, broader component coverage lets the model factorize a syllable into Jamo-level parts it has already observed and recombine them, rather than memorizing whole-glyph appearance [9,16]. This is consistent with the architectural behavior we observe: the discrete codebook of VQ-Font, which regularizes syllable-block geometry, benefits most when the reference set supplies a structurally diverse set of component exemplars to quantize against. In short, within the present design, the pangram advantage is most consistently explained as compositional generalization driven by component coverage, a property of how the support set is designed, rather than as an effect of individually superior reference characters.

Table 18 decomposes the Jamo Coverage Index into its component dimensions, making explicit that the pangram set’s advantage stems from broader component coverage—not a larger reference budget, which is held constant at |

R

| = 10 across all conditions.

Although the few-shot budget is fixed at only ten reference glyphs, Experiment 2 directly probes the sensitivity of the results to the choice of those glyphs by comparing three alternative compositions under the same budget. The induced variation is substantial (SSIM 0.76–0.97; LPIPS 0.41–1.28) but systematic: downstream quality is monotonically ordered by the Jamo Coverage Index, which accounts for ~81% of the variance in structural coverage (one-way ANOVA, p < 0.001). These results indicate that performance is stable under reference selections that provide adequate structural coverage and degrades predictably as coverage falls, so the controlling factor is the structural representativeness of the support set rather than its raw size.

Experiment 3 examined the more difficult case of royal calligraphic Hangeul, where stylistic irregularity is not simply noise but a defining feature of the source identity. The comparison between VQ-Font and Diff-Font revealed a clear structure–style trade-off. In the royal calligraphy condition, the two models were close on global SSIM (Diff-Font 0.80 vs. VQ-Font 0.78) and on expert ratings (4.3 vs. 4.2). The more pronounced differences appeared on structure-sensitive indicators, where VQ-Font recorded lower LPIPS (0.41 vs. 1.28), smaller character-spacing deviation (2.5 vs. 3.8 px), lower stroke-thickness variation (CV 0.14 vs. 0.18), and smaller center-axis deviation (3.2 vs. 4.1 px). These results indicate that the two models are comparable on global similarity but differ mainly in structural stability, rather than one being uniformly superior. These results suggest that VQ-Font is more effective in maintaining geometric stability, while Diff-Font is more effective in preserving expressive stylistic variation. This distinction is theoretically important because it shows that global similarity metrics, such as SSIM, may not fully represent the functional quality of generated typefaces. A glyph may appear visually close to the source image but still contain structural instability that affects readability, spacing behavior, or font-system consistency [1,2,18,38].

The expert evaluation supports this interpretation. VQ-Font received higher scores for style consistency and typographic quality, whereas Diff-Font received higher scores for similarity to the original, readability, and historical atmosphere reproduction. These results indicate that the two model families have complementary, task-dependent strengths within this single royal calligraphy source: VQ-Font tended to be preferable when restoration prioritized stable layout and consistent stroke organization, whereas Diff-Font tended to be preferable when the goal is to preserve historical atmosphere and expressive irregularity. Because this comparison rests on one source and a limited expert panel, these tendencies should be viewed as condition-specific rather than general rankings. This distinction is particularly relevant for digital heritage applications because historical typeface restoration often involves competing priorities. A restoration model must produce glyphs that are legible and usable as a modern font, but it must also preserve the visual evidence of historical writing practices. The findings of this study suggest that no single model is universally optimal across these objectives. Instead, model selection should be guided by the restoration priority and the regularity of the source material [41,42].

The error analysis further clarifies the practical implications of the benchmark. VQ-Font showed lower raw error incidence than Diff-Font across stroke, component, connection, and spacing errors [1,34]. However, Diff-Font showed higher postprocessing improvement rates, and the residual error rates of the two models converged after correction. This result indicates that raw generation accuracy and final production usability are related but not identical. In practical restoration workflows, the cost and recoverability of errors may be as important as their initial frequency. For example, a model with a higher raw error rate may still be useful if its errors are systematic and easily correctable during postprocessing Conversely, a model with fewer errors may impose greater production costs if those errors are structurally complex or difficult to repair. Therefore, evaluation of AI-generated fonts should include workflow-level criteria, including postprocessing burden, vectorization stability, and deployability as a functional font asset [41,42].

The multi-layer evaluation framework used in this study provides a more appropriate assessment model for functional glyph generation. SSIM and PSNR were useful for measuring pixel-level similarity, and LPIPS provided perceptual-distance information that was not fully captured by pixel-level metrics. However, structure-sensitive indicators, OCR-based usability, expert evaluation, heatmap structural deviation, and postprocessing analysis were necessary to identify failure modes that global metrics alone could not explain. In this respect, this study supports the broader argument that computer vision benchmarks for structured outputs should incorporate task-specific functional metrics. For font generation, such metrics must reflect not only whether an image resembles the source but also whether the generated glyph can operate as part of a complete typeface system [44,45].

The findings also have methodological implications for few-shot learning. We show that the support set is not a passive input but an active design variable. The superiority of the pangram-type reference condition demonstrates that a small number of characters can support high-quality generation when they are selected to maximize structural coverage. This result is particularly relevant for domains where additional data collection is difficult or impossible, such as historical manuscripts, fragmentary archives, endangered scripts, and culturally significant handwritten materials. Rather than increasing the number of references indiscriminately, future few-shot systems may benefit from optimizing reference selection according to compositional diversity, component coverage, and structural representativeness. Table 19 translates the experimental findings into a practical model-selection framework by aligning source regularity, restoration priority, reference-set design, and expected reconstruction performance. The expected SSIM ranges should be interpreted as condition-specific guidance derived from the present benchmark rather than universal performance guarantees, and the proposed VQ-Font–Diff-Font ensemble condition remains a future validation direction for cases requiring both structural stability and historical stylistic fidelity.

6. Conclusions

6.1. Summary

This study examined irregular Hangeul typeface restoration and expansion as a few-shot learning problem under limited supervision, benchmarking GAN-, VQGAN-, and diffusion-based models across a spectrum of source regularity. The results demonstrate that generation performance is jointly determined by model architecture, source typeface regularity, and reference-set composition. A GAN-based baseline (DM-Font) provided feasible restoration for a relatively regular historical source, while VQ-Font attained the highest structural similarity among the evaluated models for the irregular contemporary typefaces in Experiment 2 (SSIM 0.97) when paired with a structurally optimized pangram-type reference set, while the other metrics followed the same condition-specific pattern. Under the highly irregular royal calligraphy conditions of Experiment 3, VQ-Font and Diff-Font showed similar global similarity and expert ratings (SSIM 0.78 vs. 0.80, 4.2 vs. 4.3), and their differences appeared mainly in structure-sensitive indicators: VQ-Font tended to provide more stable geometric structure and readability, whereas Diff-Font tended to preserve stylistic nuance more faithfully. These findings indicate that successful few-shot Hangeul generation depends on aligning model selection and reference design with the structural properties of the source and the intended restoration objective.

6.2. Theoretical Contribution

Finally, the proposed benchmark advances evaluation theory for structured generative models. By integrating architectural comparison, reference-set ablation, pixel-level similarity metrics, perceptual-distance measurement, typography-specific structural indicators, OCR-based usability testing, expert qualitative assessment, and postprocessing recoverability into a single experimental framework, this study offers a more comprehensive methodological model for assessing generative performance in complex writing systems, such as Hangeul.

The theoretical significance of this work lies not only in identifying which model performs best under a given source condition, but also in demonstrating why performance changes across regularity levels, how compositional reference coverage affects generalization under a fixed few-shot budget, and how model selection should be interpreted relative to restoration objectives—including structural consistency, stylistic authenticity, operational readability, and digital-heritage preservation. Collectively, these contributions broaden the scope of computer vision research from image-level resemblance toward functionally constrained visual-system reconstruction [4,8,19,20].

Although this study does not propose a new generative backbone, its methodological contributions are not specific to Hangeul. The Jamo Coverage Index operationalizes few-shot reference-set quality for any compositional script, turning an implicit design choice into a measurable, optimizable quantity; the controlled treatment of reference-set composition provides a template for studying input design as a first-class factor in few-shot generation; and the multi-layer, reliability-validated evaluation protocol offers a reusable methodology for assessing structured generative output where global pixel metrics are insufficient. These instruments are independent of the particular architectures evaluated here and can be applied to other low-resource, structure-critical generation problems.

We therefore position the paper as a methodological and empirical contribution: it provides transferable evaluation and input-design instruments (JCI, the composition-as-variable protocol, and the multi-layer assessment) validated within a rigorous comparative benchmark, rather than a new architecture. We have clarified this scope in the Introduction and Conclusion so that the intended contribution is not mistaken for an architecture proposal.

6.3. Practical Contribution

This study provides practical guidance for stakeholders involved in digital heritage preservation, Hangeul typography, archival documentation, museum curation, font production, and AI-assisted design. Rather than treating generated glyphs as isolated images, the proposed framework evaluates them as functional typographic assets that must satisfy visual fidelity, structural consistency, legibility, and deployment readiness. By combining SSIM, PSNR, LPIPS, structure-sensitive indicators, OCR usability, expert review, and postprocessing analysis, this study offers a workflow that enables practitioners to judge whether AI-generated Hangeul fonts are suitable for restoration, publication, education, exhibition, or design application.

The results also provide a stakeholder-oriented model-selection guideline. For relatively regular historical print sources, GAN-based models, such as DM-Font, can serve as a feasible baseline when sufficient reference glyphs are available and minimal setup is required. For irregular contemporary or historically complex sources where readability, spacing stability, and syllable-block consistency are critical, VQ-Font may be the more suitable starting point because it showed more stable structural control under limited reference conditions in our experiments (lower spacing and center-axis deviation; lower LPIPS). When the restoration objective prioritizes historical atmosphere and brush texture, Diff-Font is a reasonable alternative, although its outputs required additional structural verification before deployment.

Finally, this study indicates that reference-set design and production workflow are important for practical success: under the same 10-character budget, a structurally curated pangram-type reference set improved every reported quality indicator. The proposed pipeline, including generation, evaluation, postprocessing, vectorization, and conversion into deployable font formats, can support institutions seeking to restore incomplete historical typefaces, expand irregular contemporary fonts, or develop culturally grounded digital resources. Thus, this study offers an actionable workflow through which researchers, designers, and heritage professionals can integrate few-shot font generation into real-world restoration and production environments while maintaining accountability through metric-based validation and expert judgment.

6.4. Limitations and Future Research Directions

Several limitations of this study suggest directions for future research. First, all glyphs were normalized to 100 × 100 resolution to ensure cross-source comparability and to match the native input specification of the evaluated models. This constrained the representation of fine calligraphic detail, an effect strongest for the royal calligraphy sources: at 100 × 100, principal strokes span roughly 4–7 px, whereas the hairline entries, tapered terminals, and pressure-modulated thin strokes of historical brush writing span only about 1–2 px—at or below the raster’s effective sampling limit—so such features tend to merge before any model is applied. This is consistent with the expert evaluation, in which reviewers described the royal-source brush texture as standardized, and with the larger LPIPS than SSIM differences observed on these sources. Future work should evaluate higher-resolution generation (e.g., 256 × 256), retraining the architectures at the higher resolution, to recover stroke texture for fine-detail-critical restoration. Second, the source corpus was restricted to a small number of typefaces, and for the royal calligraphy sources, only about 10 usable original glyphs survive per hand; the corresponding metrics are therefore reported as few-sample, leave-the-reference-out comparisons and interpreted as indicative rather than population-level estimates. Expanding the dataset across a broader range of historical periods and styles would strengthen robustness. Third, the qualitative evaluation relied on a limited expert panel (n = 5); although it achieved good-to-substantial inter-rater reliability (ICC(2,k) = 0.87; Cronbach’s α = 0.88; weighted κ = 0.74), the panel size remains modest. Future work will expand to a larger, institutionally diverse panel, recruit an independent external panel to replicate the ranking under a pre-registered protocol, and add a larger-scale crowdsourced legibility study. Fourth, the study used a single preprocessing and evaluation pipeline without standardized computational-cost analysis; future benchmarks should examine pipeline variability and include reproducible efficiency metrics. Fifth, although the component decomposition of the Jamo Coverage Index points to compositional coverage as the principal driver of the pangram-set advantage, coverage and per-glyph structural quality are not fully orthogonal in the current design, and a single representative reference set was used per composition type. A decisive test would hold Jamo coverage constant while varying the structural representativeness of individual glyphs (and vice versa)—sampling multiple coverage-matched reference sets and several independent 10-glyph draws at fixed JCI—to report support-set confidence intervals and separate variability due to coverage level from variability due to glyph identity. We identify this coverage-matched robustness ablation as a priority for future work. Sixth, reference-based metrics require ground-truth glyphs, so quality on the portion of the 11,172-syllable inventory lacking ground truth is currently supported by structural-coverage evidence rather than direct measurement. Future work will close this gap through a stratified human spot-check audit over a statistically powered random sample across all Jamo-combination strata, reference-free quality estimators computed over the entire generated set, and a component-level consistency check. These analyses will be reported with full statistics; no projected values are claimed here. Finally, this study contributes evaluation and input-design methodology (JCI, reference-set composition as a controlled variable, and a multi-layer assessment protocol) together with a comparative benchmark rather than a new generative architecture; designing an architecture or objective tailored to highly irregular calligraphic structure—and the observed VQ-Font/diffusion complementarity that motivates hybrid or ensemble approaches—is a natural direction for future research on highly irregular and historically complex typefaces.

Author Contributions

Conceptualization, J.H. and S.K.; methodology, J.H.; software, J.H.; validation, J.H. and S.K.; formal analysis, J.H.; investigation, J.H.; resources, J.H.; data curation, J.H.; writing—original draft preparation, J.H.; writing—review and editing, S.K.; visualization, J.H.; supervision, S.K.; project administration, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Some or all data, trained models, codes, and font files that support the findings of this study are available from the authors upon reasonable request.

Acknowledgments

During the preparation of this study, the authors used Claude (Anthropic, version Opus 4.7) to support two clearly bounded verification tasks related to the accuracy of the Hangeul corpus analysis: (i) verifying the consistency of the Hangeul corpus labels—that is, cross-checking that each extracted glyph image was correctly paired with its Unicode syllable codepoint—so as to assess and improve the accuracy of the corpus annotation, and (ii) cross-checking that the reported statistical outputs (e.g., descriptive statistics and ANOVA results) were consistent with their underlying computations. The AI tool was not used to design the experiments, generate experimental data or glyph images, write analysis code, or draw scientific conclusions. All AI-assisted outputs were subsequently reviewed, corrected where necessary, and validated against the original corpus and source data by the authors, who take full responsibility for the accuracy and integrity of the reported content. All analytical decisions, interpretations, and conclusions were made independently by the authors. The authors have reviewed and edited all outputs and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CV	Computer Vision
DDPM	Denoising Diffusion Probabilistic Modeling
Diff-Font	Diffusion Font
DM-Font	Dual Memory Font
FFG	Few-Shot Font Generation
GAN	Generative Adversarial Network
LF-Font	Localized Style Representations and Factorization Font
LPIPS	Learned Perceptual Image Patch Similarity
OCR	Optical Character Recognition
PSNR	Peak Signal-to-Noise Ratio
SSIM	Structural Similarity Index Measure
VQ-Font	Vector Quantized Font
VQGAN	Vector Quantized Generative Adversarial Network

References

Liu, Y.; Ding, Y.; Khalid, F.B.; Wang, C.; Wang, L. Few-shot font generation via denoising diffusion and component-level fine-grained style. Expert Syst. Appl. 2026, 296, 128987. [Google Scholar] [CrossRef]
He, H.; Chen, X.; Wang, C.; Liu, J.; Du, B.; Tao, D.; Qiao, Y. Diff-font: Diffusion model for robust one-shot font generation. Int. J. Comput. Vis. 2024, 132, 5372–5386. [Google Scholar] [CrossRef]
Li, Z.; Chen, S.; Liang, D. SGD-font: Style and glyph decoupling for one-shot font generation. Knowl.-Based Syst. 2025, 330, 114600. [Google Scholar] [CrossRef]
Hong, J.K.; Kim, S.K. Research on expanding the irregular Hangeul typeface using a VQGAN-based font generation model. J. Basic Des. Art 2026, 27, 805–822. [Google Scholar] [CrossRef]
Jo, Y.J.; Kang, S.J.; Seo, B.J.; Kim, S.Y. Font generation system development based on few-shot font generation model. J. KIISE 2025, 52, 77–87. [Google Scholar] [CrossRef]
Kristianto, Y.; Soewito, B. Beyond OCR: GAN-driven restoration of severely degrading document. Int. J. Comput. Theory Eng. 2025, 17, 189–201. [Google Scholar] [CrossRef]
Hayashi, H.; Abe, K.; Uchida, S. GlyphGAN: Style-consistent font generation based on generative adversarial networks. Knowl.-Based Syst. 2019, 186, 104927. [Google Scholar] [CrossRef]
Hong, J.K.; Kim, S.K. GAN (Generative Adversarial Network)-based restoration research of a 17th century Hangeul typeface focused on Jinbeopunhae. J. Basic Des. Art 2025, 26, 947–960. [Google Scholar] [CrossRef]
Sami, A.; Kumar, A.; Memon, I.; Jo, Y.J.; Rizwan, M.; Choi, J.Y. Text-conditioned diffusion model for high-fidelity Korean font generation. In Proceedings of the 2025 International Conference on Information Networking (ICOIN), Chiang Mai, Thailand, 15–17 January 2025; pp. 660–665. [Google Scholar]
Yao, M.; Zhang, Y.; Lin, X.; Li, X.; Zuo, W. VQ-Font: Few-shot font generation with structure-aware enhancement and quantization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 16407–16415. [Google Scholar]
Hassan, A.U.; Memon, I.; Choi, J.Y. Real-time high quality font generation with conditional font GAN. Expert Syst. Appl. 2023, 213, 118907. [Google Scholar] [CrossRef]
Jung, J.E.; Byun, H. A study of the creation of Korean fonts using DM-Font and an image style transfer model. J. Korean Soc. Media Arts 2022, 20, 63–72. [Google Scholar] [CrossRef]
Park, S.; Chun, S.H.; Cha, J.B.; Lee, B.D.; Shim, H.J. Few-shot font generation with weakly supervised localized representations. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Cha, J.B.; Chun, S.H.; Lee, G.Y.; Lee, B.D.; Kim, S.H.; Lee, H.S. Few-shot compositional font generation with dual memory. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 735–751. [Google Scholar]
Park, S.; Chun, S.H.; Cha, J.B.; Lee, B.D.; Shim, H.J. Few-shot font generation with weakly supervised localized representations. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1479–1495. [Google Scholar] [CrossRef]
Memon, I. A Component Position-Aware GAN Framework Using Cross-Attention Content Encoder for Precise Korean Font Generation. Doctoral Dissertation, Soongsil University, Seoul, Republic of Korea, 2025. [Google Scholar]
Xu, T.; Wang, K.; Chen, Z.; Wu, L.; Wen, T.; Chao, F.; Chen, Y. UniCalli: A unified diffusion framework for column-level generation and recognition of Chinese calligraphy. In Proceedings of the ICLR, Rio de Janeiro, Brazil, 23–27 April 2026. [Google Scholar]
Shakhovska, N.; Petrovskyi, O.; Fedusko, S.; Gregus, M. Automated typographic font generation using artificial intelligence. In Proceedings of the International Conference on Artificial Intelligence and Soft Computing (ICAISC 2024), Zakopane, Poland, 16–20 June 2024; pp. 165–176. [Google Scholar]
Zeng, J.; Yuan, Y.; Wang, X.; Zhang, Y.; Wang, Y. Cross-lingual font generation via patch-level style contrastive learning and relative position awareness. Pattern Recognit. 2025, 169, 111937. [Google Scholar] [CrossRef]
Yang, Z.; Peng, D.; Kong, Y.; Zhang, Y.; Chao, F.; Jin, L. FontDiffuser: One-shot font generation via denoising diffusion with multi-scale content aggregation and style contrastive learning. In Proceedings of the 38th AAAI 2024 Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 6603–6611. [Google Scholar]
Park, J.S. The characteristics of calligraphic style and significance of calligraphy history of published Hangeul document in the 17th century of Joseon dynasty. Study Calligr. 2022, 40, 5–50. [Google Scholar] [CrossRef]
Kumar, A.; Kang, K.; Hassan, A.; Choi, J. FontFusionGAN: Refinement of handwritten fonts by font fusion. Electronics 2023, 12, 4246. [Google Scholar] [CrossRef]
Filip, P. Regularized font data for ML font generation. Lett. Seed 2025, 29, 113–124. [Google Scholar]
Park, J.H. An analysis of Korean characters frequency survey from 21st century Sejong plan for type design in Hangeul. Treatise Plast. Media 2018, 21, 292–299. [Google Scholar]
Park, J.S. A calligraphic study of royal Hangeul manuscript materials from the Joseon dynasty (18th century). Natl. Hangeul Mus. 2014, 1, 119–139. [Google Scholar]
Wang, Y.; Lian, Z. DeepVecFont: Synthesizing high-quality vector fonts via dual-modality learning. ACM Trans. Graph. 2021, 40, 1–15. [Google Scholar] [CrossRef]
Hong, Y.P. Hangeul Calligraphy and Hangeul Typefaces, 3rd ed.; Taehaksa: Seoul, Republic of Korea, 2023; pp. 154–196. [Google Scholar]
Fu, B.; Yu, F.; Liu, A.; Wang, Z.; Wen, J.; He, J.; Qiao, Y. Generate like experts: Multi-stage font generation by incorporating font transfer process into diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 6892–6901. [Google Scholar]
Fernandes, F.J. Generative Modeling for Automated Type Design. Master’s Thesis, University of Coimbra, Coimbra, Portugal, 2023. [Google Scholar]
Yang, H.J. A Foundation Study on Archaic Hangeul Character Font Design. Master’s Thesis, Hongik University, Seoul, Republic of Korea, 2023. [Google Scholar]
Ko, K.; Yeom, T.; Lee, M. SuperstarGAN: Generative adversarial networks for image-to-image translation in large-scale domains. Neural Netw. 2023, 162, 330–339. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Wang, C.; Liu, Y. EHW-font: A handwriting enhancement approach mimicking human writing process. Expert Syst. Appl. 2025, 278, 127278. [Google Scholar] [CrossRef]
Liu, Y.; Lian, Z. Deepcallifont: Few-shot Chinese calligraphy font synthesis by integrating dual-modality generative models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 3774–3782. [Google Scholar]
Wang, L.; Liu, Y.; Sharum, M.Y.; Yaakob, R.; Kasmiran, K.A.; Wang, C. Deep learning for Chinese font generation: A survey. Expert Syst. Appl. 2025, 276, 127105. [Google Scholar] [CrossRef]
Wang, X.; Wang, Y.; Ai, C.; Zeng, J. One-shot font generation with masked diffusion transformers. In Proceedings of the 2024 3rd International Conference on Artificial Intelligence, Human-Computer Interaction and Robotics (AIHCIR), Hong Kong, China, 15 November 2024; pp. 152–159. [Google Scholar]
Azadi, S.; Fisher, M.; Kim, V.; Wang, Z.; Shechtman, E.; Darrell, T. Multi-content GAN for few-shot font style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7564–7573. [Google Scholar]
Wang, C.; Zhou, M.; Ge, T.; Jiang, Y.; Bao, H.; Xu, W. CF-Font: Content fusion for few-shot font generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–23 June 2023; pp. 1858–1867. [Google Scholar]
Liu, Y.; Lian, Z. FontTransformer: Few-shot high-resolution Chinese glyph image synthesis via stacked transformers. Pattern Recognit. 2023, 141, 109593. [Google Scholar] [CrossRef]
Pan, W.; Zhu, A.; Zhou, X.; Iwana, B.K.; Li, S. Few shot font generation via transferring similarity guided global style and quantization local style. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 19449–19459. [Google Scholar]
Hassan, A.U.; Memon, I.; Choi, J. Learning font-style space using style-guided discriminator for few-shot font generation. Expert Syst. Appl. 2024, 242, 122817. [Google Scholar] [CrossRef]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Li, W.; He, Y.; Qi, Y.; Li, Z.; Tang, Y. FET-GAN: Font and effect transfer via K-shot adaptive instance normalization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 1717–1724. [Google Scholar]
Pae, H.K. Orthographic and phonological representations in Hangeul. In Analyzing the Korean Alphabet: The Science of Hangeul; Springer: Cham, Switzerland, 2024; pp. 121–152. [Google Scholar] [CrossRef]
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar]
National Hangeul Museum. Available online: https://www.hangeul.go.kr (accessed on 24 March 2026).

Figure 1. Architecture of a structure-aware few-shot Hangeul font generation model. A style encoder ε_S extracts style features from the source (reference) image I_s—a small set of reference glyphs that defines the target typeface style—while a content encoder ε_C extracts the structural content of the content image I_c—the standard-form glyph of the character to be generated. The style and content features interact through Key–Query–Value attention and a Structure-level Style Enhancement Module (A_patch, A_reweight), are quantized against a learned font codebook C via a Transformer, and are decoded by D to synthesize the output glyph I_q.

Figure 2. Architecture of Diff-Font, a denoising diffusion model for few-shot font generation. A character-attributes encoder combines a content code with a style code produced by a style encoder into a conditioning vector z, which is injected through MLP layers into the diffusion backbone. In the forward diffusion process q(x_t | x_t₋₁), Gaussian noise is progressively added to the target glyph from x₀ to x_T; in the learned reverse process p(x_t₋₁ | x_t, z), the noise predictor

\hat{ε_{θ}}

iteratively denoises the noisy input conditioned on the content and style features, reconstructing typographic detail through multi-step denoising rather than the single-pass generation used by GAN-based models.

Figure 2. Architecture of Diff-Font, a denoising diffusion model for few-shot font generation. A character-attributes encoder combines a content code with a style code produced by a style encoder into a conditioning vector z, which is injected through MLP layers into the diffusion backbone. In the forward diffusion process q(x_t | x_t₋₁), Gaussian noise is progressively added to the target glyph from x₀ to x_T; in the learned reverse process p(x_t₋₁ | x_t, z), the noise predictor

\hat{ε_{θ}}

iteratively denoises the noisy input conditioned on the content and style features, reconstructing typographic detail through multi-step denoising rather than the single-pass generation used by GAN-based models.

Figure 3. Overview of the AI-based irregular Hangeul typeface expansion pipeline. The Hangeul characters in the Pangram-style image serve as a reference character set for training, while the Hangeul characters in the content image and the input are input characters, and the Hangeul characters in the target and output are generated characters. Additionally, the reference image was trained using standard Hangeul font files.

Figure 4. Per-category reconstruction-error map for DM-Font in Experiment 1.

Figure 5. Category-wise generated glyph outputs of DM-Font for the AI-based restoration of Jinbeopunhae, a 17th-century Hangeul woodblock-printed typeface, trained under a fixed few-shot budget of 10 reference glyphs: (a) original Jinbeopunhae from Hangeul woodblock print; (b) generated Hangeul typefaces by DM-Font. Results are organized by the ten syllable-block composition categories defined in Section 3.5, ranging from symmetric/simple (Category 1) to mixed feature (Category 10).

Figure 6. The training process of VQ-Font for irregular modern Hangul fonts under the 10-glyph pangram criteria. This illustrates the stages through which Hangul characters are generated based on the training sequence. It depicts a flowchart showing the sequence of Hangul characters generated from reference characters.

Figure 7. Generation results of VQ-Font and Diff-Font for the royal calligraphic Hangeul typefaces of King Jeongjo and Queen Hyoui.

Figure 8. Cross-experiment SSIM trajectory across three Hangeul typeface benchmarks.

Figure 9. Experts’ qualitative evaluation ratings for VQ-Font and Diff-Font on the royal calligraphic Hangeul benchmark, scored on a five-point Likert scale across five criteria. (The star rating indicates which model is the better choice.)

Figure 10. Geometric deformation and feature-map representation underlying few-shot Hangeul style transfer. Changes in the Style of the Korean Character ‘Flower’ (a) Geometric deformation between two font styles: a source glyph in style A is transformed into the target style B, and the overlaid contours visualize the per-point geometric displacement between the two styles. (b) Feature-map representation of the same transfer: corresponding sampling locations in the content font (point p) and the target font (point p_r) are linked across the feature volume, and the magnified Detail view shows the predicted sampling offsets ΔP used to align local stroke geometry between the content and target domains.

Figure 11. Structural fidelity tracks of perceived quality per-category SSIM vs. expert rating.

Table 1. Overview of key FFG models and architectures.

Model	Architecture	Year	Reference Set	Strength	Limitation
Glyph GAN	GAN	2019	Large	Style consistency	Regular fonts only
DM-Font	Dual Memory GAN	2020	~100	Component separation	Irregular fonts
LF-Font	Localized GAN	2022	~6–28	Local feature control	Complex structure
VQ-Font	VQGAN (Codebook)	2024	~10	Structural consistency	Style nuance
Diff-Font	Diffusion (DDPM)	2024	~1–10	Style reproducibility	Structural instability
FontDiffuser	Diffusion (Contrastive)	2024	~1	Cross-lingual transfer	Korean-specific
SGD-Font	Style–Glyph Decoupling	2025	~10	Decoupled generation	Limited coverage

Table 2. Source typeface characteristics across the three experiments.

Property	Exp. 1: Jinbeopunhae	Exp. 2: Irregular Contemporary	Exp. 3: Royal Calligraphy
Period	17th century	Contemporary	18th century
Production	Woodblock printing	Digital hand-drawn	Brush calligraphy
Regularity	Relatively regular	Irregular	Highly irregular
References chars	10	10 (×3 conditions)	10 (pangram-type)
Target output	11,172 characters	11,172 characters	11,172 chars (×2 typefaces)

Table 3. Summary of mathematical notations used in this study.

Symbol	Definition
$R$	Few-shot reference set $R$ of glyph images sampled from the source typeface
$x_{i},$ $y_{i}$	$i-th reference glyph image and its character label, (x_{i}, y_{i})$ ∈ $R$
\| $R$ \|	Number of reference glyphs (reference budget); \| $R$ \| = 10 in this study
$C$	Modern Hangeul syllable inventory; \| $C$ \| = 11,172
G_θ	Generation function (model) with parameters θ
x	Ground-truth target glyph image
$\hat{x}$	Generated (synthesized) glyph image
μ_x, σ_x	Mean and standard deviation of pixel intensities of image x (SSIM)
σ_x $\hat{x}$	Covariance between target x and generated $\hat{x}$ (SSIM)
ε_θ	Noise predicted by the diffusion model (denoising network)
σ_w, μ_w	Standard deviation and mean of stroke width w (stroke-thickness CV)
c(·), s(·)	Center-axis and inter-glyph spacing operators used in geometric metrics
N	Number of evaluated glyphs in a metric average
JCI( $R$ )	Jamo Coverage Index of reference set $R$ (initial/medial/final coverage)

Table 4. Corpus extraction and retention statistics per source.

Source/Experiment	$N_{raw}$	$N_{usable}$	$\|R\|$ (Reference)	Paired Eval Set $\| C_e v a l \|$
Exp. 1—Jinbeopunhae (1693)	1386	997	10	≈987 (997 usable minus 10 held as reference)
Exp. 2—Irregular contemporary	18	10	10	Evaluated on the generated modern inventory vs. the available reference glyphs; OCR + structural indicators
Exp. 3—King Jeongjo	753	10	10	≈10 (leave-reference-out, few-sample)
Exp. 3—Queen Hyoui	2568	10	10	≈10 (leave-reference-out, few-sample)
$Generation inventory C$ (all exp.)				11,172 synthesized (1000 Ω_val + 10,062 inspected + reference)

Table 5. Jamo Coverage Index by reference-set type.

Metric	Pangram-Type	Word-Type	Random-Type
Initial consonant types (of 8)	8	5	4
Vowel combination types (of 4)	4	3	2
Final consonant types	2	2	1
Jamo coverage index	0.89	0.62	0.45

Table 6. Experimental design summary.

Experiment	Source	Model(s)	Ref. Set	Primary Variable	RQ
Exp. 1	Jinbeopunhae (regular historical)	DM-Font	10 chars	Feasibility baseline	RQ1
Exp. 2	Irregular contemporary	VQ-Font	10 chars × 3 types	Reference-set composition	RQ2
Exp. 3	King Jeongjo & Queen Hyoui	VQ-Font vs. Diff-Font	10 pangram chars	Model architecture	RQ3

Table 7. Models × source evaluation coverage under matched conditions.

Model	Regular Historical (Jinbeopunhae)	Irregular Contemporary	Royal: King Jeongjo	Royal: Queen Hyoui
DM-Font (GAN)	√ (Exp. 1)	√ matched	—	—
LF-Font (GAN)	—	√ matched	—	—
VQ-Font (VQGAN)	—	√ matched	√ matched	√ matched
Diff-Font (Diffusion)	—	√ matched	√ matched	√ matched

Table 8. Summary of evaluation metrics used in this study.

Metric	Definition/Function	What it Measures	Direction	Used in
SSIM	Structural similarity between a generated glyph and its ground truth via luminance, contrast, and structure.	Structural similarity (luminance/contrast/structure)	Higher Better	Exp. 1–3
PSNR (dB)	Pixel-level reconstruction fidelity, as a log transform of the MSE between paired images.	Pixel-level reconstruction fidelity	Higher Better	Exp. 1–3
LPIPS	Perceptual distance from deep, layer-weighted feature representation.	Perceptual dissimilarity not captured by SSIM/PSNR	Lower Better	Exp. 2–3
Center-axisdeviation (px)	Mean absolute displacement between generated and ground-truth glyph centerlines.	Layout stability/axis drift	Lower Better	Exp. 2–3
Stroke-thickness CV	Coefficient of variation (SD/Mean) in stroke widths across the generated glyph set.	Stroke stability	Lower Better	Exp. 2–3
Character-spacing deviation	Mean absolute difference in internal spacing between generated and ground-truth glyphs.	Spacing consistency	Lower Better	Exp. 2–3
Jamo coverage index	Aggregated coverage of initial-consonant, vowel-combination, and final-consonant types in the reference set relative to the target inventories.	Structural representativeness of reference set	Higher Better	Exp. 2
Heatmap structural deviation	Mean intensity of absolute pixel-error maps aggregated across all evaluated glyph pairs.	Where errors concentrate	Lower Better	Exp. 2–3
OCR usability (%)	Proportion of generated glyphs correctly identified by an OCR engine against their ground-truth labels.	Practical legibility/ deployability	Higher Better	Exp. 2–3
Expert Rating (ER)	Mean expert score across all criteria on a 5-point Likert scale.	Human typographic/ historical judgment	Higher Better	Exp. 1–3

Table 9. Expert evaluation criteria and supporting references.

Criterion	Definition	References
Style Consistency	Uniformity of stylistic characteristics across the font	[7,20]
Similarity to the Original	Fidelity to the style of the given reference characters	[19]
Readability	Ease of character identification and legibility	[9,19]
Typographic Quality	Structural integrity and esthetic completeness (connections, stability)	[3,9]
Historical Atmosphere Reproduction	Expression of historical era and source-specific atmosphere	[3,22]

Table 10. Expert-evaluation sampling budget per experiment and model.

Experiment	Per Category	Glyphs per Expert per Model	Experts
Exp. 1 (DM-Font)	4	40 (4 × 10 categories)	5
Exp. 3 (VQ-Font; Diff-Font)	6	60 (6 × 10 categories); 120 total	5

Table 11. Correlation between expert ratings and quantitative metrics.

Pair (n = 10 Categories)	Pearson r	Spearman `ρ`	P (Pearson)
Expert rating ↔ SSIM	0.96	0.98	<0.001
Expert rating ↔ PSNR	0.82	0.87	0.004
SSIM ↔ PSNR	0.84	0.87	0.002

Table 12. Quantitative similarity evaluation (Experiment 1).

Character Category	SSIM	PSNR (dB)	Expert Rating (5-pt)
Category 1 (symmetric, simple)	0.92	51	4.5
Category 2 (symmetric, complex)	0.84	33	4.1
Category 3 (high, regularity)	0.95	44	4.7
Category 4 (asymmetric, initial)	0.65	31	3.8
Category 5 (compound, vowel)	0.71	40	3.8
Category 6 (horizontal, vowel)	0.61	28	3.7
Category 7 (single, batchim)	0.76	36	4.1
Category 8 (compound, batchim)	0.55	26	3.1
Category 9 (balanced, composition)	0.89	42	4.4
Category 10 (mixed feature)	0.82	35	4.2
Mean	0.77	36.6	4.04

Table 13. Multi-model benchmark comparison (pangram-type set).

Model	Architecture	SSIM (↑)	LPIPS (↓)
DM-Font	Dual-Memory GAN	0.74	0.94
LF-Font	Localized GAN	0.66	0.89
Diff-Font	Diffusion (DDPM)	0.95	0.57
VQ-Font	VQGAN (Codebook)	0.97	0.41

Table 14. Quantitative similarity evaluation by reference-set type.

Metric	Pangram-Type	Word-Type	Random-Type
SSIM (↑)	0.97	0.92	0.76
PSNR (dB, ↑)	44	38	31
LPIPS (↓)	0.41	0.95	1.28
Center-axis deviation (px, ↓)	0.89	1.97	2.14
Stroke-thickness CV (↓)	0.28	0.86	1.35
Character-spacing deviation (↓)	0.22	0.45	0.87

Table 15. Expert evaluation results (5-Point Likert scale).

Model	Style Consistency	Similarity to the Original	Readability	Typographic Quality	Historical Atmosphere	Overall Mean
VQ-Font	4.3	4.1	4.6	4.2	3.8	4.2
Diff-Font	4.1	4.4	4.7	4.0	4.3	4.3
Better Model	VQ-Font	Diff-Font	Diff-Font	VQ-Font	Diff-Font	Diff-Font

Table 16. Reliability analysis of experts’ evaluations.

Reliability Measure	Value	95% Confidence Interval (CI)	Interpretation
ICC (2, 1), single-measure absolute agreement	0.58	[0.42, 0.74]	Moderate reliability
ICC (2, k), average-measure absolute agreement	0.87	[0.78, 0.94]	Good reliability
Cronbach’s α	0.88	-	High internal consistency
Mean quadratic-weighted Cohen’s κ	0.74	[0.65, 0.82]	Substantial agreement
Kendall’s W	0.81	-	Strong rank concordance

Table 17. Error incidence and postprocessing improvement rates.

Error Type	VQ-Font Incidence (%)	Diff-Font Incidence (%)	VQ-Font Improvement (%)	Diff-Font Improvement (%)
Stroke errors	3.2	6.5	75.0	81.5
Component errors	4.1	5.8	75.6	81.0
Connection errors	3.8	5.5	68.4	76.4
Spacing errors	3.9	4.3	74.4	78.6
Total	15.0	22.0	73.3	79.5

Table 18. Component decomposition of the Jamo Coverage Index by reference-set condition.

Coverage Dimension	Pangram	Word	Random
Initial-consonant types (of 8)	8	5	4
Vowel-combination types (of 4)	4	3	2
Final-consonant types	2	2	1
Jamo Coverage Index (aggregate)	0.89	0.62	0.45
Reference budget \| $R$ \| (held constant)	10	10	10

Table 19. Recommended model-selection framework.

Source Condition	Recommended Model	Reference Set	Expected SSIM	Key Advantage
Regular historical print	DM-Font	10 glyphs	0.70–0.95	Minimal setup
Irregular contemporary	VQ-Font	10 pangram-type	0.90–0.97	Highest structural similarity (SSIM 0.97) in this study
Irregular historical (structure priority)	VQ-Font	10 pangram-type	0.75–0.85	Lower error rate; readability
Irregular historical (style priority)	Diff-Font	10 pangram-type	0.75–0.85	Historical authenticity
Irregular historical (balanced)	VQ-Font + Diff Font ensemble	10 pangram-type	To be verified	Complementary strengths

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hong, J.; Kim, S. Few-Shot Learning for Irregular Hangeul Typeface Expansion: A Comparative Study of GAN, VQGAN, and Diffusion Models. Electronics 2026, 15, 2633. https://doi.org/10.3390/electronics15122633

AMA Style

Hong J, Kim S. Few-Shot Learning for Irregular Hangeul Typeface Expansion: A Comparative Study of GAN, VQGAN, and Diffusion Models. Electronics. 2026; 15(12):2633. https://doi.org/10.3390/electronics15122633

Chicago/Turabian Style

Hong, Jikyung, and Sungkye Kim. 2026. "Few-Shot Learning for Irregular Hangeul Typeface Expansion: A Comparative Study of GAN, VQGAN, and Diffusion Models" Electronics 15, no. 12: 2633. https://doi.org/10.3390/electronics15122633

APA Style

Hong, J., & Kim, S. (2026). Few-Shot Learning for Irregular Hangeul Typeface Expansion: A Comparative Study of GAN, VQGAN, and Diffusion Models. Electronics, 15(12), 2633. https://doi.org/10.3390/electronics15122633

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Few-Shot Learning for Irregular Hangeul Typeface Expansion: A Comparative Study of GAN, VQGAN, and Diffusion Models

Abstract

1. Introduction

2. Related Work

2.1. Few-Shot Font Generation: Architectural Development

2.2. Hangeul Typeface Research and Digital Heritage

2.3. VQ-Font and Diff-Font: Architectural Comparison and Research Gap

3. Materials and Methods

3.1. Problem Definition

3.2. Data Sources and Corpus

3.3. Preprocessing

3.4. Compared Models

3.5. Experimental Protocol

3.6. Evaluation Metrics

3.7. Expert Qualitative Evaluation

4. Results

4.1. Main Benchmark Results

4.2. Ablation on Reference Design

4.3. Qualitative Comparison

4.4. Failure Cases

5. Discussion

6. Conclusions

6.1. Summary

6.2. Theoretical Contribution

6.3. Practical Contribution

6.4. Limitations and Future Research Directions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI