1. Introduction
Recent advances in computer vision have been driven by deep neural networks trained on large, labeled datasets. However, the reliance of such models on abundant annotated data remains a major limitation in domains where labeled samples are scarce, heterogeneous, or costly to obtain [
1,
2].
Typeface generation is a representative example of this challenge. Unlike conventional image synthesis, font generation requires a model to infer a coherent visual system from a limited set of reference glyphs while preserving both structural consistency and stylistic identity across a large character inventory [
3]. This challenge is especially pronounced in Hangeul, whose modern syllabic system comprises 11,172 characters generated through the combinatorial interaction of initial consonants, medial vowels, and optional final consonants. As a result, complete Hangeul typeface production remains one of the most labor-intensive tasks in typography, often requiring the design and refinement of thousands of glyphs over an extended period [
4,
5].
The problem becomes substantially more difficult when the source typeface is irregular. In this study, irregular Hangeul typefaces refer to sources that exhibit unstable stroke weight, curvature, spacing, axis balance, or component composition beyond the predictable conventions of standardized print fonts. Such irregularity is common in historical woodblock documents, royal calligraphic manuscripts, hand-drawn contemporary typefaces, and decorative experimental letterforms. From a computer vision perspective, these sources are challenging because few-shot models must generalize from a small number of reference glyphs despite high intra-style variation and weak structural regularity. Accordingly, irregular Hangeul restoration is not merely a style-transfer problem; rather, it is a data-efficient visual generation problem in which the model must reconstruct a usable and internally coherent typographic system from sparse and structurally unstable evidence [
6,
7].
Historical Hangeul materials are particularly important in this context. The seventeenth and eighteenth centuries produced a rich typographic culture spanning woodblock print, movable-type documents, and royal handwriting, yet many of these materials survive only in fragmentary or archive-bound form [
8]. Restoring them as usable digital fonts is valuable not only for design practice but also for cultural heritage preservation, scholarly access, and the reuse of historical visual resources. However, conventional restoration workflows depend heavily on expert interpretation and manual redrawing, making large-scale restoration slow, expensive, and difficult to standardize. The need for computational methods is therefore especially acute in irregular and historical Hangeul, where the number of surviving characters is limited and stylistic fidelity must be balanced against structural completeness [
4,
8].
Recent progress in few-shot font generation (FFG) has created a promising technical foundation for addressing these challenges. Early generative adversarial network (GAN)-based approaches, including GlyphGAN and Dual Memory Font (DM-Font), demonstrated that it is possible to synthesize a complete character set from a restricted number of reference glyphs by learning reusable content and style representations. More recently, vector-quantized generative adversarial network (VQGAN)-based, and diffusion-based architectures have substantially advanced the field. VQ-Font introduced discrete codebook-based style encoding and stronger structural regularization, whereas diffusion-based models, such as Diff-Font and related variants, improved stylistic flexibility and visual realism under low-reference conditions [
9,
10,
11]. Despite these advances, comparative evidence remains limited for irregular and historical Hangeul, particularly when the task requires controlled variation in source regularity, reference-set composition, and model family.
Another unresolved issue concerns the design of the reference set itself. In Hangeul, unseen syllables are generated through systematic recombination of recurring sub-syllabic elements, suggesting that the internal composition of a few-shot reference set may influence performance as strongly as, or more strongly than, raw set size. However, most prior studies have treated reference selection as a practical input choice rather than as a formal experimental variable. This is a significant limitation for Hangeul restoration because a poorly designed reference set may underrepresent consonant–vowel patterns, batchim structures (final consonants), or layout types essential for generalization. Likewise, evaluation has often emphasized global image similarity, even though practical font restoration requires broader criteria, including structural consistency, legibility, spacing stability, and historical authenticity [
12,
13,
14]. A meaningful benchmark for irregular Hangeul typefaces therefore requires both architecture-level comparison and a multi-layer evaluation framework that reflects typographic function rather than visual resemblance alone.
Against this background, this study investigates artificial intelligence (AI) generative-model-based restoration and formative expansion of irregular Hangeul typefaces through three progressive experiments. The study addresses the following research questions:
RQ1. Can a GAN-based FFG model (DM-Font) effectively restore a relatively regular 17th-century woodblock Hangeul typeface into a complete, usable digital font?
RQ2. How does the composition of the few-shot reference character set affect the generation quality of a VQGAN-based model (VQ-Font) for irregular Hangeul typefaces?
RQ3. For highly irregular historical calligraphic typefaces, do VQGAN-based (VQ-Font) and diffusion-based (Diff-Font) architectures exhibit different performance characteristics?
To answer these questions, the first experiment examines whether DM-Font provides a restoration baseline for Jinbeopunhae (1693), a relatively regular seventeenth-century woodblock-printed Hangeul typeface. The second experiment evaluates how the composition of a fixed 10-character reference set influences generation quality for an irregular contemporary Hangeul typeface using VQ-Font. The third experiment compares VQ-Font and Diff-Font on the highly irregular royal calligraphic hands associated with King Jeongjo and Queen Hyoui, thereby testing whether VQGAN-based and diffusion-based models exhibit different performance characteristics under extreme irregularity. Across these experiments, generated outputs are evaluated using image-similarity metrics, structure-sensitive indicators, optical character recognition (OCR)-based usability testing, and expert qualitative assessment.
This study makes four contributions, the first three of which are methodological. (1) We introduce the Jamo Coverage Index (JCI), a formally defined metric that quantifies the structural representativeness of a few-shot reference set; although instantiated for Hangeul, JCI generalizes to any compositional writing system and provides a principled basis for designing and comparing few-shot inputs. (2) We formalize reference-set composition as a controlled experimental variable under a fixed sampling budget, isolating the effect of input design from model architecture—an analysis applicable to few-shot generation beyond typography. (3) We propose a multi-layer evaluation protocol for structured generative output that integrates pixel-level, structure-sensitive, OCR-based, and expert assessments with explicit inter-rater reliability and metric–human correlation, addressing the well-known insufficiency of single-metric evaluation. (4) Building on these instruments, we present a unified three-tier irregularity benchmark and a model-selection framework that provide an empirical, application-grounded comparison of GAN-, VQGAN-, and diffusion-based approaches. Together, the methodological instruments are transferable, while the benchmark grounds them in a demanding real-world restoration setting.
This paper substantially extends two prior conference studies by the same authors published in Korean-language design journals [
4,
8]. The first prior study [
8] applied DM-Font to the Jinbeopunhae corpus and reported a single-model feasibility result; it constitutes the methodological precursor to Experiment 1 of the present study. The second prior study [
4] applied VQ-Font to an irregular contemporary Hangeul typeface and compared reference-set composition conditions; it constitutes the methodological precursor to Experiment 2. The present paper extends these works in the following ways: (i) it introduces Experiment 3, a new and original comparison of VQ-Font and Diff-Font on two previously unpublished royal calligraphic sources (King Jeongjo and Queen Hyoui); (ii) it integrates all three experiments into a unified benchmarking framework with standardized preprocessing, evaluation metrics, and statistical analysis; (iii) it adds new analyses not reported in either prior work, including cross-experiment SSIM trajectory comparison, postprocessing error recovery analysis, inter-rater reliability assessment (ICC, Cohen’s kappa, Kendall’s W), and the model-selection framework and (iv) all figures, tables, and quantitative results are either newly produced or substantially revised relative to the prior Korean-language publications. Any figures or tables derived or adapted from the prior studies are individually identified in their respective captions.
3. Materials and Methods
3.1. Problem Definition
We formulate irregular Hangeul typeface restoration and expansion as a few-shot font generation task under limited supervision. Let
denote a small set of reference glyph images
with character labels
sampled from a target source typeface, and let
denote the modern Hangeul syllable inventory (
). The objective is to learn a generation function
that synthesizes a complete glyph set
while preserving (i) the structural organization of Hangeul syllable-block composition and (ii) the stylistic identity of the source typeface [
15].
Unlike regular printed fonts, irregular Hangeul sources exhibit unstable stroke thickness, curvature, connection behavior, axis balance, and internal spacing. The task is therefore not a simple image-to-image transfer problem; instead, the model must infer a coherent typographic system from sparse exemplars and generalize it across thousands of unseen syllable blocks. We operationalize this objective through three progressively more challenging experiments: (1) feasibility of restoration on a relatively regular historical print source, (2) the effect of reference-set composition under a fixed few-shot budget, and (3) architecture comparison under extreme irregularity using royal calligraphy [
16,
17,
18].
Table 2 summarizes the source conditions, reference budgets, and target outputs across the three experiments.
Table 3 summary of mathematical notations, grouped by sets, variables, operations, and statistical quantities, used consistently throughout the manuscript.
3.2. Data Sources and Corpus
The corpus was designed as an irregularity spectrum rather than a single-source benchmark. All sources were processed into character-level glyph images, each associated with a unique label (Unicode syllable codepoint) to support training and one-to-one evaluation against ground truth [
4,
8,
16].
Table 4 summarizes the corpus extraction and retention statistics for each source, reporting the number of initially detected glyphs (
), the number retained after quality verification (
), the resulting retention rate (r), and the fixed few-shot reference-set size (
).
For the royal calligraphy sources, the small number of surviving glyphs (
= 10) means the paired evaluation set is necessarily small; the reported royal-source metrics are leave-the-reference-out, few-sampling comparisons and should be interpreted as indicative rather than as large-scale test results. This limitation is noted in
Section 6.4.
In all experiments, the model is tasked with generating the full modern inventory (). The ground-truth evaluation set is the intersection between the available source corpus labels and , denoted . We evaluate only on where a ground-truth glyph exists, while generation is produced for all .
Selection criteria for reference sets (training support):
Experiment 1: reference set size sampled from the usable Jinbeopunhae glyph pool. Selection prioritized (i) clean segmentation, (ii) high contrast, (iii) minimal background noise, and (iv) coverage of diverse syllable-block structures (balanced, compound vowels, batchim variants).
Experiment 2: fixed few-shot budget with three conditions: pangram-type (structure-maximizing), word-type (lexically grouped), and random-type (low-control baseline).
Experiment 3: pangram-type references for each royal hand (King Jeongjo; Queen Hyoui) to hold reference-set composition constant under architecture comparison.
It is important to distinguish the model’s generation target from the set of labels on which quantitative metrics are computed. In every experiment, the model is required to synthesize the full modern Hangeul inventory . The few-shot reference (support) set is provided during adaptation. To monitor convergence and early stopping, we held out a fixed pool of 1000 syllable labels from the generation inventory (Ω_val), and the remaining 10,062 labels (Ω_gen-test) constitute the labels synthesized for final inspection. These 1000 and 10,062 figures therefore describe the synthesis coverage of the output inventory, not a set of original ground-truth images.
Quantitative metrics (SSIM, LPIPS, PSNR, and the structure-sensitive indicators) require a real reference image and are therefore computed only on the paired evaluation set eval = D_gt ∩ D_gen, i.e., the intersection between the verified ground-truth corpus D_gt and the generated outputs D_gen. Because D_gt is bounded by the number of usable original glyphs (), the size of eval differs sharply across sources: it is large for the regular woodblock source of Experiment 1 ( = 997) but only about 10 glyphs for each royal hand in Experiment 3 ( = 10). For the royal sources, the reference set and the paired evaluation set are drawn from the same small pool of surviving glyphs; we therefore report the royal-source metrics as evaluations based on the limited set of surviving original glyphs available for comparisons. Accordingly, these results should be interpreted in the context of data-scarce historical materials and considered together with the complementary expert assessments and structure-sensitive analyses presented in this study.
Syllables in with no corresponding ground truth (the overwhelming majority for the royal hands) cannot be scored against a reference image. These were excluded from paired-image metrics and were instead assessed through OCR-based usability and expert qualitative review, and retained for completeness inspection of the generated font system.
All corpus construction and preprocessing were conducted using a hybrid workflow combining programmatic image processing and design-based manual refinement. Initial glyph extraction, binarization, and normalization were implemented using Python-based libraries (e.g., OpenCV 4.8.0 and NumPy 1.24.3), enabling consistent large-scale processing across heterogeneous sources. Unicode labeling and dataset organization were performed through custom Python scripts to ensure one-to-one alignment between glyph images and syllable codepoints.
For the royal calligraphy datasets (King Jeongjo and Queen Hyoui), additional analysis was performed using Adobe Photoshop (contrast enhancement, noise removal, and background correction) and Adobe Illustrator 30.4 (character-level segmentation and boundary verification). The segmented glyphs were subsequently processed within the Python 3.10 pipeline for normalization and quality control. This combined workflow ensured both visual precision and reproducibility, particularly for structurally irregular handwritten sources.
3.3. Preprocessing
To minimize variance unrelated to model architecture or reference-set design, all source images were processed through a unified pipeline: (1) binarization, (2) character-level segmentation, (3) normalization, and (4) quality verification. All glyphs were normalized to 100 × 100 pixels. All preprocessing procedures were implemented in Python 3.10 using Open-Source Computer Vision (CV) 4.8.0, Numeric Python 1.24.3, and scikit-image 0.21.0, with identical parameter settings applied across all datasets to ensure procedural reproducibility [
18,
21].
A segmented glyph was discarded if it exhibited one or more of the following: (i) incomplete strokes due to segmentation loss, (ii) broken contours or holes inconsistent with the source writing instrument, (iii) severe background artifacts overlapping the foreground, (iv) duplicated instances for the same label where the instance quality was inferior, or (v) label inconsistencies, defined as a mismatch between the segmented region and the assigned Unicode syllable label [
6,
22,
23].
For evaluation, each generated glyph
was paired with its corresponding ground-truth glyph
from the historical or contemporary corpus. Label-based matching was performed using the Unicode syllable label
as the matching key. First, a ground-truth dictionary
was constructed from the usable segmented corpus after verification. Second, a generated dictionary
was constructed from model outputs for all
. Third, the evaluation label set
. Finally, quantitative metrics were computed only over paired samples
. Unmatched labels, including generated glyphs without ground truth and ground-truth glyphs without corresponding generated output, were excluded from paired-image metrics but retained for qualitative inspection of completeness [
24,
25,
26,
27]. All glyphs were normalized to 100 × 100 pixels prior to model input and metric computation. This resolution was selected to ensure cross-source comparability and to match the default input specification of the evaluated models (DM-Font, LF-Font, and Diff-Font), all of which were originally trained and validated at this resolution. Although higher resolutions may better preserve fine calligraphic stroke texture, particularly for the royal calligraphy sources, the 100 × 100 constraint was uniformly applied across all experiments to prevent resolution-induced confounds in cross-architecture comparison. The implications of this resolution trade-off are discussed further in
Section 6.4.
Table 5 reports structural coverage statistics expressed as the
Jamo Coverage Index, for each reference-set type. The table compares pangram-, word-, and random-type reference sets in terms of the number of covered initial consonant types, vowel-combination types, final consonant types, and the aggregated
Jamo Coverage Index.
To enable inferential comparison across reference-set types, Jamo Coverage Index was computed for each of the 10 individual reference glyphs within each condition, yielding 10 observations per group and a total of N = 30 observations across the three conditions (k = 3). Each observation represents the structural coverage contributed by a single reference glyph, defined by the presence or absence of its constituent initial consonant, vowel combination, and final consonant types within the reference set. A one-way ANOVA indicated a significant effect of reference-set type on Jamo Coverage Index, F(2, 27) = 56.745, p < 0.001, η2 = 0.81, with reference-set type accounting for approximately 81% of the variance in structural coverage. Tukey’s HSD post hoc tests showed that pangram-type sets achieved significantly higher coverage than word-type sets, MD = 0.27, SE = 0.029, p < 0.001, and random-type sets, MD = 0.44, SE = 0.029, p < 0.001. Word-type sets also outperformed random-type sets, MD = 0.17, SE = 0.029, p = 0.001. These findings indicate that the three reference-set types constitute statistically distinct structural coverage conditions.
3.4. Compared Models
We compared four few-shot font generation models: DM-Font as the baseline model, LF-Font as an auxiliary GAN-based comparator, VQ-Font as the VQGAN-based, and Diff-Font as the diffusion-based model. All models were implemented, trained, and evaluated in the same software environment using Python 3.10, PyTorch 2.1.0, CUDA 11.8, and cuDNN 8.9, with identical data splits, preprocessing outputs, random-seed settings, and evaluation scripts applied across model families to ensure reproducibility.
- (1)
DM-Font and LF-Font (GAN family). GAN-based font generation learns a generator G and discriminator D through an adversarial objective. In Wasserstein GAN with gradient penalty (WGAN-GP), the training objective is formulated as follows [
2,
10]:
where pr denotes the real data distribution, pg the generated distribution,
a randomly interpolated sample between real and generated data, and λ the gradient penalty.
DM-Font and LF-Font extend this objective by conditioning both G and D on a content latent zc and a style latent zs extracted from the few-shot reference glyphs, such that the generator produces = G (zc, zs) for each target character.
- (2)
VQ-Font (VQGAN family). VQGAN introduces a discrete codebook and a quantizer that maps encoder latents to the nearest codebook vector.
A standard VQ-style quantization objective includes reconstruction plus codebook/commitment terms [
10,
28,
29]:
where
denotes the stop-gradient operator.
- (3)
Diff-Font (diffusion family; DDPM). In denoising diffusion probabilistic modeling (DDPM), the forward process adds Gaussian noise [
2,
20]:
and the model learns a reverse denoising model, often trained via noise prediction:
where
is a noisy version of
, and “cond” denotes conditioning information (e.g., content/style features derived from the few-shot references).
3.5. Experimental Protocol
We employed a three-stage progressive design in which one principal variable was emphasized in each experiment while the overall goal—few-shot restoration and expansion to a complete Hangeul font—remained constant. All experimental runs, including model training, fine-tuning, metric computation, and statistical analysis, were conducted in a unified Python 3.10 environment using PyTorch 2.1.0, CUDA 11.8, NumPy 1.24.3, SciPy 1.10.1, and stats models 0.14.0, with fixed random seeds and identical configuration files applied within each experimental condition to ensure reproducibility.
Experiment 1: feasibility baseline, RQ1. DM-Font was trained using 10 reference glyphs with learning rate , batch size 16, 200 epochs, Adam optimizer, and WGAN-GP loss. Evaluation was stratified into ten syllable-block composition categories (symmetric/simple through mixed feature) to analyze failure modes under increasing compositional complexity.
Experiment 2: reference-set composition, RQ2. VQ-Font was trained with a fixed few-shot budget of 10 reference glyphs under three reference-set types (pangram/word/random) with learning rate , batch size 16, 50 epochs, and Adam . A second-stage benchmark then compared DM-Font, LF-Font, Diff-Font, and VQ-Font under the optimized pangram-type reference condition to separate gains due to reference-set design from gains due to architecture.
Experiment 3: architecture under extreme irregularity, RQ3. VQ-Font and Diff-Font were fine-tuned on two royal calligraphic sources (King Jeongjo; Queen Hyoui) using matched preprocessing, a matched 10-character pangram reference set, and matched evaluation criteria. Postprocessing was applied to remove residual noise and correct structural defects before downstream font compilation where feasible.
Table 6 summarizes the experimental design by source, model(s), reference-set specification, primary variable, and research question.
Table 7 summarizes the maps which model × source combinations were evaluated under matched conditions. The irregular-contemporary column constitutes a complete four-model common-condition benchmark (DM-Font, LF-Font, VQ-Font, Diff-Font under an identical pangram-type 10-glyph reference set, preprocessing, splits, seeds, and metrics), and the two royal columns are complete for the two architectures that remained trainable at extreme irregularity (VQ-Font, Diff-Font). Cells marked “—” were not evaluated and are reported as such; no results are imputed for unevaluated combinations.
The three experiments follow a progressive design, but architecture is isolated under common conditions within two of them: the second stage of Experiment 2 compares DM-Font, LF-Font, VQ-Font, and Diff-Font on the same irregular-contemporary source under an identical optimized pangram-type reference condition and protocol, and Experiment 3 compares VQ-Font and Diff-Font on the royal sources under matched preprocessing, reference set, and evaluation criteria. The design is progressive rather than fully factorial because the three sources differ greatly in usable-glyph availability (N usable ≈ 997 versus ≈10) and the compared models have different native reference budgets (
Table 3); forcing every model onto every source under one tiny budget would introduce model–condition confounds rather than remove them.
Table 4 summarizes which model × source cells are evaluated under matched conditions. A full model × source factorial under budget-normalized conditions is identified in
Section 6.4 as a direction for future work.
3.6. Evaluation Metrics
The evaluation framework was designed as a multi-layered assessment system to capture pixel-level similarity, typography-specific structural stability, practical usability, and expert judgment. All evaluation metrics were computed using a unified Python 3.10 environment, with scikit-image 0.21.0 used for SSIM and PSNR, PyTorch 2.1.0 with the LPIPS 0.1.4 package used for perceptual-distance estimation, OpenCV 4.8.0 and NumPy 1.24.3 used for geometry-based typographic measurements and heatmap analysis, and OCR-based usability testing was conducted using Tesseract OCR 5.3.0 (pytesseract 0.3.10) configured with the Hangeul pack. Each generated glyph image was rendered at 100 × 100 pixels, binarized using the same threshold applied during preprocessing (Otsu’s method, OpenCV 4.8.0), and submitted to the OCR engine in single-character recognition mode—configured with page segmentation mode 10 (—psm 10), which treats the image as a single character, and OCR engine mode 3 (—oem 3), the default LSTM-based recognition engine—to obtain a predicted Unicode syllable label. OCR usability was selected as a downstream proxy for functional legibility because it operationalizes the question of whether a generated glyph is interpretable as the intended character under a standardized recognition system, complementing perceptual metrics that assess visual resemblance without measuring character identity. We acknowledge that OCR accuracy is a necessary but not sufficient condition for human legibility; however, high OCR accuracy (>95%) under a modern recognition engine provides an objective lower bound on deployability and is consistent with evaluation practices in prior font generation research [
9,
11,
22]. To verify robustness, OCR accuracy was additionally confirmed using a secondary Naver Clova OCR pass on a 500-glyph random subsample, and rank-order agreement between the two engines was confirmed (Spearman’s ρ > 0.95). Identical metric scripts, image-resolution settings, and threshold parameters were applied across all experimental conditions to ensure reproducibility.
OCR usability operationalizes a single, narrow question: whether a generated glyph is automatically recognizable as the intended Unicode character. It is a downstream proxy for functional, machine-level legibility only. It does not, and is not intended to, measure typographic quality, faithfulness of historical reconstruction, or stylistic fidelity to the calligraphic source. We acknowledge that OCR accuracy is a necessary but not sufficient condition for human legibility, and—critically—that it is neither necessary nor sufficient for cultural or visual authenticity, which we assess through separate expert and structural measures.
Global image fidelity was assessed using the Structural Similarity Index Measure (SSIM) and the Peak Signal-to-Noise Ratio (PSNR). For a generated glyph image
and its ground-truth image
, SSIM is defined as Equation (5) [
10,
11]:
where
,
are the mean luminances,
,
the variances,
the covariance, and
,
small stabilizing constants. Higher SSIM indicates greater luminance, contrast, and structural correspondence. PSNR was additionally reported, in decibels, as the standard logarithmic transform of the Mean Squared Error between paired images; higher pixel-level reconstruction fidelity. Because pixel-level agreement does not always reflect perceptual quality, the Learned Perceptual Image Patch Similarity (LPIPS) was also computed as a deep-feature perceptual distance using its standard implementation; lower LPIPS indicates better perceptual fidelity.
Because the object of evaluation is a structurally constrained typeface system rather than generic imagery, we further introduced structure-sensitive typographic indicators. In Experiment 2, the internal geometric stability of the generated syllable blocks was quantified by three (Equations (6)–(8)) measures. Center-axis deviation is the mean absolute displacement between the vertical centerlines of the generated and ground-truth glyphs:
where
is the estimated vertical centerline of a glyph and
is the number of evaluated glyphs; lower values indicate less axis drift. Stroke-thickness stability is the coefficient of variation in measured stroke widths:
where
and
are the mean and standard deviation of stroke widths; lower values indicate more uniform stroke thickness. Character-spacing deviation is the mean difference between generated and ground-truth internal spacings:
where
is the measured internal spacing of a glyph; lower values indicate spacing more consistent with the source typeface. Reference-set quality was quantified (Equation (9)) by the number of initial-consonant, vowel-combination, and final-consonant types covered, aggregated into the Jamo Coverage Index:
where
is the reference set;
,
,
are the initial-consonant, vowel-combination, and final-consonant types it covers; and
,
,
are the corresponding target inventories. Higher JCI indicates a more structurally representative few-shot reference set.
To localize recurrent spatial errors, heatmap structural deviation aggregates per-pixel absolute error over the evaluation set:
where
is the mean error intensity at pixel location
; lower values indicate fewer spatially concentrated structural errors.
In Experiment 3, OCR-based usability assessed whether generated glyphs were operationally legible as deployable font assets. OCR usability is the recognition accuracy over the generated set:
where
is the OCR-predicted label,
the ground-truth label, and
the indicator function; higher accuracy indicates greater practical legibility. Finally, expert qualitative judgment was summarized as the mean expert rating across criteria:
where
E is the number of evaluators,
K is the number of criteria, and
the score from evaluator
e on criterion
k; higher values indicate stronger expert judgment of typographic quality, stylistic fidelity, readability, and historical appropriateness.
Table 8 summarizes all evaluation metrics used to assess generated Hangeul glyphs across image fidelity, perceptual similarity, typographic structure, OCR-based usability, and expert judgment. It also reports each metric’s definition or function, preferred direction of interpretation, and the experiment(s) in which it was applied. All outputs were reviewed and verified by the authors, who take full responsibility for the accuracy of reported values.
3.7. Expert Qualitative Evaluation
Expert qualitative evaluation was conducted by a five-member panel (three professional typeface designers with more than ten years of experience and two traditional calligraphers; mean professional experience of 12 years), all independent of the research team and blind to model identity. Glyph samples were drawn by stratified random sampling across the ten syllable-block composition categories defined in
Section 3.5. In Experiment 1, each expert evaluated 40 glyphs (four per category × 10 categories) for DM-Font. In Experiment 3, each expert evaluated 60 glyphs per model (six per category × 10 categories), i.e., 120 glyphs in total for the VQ-Font vs. Diff-Font comparison. Each glyph was rated on five criteria (style consistency, similarity to the original, readability, typographic quality, historical atmosphere reproduction) using a 5-point Likert scale (1 = very low quality; 5 = very high quality), and open-ended feedback was collected. Reported category and model means are averages over the five experts and the sampled glyphs in each cell.
Table 9 summarizes the categories and definitions of the expert evaluation items, as well as relevant prior research.
Table 10 reports the expert-evaluation sampling budget per experiment: the number of glyphs drawn per category, the resulting glyphs rated per expert per model, and the panel size (five experts throughout).
To assess whether the quantitative metrics reflect human perception, we correlated the per-category expert ratings with the corresponding quantitative scores across the ten syllable-block categories of Experiment 1 (n = 10 paired observations). Expert rating was strongly and significantly correlated with SSIM (Pearson r = 0.96, p < 0.001; Spearman ρ = 0.98), and moderately to strongly correlated with PSNR (Pearson r = 0.82, p = 0.004; Spearman ρ = 0.87). SSIM and PSNR were themselves correlated (r = 0.84). The strong rank agreement (ρ = 0.98) indicates that the ordering of categories by expert preference closely matches their ordering by structural similarity: categories the experts rated lowest (e.g., Category 8, expert 3.1) were also those with the lowest SSIM (0.55), while the highest-rated category (Category 3, expert 4.7) had the highest SSIM (0.95).
These results support the convergent validity of SSIM as a proxy for perceived structural quality in this benchmark, while the somewhat weaker PSNR association is consistent with PSNR being more sensitive to pixel-level noise than to perceived typographic quality. The correlation does not, however, eliminate the need for the structure-sensitive indicators and expert review: as noted in
Section 5, global SSIM can miss spacing and stroke-stability failures that experts and the structural metrics detect, which is why the multi-layer framework is retained.
Table 11 reports correlation between expert ratings and quantitative metrics across the ten syllable-block categories of Experiment 1 (n = 10). The strong SSIM–expert agreement (r = 0.96, ρ = 0.98) indicates that SSIM tracks perceived structural quality in this benchmark.
4. Results
4.1. Main Benchmark Results
A three-experiment benchmark was conducted across (i) relatively regular historical woodblock print (Jinbeopunhae), (ii) an irregular contemporary typeface, and (iii) highly irregular royal calligraphy (King Jeongjo and Queen Hyoui).
Experiment 1 (historical print; DM-Font).
Table 8 reports a mean SSIM of 0.77, mean PSNR of 36.6 dB, and mean expert rating of 4.04/5.0 for DM-Font using 10 reference glyphs. Category-wise SSIM values ranged from 0.55 (Category 8) to 0.95 (Category 3). Category 1, Category 3, and Category 9 recorded SSIM values of 0.92, 0.95, and 0.89, respectively, with expert ratings ≥ 4.4.
Figure 4 shows a per-category reconstruction-error map (DM-Font, Experiment 1). It shows the structural fidelity for the ten Hangeul syllable-block categories (C1–C10) on regular historical print. Each cell shows the raw metric value from
Table 11 (SSIM; PSNR in dB; mean expert rating on a 5-point scale); cell shading encodes the within-metric normalized deviation from the best-performing category (light = best, dark = worst). Compound-batchim glyphs (C8: SSIM 0.55, PSNR 26 dB, expert 3.10) and the horizontal/compound–vowel clusters (C5, C6) are the dominant failure modes, whereas balanced-composition glyphs (C1, C3, C9) are reconstructed most faithfully.
Table 12 summarizes the evaluation results by category and analyzes them by comparing quantitative and qualitative indicators.
Characters with balanced composition (Categories 1, 3, 9) achieved SSIM 0.89–0.95, while complex batchim clusters (Categories 4, 6, 8) showed SSIM 0.55–0.65. These results establish a positive baseline but motivate the transition to more advanced architectures.
Figure 5 illustrates the category-wise generation outputs of DM-Font trained on 10 reference glyphs extracted from Jinbeopunhae, a 17th-century Hangeul woodblock-printed typeface.
Experiment 2 (irregular contemporary; multi-model comparison under pangram-type references).
Table 13 reports that VQ-Font achieved SSIM 0.97 and LPIPS 0.41, and Diff-Font achieved SSIM 0.95 and LPIPS 0.57. The GAN-based baselines reported lower similarity and higher LPIPS (DM-Font: SSIM 0.74, LPIPS 0.94; LF-Font: SSIM 0.66, LPIPS 0.89).
Figure 6 presents the generation outputs of VQ-Font on an irregular contemporary Hangeul typeface under the pangram-type 10-character reference condition, illustrating how output quality evolves as training progresses across epochs. VQ-Font achieved the highest quantitative performance in the multi-model comparison.
Experiment 3 (royal calligraphy; VQ-Font vs. Diff-Font). In the average across the two royal sources, Diff-Font recorded SSIM 0.80 and VQ-Font recorded SSIM 0.78. VQ-Font recorded lower structural similarity (0.78 vs. 0.80) and LPIPS (0.41 vs. 1.28), and also recorded lower character-spacing deviation (2.5 vs. 3.8 px), stroke-thickness variation (0.14 vs. 0.18), and center-axis deviation (3.2 vs. 4.1 px).
Figure 7 presents side-by-side generation outputs of VQ-Font and Diff-Font on the highly irregular royal calligraphic hands of King
Jeongjo and Queen
Hyoui. Across experiments, the SSIM value increased from 0.77 (Experiment 1, DM-Font) to 0.97 (Experiment 2, VQ-Font), and decreased to 0.78 (Experiment 3, VQ-Font on royal calligraphy).
4.2. Ablation on Reference Design
Under a fixed 10-character reference budget,
Table 10 reports that the pangram-type reference set achieved SSIM 0.97, PSNR 44 dB, and LPIPS 0.41. The word-type reference set achieved SSIM 0.92, PSNR 38 dB, and LPIPS 0.95. The random-type reference set achieved SSIM 0.76, PSNR 31 dB, and LPIPS 1.28.
Table 9 also reports center-axis deviation of 0.89 px (pangram-type), 1.97 px (word-type), and 2.14 px (random-type), as well as stroke-thickness CV of 0.28, 0.86, and 1.35, respectively.
The Jamo Coverage Indices were 0.89 (pangram-type), 0.62 (word-type), and 0.45 (random-type). One-way ANOVA reported significant differences among reference-set types for structural coverage (F = 56.745,
p < 0.05) and for downstream performance (F = 112.850,
p < 0.05). OCR accuracy under the pangram-type condition was 99.5%, and heatmap structural deviation was 2.1% (random-type: 7.3%).
Table 14 shows the differences in quantitative metrics across the training reference sets.
Figure 8 is the cross-experiment comparison of structural similarity (SSIM) across the three benchmark settings. Mean SSIM was 0.77 in Experiment 1 (DM-Font on the historical woodblock print
Jinbeopunhae), increased to 0.97 in Experiment 2 (VQ-Font on an irregular contemporary typeface under pangram-type references), and decreased to 0.78 in Experiment 3 (VQ-Font on highly irregular royal calligraphy from King
Jeongjo and Queen
Hyoui). The non-monotonic trajectory indicates that SSIM is strongly affected by source regularity, with royal calligraphy remaining the most challenging condition, ANOVA:
F = 112.850,
p < 0.05. Pangram-type achieved OCR accuracy of 99.5% and heatmap structural deviation of 2.1% (vs. 7.3% for random-type).
4.3. Qualitative Comparison
Table 15 summarizes the mean expert ratings for VQ-Font and Diff-Font across the five evaluation criteria. VQ-Font received higher scores for style consistency (4.3 vs. 4.1) and readability (4.2 vs. 4.0). Diff-Font received higher scores for style reproducibility (4.4 vs. 4.1), character completeness (4.7 vs. 4.6), and historical authenticity (4.3 vs. 3.8). Overall mean scores were 4.2 (VQ-Font) and 4.3 (Diff-Font).
Overall, both models received high ratings, indicating that each model produced generally legible and visually plausible Hangeul glyphs. However, the two models exhibited distinct performance tendencies. VQ-Font received higher ratings for style consistency and typographic quality, whereas Diff-Font received higher ratings for similarity to the original, readability, and historical atmosphere reproduction.
A comparison of the mean scores indicates that VQ-Font showed stronger performance in stylistic consistency. VQ-Font received a mean score of 4.3, compared with 4.1 for Diff-Font. The experts noted that VQ-Font tended to maintain uniform stroke thickness, consistent curvature, and stable compositional proportions across the generated character set. This suggests that the discrete representation mechanism of VQ-Font may contribute to typographic regularization and inter-glyph cohesion. By contrast, Diff-Font showed slightly lower consistency, largely because fine strokes were not always reproduced with the same degree of uniformity across characters.
The qualitative feedback supported this interpretation. Participant 1 stated that “the balance and thickness of the strokes are well aligned, giving the font a strong sense of unity.” Similarly, Participant 2 noted that “the stroke structure is stable, and the overall quality is high.” These comments suggest that VQ-Font was perceived as particularly effective in producing coherent glyph structures and maintaining stable typographic form across the generated font set.
In terms of similarity to the original, Diff-Font received a higher mean score of 4.4, compared with 4.1 for VQ-Font. This result indicates that Diff-Font more effectively reproduced the stylistic characteristics of the reference glyphs, particularly fine curvilinear flow, decorative details, and textural variations in the original writing. The experts emphasized that Diff-Font was better able to preserve fine stylistic features that are central to historically irregular calligraphic sources. Participant 3 commented that “the feel of the strokes and even the fine details of the old characters are well preserved.” Participant 4 further observed that “Diff-Font conveys the naturalness and variation found in human handwriting.” These responses indicate that Diff-Font was more sensitive to source-specific stylistic subtleties.
Readability was rated highly for both models, with Diff-Font receiving a mean score of 4.7 and VQ-Font receiving 4.6. This result indicates that both models generated glyphs that were generally easy to identify and operationally legible. Diff-Font’s slightly higher readability score was attributed to its ability to preserve complete stroke forms without severe omissions or distortions in most evaluated glyphs. However, some experts noted that VQ-Font occasionally produced minor spacing irregularities in complex syllable blocks, which could slightly reduce legibility in specific cases.
For typographic quality, VQ-Font received a higher mean score of 4.2, whereas Diff-Font received 4.0. VQ-Font was evaluated as producing more refined and compositionally stable glyphs, with natural stroke connections and balanced internal spacing. In contrast, Diff-Font occasionally showed local structural instability, including uneven spacing, awkward stroke connections, or slight positional skew in complex characters. Participant 5 commented that “in some characters produced by Diff-Font, the spacing between strokes is uneven, which weakens the balance.” This observation is consistent with the quantitative results, in which Diff-Font showed occasional deviations in character positioning and stroke-level stability.
The largest qualitative difference between the two models was observed in historical atmosphere reproduction. Diff-Font received a mean score of 4.3, substantially higher than VQ-Font’s score of 3.8. Experts noted that Diff-Font more effectively captured the expressive qualities of historical brush writing, including stroke flow, texture, pressure variation, and irregular material traces. These characteristics are particularly important for royal calligraphic and historically degraded sources, where stylistic irregularity is not merely noise but a defining component of the source identity. Participant 3 described Diff-Font outputs as having “a rich atmosphere in each individual character,” while Participant 4 noted that the model better preserved “the flow and texture of historical brush writing.”
By contrast, VQ-Font was viewed as more structurally stable but less expressive in terms of historical materiality. Participant 2 stated that “VQ-Font reproduces the characters accurately and cleanly, but the rustic quality of old typefaces and the texture of brush strokes are less apparent.” Participant 5 similarly noted that “the detailed stylistic expression appears somewhat standardized, which slightly reduces the uniqueness of the original.” These comments indicate that VQ-Font’s strength in structural regularization may also reduce sensitivity to fine-grained stylistic variation.
Overall, the expert evaluation indicates that VQ-Font and Diff-Font exhibit complementary strengths. VQ-Font demonstrated stronger typographic cohesion, structural stability, and design-level refinement, making it advantageous when consistency and font-system regularity are prioritized. Diff-Font, by contrast, showed stronger stylistic fidelity, perceptual richness, and historical atmosphere reproduction, making it more suitable for sources in which irregular brush texture, stroke individuality, and historical expressiveness are central to the target style. These findings are consistent with the quantitative results and further suggest that model architecture influences not only reconstruction accuracy but also the balance between structural regularity and expressive fidelity in few-shot Hangeul font generation.
To ensure the statistical reliability of the expert-based qualitative evaluation, inter-rater reliability and internal consistency were additionally examined. Since all experts evaluated the same set of generated samples using an ordinal five-point Likert scale, inter-rater reliability was assessed using a two-way random-effects intraclass correlation coefficient with absolute agreement, ICC (2, k), together with the single-measure ICC (2,1). Internal consistency across the five qualitative criteria was measured using Cronbach’s alpha. In addition, pairwise quadratic-weighted Cohen’s kappa was computed to account for the ordinal nature of the Likert-scale ratings, and the average weighted kappa was reported across all expert pairs.
Table 16 reports the reliability analysis of experts’ evaluations.
The reliability analysis indicated good agreement among the expert raters. The average-measure ICC was 0.87, 95% CI [0.78, 0.94],
p < 0.001, while the single-measure ICC was 0.58, 95% CI [0.42, 0.74],
p < 0.001. Cronbach’s alpha across the five evaluation criteria was 0.88, indicating high internal consistency of the qualitative evaluation instrument. The mean pairwise quadratic-weighted Cohen’s kappa was 0.74, 95% CI [0.65, 0.82], suggesting substantial ordinal agreement among experts. These results support the statistical reliability of the expert ratings reported in
Table 15.
Figure 9 presents a radar chart and bar comparison of expert qualitative ratings for VQ-Font and Diff-Font across five evaluation criteria—structural consistency, style reproducibility, character completeness, readability, and historical authenticity—assessed on a five-point Likert scale by a panel of typography experts.
4.4. Failure Cases
Table 12 reports error incidence for VQ-Font vs. Diff-Font: stroke errors (3.2% vs. 6.5%), component errors (4.1% vs. 5.8%), connection errors (3.8% vs. 5.5%), and spacing errors (3.9% vs. 4.3%), with total raw error rates of 15.0% and 22.0%, respectively. Postprocessing improvement rates were 75.0% vs. 81.5% (stroke), 75.6% vs. 81.0% (component), 68.4% vs. 76.4% (connection), and 74.4% vs. 78.6% (spacing). After postprocessing, residual error rates were approximately 4.0% (VQ-Font) and 4.5% (Diff-Font).
Figure 10 illustrates the geometric deformation and feature-map representation.
Table 17 shows the types of errors and the error reduction rates for each model. After postprocessing, residual error rates converged: VQ-Font ≈ 4.0%, Diff-Font ≈ 4.5%.
5. Discussion
This study investigated few-shot learning for irregular Hangeul typeface restoration and expansion across three source conditions: relatively regular historical woodblock print, irregular contemporary type, and highly irregular royal calligraphy. By comparing GAN-, VQGAN-, and diffusion-based models under controlled reference-set and evaluation conditions, we aimed to clarify how model architecture, source regularity, and reference-set composition jointly influence generation quality [
23,
30,
31]. The results demonstrate that few-shot Hangeul typeface generation is not determined by model architecture alone. Rather, performance depends on the interaction between the structural complexity of the source typeface, the compositional representativeness of the reference glyphs, and the intended restoration goals, such as structural stability, stylistic fidelity, readability, or historical authenticity [
24,
32,
33].
In Experiment 1, DM-Font established a feasible baseline for restoring a relatively regular seventeenth-century Hangeul woodblock source. The mean SSIM of 0.77, mean PSNR of 36.6 dB, and expert rating of 4.04/5.0 indicate that a GAN-based few-shot font generation model can restore a usable glyph set when the source typeface has comparatively stable stroke weight, spacing, and syllable-block composition. This finding supports the premise that compositional font generation models can be effective when the source domain provides sufficient structural regularity [
34,
35]. However, the category-wise results also show that average performance values may conceal substantial variation among syllable-block types. Categories with balanced or simple compositions achieved high SSIM values, whereas categories containing compound batchim, asymmetric layouts, or complex vowel structures showed lower similarity and weaker expert ratings [
36,
37]. This pattern indicates that the difficulty of Hangeul restoration is not merely a matter of global visual appearance but is strongly affected by internal syllable-block structure [
28,
38]. Therefore, even when a model produces acceptable overall similarity, complex structural subclasses may still require targeted correction, additional training support, or more carefully selected reference glyphs [
1,
39].
Experiment 2 further demonstrated that advanced architectures can substantially improve performance under irregular contemporary conditions. Under the pangram-type reference condition, VQ-Font obtained the highest similarity scores among the four models (SSIM 0.97, LPIPS 0.41), with Diff-Font close behind (SSIM 0.95, LPIPS 0.57); both clearly exceeded the GAN-based DM-Font and LF-Font. In contrast, the GAN-based models, DM-Font and LF-Font, showed lower SSIM and higher LPIPS values. These results suggest that VQGAN- and diffusion-based approaches are more robust than earlier GAN-based approaches when the target typeface contains irregular stroke behavior, non-standard spacing, and unstable local structure.
Figure 11 illustrates the structural similarity tracks of perceived quality (DM-Font, Experiment 1), the per-category mean expert quality rating versus SSIM for the ten syllable-block categories (C1–C10); marker area is proportional to PSNR and marker color denotes the SSIM fidelity tier (teal: SSIM ≥ 0.85; gold: 0.70 ≤ SSIM < 0.85; rust: SSIM < 0.70). The dashed line is the least-squares fit; the Pearson correlation between SSIM and expert rating across the ten categories is r = 0.96, indicating that structural similarity is a strong proxy for human-perceived quality. OCR accuracy is reported only at the condition level (pangram-type, 99.5%) and is therefore not plotted per category.
VQ-Font’s discrete codebook representation and structure-aware enhancement appear to provide strong regularization for syllable-block geometry [
7,
16,
30]. This is important for Hangeul because the generation task requires not only visually plausible individual glyphs but also consistency across a large combinatorial inventory of 11,172 syllables. The results therefore indicate that VQ-Font is especially effective when the primary objective is to produce a coherent and operationally stable font system from a small number of reference characters.
The reference-set ablation in Experiment 2 is one of the central findings of this study. Under an identical 10-character budget, the pangram-type reference set produced measurably better outcomes than the word-type and random-type reference sets across every reported indicator (e.g., SSIM 0.97 vs. 0.92 vs. 0.76; LPIPS 0.41 vs. 0.95 vs. 1.28), and the difference was statistically significant by one-way ANOVA (F(2,27) = 56.75,
p < 0.001). The pangram-type condition achieved higher SSIM and PSNR, lower LPIPS, lower center-axis deviation, lower stroke-thickness variation, and lower character-spacing deviation. It also achieved higher OCR usability and lower heatmap structural deviation. These results indicate that reference-set composition functions as an experimentally meaningful variable in few-shot Hangeul typeface generation. In other words, the number of reference glyphs alone is insufficient to define the few-shot condition [
23,
39,
40]. Because Hangeul syllables are formed through combinations of initial consonants, medial vowels, and final consonants, the structural coverage of the reference set directly affects the model’s ability to generalize to unseen syllables. The high Jamo Coverage Index of the pangram-type set explains its superior performance: it provided broader compositional evidence within the same reference budget. This finding contributes to efficient-learning research by showing that data efficiency can be improved not only through architectural innovation, but also through principled support-set design [
41,
42,
43].
OCR accuracy must be interpreted with care in the context of historical calligraphic reconstruction. OCR engines are optimized to recognize the identity of a character, not its calligraphic manner. A glyph that has been normalized toward modern, regular stroke forms may be recognized more reliably than a faithful rendering that preserves the irregular brushwork, dry-brush texture, and ductus of the royal source—yet the normalized glyph is, by construction, less historically authentic. Our expert evaluation (
Table 10) bears this out: VQ-Font attains higher typographic quality and comparable readability while scoring lower on historical atmosphere than Diff-Font. We therefore treat OCR strictly as a measure of operational legibility and deployability, and we base all claims about visual quality, stylistic fidelity, and cultural/historical authenticity on the expert panel (three type designers and two calligraphers) and on structure-aware measures (SSIM, LPIPS, and the Joint Component Inspection heatmap), never on OCR [
10,
22,
42].
Two mechanisms could, in principle, explain the pangram-set advantage: broader combinatorial coverage of the reference set (a set-level property), or greater structural representativeness of the individual reference glyphs (a glyph-level property). The decomposition in
Table 3 favors the former. The pangram-type set covers all eight initial-consonant types and all four vowel-combination types, whereas the random-type set covers only four and two, respectively; the aggregate Jamo Coverage Index rises monotonically with this component coverage (0.89 vs. 0.62 vs. 0.45). Crucially, the budget is held constant at |
| = 10 across conditions, so the difference is not the number of reference glyphs but which structural components those glyphs collectively expose. Because every one of the 11,172 modern syllables is generated by composing initial, medial, and final Jamo, a reference set that instantiates more of these primitives provides the model with direct evidence for a larger fraction of the compositions it must synthesize, reducing the need to extrapolate unseen component shapes [
12,
13,
16].
Three observations argue against per-glyph representativeness (H2) as the primary cause. First, all reference glyphs in every condition passed the same quality-verification criteria (
Section 3.3), so the conditions do not differ systematically in individual glyph cleanliness. Second, the per-glyph ANOVA treats each reference character as one observation of structural coverage and still yields a large, significant between-conditions effect (F(2,27) = 56.75,
p < 0.001, η
2 = 0.81), indicating that the conditions differ in the coverage that glyphs contribute to the set, not merely in isolated glyph traits. Third, the category-wise results of Experiment 1 show that generation quality drops specifically for compound-batchim, compound-vowel, and asymmetric syllable blocks—i.e., for compositions whose constituent Jamo are least likely to be represented by a low-coverage reference set—which is the failure pattern predicted by the coverage account rather than by a uniform glyph-quality account.
Mechanistically, broader component coverage lets the model factorize a syllable into Jamo-level parts it has already observed and recombine them, rather than memorizing whole-glyph appearance [
9,
16]. This is consistent with the architectural behavior we observe: the discrete codebook of VQ-Font, which regularizes syllable-block geometry, benefits most when the reference set supplies a structurally diverse set of component exemplars to quantize against. In short, within the present design, the pangram advantage is most consistently explained as compositional generalization driven by component coverage, a property of how the support set is designed, rather than as an effect of individually superior reference characters.
Table 18 decomposes the Jamo Coverage Index into its component dimensions, making explicit that the pangram set’s advantage stems from broader component coverage—not a larger reference budget, which is held constant at |
| = 10 across all conditions.
Although the few-shot budget is fixed at only ten reference glyphs, Experiment 2 directly probes the sensitivity of the results to the choice of those glyphs by comparing three alternative compositions under the same budget. The induced variation is substantial (SSIM 0.76–0.97; LPIPS 0.41–1.28) but systematic: downstream quality is monotonically ordered by the Jamo Coverage Index, which accounts for ~81% of the variance in structural coverage (one-way ANOVA, p < 0.001). These results indicate that performance is stable under reference selections that provide adequate structural coverage and degrades predictably as coverage falls, so the controlling factor is the structural representativeness of the support set rather than its raw size.
Experiment 3 examined the more difficult case of royal calligraphic Hangeul, where stylistic irregularity is not simply noise but a defining feature of the source identity. The comparison between VQ-Font and Diff-Font revealed a clear structure–style trade-off. In the royal calligraphy condition, the two models were close on global SSIM (Diff-Font 0.80 vs. VQ-Font 0.78) and on expert ratings (4.3 vs. 4.2). The more pronounced differences appeared on structure-sensitive indicators, where VQ-Font recorded lower LPIPS (0.41 vs. 1.28), smaller character-spacing deviation (2.5 vs. 3.8 px), lower stroke-thickness variation (CV 0.14 vs. 0.18), and smaller center-axis deviation (3.2 vs. 4.1 px). These results indicate that the two models are comparable on global similarity but differ mainly in structural stability, rather than one being uniformly superior. These results suggest that VQ-Font is more effective in maintaining geometric stability, while Diff-Font is more effective in preserving expressive stylistic variation. This distinction is theoretically important because it shows that global similarity metrics, such as SSIM, may not fully represent the functional quality of generated typefaces. A glyph may appear visually close to the source image but still contain structural instability that affects readability, spacing behavior, or font-system consistency [
1,
2,
18,
38].
The expert evaluation supports this interpretation. VQ-Font received higher scores for style consistency and typographic quality, whereas Diff-Font received higher scores for similarity to the original, readability, and historical atmosphere reproduction. These results indicate that the two model families have complementary, task-dependent strengths within this single royal calligraphy source: VQ-Font tended to be preferable when restoration prioritized stable layout and consistent stroke organization, whereas Diff-Font tended to be preferable when the goal is to preserve historical atmosphere and expressive irregularity. Because this comparison rests on one source and a limited expert panel, these tendencies should be viewed as condition-specific rather than general rankings. This distinction is particularly relevant for digital heritage applications because historical typeface restoration often involves competing priorities. A restoration model must produce glyphs that are legible and usable as a modern font, but it must also preserve the visual evidence of historical writing practices. The findings of this study suggest that no single model is universally optimal across these objectives. Instead, model selection should be guided by the restoration priority and the regularity of the source material [
41,
42].
The error analysis further clarifies the practical implications of the benchmark. VQ-Font showed lower raw error incidence than Diff-Font across stroke, component, connection, and spacing errors [
1,
34]. However, Diff-Font showed higher postprocessing improvement rates, and the residual error rates of the two models converged after correction. This result indicates that raw generation accuracy and final production usability are related but not identical. In practical restoration workflows, the cost and recoverability of errors may be as important as their initial frequency. For example, a model with a higher raw error rate may still be useful if its errors are systematic and easily correctable during postprocessing Conversely, a model with fewer errors may impose greater production costs if those errors are structurally complex or difficult to repair. Therefore, evaluation of AI-generated fonts should include workflow-level criteria, including postprocessing burden, vectorization stability, and deployability as a functional font asset [
41,
42].
The multi-layer evaluation framework used in this study provides a more appropriate assessment model for functional glyph generation. SSIM and PSNR were useful for measuring pixel-level similarity, and LPIPS provided perceptual-distance information that was not fully captured by pixel-level metrics. However, structure-sensitive indicators, OCR-based usability, expert evaluation, heatmap structural deviation, and postprocessing analysis were necessary to identify failure modes that global metrics alone could not explain. In this respect, this study supports the broader argument that computer vision benchmarks for structured outputs should incorporate task-specific functional metrics. For font generation, such metrics must reflect not only whether an image resembles the source but also whether the generated glyph can operate as part of a complete typeface system [
44,
45].
The findings also have methodological implications for few-shot learning. We show that the support set is not a passive input but an active design variable. The superiority of the pangram-type reference condition demonstrates that a small number of characters can support high-quality generation when they are selected to maximize structural coverage. This result is particularly relevant for domains where additional data collection is difficult or impossible, such as historical manuscripts, fragmentary archives, endangered scripts, and culturally significant handwritten materials. Rather than increasing the number of references indiscriminately, future few-shot systems may benefit from optimizing reference selection according to compositional diversity, component coverage, and structural representativeness.
Table 19 translates the experimental findings into a practical model-selection framework by aligning source regularity, restoration priority, reference-set design, and expected reconstruction performance. The expected SSIM ranges should be interpreted as condition-specific guidance derived from the present benchmark rather than universal performance guarantees, and the proposed VQ-Font–Diff-Font ensemble condition remains a future validation direction for cases requiring both structural stability and historical stylistic fidelity.
6. Conclusions
6.1. Summary
This study examined irregular Hangeul typeface restoration and expansion as a few-shot learning problem under limited supervision, benchmarking GAN-, VQGAN-, and diffusion-based models across a spectrum of source regularity. The results demonstrate that generation performance is jointly determined by model architecture, source typeface regularity, and reference-set composition. A GAN-based baseline (DM-Font) provided feasible restoration for a relatively regular historical source, while VQ-Font attained the highest structural similarity among the evaluated models for the irregular contemporary typefaces in Experiment 2 (SSIM 0.97) when paired with a structurally optimized pangram-type reference set, while the other metrics followed the same condition-specific pattern. Under the highly irregular royal calligraphy conditions of Experiment 3, VQ-Font and Diff-Font showed similar global similarity and expert ratings (SSIM 0.78 vs. 0.80, 4.2 vs. 4.3), and their differences appeared mainly in structure-sensitive indicators: VQ-Font tended to provide more stable geometric structure and readability, whereas Diff-Font tended to preserve stylistic nuance more faithfully. These findings indicate that successful few-shot Hangeul generation depends on aligning model selection and reference design with the structural properties of the source and the intended restoration objective.
6.2. Theoretical Contribution
Finally, the proposed benchmark advances evaluation theory for structured generative models. By integrating architectural comparison, reference-set ablation, pixel-level similarity metrics, perceptual-distance measurement, typography-specific structural indicators, OCR-based usability testing, expert qualitative assessment, and postprocessing recoverability into a single experimental framework, this study offers a more comprehensive methodological model for assessing generative performance in complex writing systems, such as Hangeul.
The theoretical significance of this work lies not only in identifying which model performs best under a given source condition, but also in demonstrating why performance changes across regularity levels, how compositional reference coverage affects generalization under a fixed few-shot budget, and how model selection should be interpreted relative to restoration objectives—including structural consistency, stylistic authenticity, operational readability, and digital-heritage preservation. Collectively, these contributions broaden the scope of computer vision research from image-level resemblance toward functionally constrained visual-system reconstruction [
4,
8,
19,
20].
Although this study does not propose a new generative backbone, its methodological contributions are not specific to Hangeul. The Jamo Coverage Index operationalizes few-shot reference-set quality for any compositional script, turning an implicit design choice into a measurable, optimizable quantity; the controlled treatment of reference-set composition provides a template for studying input design as a first-class factor in few-shot generation; and the multi-layer, reliability-validated evaluation protocol offers a reusable methodology for assessing structured generative output where global pixel metrics are insufficient. These instruments are independent of the particular architectures evaluated here and can be applied to other low-resource, structure-critical generation problems.
We therefore position the paper as a methodological and empirical contribution: it provides transferable evaluation and input-design instruments (JCI, the composition-as-variable protocol, and the multi-layer assessment) validated within a rigorous comparative benchmark, rather than a new architecture. We have clarified this scope in the Introduction and Conclusion so that the intended contribution is not mistaken for an architecture proposal.
6.3. Practical Contribution
This study provides practical guidance for stakeholders involved in digital heritage preservation, Hangeul typography, archival documentation, museum curation, font production, and AI-assisted design. Rather than treating generated glyphs as isolated images, the proposed framework evaluates them as functional typographic assets that must satisfy visual fidelity, structural consistency, legibility, and deployment readiness. By combining SSIM, PSNR, LPIPS, structure-sensitive indicators, OCR usability, expert review, and postprocessing analysis, this study offers a workflow that enables practitioners to judge whether AI-generated Hangeul fonts are suitable for restoration, publication, education, exhibition, or design application.
The results also provide a stakeholder-oriented model-selection guideline. For relatively regular historical print sources, GAN-based models, such as DM-Font, can serve as a feasible baseline when sufficient reference glyphs are available and minimal setup is required. For irregular contemporary or historically complex sources where readability, spacing stability, and syllable-block consistency are critical, VQ-Font may be the more suitable starting point because it showed more stable structural control under limited reference conditions in our experiments (lower spacing and center-axis deviation; lower LPIPS). When the restoration objective prioritizes historical atmosphere and brush texture, Diff-Font is a reasonable alternative, although its outputs required additional structural verification before deployment.
Finally, this study indicates that reference-set design and production workflow are important for practical success: under the same 10-character budget, a structurally curated pangram-type reference set improved every reported quality indicator. The proposed pipeline, including generation, evaluation, postprocessing, vectorization, and conversion into deployable font formats, can support institutions seeking to restore incomplete historical typefaces, expand irregular contemporary fonts, or develop culturally grounded digital resources. Thus, this study offers an actionable workflow through which researchers, designers, and heritage professionals can integrate few-shot font generation into real-world restoration and production environments while maintaining accountability through metric-based validation and expert judgment.
6.4. Limitations and Future Research Directions
Several limitations of this study suggest directions for future research. First, all glyphs were normalized to 100 × 100 resolution to ensure cross-source comparability and to match the native input specification of the evaluated models. This constrained the representation of fine calligraphic detail, an effect strongest for the royal calligraphy sources: at 100 × 100, principal strokes span roughly 4–7 px, whereas the hairline entries, tapered terminals, and pressure-modulated thin strokes of historical brush writing span only about 1–2 px—at or below the raster’s effective sampling limit—so such features tend to merge before any model is applied. This is consistent with the expert evaluation, in which reviewers described the royal-source brush texture as standardized, and with the larger LPIPS than SSIM differences observed on these sources. Future work should evaluate higher-resolution generation (e.g., 256 × 256), retraining the architectures at the higher resolution, to recover stroke texture for fine-detail-critical restoration. Second, the source corpus was restricted to a small number of typefaces, and for the royal calligraphy sources, only about 10 usable original glyphs survive per hand; the corresponding metrics are therefore reported as few-sample, leave-the-reference-out comparisons and interpreted as indicative rather than population-level estimates. Expanding the dataset across a broader range of historical periods and styles would strengthen robustness. Third, the qualitative evaluation relied on a limited expert panel (n = 5); although it achieved good-to-substantial inter-rater reliability (ICC(2,k) = 0.87; Cronbach’s α = 0.88; weighted κ = 0.74), the panel size remains modest. Future work will expand to a larger, institutionally diverse panel, recruit an independent external panel to replicate the ranking under a pre-registered protocol, and add a larger-scale crowdsourced legibility study. Fourth, the study used a single preprocessing and evaluation pipeline without standardized computational-cost analysis; future benchmarks should examine pipeline variability and include reproducible efficiency metrics. Fifth, although the component decomposition of the Jamo Coverage Index points to compositional coverage as the principal driver of the pangram-set advantage, coverage and per-glyph structural quality are not fully orthogonal in the current design, and a single representative reference set was used per composition type. A decisive test would hold Jamo coverage constant while varying the structural representativeness of individual glyphs (and vice versa)—sampling multiple coverage-matched reference sets and several independent 10-glyph draws at fixed JCI—to report support-set confidence intervals and separate variability due to coverage level from variability due to glyph identity. We identify this coverage-matched robustness ablation as a priority for future work. Sixth, reference-based metrics require ground-truth glyphs, so quality on the portion of the 11,172-syllable inventory lacking ground truth is currently supported by structural-coverage evidence rather than direct measurement. Future work will close this gap through a stratified human spot-check audit over a statistically powered random sample across all Jamo-combination strata, reference-free quality estimators computed over the entire generated set, and a component-level consistency check. These analyses will be reported with full statistics; no projected values are claimed here. Finally, this study contributes evaluation and input-design methodology (JCI, reference-set composition as a controlled variable, and a multi-layer assessment protocol) together with a comparative benchmark rather than a new generative architecture; designing an architecture or objective tailored to highly irregular calligraphic structure—and the observed VQ-Font/diffusion complementarity that motivates hybrid or ensemble approaches—is a natural direction for future research on highly irregular and historically complex typefaces.