A Study on the Generation and Evaluation of Illustrations for Chinese Idiom Allusions Based on AIGC

Li, Jingxue; Teng, Youping; Wang, Weijia

doi:10.3390/info17050495

Open AccessArticle

A Study on the Generation and Evaluation of Illustrations for Chinese Idiom Allusions Based on AIGC

by

Jingxue Li

^1,*

,

Youping Teng

¹ and

Weijia Wang

²

¹

School of Art and Archaeology, Hangzhou City University, Hangzhou 310015, China

²

School of Digital Technology and Innovation Design, Jiangnan University, Wuxi 214122, China

^*

Author to whom correspondence should be addressed.

Information 2026, 17(5), 495; https://doi.org/10.3390/info17050495

Submission received: 9 April 2026 / Revised: 7 May 2026 / Accepted: 13 May 2026 / Published: 18 May 2026

(This article belongs to the Section Information Applications)

Download

Browse Figures

Versions Notes

Abstract

As carriers of traditional culture, Chinese idiom allusions contain rich semantic and emotional content. High-quality illustrations of these idioms hold significant potential for applications in cultural communication and education. Although generative artificial intelligence has achieved substantial progress in general image synthesis, it remains challenging to produce idiom illustrations in culture-intensive scenarios that simultaneously preserve cultural symbols, maintain affective ontology, and exhibit high visual aesthetic quality. To address this gap, we propose a three-dimensional evaluation framework—Zhen-Shan-Mei (Truth-Goodness-Beauty)—for idiom illustrations. The ‘Truth’ module uses Chinese vision–language models to quantify cultural symbols; the ‘Goodness’ module applies cross-modal affective analysis to assess affective ontology; and the ‘Beauty’ module computes quantitative aesthetic metrics (composition balance, color harmony, and line expressiveness). Based on this system, an AI-idiom prototype system is constructed to realize closed-loop iteration of generation-evaluation-regeneration and threshold screening. Experiments show that the proportion of illustrations selected by subjects after the “Truth-Goodness-Beauty” screening reaches 78.1%. The results suggest that the proposed method is effective in maintaining cultural symbols, strengthening affective ontology, and improving visual aesthetics and offers a potentially interpretable and reproducible evaluation and optimization framework for culture-intensive image generation tasks.

Keywords:

AIGC; idiom illustration; cultural symbols; affective ontology; visual aesthetics

1. Introduction

Generative Artificial Intelligence (AIGC) has developed rapidly in recent years. From early Generative Adversarial Networks (GANs) [1] to today’s high-quality diffusion models (such as Stable Diffusion) [2] and the efficient alignment of text and visual features by CLIP-like multimodal pre-trained models [3], the quality and controllability of text-driven image generation have been significantly improved. These advancements have improved text-driven image generation, enabling automated creation in illustration, education, and cultural communication.

As important carriers of traditional Chinese culture, Chinese idioms and allusions have unique value when visualized for cultural education and communication. Visualizing idiom allusions not only requires accurately restoring the core scenes and elements in the allusions but also conveying their deep emotions and cultural connotations while meeting the integration needs of modern visual aesthetics. However, when the generation task focuses on such culture-intensive themes, existing methods reveal three main shortcomings. First, mainstream models lack an understanding of culture-specific terms and allusion backgrounds, which can easily lead to semantic deviation [4,5]. Second, common evaluation indicators (such as FID and Inception Score) find it difficult to quantify the consistency of cultural semantics [6] and emotions. Third, existing aesthetic evaluations are mostly based on photography or general visual styles, lacking specific scales for the compositional balance and colors of Chinese idioms. Meanwhile, recent studies have begun to explicitly incorporate cultural relevance into the automated evaluation of generated images. Examples include the Cultural Relevance Index (CRI) [7], proposed for cross-cultural settings, and community-informed assessment rubrics grounded in participatory design [8]. However, these approaches have primarily been developed for non-Chinese cultural contexts, such as Arab culture and South Asian folk artifacts, and have not yet yielded a dedicated metric system or empirical validation specifically for Chinese idiom illustrations. Therefore, relying solely on the existing generation and evaluation methods makes it difficult to ensure the balanced performance of idiom illustrations in terms of cultural symbols, affective ontology, and visual aesthetics.

To address these challenges, we propose a three-dimensional evaluation framework—Zhen–Shan–Mei (Truth–Goodness–Beauty)—and present AI-Idiom, a prototype system for the automatic generation and rigorous selection of the illustrations of Chinese idioms and allusions. The Truth dimension leverages a Chinese vision-language pretrained model to evaluate macro-level semantic consistency and the completeness of fine-grained cultural symbols. The Goodness dimension employs cross-modal sentiment analysis together with fine-grained sub-emotion modeling to quantify the alignment between images and the text’s affective ontology. The Beauty dimension operationalizes illustration aesthetics through quantitative indicators—compositional balance, color harmony, and line expression—that are used to assess visual appeal. Built on this framework, AI-Idiom implements a generate–evaluate–regenerate closed-loop pipeline: prompts are automatically refined and candidate images are retained only if they exceed the predefined threshold criteria, yielding illustrations that are culturally faithful, affectively aligned, and aesthetically compelling (Figure 1).

2. Related Work

2.1. Generative Models and Semantic Alignment

Diffusion-based text-to-image models (e.g., Stable Diffusion) have largely superseded earlier GAN-based approaches in stability and fidelity. However, these models are typically trained on English or lingua-franca corpora and are therefore prone to semantic misalignment when applied within Chinese cultural contexts. Conditional-control architectures such as ControlNet—which employ a frozen pretrained backbone together with trainable auxiliary branches—have been proposed to strengthen the spatial and structural constraints [9], yet they do not explicitly address the grounding of cultural symbols. Research on cultural adaptation has followed two complementary directions. One direction is model and data localization (e.g., Chinese-language pretraining [10] or bilingual fusion [11]) to improve comprehension of metaphors and classical styles; the other is retrieval- or prompt-augmentation [12], which injects cultural context through external knowledge or iterative prompting to improve the rendering of niche cultural elements. Localization affords fundamental improvements in cultural understanding but incurs substantial annotation and compute costs; retrieval-based methods are inexpensive and flexible but provide limited support for complex, multi-agent or temporally sequenced narratives. Because idioms and classical allusions are semantically dense and often episodic, a high-fidelity, culture-sensitive joint semantic-alignment solution remains lacking.

2.2. Sentiment Analysis

Visual affect research has progressed from static classification [13] to methods that embed affective constraints within the generative pipeline—e.g., by constructing affect spaces [14], controlling affective intensity [15], and employing emotion adapters [16], while also supporting composite affective states [17]. Such vectorized controls enable manipulation of emotional expression but remain constrained by three issues: the subjectivity of emotion annotation, cross-cultural disparities in affective semantics, and the absence of standardized mappings between culture-specific affective states and visual symbols. For idioms and classical narratives, affective modeling therefore requires a two-level strategy that jointly accounts for overall polarity and fine-grained sub-emotions, together with targeted fine-tuning on Chinese cultural contexts.

2.3. Aesthetic Evaluation

Automated aesthetic assessment (e.g., AVA, AADB, and models such as NIMA) [18,19,20] has established quantitative frameworks for photographic quality and fine-grained attributes (contrast, composition, color, etc.). These datasets and models, however, are biased toward photography and Western visual styles and do not adequately capture the aesthetic priorities of illustration, traditional Chinese painting, or contemporary “guo feng” (national-style) art. Interpretable indicators—composition metrics, color-difference measures, stroke/texture features, and skeletal edge analysis [21]—have been integrated as training signals or post-processing criteria to enhance visual quality, but the absence of large-scale, standardized annotations for Chinese traditional aesthetics limits the efficacy of such modules in idiom-illustration tasks.

Collectively, diffusion models with cross-modal alignment provide a technical foundation for high-quality generation; Chinese-centric pretraining and retrieval-augmented prompting can partially mitigate cultural gaps; and affective and aesthetic controls offer quantifiable constraints on emotion consistency and visual appeal. Nevertheless, idiom illustration, which is characterized by its multiple roles, temporal structure, and culture-specific affect, exposes three principal gaps: (i) the lack of a joint semantic-alignment strategy that faithfully reconstructs episodic narratives and cultural symbols; (ii) the absence of cross-modal affective mechanisms that reconcile overall polarity with fine-grained sub-emotions; and (iii) the shortage of interpretable, illustration-oriented aesthetic standards tailored to “guo feng” and traditional Chinese art. To address these gaps, this paper proposes a three-dimensional “Truth–Goodness–Beauty” evaluation and generation loop that integrates Chinese visual–language fine-tuning, a two-level affective model, and customized aesthetic metrics for illustration, thereby seeking a balance between interpretability and practical utility.

3. Methods

The proposed method comprises three core modules. In the Truth module, cultural symbols are assessed via macro-level semantic alignment using the Chinese CLIP and micro-level matching of key visual symbols. In the Goodness module, an affective ontology is constructed from both the polarity and sub-emotion signals. In the Beauty module, rule-based aesthetic features are combined with NIMA to quantify visual aesthetics. The overall pipeline proceeds as follows: structured prompts are first generated from extant idiom narratives and meanings to guide image synthesis. The generated illustrations are then scored along the three dimensions—Truth (cultural symbols), Goodness (affective ontology), and Beauty (visual aesthetics). Finally, the highest-scoring illustrations are selected.

3.1. Prompt Generation

To enable the generative model to accurately comprehend the narrative context of idiomatic expressions, we first construct a three-level idiom cultural knowledge graph—comprising the allusion layer, semantic layer, and visual feature layer—to establish, in advance, a deep cultural-semantic constraint framework. On this basis, a standardized prompt-generation procedure is implemented through a two-stage workflow: (i) the idiom’s background story and core moral are extracted from Baidu Baike to guarantee cultural fidelity; (ii) a large language model (e.g., GPT-4) is used to produce painting prompts that explicitly contain four core elements—time, setting, concrete events (including agent behavior of humans/animals), and affect—where the depicted events are constrained by historical accounts or folklore and the affect must match the idiom’s core meaning.

For example, for “shui zhong lao yue” (to scoop the moon from the water), the allusion layer anchors the fable of “the monkeys fishing for the moon” and its core moral of futility and illusion. The semantic layer explicitly identifies four central entities—the ancient well, the reflected moon, the monkey group, and the act of reaching for the moon—as well as their spatial relations, such as the monkey group, tree branches, hanging in a linked chain, and the moon’s reflection in the well. The visual feature layer further specifies visual rendering requirements, including a nighttime forest setting, the form of the old well, and the detailed depiction of monkeys linked head-to-tail in a chain. Under these structured constraints, prompt generation can simultaneously satisfy the requirements of the complete narrative reconstruction, comprehensive coverage of core cultural entities, internal consistency of action logic, and alignment with the intended emotional orientation. Accordingly, the story and moral extracted from Baidu Baike, together with the structural constraints encoded in the knowledge graph, are jointly input into GPT-4, along with the following system prompt: “You are an expert prompt writer for idiom illustrations. Produce a Chinese description that contains the four elements: time + setting + concrete event + affect. The event must accord with historic/folkloric accounts, and the affect must match the idiom’s meaning”. GPT-4 generates a structured description (e.g., “At night, in a rural forest, there is an ancient well reflecting the full moon. A troop of monkeys hangs from the branches around the well in a chain; the lowest monkey reaches into the rippling water to grasp the moon’s reflection. This scene metaphorically denotes the pursuit of vain illusions and the inevitable failure thereof”). As shown in Figure 2, the illustration generated with the structured prompt reproduces core elements such as the “ancient well” and the “chain of monkeys” more faithfully than the illustration produced without structured prompting.

3.2. “Truth” Dimension: Cultural Symbols

The Truth dimension decomposes the computation of idiom cultural-symbol fidelity into two sub-tasks: (1) macro-level semantic similarity, which evaluates the global semantic consistency between the idiom description and the generated illustration in a joint embedding space; and (2) micro-level symbol matching, which evaluates the presence and quality of the specific visual elements referenced by the idiom. This decomposition aligns with Peirce’s triadic semiotic relation (sign–object–interpretant) [22]: the idiom text serves as the sign, the image instantiation as the object, and the matching processes (macro and micro) function as interpretants that validate the cultural meaning.

Macro-level semantic consistency: Macro semantic consistency is computed using a large-scale vision–language pretrained model, Chinese-CLIP (ViT-L/14). Image features and text embeddings are extracted by the respective encoders and are normalized. A cosine-similarity measure is then computed and converted to a standardized semantic-consistency score in the range [0, 1]. Concretely, the CLIPScore formulation is adopted with a weighting coefficient ω = 2.5 applied as described in the experimental Appendix A [23]. This macro score reflects the overall semantic alignment of the generated image with the idiom description.

Micro-level symbol similarity: At the micro level, local regions or object-level features are detected and compared with the list of idiom-specific visual elements (e.g., for “hua long dian jing”, they extract “dragon”, “brush”, and “eye”). A similarity score is computed for each element; the micro-symbol score is obtained as the arithmetic mean of these per-element similarities. The final cultural-symbol score is the weighted fusion of macro and micro scores, with equal significance (50% macro, 50% micro) in the present study.

3.3. “Goodness” Dimension: Affective Ontology

The affective ontology is designed to achieve a hierarchical affective alignment between idiom text and illustration. Because idiom affect often exhibits a two-layer structure—global polarity (macro-polarity) plus fine-grained sub-emotions (micro-emotions)—a single polarity metric is insufficient to distinguish cases that share overall valence but differ in the sub-emotional nuance (e.g., “hua she tian zu” is broadly negative but entails “absurdity/irony”; “jing di zhi wa” is negative but conveys “narrow-mindedness/ignorance”). We therefore adopt a dimensional-categorical hybrid model where the first layer captures the overall polarity and intensity; the second layer captures discrete sub-emotion types and strengths, enabling hierarchically structured, cross-modal alignment and interpretable quantification.

First-level affective polarity: Primary polarity modeling uses the Dalian University of Technology Sentiment Lexicon (DUT; 27,466 sentiment words with raw intensity scale 1–9) [24]. Let

S_{p o s_d e s c}

and

S_{n e g_d e s c}

denote the normalized weighted sums of positive and negative sentiment strengths in the idiom text. Let

{s i m}_{p o s}

and

{s i m}_{n e g}

denote the cosine similarities between the generated image and the positive/negative reference-description sets, respectively. Define the text polarity ratio

P_{t e x t} = S_{p o s_d e s c} / S_{p o s_d e s c} + S_{n e g_d e s c} + ε

and the image polarity ratio

P_{i m g} = {s i m}_{p o s} / {s i m}_{p o s} + {s i m}_{n e g} + ε

, where ε is a small constant to avoid division by zero [25]. The polarity agreement is measured as

M_{p} = 1 - |P_{t e x t} - P_{i m g}|

. To introduce an intensity-reliability calibration, a calibration factor

P_{c a l i b} \in [0.6, 0.9]

[26] (set empirically) is mapped to [0, 1] using a clipped linear mapping [27]; the calibrated polarity match is then computed as

P o l a r i t y M a t c h = M_{p} \times c l i p (P_{c a l i b} - 0.6 / 0.3, 0,1)

.

Second-level sub-emotions: To enhance the theoretical rigor and adaptivity of sub-emotion weighting, we develop a sub-emotion weighted fusion mechanism that integrates the Weber–Fechner psychophysical law with the dataset-conditioned priors. Specifically, based on the “emotion category” field in the DUT, we first select the sub-emotion candidates most relevant to the target idiom and retain the top K = 5 candidates according to the normalized intensity. For each sub-emotion e, let

I_{e}

denote its normalized intensity. To capture the non-linear nature of the human perception of emotional intensity, we define its weight as

w_{e} = {(I_{e} / 100)}^{1.5}

[28], where the exponent 1.5 is motivated by Weber–Fechner–type perceptual modeling. On this basis, we further incorporate a class-balanced prior and a dataset-conditioned prior to calibrate the weights. In small-sample generalization settings, we adopt an unbiased uniform prior, in which case the adaptive weight degenerates to

{w'}_{e} = w_{e}

. Under standard conditions, we first compute the empirical frequency of each sub-emotion within the target idiom class in the training set and normalize these frequencies to obtain a class prior distribution. This prior is then jointly modeled with the sub-emotion intensity term to construct adaptive weights, thereby maintaining an approximately uniform allocation when the sample distribution is balanced while increasing the contribution of highly relevant sub-emotions and suppressing noisy terms when the sub-emotion distribution is skewed. Let

s_{e} \in [0, 1]

denote the cosine similarity between the image and sub-emotion e. The contribution of each sub-emotion is

{s c o r e}_{e} = s_{e} \times w_{e}

. Accordingly, the aggregated sub-emotion score is formulated as

S_{s u b} = \sum_{e = 1}^{K} {s c o r e}_{e} / \sum_{e = 1}^{K} w_{e}

[29,30], which is guaranteed to lie in the interval [0, 1]. The final emotion-ontology score is the balanced average of the primary and secondary components:

E m o t i o n S c o r e = 0.5 \times P o l a r i t y M a t c h + 0.5 \times S_{s u b}

.

3.4. “Beauty” Dimension: Visual Aesthetics

Aesthetic assessment of visual artworks constitutes a central component of illustration analysis. This study integrates rule-based illustration aesthetics (composition, color, and line) with deep learning to produce an aesthetic scoring framework that is both interpretable in artistic terms and generalizable across data. Classical rules provide explainable evaluations of compositional stability and visual harmony, while the data-driven models capture statistical regularities of human subjective taste at scale; together they complement one another and enhance the robustness and credibility of the aesthetic assessment.

3.4.1. Rule-Based Illustration Aesthetics

Composition balance: Guided by the Gestalt principles and the visual balance theory [31,32], a multi-metric quantitative framework was constructed. First, Learned Perceptual Image Patch Similarity (LPIPS) [33] features were used to measure perceptual similarity and analyze symmetry under vertical, horizontal, and rotational transformations. Second, Canny edge detection [34], combined with saliency maps, was employed to compute centroid displacements [35] and, thereby, evaluate dynamic balance. Third, a lever-balance index and saliency sampling analysis based on classical compositional rules (rule of thirds, golden ratio, etc.) [36,37] were introduced to assess compositional stability. Scores from these subcomponents were averaged with the weights and normalized to the range [0, 1].

Color harmony: A compounded model of color harmony was developed by integrating the Munsell color system, the Moon–Spencer luminance-balance principle [38], and the information-theoretic measures. Dominant colors were extracted via K-means clustering [39] and hue coordination was quantified using the CIEDE2000 color-difference formula [40,41]. Luminance balance was assessed in the CIELAB space by analyzing L-channel entropy and gradient statistics. The orderliness of color distribution was further evaluated through the entropy of the dominant-color proportions in relation to contrast [42]. The three component scores were fused to yield an overall harmony measure.

Line expressiveness: A multi-scale, multi-dimensional evaluation scheme was proposed by combining traditional aesthetic concepts with modern deep learning methods. Line features were extracted using Holistically-Nested Edge Detection (HED) [43] and the Canny operator, then quantified along four axes: continuity (endpoint density), density (distance transformed with Gaussian weighting) [44], directional entropy (uncertainty of orientation distribution), and curvature (rate of contour angle variation) [45]. A scene-adaptive mechanism based on DeepLabv3 was incorporated to distinguish dynamic from static regions and adjust subcomponent weights accordingly. The resulting measure was normalized by an S-shaped function and aggregated into a single line-expressiveness score.

3.4.2. Deep-Learning-Based Assessment

We employed a Neural Image Assessment (NIMA) model with a VGG16 backbone, pretrained on the AVA dataset, to predict the aesthetic quality. Input images were preprocessed (resizing, tensor conversion, and normalization), and the model produced a probabilistic score distribution over the 1–10 range. An augmentation-informed reweighting strategy was applied to emphasize high-score probabilities, followed by a nonlinear transformation to optimize the score distribution; the output was then normalized to [0, 1]. The NIMA score was fused with the rule-based aesthetic score via weighted averaging to produce the final aesthetic evaluation.

3.5. System Implementation

To validate the method, 300 illustrations were generated for a selected set of idioms (a total of 150 with relatively high quantitative scores and 150 with relatively low quantitative scores), and a preference study was conducted with 80 design-major students—approximately 70% of the participants preferred the higher-scoring images. The study was conducted in accordance with general ethical guidelines: all participants provided informed consent, and data were de-identified and used solely for academic evaluation.

An AI-idiom prototype was implemented (see Figure 3). Given an idiom, structured prompts are generated, and multiple candidate images are synthesized. Each image is independently scored by the Truth, Goodness, and Beauty modules and retained only if it satisfies the predefined threshold criteria across all three dimensions. The system implements a generate–evaluate–regenerate closed loop where the failing candidates trigger an automatic prompt refinement and iterative regeneration. The example optimization strategies include concretizing missing symbols (e.g., for “hua she tian zu”, add the explicit phrasing “clearly depict superfluous feet on the snake’s body”), intensifying affective cues for low-affect images (e.g., for “dui niu tan qin”, add “the musician frowns slightly, mouth corners turned down to emphasize helplessness”), and providing compositional guidance (e.g., “place Chang’e on the right third of the frame (rule of thirds)”). If threshold criteria remain unmet after three successive iterations, the system issues an instruction recommending manual prompt adjustment.

4. Experiments

4.1. Illustration Generation

This study adopted a culturally themed stratified sampling strategy by selecting four categories—fable, historical anecdote, myth/legend, and literary classic—with five idioms from each category, yielding a total of 20 canonical idioms for the experimental dataset, as shown in Table 1. This experiment selected three mainstream AI image generation models—Doubao(Seedream 4.0), GPT-4(GPT-4o), and Midjourney(V7.1)—for comparative analysis. These models respectively represent a Chinese localized generation model, a general-purpose multimodal foundation model, and an internationally leading commercial-grade image generation model, thereby covering three typical technical paradigms in the current text-to-image generation and ensuring the fairness and comprehensiveness in the cross-model comparison. Images were generated following a standardized prompt paradigm: for each prompt, each platform produced five candidate images, after which only the clearly invalid samples were removed through a standardized screening procedure. Such exclusions were limited to the cases exhibiting missing core cultural symbols, severe visual distortion, illegible text, or complete semantic deviation; no subjective aesthetic preference was used for ranking or selecting the outputs. The screening process was independently conducted by two researchers, and any disagreements were adjudicated by a third researcher. To verify that this filtering strategy did not affect cross-platform comparisons, we further report results based on the full sample and compare them with the filtered valid-sample results to assess consistency. Spearman rank correlation tests showed that the model-performance rankings across the two datasets were highly consistent on the three evaluation dimensions—cultural symbols, emotional ontology, and visual aesthetics—with correlations exceeding 0.95 in all cases (p < 0.001). These results confirm that the exclusion procedure introduced no subjective bias and did not affect fair cross-model comparison or the main conclusions of this study. Representative illustrations generated by different models following the standardized workflow are shown in Figure 4.

4.2. Determination of Optimal Thresholds

To enable measurable, automated filtering of high-quality idiom illustrations, the proposed Truth–Goodness–Beauty multidimensional evaluation framework was applied. Quantitative metrics were combined with expert qualitative assessments to derive optimal decision thresholds for each dimension. These thresholds provide an interpretable basis for the AI-Idiom system’s automated image selection.

4.2.1. Quantitative Computation

For each generated image, scores were computed for the three dimensions—cultural symbols, affective ontology, and visual aesthetics. Overall, Doubao and GPT-4 outperformed Midjourney on the cultural-symbol fidelity and emotional alignment, whereas Midjourney exhibited a marginal advantage on the aesthetic metric but lacked overall balance. For example, for the idiom “hua she tian zu” (drawing legs on a snake), the cultural-symbol scores were 77.7 (Doubao), 76.1 (GPT-4), and 50.7 (Midjourney); the emotional-core scores were 70.8, 70.4, and 63.4, respectively; and the aesthetic scores were 71.1, 75.9, and 73.5. Midjourney’s low cultural and emotional scores resulted from failures to render critical visual elements (the snake and the added legs) and to convey the idiom’s ironic tone, which validates the sensitivity of the proposed quantitative metrics to the low-quality outputs (see Figure 5).

4.2.2. Qualitative Analysis

This study strictly adhered to academic ethics and expert recruitment standards. Twelve independent experts were publicly recruited through university academic databases and design-industry associations. None of the experts had any collaborative, supervisory, or financial relationship with the authors. The expert panel comprised six specialists in illustration design (including university professors in visual communication design and senior designers) and six experts in Chinese cultural studies (including PhD holders and lecturers specializing in idiom culture and classical literature). The inclusion criteria were as follows: (i) at least five years of research or professional experience in the relevant field; (ii) appropriate professional titles and qualifications; and (iii) voluntary participation with signed informed consent. All experts conducted double-blind independent ratings to minimize selection bias and enhance the objectivity of the evaluation results. The evaluation dimensions included cultural symbols, emotional ontology, and visual aesthetics, each measured on a 1–5 Likert scale. The questionnaire items were adapted from the multidimensional Aesthetic Experience Scale (AES) proposed by Hager et al. [46], with modifications to suit the scene characteristics of idiom illustrations. Each dimension contained four measurement items, and Cronbach’s α exceeded 0.98 for all dimensions, indicating excellent internal consistency and stable rating reliability (Table 2).

Prior to intergroup difference analysis, the Shapiro–Wilk test was first used to assess normality, and Levene’s test was employed to examine homogeneity of variance. The results showed that the distributions of the three groups were generally non-normal. In particular, variance heterogeneity was observed in the cultural-symbol dimension (p < 0.0001) and the emotional-ontology dimension (p = 0.0001), whereas only the visual-aesthetics dimension satisfied the homogeneity assumption (p = 0.3632 > 0.05). Accordingly, a robust Welch ANOVA strategy was adopted, with Games–Howell post hoc correction automatically applied for multiple comparisons to control family-wise Type I error inflation. Corrected p-values, global effect size η², and Cohen’s d were also reported. The statistical results are as follows: for the cultural-symbol dimension, F = 27.64, p < 0.0001, η² = 0.434; post hoc tests showed highly significant differences between Doubao and Midjourney (p = 0.0018, d = 1.19), as well as between GPT-4 and Midjourney (p < 0.0001, d = 2.37). For the emotional-ontology dimension, F = 20.80, p < 0.0001, η² = 0.386; all pairwise differences among the three models were significant (p < 0.05), with the GPT-4 versus Midjourney contrast being the most pronounced (p < 0.0001, d = 2.06). For the visual-aesthetics dimension, F = 2.67, p = 0.0827, η² = 0.086, indicating no significant between-group difference, as shown in Figure 6. In the boxplots presented in Figure 6, white squares denote the mean values of the qualitative scores for each model group, while open circles represent outlier data points lying beyond the whiskers of the boxplots. To further validate robustness, we additionally conducted Kruskal–Wallis nonparametric rank-sum tests, and the significance patterns and model-ranking results were fully consistent with those obtained from Welch ANOVA. Overall, the expert ratings were highly consistent with the quantitative results: GPT-4 and Doubao performed better on the cultural and emotional dimensions, whereas Midjourney received lower scores.

4.2.3. Threshold and Model Selection

Based on the combined quantitative and qualitative results, images with expert questionnaire scores no lower than four were labeled as “high-quality illustrations”. The optimal decision threshold was then learned automatically by the five-fold cross-validation stratified by idiom category, using the threshold that maximized the F1 score. To prevent threshold selection from overfitting the current sample set, 20% of the samples were further reserved as an independent test set, also stratified by idiom category, for external validation. The results showed that the F1 score on the independent test set deviated by less than 2% from the cross-validated optimum. Cross-category testing further indicated that the threshold remained stable and effective across all four categories, with no obvious performance degradation, demonstrating good robustness within the idiom domain. Because the threshold was learned from the data distribution and classification performance rather than manually specified, it is objective and highly reproducible.

Threshold sensitivity analysis further showed that, around the optimal threshold identified by cross-validation, precision, recall, and F1-score remained relatively stable across all models. GPT-4 achieved the best overall discriminative performance across the three evaluation dimensions, with optimal thresholds of 0.75 ± 0.03 for cultural symbols (validation F1 = 0.82 ± 0.09), 0.68 ± 0.01 for emotional ontology (validation F1 = 0.86 ± 0.08), and 0.70 ± 0.00 for visual aesthetics (validation F1 = 0.90 ± 0.15), as shown in Figure 7. These results indicate that the thresholds for different dimensions are all reasonably stable. Although the visual-aesthetics dimension exhibited relatively weaker separability, the overall filtering performance remained high.

To validate the practical utility of these thresholds, a user preference study was conducted: (1) participants—80 undergraduate and graduate students majoring in design; (2) task—forced-choice preference between pairs of illustrations before and after the Truth–Goodness–Beauty threshold filtering (40 images in total, two per idiom for 20 idioms); (3) result—the proportion of threshold-filtered illustrations selected by participants was 78.1%. Further analysis indicated that, although the probability-density overlap between high- and low-quality samples was relatively large for the visual-aesthetics dimension (i.e., lower separability), most participants reported that cultural-symbol accuracy and emotional alignment were the primary selection criteria and that acceptable aesthetics sufficed. In summary, GPT-4 combined with the Truth–Goodness–Beauty framework produced illustrations that align closely with human judgments of cultural fidelity, emotional expressiveness, and visual appeal.

4.3. Validation of the Evaluation Framework

To systematically verify the scientific soundness and practical utility of the proposed Truth–Goodness–Beauty evaluation framework, this study conducted quantitative validation from three perspectives—validity, objectivity, and generalization capability—using expert double-blind annotations as the gold standard and CLIPScore as a general-purpose baseline. In addition, the ablation experiments were performed to examine the necessity and incremental benefit of the three-dimensional joint evaluation scheme.

4.3.1. Validity, Objectivity, and Generalization Capability

We first used the expert mean score as the gold standard to evaluate the consistency between the proposed quantitative metrics and human judgments. The results show that the cultural-symbol dimension was significantly positively correlated with the expert ratings, with Spearman’s ρ = 0.353 (p = 0.006) and Pearson’s r = 0.362 (p = 0.005). The correlation between the emotional-ontology dimension and expert ratings was even stronger with Spearman’s ρ = 0.469 (p < 0.0001) and Pearson’s r = 0.593 (p < 0.0001). By contrast, CLIPScore (Spearman’s ρ = 0.172) and CRI (Spearman’s ρ = 0.236) exhibited markedly weaker correlations with expert ratings, indicating that neither general-purpose image–text alignment metrics nor existing cultural-relevance metrics can adequately capture the culturally specific symbolism and affective ontology embodied in idiom illustrations. Further ROC analysis showed that the AUC values for the cultural-symbol, emotional-ontology, and visual-aesthetics dimensions were 0.675, 0.746, and 0.600, respectively, all exceeding the random-discrimination level (AUC = 0.5). Notably, the 95% confidence intervals of the AUCs for the cultural-symbol and emotional-ontology dimensions did not include 0.5, demonstrating stable discriminative ability. The visual-aesthetics dimension showed relatively weaker discriminative power, consistent with the correlation analysis, as shown in Figure 8, Figure 9 and Table 3.

From the perspective of method consistency, the Bland–Altman analysis indicated that 96.7% of the sample points fell within the 95% limits of agreement (LoA), and regression analysis detected no significant proportional bias (p = 0.216). These findings suggest that the proposed quantitative metrics do not exhibit systematic deviation from the expert gold standard and that the two evaluation modes are highly consistent in overall trend, confirming the framework’s stability and reproducibility. Overall, the proposed framework shows favorable discriminative validity in culture-intensive illustration tasks and, compared with the general baseline, more accurately captures the latent cultural semantics, affective connotations, and aesthetic characteristics of the idiom images.

Regarding generalization robustness, the performance ranking of the three models remained consistent across the four idiom subsets covered in this experiment: GPT-4 performed best overall, followed by Doubao, while Midjourney ranked lowest. Moreover, the evaluation results on the cultural-symbol and emotional-ontology dimensions did not show obvious degradation, indicating that the framework is stable in the idiomatic scenarios covered by this study. It should be noted, however, that this conclusion primarily applies to the idiom illustration tasks, and transferability to other culture-intensive tasks still requires further validation on larger-scale datasets.

4.3.2. Ablation Experiments

To verify the necessity and joint effectiveness of the Truth–Goodness–Beauty evaluation framework, we constructed six ablation settings: no filtering, single-dimensional cultural filtering, single-dimensional emotional filtering, single-dimensional aesthetic filtering, full-dimensional filtering with nested cross-validated thresholds, and a 100-run bootstrap random-sampling baseline. The expert mean score and the proportion of high-quality samples were used as evaluation criteria for comparative analysis. The Kruskal–Wallis nonparametric test indicated a marginally significant overall difference among the six strategies (H = 12.228, df = 5, and p = 0.0571). Among them, the full-dimensional filtering strategy based on cross-validated thresholds achieved the best performance, with an expert mean score of 4.172 and a high-quality-sample proportion of 78.6%. After the Dunn–Bonferroni multiple-comparison correction, the pairwise differences did not reach strict statistical significance; however, Cliff’s delta effect-size analysis showed that full-dimensional filtering produced medium practical effects relative to both the no-filtering condition and the random baseline, with clearly improved scores and high-quality-sample ratios. Among the single-dimensional settings, emotional filtering came closest to the full-dimensional scheme, whereas the remaining single-dimension constraints exhibited relatively limited filtering capability, thereby confirming the value of three-dimensional synergistic gains.

In summary, the Truth–Goodness–Beauty dimensions are not a simple linear superposition; rather, they form a mutually constrained and complementary quality-evaluation loop. This framework demonstrates strong effectiveness and stability in terms of filtering efficiency, category robustness, and sample impartiality, as shown in Figure 10.

5. Conclusions

This work proposes an integrated technical pipeline and process design for automated generation and evaluation of idiom illustrations. First, a cultural knowledge graph of idioms is constructed to establish a priori deep cultural-semantic constraints. In addition, historical records and encyclopedia entries (e.g., Baidu Baike) were used as knowledge sources, and GPT-4 was leveraged to generate structured Chinese prompts that improve the cultural grounding of downstream image generation. Second, the paper introduces a three-dimensional evaluation framework—Truth (cultural symbols), Goodness (affective ontology), and Beauty (visual aesthetics)—and operationalizes each dimension as follows: Truth is evaluated via Chinese visual-language pre-trained models to jointly assess macro semantic consistency and micro-symbol presence; Goodness is quantified by a two-level, cross-modal alignment of emotion polarity and fine-grained sub-emotions to measure image–text emotional congruence; Beauty combines interpretable computational aesthetic features (composition, color, and line) with a data-driven NIMA deep aesthetic model to balance theoretical guidance and learned prediction. The system executes a generate–evaluate–regenerate loop with an embedded automatic prompt optimization mechanism. Empirical results indicate that images filtered by this framework achieved a selection rate of 78.1% in human preference tests, suggesting that the approach is promising for improving cultural fidelity, emotional transmission, and visual attractiveness. Moreover, the framework demonstrates promising scenario adaptability and may be extended to applications such as digital cultural heritage preservation, Chinese language education, and customized cultural and creative product design.

Limitations remain. First, although the proposed framework was validated using a stratified sampling strategy and the current experiments demonstrate its scalability to some extent, its present scope remains limited to the single task of idiom illustration generation, and it has not yet been standardized on large-scale, multi-category traditional cultural image datasets. Second, the separability and interpretability of the visual-aesthetics metric are lower than those for cultural symbols and emotional alignment; existing quantitative measures do not fully capture higher-order aesthetic attributes. Third, inter-model variability in emotional mapping and stylistic rendering suggests a need for cross-model adaptation strategies to improve robustness.

Future work will pursue four directions: (1) scale the corpus and build benchmark datasets spanning narrative types and artistic styles to improve generalization; (2) deepen aesthetic evaluation by integrating behavioral and neurophysiological evidence to capture high-order aesthetic features; (3) incorporate human–machine collaboration (e.g., preference-driven reinforcement learning, interactive prompt fine-tuning) to narrow the gap between automatic evaluation and human aesthetic judgment and to enhance the system’s practical applicability; and (4) extend the Truth–Goodness–Beauty framework to other cultural carriers, including classical poetry, mythological narratives, and folk motifs, to establish a more general evaluation system for the digital generation of Chinese traditional culture and further enhance the system’s practical applicability.

Author Contributions

J.L. conceived and designed the study; J.L., Y.T. and W.W. analyzed the data and discussed the results; and J.L. wrote the paper; Y.T. and W.W. assisted with the writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Hangzhou City University (Approval No. [HZCU2025-34] and date of approval 30 November 2025).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study prior to participation. All participants were informed of the purpose of the study and their rights, including voluntary participation and the right to withdraw at any time without penalty.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

During the preparation of this manuscript, the authors used Doubao, GPT-4, and Midjourney 6.0 for the creation of Chinese idiom allusion illustrations and Python 3.14.5 data analysis. The authors have reviewed and edited all AI-generated output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

Table A1. Quantitative Calculation Results of Chinese Idioms.

Idiom	Model	Cultural Symbols (%)			Affective Ontology (%)
		Macro-Level Semantic Consistency	Micro-Level Symbol Similarity	Overall Score	First-level Affective Polarity	Second-Level Sub-Emotions	Overall Score	Rule-Based Illustration Aesthetics			Deep-Learning Based Assessment	Overall Score
		Macro-Level Semantic Consistency	Micro-Level Symbol Similarity	Overall Score	First-level Affective Polarity	Second-Level Sub-Emotions	Overall Score	Composition Balance	Color Harmony	Line Expressiveness	Deep-Learning Based Assessment	Overall Score
hua she tian zu	Doubao	82.6	72.8	77.7	90.4	51.2	70.8	67.4	84.0	72.8	67.5	71.1
	GPT-4	81.7	70.5	76.1	88.6	52.2	70.4	62.3	81.0	96.0	72.1	75.9
	Midjourney	55.7	45.6	50.7	78.9	47.9	63.4	63.6	88.3	77.4	70.6	73.5
jing di zhi wa	Doubao	87.6	72.1	79.8	88.0	53.0	70.5	63.6	82.3	84.2	72.8	74.8
	GPT-4	85.6	75.0	80.3	89.7	50.1	69.9	59.3	83.0	69.7	69.2	69.9
	Midjourney	93.8	70.1	82.0	85.2	52.0	68.6	64.8	81.7	87.3	70.7	74.3
dui niu tan qin	Doubao	93.9	68.9	81.4	93.2	50.0	71.6	63.9	82.9	89.5	67.7	73.2
	GPT-4	87.3	73.8	80.6	95.9	48.9	72.4	57.1	79.9	94.2	78.6	77.8
	Midjourney	94.2	68.6	81.4	82.6	55.4	69.0	59.4	82.6	94.5	72.2	75.5
shui zhong lao yue	Doubao	100.0	69.7	84.9	89.7	44.7	67.2	69.6	81.2	86.2	66.3	72.7
	GPT-4	93.4	70.0	81.7	86.9	48.5	67.7	58.1	75.3	91.2	63.9	69.4
	Midjourney	100.0	69.4	84.7	79.8	50.8	65.3	61.6	80.8	96.5	70.4	75.0
shou zhu dai tu	Doubao	98.6	62.9	80.7	93.5	45.5	69.5	63.8	76.5	70.4	74.8	72.5
	GPT-4	97.0	63.0	80.0	92.1	47.5	69.8	55.8	85.9	98.3	69.1	74.6
	Midjourney	87.0	62.0	74.5	78.4	55.2	66.8	65.6	84.8	82.0	71.1	74.3
zao bi tou guang	Doubao	99.0	60.1	79.5	97.3	49.3	73.3	70.6	81.9	84.6	71.2	75.1
	GPT-4	96.9	65.5	81.2	96.5	50.5	73.5	59.4	75.6	72.2	70.0	69.5
	Midjourney	88.9	56.3	72.6	85.7	52.9	69.3	62.1	82.4	70.4	71.3	71.5
wen ji qi wu	Doubao	91.7	56.5	74.1	94.9	46.3	70.6	59.0	87.1	72.4	74.4	73.6
	GPT-4	87.5	60.7	74.1	92.3	49.9	71.1	61.1	82.5	99.3	76.7	78.8
	Midjourney	92.4	61.1	76.8	79.5	51.9	65.7	56.5	80.4	99.2	68.9	73.8
zhi lu wei ma	Doubao	92.8	64.6	78.7	95.4	46.0	70.7	73.8	82.4	80.9	62.5	70.8
	GPT-4	97.1	70.0	83.5	95.9	47.9	71.9	60.7	82.2	99.3	70.0	75.4
	Midjourney	93.7	63.3	78.5	83.2	54.2	68.7	61.2	81.0	99.3	74.5	77.5
fu jing qing zui	Doubao	85.1	59.7	72.4	92.5	48.1	70.3	68.4	87.6	71.5	74.6	75.2
	GPT-4	86.0	61.2	73.6	90.8	49.2	70.5	60.3	82.4	99.3	69.9	75.3
	Midjourney	64.0	49.8	56.9	76.3	51.9	64.1	59.7	84.3	81.4	76.2	75.7
zhi shang tan bing	Doubao	85.6	63.7	74.6	89.2	41.6	65.4	62.9	82.3	99.2	64.9	73.2
	GPT-4	82.8	66.4	73.6	86.9	48.5	67.7	59.2	78.0	99.3	64.9	71.9
	Midjourney	85.5	58.4	72.0	78.9	50.3	64.6	61.3	82.0	99.3	71.3	76.1
kua fu zhu ri	Doubao	100.0	70.2	85.1	95.5	47.5	71.5	74.8	75.7	70.1	74.9	74.2
	GPT-4	97.6	75.7	86.7	93.8	46.6	70.2	62.5	76.5	88.6	66.5	71.2
	Midjourney	96.5	70.6	83.6	82.1	54.5	68.3	65.0	80.8	95.6	68.6	74.5
nv wa bu tian	Doubao	90.3	71.0	80.6	94.2	46.2	70.2	75.8	79.6	84.9	62.2	71.2
	GPT-4	86.7	79.3	83.0	92.4	48.6	70.5	58.4	76.8	95.7	69.6	73.3
	Midjourney	89.7	68.8	79.2	78.3	53.5	65.9	60.7	87.3	97.8	70.6	76.3
hou yi she ri	Doubao	84.1	66.2	75.1	93.6	47.2	70.4	61.9	85.6	80.3	76.2	76.1
	GPT-4	91.0	78.2	84.6	90.9	48.7	69.8	58.2	74.3	99.1	69.2	73.2
	Midjourney	100.0	78.0	89.0	96.0	45.0	70.5	60.6	87.6	99.3	69.5	76.0
jing wei tian hai	Doubao	90.4	68.2	79.3	91.8	44.6	68.2	63.4	82.9	84.1	75.7	76.3
	GPT-4	86.5	67.7	77.1	88.7	46.3	67.5	61.9	85.1	98.4	68.1	75.0
	Midjourney	94.2	63.7	78.9	82.9	53.9	68.4	53.4	81.9	88.2	67.1	70.8
chang e ben yue	Doubao	96.8	82.0	89.4	97.3	45.9	71.6	65.9	86.0	69.7	68.9	71.4
	GPT-4	85.6	80.2	82.9	93.6	46.2	69.9	58.8	81.9	93.2	77.2	77.6
	Midjourney	89.5	78.7	84.1	81.7	53.3	67.5	62.7	87.0	75.9	74.2	74.7
xiong you cheng zhu	Doubao	100.0	61.8	80.9	95.2	46.6	70.9	67.1	83.0	96.7	62.6	72.4
	GPT-4	87.6	66.9	77.3	90.5	47.9	69.2	60.2	80.4	99.3	67.5	73.7
	Midjourney	100.0	66.8	83.4	83.6	54.4	69.0	61.4	84.3	98.7	69.9	75.7
zhuan xin zhi zhi	Doubao	89.5	64.1	76.8	89.6	43.8	66.7	69.4	84.4	85.2	72.9	76.3
	GPT-4	98.0	62.8	80.4	95.3	45.5	70.4	59.4	79.4	99.0	64.1	71.7
	Midjourney	90.2	60.1	75.1	77.2	50.4	63.8	63.0	85.3	98.6	70.3	76.3
xue hai wu ya	Doubao	98.0	68.2	83.1	94.9	45.5	70.2	60.1	78.3	92.0	69.4	73.1
	GPT-4	88.1	73.4	80.8	95.5	47.5	71.5	56.9	84.8	99.3	69.6	75.0
	Midjourney	86.5	55.9	71.2	80.1	50.3	65.2	60.7	84.4	99.3	71.1	76.3
bu chi xia wen	Doubao	84.9	69.0	76.9	92.3	47.3	69.8	65.2	86.3	96.5	72.3	77.5
	GPT-4	81.6	69.6	75.6	94.3	46.7	70.5	61.7	81.6	99.3	70.6	75.7
	Midjourney	81.4	68.4	74.9	81.6	50.6	66.1	62.3	85.3	99.3	67.9	75.1
shu neng sheng qiao	Doubao	95.8	75.7	85.8	91.5	45.3	68.4	70.9	84.4	83.2	63.4	71.5
	GPT-4	81.6	69.6	80.8	89.8	47.4	68.6	63.0	80.7	98.3	70.1	75.4
	Midjourney	86.0	73.1	79.6	82.9	49.5	66.2	68.6	84.3	97.6	78.6	81.1
average	Doubao			79.8			69.9					73.5
	GPT-4			79.7			70.2					74.1
	Midjourney			76.5			66.8					75.2

Appendix A.2

Table A2. Qualitative questionnaire.

Title 1	Title 2
“Truth” dimension: Cultural Symbols	This image is consistent with the Chinese idiom/story as I understand it.
	I can discern elements of traditional Chinese culture in this image.
	This image helps deepen my understanding of the particular culture.
	The image communicates respect for, and a profound understanding of, the idiom’s background.
“Goodness” Dimension: Affective Ontology	This image evokes a strong emotional response in me.
	The image effectively conveys a distinct emotional tone.
	The image’s atmosphere evokes deep associations or personal memories.
	The image’s visual elements are congruent with its emotional theme.
“Beauty” Dimension: Visual Aesthetics	I find this image highly visually appealing.
	The color palette of the image is harmonious.
	The overall composition is balanced, and the visual focus is clear.
	The use of line in the image is fluid and expressive.

References

Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the NIPS, Vancouver, BC, Canada, 6–12 December 2020; pp. 1–12. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the ICML, Virtual, 18–24 July 2021. [Google Scholar]
Huang, J.; Yang, D. Culturally aware natural language inference. In Proceedings of the EMNLP, Singapore, 6–10 December 2023. [Google Scholar]
Huang, Y.; Fan, Z.; He, Z.; Polisetty, S.; Li, W.; Fung, Y.R. Culture CLIP: Empowering CLIP with cultural awareness through synthetic images and contextualized captions. In Proceedings of the Second Conference on Language Modeling, Montreal, QC, Canada, 7–10 October 2025. [Google Scholar]
Stein, G.; Cresswell, J.C.; Hosseinzadeh, R.; Sui, Y.; Ross, B.L.; Villecroze, V.; Liu, Z.; Caterini, A.L.; Taylor, J.E.T.; Loaiza-Ganem, G. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. In Proceedings of the NIPS, New Orleans, LA, USA, 10–16 December 2023; pp. 1–53. [Google Scholar]
Elsharif, W.; Agus, M.; Alzubaidi, M.; She, J. Cultural Relevance Index: Measuring Cultural Relevance in AI-Generated Images. In Proceedings of the IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 7–9 August 2024; pp. 410–416. [Google Scholar]
Johnson, N.; Sudharsan, D.; Hamna; Dalal, S.; Holroyd, T.; Thieme, A.; Heidari, H.; Massiceti, D.; Vaughan, J.W.; Morrison, C. Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics. arXiv 2026, arXiv:2604.02406. [Google Scholar]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the ICCV, Paris, France, 1–6 October 2023; pp. 3813–3824. [Google Scholar] [CrossRef]
Yang, A.; Pan, J.; Lin, J.; Men, R.; Zhang, Y.; Zhou, J.; Zhou, C. Chinese CLIP: Contrastive vision-language pretraining in Chinese. arXiv 2022, arXiv:2211.01335. [Google Scholar]
Wu, X.; Zhang, D.; Gan, R.; Lu, J.; Wu, Z.; Sun, R.; Zhang, J.; Zhang, P.; Song, Y. Taiyi-Diffusion-XL: Advancing bilingual text-to-image generation with large vision-language model support. arXiv 2024, arXiv:2401.14688. [Google Scholar]
Jeong, S.; Choi, I.; Yun, Y.; Kim, J. Culture-TRIP: Culturally-Aware Text-to-Image Generation with Iterative Prompt Refinement. In Proceedings of the NAACL-HLT, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 9543–9573. [Google Scholar] [CrossRef]
Borth, D.; Ji, R.; Chen, T.; Breuel, T.; Chang, S.-F. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the ACM Multimedia, Barcelona, Spain, 21–25 October 2013; pp. 223–232. [Google Scholar] [CrossRef]
Yang, J.; Feng, J.; Huang, H. EmoGen: Emotional image content generation with text-to-image diffusion models. In Proceedings of the CVPR, Seattle, WA, USA, 16–22 June 2024; pp. 6358–6368. [Google Scholar] [CrossRef]
Dang, S.; He, Y.; Ling, L.; Qian, Z.; Zhao, N.; Cao, N. EmotiCrafter: Text-to-emotional-image generation based on valence-arousal model. arXiv 2025, arXiv:2501.05710. [Google Scholar]
Yang, J.; Feng, J.; Luo, W.; Lischinski, D.; Cohen-Or, D.; Huang, H. EmoEdit: Evoking emotions through image manipulation. In Proceedings of the CVPR, Nashville, TN, USA, 10–17 June 2025; pp. 24690–24699. [Google Scholar] [CrossRef]
Paskaleva, R.; Holubakha, M.; Ilic, A.; Motamed, S.; Van Gool, L.; Paudel, D. A unified and interpretable emotion representation and expression generation. In Proceedings of the CVPR, Seattle, WA, USA, 16–22 June 2024; pp. 2447–2456. [Google Scholar] [CrossRef]
Murray, N.; Marchesotti, L.; Perronnin, F. AVA: A large-scale database for aesthetic visual analysis. In Proceedings of the CVPR, Providence, RI, USA, 16–21 June 2012; pp. 2408–2415. [Google Scholar] [CrossRef]
Kong, S.; Shen, X.; Lin, Z.; Měch, R.; Fowlkes, C. Photo aesthetics ranking network with attributes and content adaptation. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9905, pp. 662–679. [Google Scholar] [CrossRef]
Talebi, H.; Milanfar, P. NIMA: Neural image assessment. IEEE Trans. Image Process. 2018, 27, 3998–4011. [Google Scholar] [CrossRef] [PubMed]
Gu, Y.; Xu, S.; Tang, M.; Dong, J. AI supported computer-generated pen-and-ink illustration. In Proceedings of the CSCWD, London, ON, Canada, 14 July 2001; pp. 227–231. [Google Scholar] [CrossRef]
Peirce, C.S. Logic as semiotic: The theory of signs. In Philosophical Writings of Peirce; Buchler, J., Ed.; Dover Publications: New York, NY, USA, 1940; pp. 98–119. [Google Scholar]
Hessel, J.; Holtzman, A.; Forbes, M.; Le Bras, R.; Choi, Y. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the EMNLP, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7514–7528. [Google Scholar] [CrossRef]
Xu, L.H.; Lin, H.F.; Pan, Y.; Ren, H.; Chen, J.M. Constructing the affective lexicon ontology. J. China Soc. Sci. Tech. Inf. 2008, 27, 180–185. [Google Scholar] [CrossRef]
Zhao, S.; Yao, X.; Yang, J.; Jia, G.; Ding, G.; Chua, T.-S.; Schuller, B.W.; Keutzer, K. Affective image content analysis: Two decades review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 6729–6751. [Google Scholar] [CrossRef]
Xu, X.; Wang, T.; Yang, Y.; Zuo, L.; Shen, F.; Shen, H.T. Cross-modal attention with semantic consistence for image–text matching. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 5412–5425. [Google Scholar] [CrossRef]
Picard, R.W. Affective computing: From laughter to IEEE. IEEE Trans. Affect. Comput. 2010, 1, 11–17. [Google Scholar] [CrossRef]
Sharma, S.; Ramaneswaran, S.; Akhtar, M.S.; Chakraborty, T. Emotion-aware multimodal fusion for meme emotion detection. IEEE Trans. Affect. Comput. 2024, 15, 1800–1811. [Google Scholar] [CrossRef]
Liang, Z.; Li, H.; Zhang, R.; Liu, X. Non-uniform circular-structured loss inspired by psychology for image emotion recognition. Multimed. Syst. 2024, 30, 346. [Google Scholar] [CrossRef]
Lee, G.; Yi, S.; Lee, J. A study on deep learning performances of identifying images’ emotion: Comparing performances of three algorithms to analyze fashion items. Appl. Sci. 2025, 15, 3318. [Google Scholar] [CrossRef]
Zeki, S. Clive Bell’s “Significant Form” and the neurobiology of aesthetics. Front. Hum. Neurosci. 2013, 7, 730. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Miao, Y.; Yu, J. A comprehensive survey on computational aesthetic evaluation of visual art images: Metrics and challenges. IEEE Access 2021, 9, 77164–77187. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Canny, J. A computational app.roach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1254–1259. [Google Scholar] [CrossRef]
Luo, Y.; Tang, X. Photo and video quality evaluation: Focusing on the subject. In Proceedings of the ECCV; Forsyth, D., Torr, P., Zisserman, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 1–14. [Google Scholar] [CrossRef]
Tversky, B.; Hemenway, K. Objects, parts, and categories. J. Exp. Psychol. Gen. 1984, 113, 169–193. [Google Scholar] [CrossRef] [PubMed]
Moon, P.H.; Spencer, D.E. Geometric formulation of classical color harmony. J. Opt. Soc. Am. 1944, 34, 46–59. [Google Scholar] [CrossRef]
MacQueen, J.B. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Le Cam, L.M., Neyman, J., Eds.; University of California Press: Berkeley, CA, USA, 1967; pp. 281–297. [Google Scholar]
Sharma, G.; Wu, W.; Dalal, E.N. The CIEDE2000 color-difference formula: Implementation notes, supplementary test data, and mathematical observations. Color Res. Appl. 2005, 30, 21–30. [Google Scholar] [CrossRef]
Cohen-Or, D.; Sorkine, O.; Gal, R.; Leyvand, T.; Xu, Y.-Q. Color harmonization. ACM Trans. Graph. 2006, 25, 624–630. [Google Scholar] [CrossRef]
O’Donovan, P.; Agarwala, A.; Hertzmann, A. Color compatibility from large datasets. ACM Trans. Graph. 2011, 30, 63. [Google Scholar] [CrossRef]
Xie, S.; Tu, Z. Holistically-nested edge detection. In Proceedings of the ICCV, Santiago, Chile, 7–13 December 2015; pp. 1395–1403. [Google Scholar] [CrossRef]
Lowe, D.G. Organization of smooth image curves at multiple scales. In Proceedings of the International Conference on Computer Vision, Tampa, FL, USA, 5–8 December 1988; pp. 558–567. [Google Scholar] [CrossRef]
Martin, D.R.; Fowlkes, C.C.; Malik, J. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 530–549. [Google Scholar] [CrossRef] [PubMed]
Hager, M.; Hagemann, D.; Danner, D.; Schankin, A. Assessing aesthetic app.reciation of visual artworks—The construction of the Art Reception Survey (ARS). Psychol. Aesthet. Creat. Arts 2012, 6, 320–333. [Google Scholar] [CrossRef]

Figure 1. Research framework diagram of the AI-Idiom system.

Figure 2. Doubao-generated cartoon illustrations of “shui zhong lao yue” ((left) unstructured prompt; (right) structured prompt).

Figure 3. AI-Idiom system prototype.

Figure 4. The generated illustrations (from (left) to (right): GPT-4, Doubao, Midjourney).

Figure 5. Illustration of the Idiom “hua she tian zu” (From (left) to (right): Doubao, GPT-4, Midjourney).

Figure 6. The distribution of the model’s qualitative scores across various dimensions (* p < 0.05, ** p < 0.01, *** p < 0.001).

Figure 7. Diagram of the Optimal Threshold Results.

Figure 8. Scatter plots of the correlations between the quantitative metrics and expert qualitative ratings across the three dimensions.

Figure 9. ROC curves and AUC values for the three dimensions.

Figure 10. Comparison of expert scores and high-quality sample rates across ablation settings.

Table 1. Selected Idioms and Category assignments.

Category	Idioms (Chinese—English)
Fable	画蛇添足 (hua she tian zu, “to draw a snake and add feet”); 井底之蛙 (jing di zhi wa, “a frog at the bottom of a well”); 对牛弹琴 (dui niu tan qin, “playing lute to a cow”); 水中捞月 (shui zhong lao yue, “fishing for the moon”); 守株待兔 (shou zhu dai tu, “waiting by a stump for a rabbit”)
Historical Anecdote	凿壁偷光 (zao bi tou guang, “borrowing light through a hole in the wall”); 闻鸡起舞 (wen ji qi wu, “rise at rooster’s call to practise”); 指鹿为马 (zhi lu wei ma, “calling a deer a horse”); 负荆请罪 (fu jing qing zui, “carry thorned branches to ask forgiveness”); 纸上谈兵 (zhi shang tan bing, “battle on paper”)
Myth/ Legend	夸父逐日 (kua fu zhu ri, “Kuafu chasing the sun”); 女娲补天 (nv wa bu tian, “Nüwa repairing the sky”); 后羿射日 (hou yi she ri, “Houyi shooting the suns”); 精卫填海 (jing wei tian hai, “Jingwei filling the sea”); 嫦娥奔月 (chang e ben yue, “Chang’e ascending to the moon”)
Literary Classic	胸有成竹 (xiong you cheng zhu, “to have a plan in mind”); 专心致志 (zhuan xin zhi zhi, “single-minded devotion”); 学海无涯 (xue hai wu ya, “boundless sea of learning”); 不耻下问 (bu chi xia wen, “not ashamed to ask subordinates”); 熟能生巧 (shu neng sheng qiao, “practice makes perfect”)

Table 2. Results of Cronbach’s α test.

Model	Cultural Symbols	Affective Ontology	Visual Aesthetics
Doubao	0.988	0.991	0.989
GPT-4	0.987	0.987	0.991
Midjourney	0.990	0.989	0.989

Table 3. Correlation comparison between the proposed metrics and the general baseline, using expert ratings as the gold standard.

Evaluation Metric	Pearson r	Spearman ρ	Correlation with Expert Ratings
CLIPScore (Baseline)	0.185 *	0.172 *	weak correlation
Cultural symbols	0.362 **	0.353 **	moderate correlation
Emotional ontology	0.593 **	0.469 **	strong correlation
Visual aesthetics	−0.087 ns	−0.060 ns	weak correlation

Note: ns = not significant (p ≥ 0.05); * p < 0.05; ** p < 0.01.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, J.; Teng, Y.; Wang, W. A Study on the Generation and Evaluation of Illustrations for Chinese Idiom Allusions Based on AIGC. Information 2026, 17, 495. https://doi.org/10.3390/info17050495

AMA Style

Li J, Teng Y, Wang W. A Study on the Generation and Evaluation of Illustrations for Chinese Idiom Allusions Based on AIGC. Information. 2026; 17(5):495. https://doi.org/10.3390/info17050495

Chicago/Turabian Style

Li, Jingxue, Youping Teng, and Weijia Wang. 2026. "A Study on the Generation and Evaluation of Illustrations for Chinese Idiom Allusions Based on AIGC" Information 17, no. 5: 495. https://doi.org/10.3390/info17050495

APA Style

Li, J., Teng, Y., & Wang, W. (2026). A Study on the Generation and Evaluation of Illustrations for Chinese Idiom Allusions Based on AIGC. Information, 17(5), 495. https://doi.org/10.3390/info17050495

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Study on the Generation and Evaluation of Illustrations for Chinese Idiom Allusions Based on AIGC

Abstract

1. Introduction

2. Related Work

2.1. Generative Models and Semantic Alignment

2.2. Sentiment Analysis

2.3. Aesthetic Evaluation

3. Methods

3.1. Prompt Generation

3.2. “Truth” Dimension: Cultural Symbols

3.3. “Goodness” Dimension: Affective Ontology

3.4. “Beauty” Dimension: Visual Aesthetics

3.4.1. Rule-Based Illustration Aesthetics

3.4.2. Deep-Learning-Based Assessment

3.5. System Implementation

4. Experiments

4.1. Illustration Generation

4.2. Determination of Optimal Thresholds

4.2.1. Quantitative Computation

4.2.2. Qualitative Analysis

4.2.3. Threshold and Model Selection

4.3. Validation of the Evaluation Framework

4.3.1. Validity, Objectivity, and Generalization Capability

4.3.2. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI