Harnessing Generative Artificial Intelligence to Construct Multimodal Resources for Chinese Character Learning

Yu, Jinglei; Song, Jiachen; Lu, Yu

doi:10.3390/systems13080692

Open AccessArticle

Harnessing Generative Artificial Intelligence to Construct Multimodal Resources for Chinese Character Learning

by

Jinglei Yu

¹,

Jiachen Song

² and

Yu Lu

^1,*

¹

School of Educational Technology, Faculty of Education, Beijing Normal University, Beijing 100875, China

²

Institutes of Science and Development, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Systems 2025, 13(8), 692; https://doi.org/10.3390/systems13080692

Submission received: 3 June 2025 / Revised: 8 August 2025 / Accepted: 11 August 2025 / Published: 13 August 2025

(This article belongs to the Special Issue The Application of a Large Language Model (LLM) in Education Reform and Innovation)

Download

Browse Figures

Versions Notes

Abstract

In Chinese character learning, distinguishing similar characters is challenging for learners regardless of their proficiency. This is due to the complex orthography (visual word form) linking symbol, pronunciation, and meaning. Multimedia learning is a promising approach to implement learning strategies for Chinese characters. However, the availability of multimodal resources specifically designed for distinguishing similar Chinese characters is limited. With the advanced development of generative artificial intelligence (GenAI), we propose a practical framework for constructing multimodal resources, enabling flexible and semi-automated resource generation for Chinese character learning. The framework first constructs image illustrations due to their broad applicability across various learning contexts. After that, other four types of multimodal resources implementing learning strategies for similar character learning can be developed in the future, including summary slide, micro-video, self-test question, and basic information. An experiment was conducted with one group receiving the constructed multimodal resources and the other receiving the traditional text-based resources for similar character learning. We explored the participants’ learning performance, motivation, satisfaction, and attitudes. The results showed that the multimodal resources significantly improved performance on distinguishing simple characters, but were not suitable for non-homophones, i.e., visually similar characters with different pronunciations. Micro-videos introducing character formation knowledge significantly increased students’ learning motivation for character evolution and calligraphy. Overall, the resources received high satisfaction, especially for micro-videos and image illustrations. The findings regarding the effective design of multimodal resources for implementing learning strategies (e.g., using visual mnemonics, character formation knowledge, and group reviews) and implications for different Chinese character types are also discussed.

Keywords:

multimodal resource construction; multimedia learning; Chinese character learning; generative artificial intelligence; foundation model

1. Introduction

Chinese characters are logographic, with a complex mapping between orthography and semantics and an ambiguous relationship between orthography and phonology [1,2,3]. Unlike the consistent orthography–phonology relationships in alphabetic systems, the irregular connections between phonetics and orthography in Chinese present challenges for learners [4,5]. Orthographic awareness—referring to the recognition of visual word forms—plays a crucial role in linking visual symbols with phonology and semantics, which has been shown to impact Chinese reading in psycholinguistic research [1]. Hence, Chinese character learning places great emphasis on visual recognition, especially for native learners, including distinguishing visual similarities and differences between characters, identifying radicals (structural components used to classify and compose Chinese characters) [6], and understanding their meanings [5]. Nonetheless, distinguishing similar Chinese characters, both visually and phonologically, remains a considerable challenge for learners regardless of their proficiency. These similarities significantly increase the likelihood of errors. Liu et al. [7] analyzed 4,100 Chinese character errors from published books and found that 76% of the errors were linked to phonological similarities, 46% to visual similarities, and 29% were attributed to a combination of both factors. The four Chinese characters shown in Figure 1 exemplify these similarities. The characters “Zhi” for “single” and “branch” share the same pronunciation with different meanings, while the characters “Shi” and “Dai” for “waiter” and “treat” share similar shapes but have completely different pronunciations and meanings.

The traditional methods of learning these characters are copying and focusing on stroke sequences, which helps to improve orthographic awareness [1,8]. Several other less mechanized learning strategies have also been proposed. Tse et al. [9] suggested teaching Chinese characters in relational clusters and identifying similarities and variations among related characters within these clusters. Chou introduced seven teaching guidelines for these similar characters in forms such as using visual mnemonics and group character reviews, etc. [10,11]. Furthermore, teaching the evolvement of Chinese characters is also meaningful, since many characters that are currently similar have evolved from distinct origins, becoming similar over time. The pictorial origins of characters are also useful for connecting graphic forms with their semantic meanings, as applied in Tse et al.’s study [9]. However, these strategies have yet to be fully explored in terms of their effective integration and enhancement as a whole for similar character learning. To collaboratively implement these learning strategies, multimedia learning can be an effective solution [11,12,13]. Multimedia learning is defined as the presentation of both verbal and visual information to learners, where verbal information includes printed or spoken text and visual information includes static or dynamic images [14]. According to the cognitive theory of multimedia learning, appropriate multimodal instructions that include both verbal and visual information can aid learners in comprehending learning resources and reduce cognitive load [14,15,16]. It has been supported by dual-coding theory [17] and cognitive load theory [18].

However, the previous studies on multimedia learning in language acquisition have often overlooked resources specifically designed to address the challenge of distinguishing similar Chinese characters. Existing multimodal resources in educational technologies for Chinese character learning are typically based on rote memorization and mechanical repetition without contextualization [19], lacking the flexibility for teachers to adapt them based on their teaching strategies for similar character learning. Designing such resources independently is time-consuming and requires advanced technical skills and extensive knowledge. Moreover, the production of multimodal resources is heavily reliant on human effort, making it inefficient for scaling to the thousands of characters in the Chinese language, of which around 3000 are commonly used. Consequently, identifying effective technical methods for constructing multimodal resources to assist in distinguishing similar characters remains a significant and unresolved challenge.

In recent years, generative artificial intelligence (GenAI) has made it possible to create text, images, and other media using foundation models. This progress shows significant potential in addressing the above challenges. With the increasing scale of training data, model parameters, and computational power, the foundation model gradually demonstrates its exceptional capabilities compared with traditional deep learning models, especially in the construction of multimodal content.

Hence, we proposed a novel and effective framework for constructing multimodal learning resources assisted by GenAI for Chinese character learning. The model was first implemented to generate one typical multimodal resource—image illustration. To investigate the effectiveness of constructing these resources, other multimodal resources (including summary slides, micro-videos, basic information, and self-test questions) were first human-designed for learning strategy implementation, with future automation planned. To maintain quality consistency, the image illustrations were also manually refined to meet human-made standards. An experiment was then conducted to examine learners’ performance, motivation, satisfaction, and attitude when using these learning resources. The study will contribute valuable guidance on designing construction methods of multimodal resources with GenAI and implementing learning strategies tailored to different types of Chinese characters and multimodal representations for distinguishing similar characters.

2. Literature Review

2.1. Multimedia Learning Methods in Language Learning

Numerous studies on the design of multimedia learning methods have investigated and proven their effectiveness for language acquisition. They demonstrated that combining both verbal and visual elements in dual glossing modes was more effective than using a single glossing mode for English and German vocabulary learning for second language learners, where visual resources such as graphic illustrations, micro-videos, and animations were used [20,21,22,23,24,25,26].

In Chinese character learning, the effectiveness of multimodal presentations has also been studied and explored. For example, Chang [11] empirically investigated visual mnemonics by pairing Chinese characters with corresponding images to teach visually similar characters and demonstrated their effectiveness in both writing and reading Chinese characters. Lee et al. [13] highlighted the positive influence of integrative multimodal instruction on the recall and writing of Chinese character phonetic symbols, radicals, word meanings, and complex words. These benefits were evident in both immediate and delayed post-tests of young native speakers in learning traditional Chinese characters. Chen et al. [12] investigated the effects of various multimedia strategies in instructional presentation and practice on strokes, radicals, and writing of Chinese characters. The study found that the radical highlighting strategy, combined with visual cue practice, significantly improved the performance of non-native beginners in writing Chinese characters. Kuo and Hooper [27] investigated the effects of dual coding for second language learners, where the verbal information includes the English translation and the description of the character’s etymology, and the visual information refers to the image illustrations corresponding to the character. They found that learners’ performance on both immediate and delayed post-tests was better when dual coding was used than when verbal coding alone was used.

In short, both images and micro-videos along with text would be effective in language learning for meaning comprehension. In terms of Chinese character learning, the previous studies put more emphasis on the acquisition of basic character recognition but paid less attention to how to tell differences between similar characters with different meanings. In addition, they mostly introduced only one simplified meaning of characters for beginner learners, while each character commonly has multiple meanings; which are essential for learners to understand to use characters in words and distinguish similar characters.

Therefore, constructing multimodal resources to convey the multiple and profound meanings of each character is demanding for enhancing learner understanding. However, the current process of resource generation is predominantly manual, resulting in limited availability and scalability. Investigating and implementing technically efficient methods for resource generation is necessary.

2.2. Generative Artificial Intelligence and Foundation Models

The emergence of GenAI marks a new era in AI. As the key technology of GenAI, foundational models have been prominently developed in natural language processing [28] and are expanding into multimodal domains such as image, video, audio, and music, providing an effective solution for automating the generation of multimodal resources.

Firstly, the large language model (LLM) can show remarkable adaptability through prompt engineering without further training for various downstream tasks such as text generation, translation, summarization, interactive conversations, etc. For example, the prominent application ChatGPT [29] can respond to subsequent instructions in a zero-shot manner without parameter changes and also refine and improve its responses based on human-generated feedback [30].

Moreover, in the multimodal domain, text–image-related foundational models, such as CLIP [31], can perform multiple cross-modal downstream tasks, including cross-modal retrieval, as well as generation tasks with the help of generative models such as the diffusion model [32]. However, many of these text–image models are predominantly pre-trained on English data, resulting in a lack of Chinese language understanding. Chinese-CLIP [33], developed by Alibaba Group, is a multimodal foundation model tailored for Chinese language processing and built upon the CLIP framework. The model is trained on a large dataset of 200 million Chinese image–text pairs and shows strong performance in cross-modal retrieval tasks involving Chinese content. Specifically, in the text–image retrieval task on the Flickr30K-CN dataset, it can correctly retrieve the desired image among the top 5 retrieved images with a success rate of 86.9%. Furthermore, diffusion models, classified as generative models, operate through two distinct phases: a forward process that initially corrupts data by injecting noise, and a learned reverse process that reconstructs the data distribution for generation [34]. In text-to-image generation tasks, diffusion models can serve as a decoder conditioned on text-conditioned image embeddings, forming a two-step process as realized in unCLIP (DALL·E 2) [35].

Although foundation models have substantial generative capabilities, their application in generating educational resources remains limited due to the high demand of professional knowledge and security assurance in education. Optimizing the alignment between training data and application context and integrating different functional models are crucial to maximize the diverse potential of foundation models. Moreover, although GenAI is often praised for its innovative capabilities, there are concerns about its tendency to produce hallucinatory results, which may pose potential risks for learners. Thus, an appropriate approach to the use of the models and the maintenance of strict standards for the quality of the resources are two important points to consider in the process of constructing learning resources.

2.3. The Present Study

This study proposes a multimodal learning resource construction framework that leverages synergistic foundation models to integrate learning strategies into multimodal resources aimed at helping learners distinguish similar Chinese characters. The study seeks to gather feedback from learners on these multimodal resources, which will provide valuable insights for designing effective learning strategies with multimodal representations. Specifically, the following research questions were addressed:

(1). How can multimodal resources be efficiently constructed to aid learners in distinguishing similar Chinese characters?

(2). Are these multimodal resources effective in distinguishing various types of similar Chinese characters?

(3). Which feature designs in multimodal resources are considered effective to distinguish similar Chinese characters?

3. Methods

3.1. Multimodal Learning Resource Construction Framework

This study proposes a framework of multimodal learning resource construction using foundation models, as shown in Figure 2. The framework consists of two main steps, namely, text augmentation and multimodal resource construction. The knowledge database and the resource database serve as professional and reliable resources created by human intelligence. To guarantee the quality and security of the constructed materials, humans are in charge of the final review in a human–AI collaboration manner.

In the text augmentation step, the LLM plays a crucial role in transforming concise knowledge into comprehensive descriptions and instructions for subsequent resource construction. Through prompt engineering, context, instructions, and output indicators are specified to enable the LLM to generate the augmented text. This augmented text can assist in the resource access process, such as Internet searching, to build the resource database. In the multimodal resource construction step, the learning resources can be constructed either by retrieval or generation. In the retrieval branch, the feature database of the resource database is pre-built by the multimodal foundation model. Using the augmented text as a query string, the multimodal foundation model retrieves the feature database for target cross-modal resources. In the generation branch, the resources are generated directly by the generative foundation model, guided by the corresponding augmented text. Finally, the human is responsible for verifying and refining the outputs for high-quality learning resources.

Taking the construction of image illustrations for Chinese characters as an example, both text-to-image retrieval and image generation approaches can be adopted. In the retrieval way, the Chinese CLIP model can be an appropriate choice to match images from the Internet with characters’ text definitions. The exemplary results are presented in Figure 3, and the detailed explanation of their generation process is elaborated in Section 3.2. Alternatively, generative foundation models can be used to create original images for each definition of Chinese characters. Figure 4 depicts an example result generated in portrait style by HiDream.AI [36], a text-to-image generation online application. Both approaches require generating multiple candidates for human final selection. Although the generated images in Figure 4 are aesthetically high-quality, numerous scientific errors can occur during generation, such as hands with six fingers, due to the instability and randomness inherent in foundation model generation. In contrast, retrieved images are collected from the Internet and are generally accurate without factual errors. Considering the need for secure educational resources and efficient process, we adopt the retrieval approach in this study for constructing image illustrations of Chinese characters.

3.2. Implementation of Multimodal Learning Resource Construction Framework

In this preliminary work, as shown in Figure 5, we first explored the construction of image illustrations through the retrieval process of the multimodal learning resource construction framework due to their broad applicability across various learning contexts and relatively feasible and scalable generation process. Additionally, to implement learning strategies for similar character learning, other multimodal resources were also conventionally generated by researchers. A total of five categories of multimodal learning resources were designed and are outlined in Table 1.

3.2.1. Image Illustration Construction Using Foundation Models

As shown in Figure 5, image illustrations aligned with each text definition are developed. The Xinhua dictionary serves as a knowledge database providing the basic information of Chinese characters. Human selection plays a crucial role in the final refining and enhancing the quality of the image illustrations.

In the text augmentation step, the goal is to enrich the description of each character’s definition for Internet image searching by utilizing the ChatGPT API, gpt-3.5-turbo model. Several rounds of prompt are created, asking for step-by-step implementation instructions. Each prompt includes context introducing the previous response or prior knowledge along with an instruction specifying the desired operation and output format requirements.

Specifically, a three-step procedure is implemented, as shown in Figure 6. First, the model is instructed to generate three synonymous sentences for the query phrase. Second, it is asked to enhance each phrase with figurative descriptions related to the topic, based on the output from step one. Third, to mitigate distractions from indirect metaphors in the second step, the model rephrases and eliminates irrelevant symbolic comparisons. We take the Chinese character “Zhi” (branch) as an example. It functions as a measure word and has three frequently used definitions. Each definition contains a meaning description as topic phrase, and several sample words as query phrases, separated by ‘:‘ For instance, one of its definitions is “used for a song or piece of music: two new pieces of music.” Following the three-step procedure, the enriched text can be {‘phrase1’:‘Two brand new tracks that exude enchantment in the bloom of notes.’,’phrase2’:‘Two brand new songs that reveal beautiful melodies in the interweaving of melodies.’,’phrase3’:‘Two brand new pieces of music that bring freshness in the flow of notes.’}.

Using both the augmented text definitions and the original text definitions as query strings, 30 image candidates for each sample word within each definition are crawled from the Internet, building the image resource database. To further refine the candidates in the database, we employ the Chinese-CLIP to extract features from the text definitions and image candidates. The image features are stored in a cross-modal feature database, and the features of the text definition are used as the query strings to retrieve the top five aligned images. Retrieval occurs through cosine similarity comparisons.

Finally, the most appropriate image from the candidates for each definition is selected by a human, with the assistance of cosine similarity clues indicating the degree of alignment between the image and text definition. If the results are not appropriate enough, humans can conduct additional web searches to find images that accurately capture the underlying meaning of the target text definition. To better evaluate the effectiveness of the resource design, we carefully searched for the most appropriate image for each definition in this study. The outcomes of the three definitions of the Chinese character “Zhi” (branch) are presented in the first row of Figure 3.

3.2.2. Multimodal Learning Resources Design

In addition to image illustrations, a total of five categories of learning materials in three modalities were developed to implement the multimodal learning strategies for distinguishing similar characters, as described below. They are also planned to be further developed following the proposed semi-automated multimodal learning resource construction framework, with specific refinements tailored to their needs.

(1): Image illustrations serve as visual mnemonic corresponding to the Chinese character’s text definition. Each image is tailored to effectively illustrate each meaning of the character, which can either depict sample words or convey the meaning description.
(2): Summary slides are created to review the key points mentioned in the other resources, presenting similar Chinese characters in relational clusters, including their pinyin, meanings and sample words, pictorial representations, current and ancient forms, as well as the evolvement between them.
(3): Micro-videos mainly analyze Chinese characters from the evolvement perspective, introducing formation knowledge, as shown in Figure 7. Four short sections are designed to explain how to write the character, how the character’s form evolves in calligraphy, how the character is created and conveys meanings, and how to use the character in words.
(4): Self-test questions are presented in a multiple-choice format assessing learners’ grasp of knowledge covered in the micro-videos. Learners have the option to request answers and receive immediate feedback after making their choices.
(5): Basic information mainly contains pinyin and multiple definitions of Chinese characters. Pinyin represents the pronunciation of the Chinese character using the official romanization system. Each definition consists of a meaning description and sample words that demonstrate the character’s usage in words.

Figure 7. The screen shots of the micro-videos for “Branch”(支) and “Single”(只).

4. Experiment

4.1. Participants

A total of 61 participants were randomly recruited via an online broadcast on a university forum in Beijing, China, and were divided into an experimental group (N = 31) and a control group (N = 30). All participants were native Chinese speakers from various academic backgrounds proficient in Mandarin. Their demographic characteristics, including gender, age, level of education, and self-assessed proficiency in Chinese based on their previous learning experiences in school, were investigated.

4.2. Measuring Tools

The measuring tools included learning achievement tests as well as questionnaires of satisfaction, utility selection, and learning motivation. The learning achievement test consisted of 10 multiple-choice questions aimed at assessing the participant’s ability to discriminate between similar characters. These questions were designed according to a specific procedure. Firstly, 61 commonly used Chinese characters were selected from the Chinese dictionary, forming 28 groups of similar characters that are easily confused in written Chinese. Secondly, the characters were classified into three difficulty levels based on the number of strokes: 20 characters with 1–5 strokes, 30 characters with 6–10 strokes, and 11 characters with 11–15 strokes. Finally, 21 characters were selected for the study following four principles: (a) covering all three levels of difficulty, (b) having significant differences in original forms between paired characters, (c) being easily confused with each other, and (d) having clear formation explanations. These 21 selected characters formed 9 groups of similar characters and constructed 10 questions, including 8 pairs of characters for 8 questions and a group of 5 similar characters for 2 questions. The questions in the pre-tests and post-tests used the same 21 characters but with different test words using those characters.

The satisfaction post-questionnaire for the experimental group was modified from Chu et al.’s work [37] with a reliability coefficient of 0.91. It consisted of 9 items on a 5-point Likert-scale to measure learners’ overall satisfaction with the learning resources. For example, participants were asked to rate statements such as, “When learning in this way, I learned how to observe the Chinese character from new perspectives.” Meanwhile, for the control group, a post-questionnaire with one item was also designed to investigate the learners’ overall satisfaction with the provided text-based resource.

The utility selection post-questionnaire was designed for the experimental group to evaluate which modality of resource was most beneficial among the text definition in basic information, image illustration, micro-video, summary slide, and self-test question.

The learning motivation pre-questionnaires and post-questionnaires were derived from the learning motivation questionnaire developed by Hwang et al. [38] with a Cronbach’s alpha value of 0.79. The questionnaire was modified to investigate experimental group learners’ motivation toward contents of image illustrations and micro-videos, consisting of seven five-point Likert-scale items, such as “I think learning image illustration is interesting and valuable”.

4.3. Procedure

The experiment procedure (see Figure 8) consisted of five steps conducted online for each participant individually. Instructions were prepared separately for the experiment and control groups, guiding participants to log in to the online experimental platform and complete the experiment while their screen activity was recorded.

First, the experimental group was assigned to a pre-questionnaire about the learning motivation toward the image illustration and evolvement knowledge of Chinese characters. Then, all participants were asked to complete the pre-test, which consisted of 10 questions on distinguishing similar characters. Second, Chinese character learning resources were provided based on the learner’s performance in the pre-test, related to the characters that the learner answered incorrectly in the pre-test. As shown in Figure 9, the experimental group could access all five kinds of multimodal resources by clicking the corresponding buttons, while the control group could access only one text-based resource, i.e., the basic information, without other buttons. Third, all participants took the post-test, which contained the same characters as the pre-test but with different test words using the characters. Fourth, post-test questionnaires were designed for both groups. For the experimental group, the questionnaires included demographic questions, overall satisfaction with the learning resources, utility selection of the multimodal resources, as well as learning motivation post-questionnaire. For the control group, the questionnaires included demographic questions and satisfaction with the text-based resource. They did not receive questions about the utility selection of multimodal resources or learning motivation regarding them because the control group did not have access to the multimodal resources. Finally, online text interviews were conducted to further understand participants’ attitudes toward the provided learning resources.

5. Results

5.1. Demographic Characteristics of the Participants

To ensure the equivalence of the experiment and control groups, Chi-square tests and the Mann–Whitney U test were performed to analyze differences in gender, age, level of education, and self-assessed proficiency in Chinese. Notably, undergraduate students who had not yet earned a Bachelor’s Degree had their level of education classified as secondary education. As presented in Table 2, there were no significant differences in the demographic characteristics between the two groups.

5.2. Learning Performance

ANCOVA was used to explore the differences in learning performance between the experiment and control groups. The pre-test scores were taken as covariate variables, and the post-test scores were considered as the dependent variables. Chinese characters generally can be categorized into two types: simple and compound characters. Simple characters consist of a single radical component, while compound characters are formed by combining two or more radical components. Compound characters make up more than 80% of all Chinese characters [39,40]. Thus, questions were separated into a simple character category (N = 2) and compound character category (N = 8). The Levene’s test was performed to check for homogeneity of variance for both categories in the post-test, where F = 0.229, p = 0.634 > 0.05 for the simple character category, and F = 1.137, p = 0.291 > 0.05 for the compound character category, indicating that ANCOVA could be applied. The ANCOVA results for the post-test scores are shown in Table 3, where the adjusted mean is the mean of the score for each group after statistically controlling for the effects of the covariates. For the simple character category, the analysis revealed a significant difference between the two groups (F = 5.799, p < 0.05). However, for the compound character category, the results indicated that there was no significant difference between the two groups (F = 3.294, p > 0.05). Furthermore, when investigating similar characters from a phonological perspective, as shown in Table 4, we found that both groups showed performance improvement in homophones. However, the experimental group exhibited a significant decreased performance in non-homophones compared with the control group.

5.3. Learning Motivation

In the post-questionnaire, two learning motivation questionnaires were assigned to the experimental group toward image illustrations as well as the character evolvement and calligraphy introduced in the micro-video. The paired t-test results are presented in Table 5. The learners maintained a high learning motivation toward image illustrations both before (M = 4.088, SD = 0.862) and after (M = 4.129, SD = 0.713) using the learning resources. Meanwhile, learners presented significant increments of learning motivation between the two rounds toward character evolvement and calligraphy (t = −2.698, p < 0.05).

5.4. Satisfaction and Attitude

In the results of the utility selection, the micro-video received the most votes with 32% (10 votes) of the total. The image illustration came in second with 29% (9 votes). The summary slide received 16% (5 votes), the self-test question received 13% (4 votes), and the text definition received 10% (3 votes). Overall, learners’ satisfaction with the multimodal learning platform was high (M = 4.344, SD = 0.658).

To further explore learners’ attitudes toward each multimodal resource, an online text interview was conducted with each participant in the experimental group at the end of the experiment. The open-ended question was, “Which resource do you find most useful and why?” We summarized the reasons given in each response in a bottom-up manner, and 30 related reasons were mentioned and coded in the responses, as shown in Table 6. Specifically, if multiple reasons were provided in a single response, each reason was counted individually. If no reasons were explicitly mentioned in the response, the response was excluded from the count. It is noteworthy that participants identified two common reasons for considering image illustrations and micro-videos useful: “Aiding memorization” and “Stimulating interest”, which underscores the effectiveness of the multimodal resources in similar character learning. For the control group, despite not experiencing the multimodal learning resources, the learners expressed high satisfaction with the textual information provided in their learning platform (M = 4.091, SD = 0.621), which is the most familiar learning mode for Chinese native students through long periods of training.

6. Discussion

6.1. Efficient Framework Design for Constructing Multimodal Learning Resources

The synergy of different functional foundation models offers a feasible approach to constructing multimodal learning resources, addressing the first research question. Specifically, based on the multimedia learning methods, this study proposed and implemented a multimodal learning resource construction framework. Unlike previous studies that relied on completely manual resource generation, the present study designed a grounded approach by utilizing foundation models such as ChatGPT and Chinese-CLIP to construct image illustrations for each text definition of character. The integration of LLM with multimodal models in the proposed framework demonstrates the efficiency and effectiveness of different types of foundational models working together in order to maximize each other’s strengths. This approach aligns with the design concept of DALLE-3 [41], which has been built natively on ChatGPT. Using foundation models, humans can save a significant amount of time and effort that would have been spent searching the Internet for appropriate images using different query terms. Instead, they can simply make final decisions based on the image candidates matched by the model, with semantic distance cues to assist in the selection process.

Analysis of the construction results showed that ChatGPT could effectively follow the instructions in the prompts, including generating synonyms and removing metaphors. This approach was particularly effective for less commonly used meanings of a character, as they could be contextualized with more familiar words to enhance comprehension. With augmented text from ChatGPT, Chinese-CLIP has the potential to match more precise image features across modalities compared with original dictionary descriptions only. In other words, the synergy between different foundation models can be used to improve the illustration quality of multiple meanings for each character. Furthermore, challenges were encountered when dealing with character meanings used as adjectives or adverbs, as illustrating these meanings with images was less straightforward. In such cases, further explanation of the association between the image and the meaning may be necessary.

6.2. Effectiveness of Multimodal Resources in Distinguishing Similar Chinese Characters

Regarding the second research question, the answer is that the multimodal resources’ effects on learning simple characters surpasses that of compound characters and visually similar non-homophones. The results showed significant improvement in learning simple characters, but not in learning compound characters. The reason may lie in the nature of the characters themselves. The simple characters tested in the study have current representations that are significantly different from their original forms. The micro-videos that introduced the character evolution played a crucial role in helping learners trace modern Chinese characters back to their original forms. This process can facilitate the understanding of original meanings and distinctions between visually similar characters, which may lead to a significant improvement in learners’ performance. The finding is consistent with Yu’s claim [42] that simple characters are effectively taught through tracing back strategies, which help learners make associations between characters’ meanings and their orthography. Additionally, the image illustration for multiple meanings of each Chinese character can also assist in efficiently comparing the semantics of similar characters, as evidenced in the online text interview (see the following section). These image illustrations for simple characters may effectively facilitate the integration of corresponding pictorial and verbal representations into learners’ working memory and improve their performance, according to the cognitive theory of multimedia learning [15].

On the other hand, the reason for the lack of significant learning improvement in compound characters may be because the compound characters contain not only pictographic elements but also ideographs and phonetic compounds, making them more complex than simple characters to explain and understand in a short time. Furthermore, cognitive load is related to the learners’ prior knowledge and task complexity [43]. If learners lack the necessary basic knowledge of character formation, the task becomes even more challenging, thereby demanding greater mental effort [43], which may also affect learners’ anxiety [44]. As stated by a participant in the online text interview:

“Without prior introduction to concepts such as pictograms, ideograms, and phonetic compounds, learners may experience some difficulty when making choices (in the test).”

“The differences between Oracle Bone Script (the original form) and modern script appear to be quite significant. Additionally, the complexity of Oracle Bone Script makes it challenging to retain a lasting impression.”

As a result, learners may fail to use the multimodal resources to internalize and transfer information, and instead rely on prior knowledge of the characters to answer questions in the test. Consequently, no significant difference was observed between the experimental and control groups.

When considering similar characters from a phonological perspective, the multimodal resources showed no significant difference between the two groups for homophones, but had negative effects on non-homophones in the experimental group. Investigating further, we found that the visual differences between the non-homophones tested in this study had no direct implications for their pronunciations (e.g., the characters “Shi” and “Dai” for “waiter” and “treat” in Figure 1) due to the intricate connections among the orthographic, phonological, and meaning systems of Chinese [1]. Although micro-videos introduced knowledge of character evolution for phonetic radicals, such unintuitive and novel contents may overload learners’ limited working memory [45], hindering their ability to comprehend and apply it during testing—even for native speakers. Since cognitive load is highly related to learners’ working memory [46], the heavy cognitive load can ultimately affect their performance [47]. It aligns with the potential negative effects found by Chu in mobile learning environments [48].

6.3. Effective Feature Designs in Multimodal Resources for Distinguishing Similar Chinese Characters

The third research question investigated the effective feature designs of the proposed resources. Considering the learning strategies in the multimodal resource design, we implemented visual mnemonic strategies by constructing image illustrations for multiple definitions of each character rather than providing only one basic meaning. Meanwhile, character formation knowledge was conveyed through micro-videos followed by self-testing questions, and a group character review was incorporated into the final summary slide. Participants showed high agreement on the implementation of these strategies, as reflected in their overall satisfaction with the resources. In particular, micro-videos ranked as the most useful, which is consistent with Al-Seghayer’s finding [20]. Al-Seghayer found that dynamic videos were more effective than still images and text for second-language vocabulary acquisition. In the present study, micro-videos were utilized to convey a greater volume of information with higher density, offering comprehensive explanations of the evolution and meanings of Chinese characters. They functioned as effective mediums for knowledge accumulation and cultural transmission. Moreover, by establishing strong correlations between videos, diverse character meanings could be effectively connected, thus enhancing the integrative impact of multiple videos, as noted in Table 6. In addition, the effectiveness of image illustrations is consistent with the findings of Chuang and Ku [49], which showed that, although no significant difference was found between the effectiveness of image–text and image–audio treatments in Chinese character recognition, learners in both groups expressed a strong preference for images corresponding to Chinese characters.

Particularly, employing character cluster design in learning strategies across various modalities received high satisfaction for similar character learning, which indicates its effectiveness in enhancing the convenience of meaning comprehension. This may be due to the clustering design helping reduce the load on the limited capacity of working memory [50]. Below are some excerpts from the online text interviews, indicating the effectiveness of these feature designs:

“I found the image illustration the most helpful, being able to distinguish the biggest differences between similar characters in the shortest time possible” (image illustrations).

“Directly a chart comparing two characters, very concise, at a glance, will be a time saver for college students” (summary slides).

“The video on the evolution of the character was able to tell the story vividly in terms of the shape and meaning of the character, and it also allowed me to better distinguish the implicit differences between similar characters.” (micro-videos)

7. Conclusions, Limitation and Future Research

In Chinese character learning, distinguishing similar characters poses a significant challenge for learners regardless of their proficiency level. To effectively integrate learning strategies for similar Chinese characters, multimedia learning methods have shown promise. Thus, to enhance the flexibility and practicality of the multimodal resource construction, the study proposed a multimodal resource construction framework using synergistic foundation models and preliminarily implemented the framework to construct image illustrations for Chinese character meanings. The findings of the study indicate that the learning strategies implemented in various multimodal resources, such as visual mnemonics in image illustration, micro-videos of character formation knowledge, and character group reviews in summary slides, are more effective at distinguishing simple characters compared with compound characters and are not suitable for non-homophones, which require further careful consideration of learners’ prior knowledge as well as the orthographic complexity of the characters. The results also showed that learners preferred micro-videos and clustered multimodal presentations when learning similar characters.

Additionally, there are several limitations that require further improvement. The first limitation is that the participant group primarily consisted of native speakers from college. Future research could include a diverse sample of different age groups, educational backgrounds, and second-language learners. It could also explore a wider range of visually and phonetically similar characters based on participants’ learning experiences [51]. Another area for improvement is the evaluation of the resource generation process. The present study focuses on implementing the proposed framework to provide an initial demonstration of its feasibility with the assistance of GenAI. Further exploration is needed to conduct in-depth evaluations concerning the accuracy of generated text content, the appropriateness of language expression, and adherence to educational requirements. Moreover, this study represents an early attempt to apply the proposed framework for generating image illustrations. Future work could further explore the application of the proposed framework to automate the generation of other types of resources, particularly by utilizing advanced multimodal foundation models such as GPT-4o [52]. At the same time, it is important to ensure that the generated content is reliable, ethical, and safe [30]. Human intervention, such as human review and validation, is always necessary to ensure the security and integrity of educational resources, mitigate potential risks, and ensure content quality.

Author Contributions

Conceptualization, J.Y., J.S. and Y.L.; methodology, J.Y., J.S. and Y.L.; software, J.Y.; formal analysis, J.Y.; investigation, J.Y. and J.S.; resources, J.S. and J.Y.; data curation, J.Y.; writing—original draft preparation, J.Y. and J.S.; writing—review and editing, J.Y., J.S. and Y.L.; supervision, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Tencent (China): 230200024.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of Advanced Innovation Center for Future Education, Beijing Normal University (protocol code AICFE-011 and date of approval 15 May 2023).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tan, L.H.; Spinks, J.A.; Eden, G.F.; Perfetti, C.A.; Siok, W.T. Reading depends on writing, in Chinese. Proc. Natl. Acad. Sci. USA 2005, 102, 8781–8785. [Google Scholar] [CrossRef]
Zhang, H.; Roberts, L. The role of phonological awareness and phonetic radical awareness in acquiring Chinese literacy skills in learners of Chinese as a second language. System 2019, 81, 163–178. [Google Scholar] [CrossRef]
Zhang, L.; Xing, H. The interaction of orthography, phonology and semantics in the process of second language learners’ Chinese character production. Front. Psychol. 2023, 14, 1076810. [Google Scholar] [CrossRef] [PubMed]
Caravolas, M.; Lervåg, A.; Defior, S.; Seidlová Málková, G.; Hulme, C. Different patterns, but equivalent predictors, of growth in reading in consistent and inconsistent orthographies. Psychol. Sci. 2013, 24, 1398–1407. [Google Scholar] [CrossRef] [PubMed]
McBride, C.A. Is Chinese special? Four aspects of Chinese literacy acquisition that might distinguish learning Chinese from learning alphabetic orthographies. Educ. Psychol. Rev. 2016, 28, 523–549. [Google Scholar] [CrossRef]
Shen, H.H.; Ke, C. Radical awareness and word acquisition among nonnative learners of Chinese. Mod. Lang. J. 2007, 91, 97–111. [Google Scholar] [CrossRef]
Liu, C.L.; Lai, M.H.; Tien, K.W.; Chuang, Y.H.; Wu, S.H.; Lee, C.Y. Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications. ACM Trans. Asian Lang. Inf. Process. (TALIP) 2011, 10, 1–39. [Google Scholar] [CrossRef]
Wu, X.; Li, W.; Anderson, R.C. Reading instruction in China. J. Curric. Stud. 1999, 31, 571–586. [Google Scholar] [CrossRef]
Tse, S.K.; Marton, F.; Ki, W.W.; Loh, E.K.Y. An integrative perceptual approach for teaching Chinese characters. Instr. Sci. 2007, 35, 375–406. [Google Scholar] [CrossRef]
Chou, P.H. Study of Chinese similar form of the character teaching by theory of learning. J. Natl. United Univ. 2009, 6, 79–98. [Google Scholar]
Chang, L.Y.; Tang, Y.Y.; Lee, C.Y.; Chen, H.C. The Effect of Visual Mnemonics and the Presentation of Character Pairs on Learning Visually Similar Characters for Chinese-as-Second-Language Learners. Front. Psychol. 2022, 13, 783898. [Google Scholar] [CrossRef] [PubMed]
Chen, M.P.; Wang, L.C.; Chen, H.J.; Chen, Y.C. Effects of type of multimedia strategy on learning of Chinese characters for non-native novices. Comput. Educ. 2014, 70, 41–52. [Google Scholar] [CrossRef]
Lee, C.P.; Shen, C.W.; Lee, D. The effect of multimedia instruction for Chinese learning. Learn. Media Technol. 2008, 33, 127–138. [Google Scholar] [CrossRef]
Mayer, R.E. The Cambridge Handbook of Multimedia Learning; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
Mayer, R.E. Multimedia Learning. Psychology of Learning and Motivation; Academic Press: Cambridge, CA, USA, 2002; pp. 85–139. [Google Scholar]
Mayer, R.E.; Moreno, R. Nine ways to reduce cognitive load in multimedia learning. Educ. Psychol. 2003, 38, 43–52. [Google Scholar] [CrossRef]
Paivio, A. Mental Representations: A Dual Coding Approach; Oxford University Press: Oxford, UK, 1990. [Google Scholar]
Sweller, J.; Van Merrienboer, J.J.; Paas, F.G. Cognitive architecture and instructional design. Educ. Psychol. Rev. 1998, 10, 251–296. [Google Scholar] [CrossRef]
Olmanson, J.; Liu, X. The Challenge of Chinese Character Acquisition: Leveraging Multimodality in Overcoming a Centuries-Old Problem. Emerg. Learn. Des. J. 2018, 4, 1–9. [Google Scholar]
Al-Seghayer, K. The effect of multimedia annotation modes on L2 vocabulary acquisition: A comparative study. Lang. Learn. Technol. 2005, 3, 133–151. [Google Scholar]
Kim, D.; Gilman, D.A. Effects of text, audio, and graphic aids in multimedia instruction for vocabulary learning. J. Educ. Technol. Soc. 2008, 11, 114–126. [Google Scholar]
Plass, J.L.; Chun, D.M.; Mayer, R.E.; Leutner, D. Cognitive load in reading a foreign language text with multimedia aids and the influence of verbal and spatial abilities. Comput. Hum. Behav. 2003, 19, 221–243. [Google Scholar] [CrossRef]
Ramezanali, N.; Faez, F. Vocabulary learning and retention through multimedia glossing. Lang. Learn. Technol. 2019, 23, 105–124. [Google Scholar]
Zhu, Y.; Mok, P. The role of prosody across languages. In The Routledge Handbook of Second Language Acquisition and Speaking, 1st ed.; Routledge: Abingdon-on-Thames, UK, 2022; pp. 201–214. [Google Scholar]
Figueiredo, S. The efficiency of tailored systems for language education: An app based on scientific evidence and for student-centered approach. Eur. J. Educ. Res. 2023, 12, 583–592. [Google Scholar] [CrossRef]
Figueiredo, S.; Brandão, T.; Nunes, O. Learning Styles Determine Different Immigrant Students’ Results in Testing Settings: Relationship Between Nationality of Children and the Stimuli of Tasks. Behav. Sci. 2019, 9, 150. [Google Scholar] [CrossRef]
Amy Kuo, M.L.; Hooper, S. The effects of visual and verbal coding mnemonics on learning Chinese characters in computer-based instruction. Educ. Technol. Res. Dev. 2004, 52, 23–34. [Google Scholar] [CrossRef]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Liang, P. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar]
GPT-4. OpenAI Website. Available online: https://openai.com/research/gpt-4 (accessed on 3 August 2023).
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Lowe, R. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sutskever, I. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18 July 2021. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19 June 2022. [Google Scholar]
Yang, A.; Pan, J.; Lin, J.; Men, R.; Zhang, Y.; Zhou, J.; Zhou, C. Chinese clip: Contrastive vision-language pretraining in Chinese. arXiv 2022, arXiv:2211.01335. [Google Scholar]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M. Diffusion Models: A Comprehensive Survey of Methods and Applications. ACM Comput. Surv. 2023, 56, 39. [Google Scholar] [CrossRef]
Ramesh, D.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
HiDream. AI Website. Available online: https://www.hidreamai.com/ (accessed on 31 March 2024).
Chu, H.C.; Hwang, G.J.; Tsai, C.C.; Tseng, J.C. A two-tier test approach to developing location-aware mobile learning systems for natural science courses. Comput. Educ. 2010, 55, 1618–1627. [Google Scholar] [CrossRef]
Hwang, G.J.; Yang, L.H.; Wang, S.Y. A concept map-embedded educational computer game for improving students’ learning performance in natural science courses. Comput. Educ. 2013, 69, 121–130. [Google Scholar] [CrossRef]
Wang, M.; Perfetti, C.A.; Liu, Y. Alphabetic readers quickly acquire orthographic structure in learning to read Chinese. Sci. Stud. Read. 2003, 7, 183–208. [Google Scholar] [CrossRef]
Sung, K.Y.; Wu, H.P. Factors influencing the learning of Chinese characters. Int. J. Biling. Educ. Biling. 2011, 14, 683–700. [Google Scholar] [CrossRef]
Betker, J.; Goh, G.; Jing, L.; Brooks, T.; Wang, J.; Li, L.; Ramesh, A. Improving image generation with better captions. Comput. Sci. 2023, 2, 8. [Google Scholar]
Yu, Z. Teaching Chinese character literacy from the perspective of word theory. Teach. Manag. 2018, 9, 80–82. [Google Scholar]
Paas, F.G.W.C.; Van, M.J.J.G. Instructional control of cognitive load in the training of complex cognitive tasks. Educ. Psychol. Rev. 1994, 6, 351–371. [Google Scholar] [CrossRef]
Wang, H.; Zhang, X.; Jin, Y.; Ding, X. Examining the relationships between cognitive load, anxiety, and story continuation writing performance: A structural equation modeling approach. Hum. Soc. Sci. Commun. 2024, 11, 1297. [Google Scholar] [CrossRef]
Baddeley, A.D.; Hitch, G.J. Working memory. Psychol. Learn. Motiv. 1974, 8, 47–90. [Google Scholar]
Paas, F.; Renkl, A.; Sweller, J. Cognitive load theory and instructional design: Recent developments. Educ. Psychol. 2003, 38, 1–4. [Google Scholar] [CrossRef]
DeStefano, D.; LeFevre, J.A. Cognitive Load in Hypertext Reading: A Review. Comput. Hum. Behav. 2007, 23, 1616–1641. [Google Scholar] [CrossRef]
Chu, H.C. Potential negative effects of mobile learning on students’ learning achievement and cognitive load—A format assessment perspective. J. Educ. Technol. Soc. 2014, 17, 332–344. [Google Scholar]
Chuang, H.Y.; Ku, H.Y. The effect of computer-based multimedia instruction with Chinese character recognition. Educ. Media Int. 2011, 48, 27–41. [Google Scholar] [CrossRef]
Zahradníková, M. A qualitative inquiry of character learning strategies by Chinese L2 beginners. Chin. Second Lang. J. 2016, 51, 117–137. [Google Scholar] [CrossRef]
Figueiredo, S.; Silva, C. Cognitive differences in second language learners and the critical period effects. L1 Educ. Stud. Lang. Lit. 2009, 9, 157–178. [Google Scholar] [CrossRef]
Hello GPT-4o. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 26 May 2024).

Figure 1. Examples of two pairs of similar Chinese characters.

Figure 2. The multimodal learning resource construction framework.

Figure 3. Image illustrations of Chinese characters for “Branch”(支) and “Single”(只) via text-to-image retrieval.

Figure 4. Image illustrations of Chinese characters for “Branch”(支) and “Single”(只) via text-to-image generation.

Figure 5. The implementation process of the multimodal learning resource construction framework.

Figure 6. The prompts designed for text augmentation via ChatGPT.

Figure 8. The experiment procedure.

Figure 9. The online learning user interface with multimodal resources for the experimental group.

Table 1. Types of multimodal resources for similar character learning.

Modality	Multimodal Resources
Image	1. Image illustration 2. Summary slide
Video	3. Micro-video
Text	4. Self-test question
Text	5. Basic information

Table 2. Demographics characteristics of the participants.

	Participants
Characteristics	All (N = 61)	Experimental Group (N = 31)	Control Group (N = 30)
	n (%)	n (%)	n (%)	X²	df	p
Gender				0.759
Male	13 (21.3%)	8 (25.8%)	5 (16.7%)		1	0.384
Female	48 (78.7%)	23 (74.2%)	25 (83.3%)
Self-assessed proficiency in Chinese				2.908	3	0.406
Excellent (A)	13 (21.3%)	7 (22.6%)	6 (20%)
Good (B)	35 (57.3%)	18 (58.1%)	17 (56.7%)
Fair (C)	11 (18.0%)	4 (12.9%)	7 (23.3%)
Pass (D)	2 (3.3%)	2 (6.5%)	0 (0%)
Fail (F)	0 (0%)	0 (0%)	0 (0%)
Level of Education
Secondary Education	36 (59.0%)	17 (54.8%)	19 (63.3%)	3.823	3	0.281
Bachelor’s Degree	22 (36.1%)	13 (41.9%)	9 (30.0%)
Master’s Degree	2 (3.3%)	0 (0%)	2 (6.7%)
Doctoral Degree	1 (1.6%)	1 (3.2%)	0 (0%)
	Mean (SD)	Mean (SD)	Mean (SD)	U	Z	p
Age	22.1 (2.47)	21.7 (2.604)	22.5 (2.307)	356.5	−1.582	0.114

Table 3. ANCOVA result for the post-test scores of two groups on different visually similar character categories.

Character Category	Question N	Group	N	Mean	SD	Adjusted Mean	Adjusted SD	F	p Value
Simple Character	2	Experimental	31	7.32	1.037	7.437	0.168	5.799 *	0.019
Simple Character	2	Control	30	6.97	0.801	6.848	0.171	5.799 *	0.019
Compound Character	8	Experimental	31	29.39	2.186	29.396	0.294	3.294	0.075
Compound Character	8	Control	30	30.17	1.440	30.157	0.299	3.294	0.075

* p < 0.05.

Table 4. t-test result for the pre-test and post-test scores of two groups on different phonologically similar character categories.

Character Category	Question N		Group	N	Mean	SD	t	p Value
Homophones	8	Pre-test	Experimental	31	28.677	2.116	–1.372	0.175
		Pre-test	Control	30	29.300	1.242	–1.372	0.175
		Post-test	Experimental	31	29.741	2.228	0.547	0.586
		Post-test	Control	30	29.467	1.565	0.547	0.586
Non-homophones	2	Pre-test	Experimental	31	7.290	1.113	–0.486	0.628
		Pre-test	Control	30	7.433	1.146	–0.486	0.628
		Post-test	Experimental	31	6.968	1.448	–2.439 *	0.019
		Post-test	Control	30	7.667	0.596	–2.439 *	0.019

* p < 0.05.

Table 5. Paired t-test results of learning motivation pre-questionnaires and post-questionnaires of the experimental group.

Variables		N	Mean	SD	t	p Value
Image Illustration	Pre-questionnaire	31	4.088	0.862	–0.444	0.661
Image Illustration	Post-questionnaire	31	4.129	0.713	–0.444	0.661
Character Evolvement and Calligraphy (Micro-video)	Pre-questionnaire	31	4.074	0.856	–2.698 *	0.011
Character Evolvement and Calligraphy (Micro-video)	Post-questionnaire	31	4.290	0.752	–2.698 *	0.011

* p < 0.05.

Table 6. Analysis of online text interview results.

Modality	Multimodal Resources	Reason Themes	Frequency
Image	Image illustration	1. Aiding memorization	3
		2. Efficient to compare differences between similar characters	2
		3. Stimulating interest	2
		4. Helping to understand meaning of Chinese character	1
	Summary slide	1. Containing concise and intuitive visual comparisons between similar characters	5
	Summary slide	2. Allowing for reviewing of new learning	1
Video	Micro-video	1. Helping to understand Chinese character evolvement and meanings	4
		2. Having good correlation between videos to distinguish similar characters	3
		3. Stimulating interest	3
		4. Aiding memorization	2
		5. Helping to accumulate knowledge	1
Text	Self-test question	1. Allowing for testing of new learning	1
	Self-test question	2. Helping to distinguish differences in meanings of Chinese characters	1
	Basic information	1. Intuitive	1
Total			30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, J.; Song, J.; Lu, Y. Harnessing Generative Artificial Intelligence to Construct Multimodal Resources for Chinese Character Learning. Systems 2025, 13, 692. https://doi.org/10.3390/systems13080692

AMA Style

Yu J, Song J, Lu Y. Harnessing Generative Artificial Intelligence to Construct Multimodal Resources for Chinese Character Learning. Systems. 2025; 13(8):692. https://doi.org/10.3390/systems13080692

Chicago/Turabian Style

Yu, Jinglei, Jiachen Song, and Yu Lu. 2025. "Harnessing Generative Artificial Intelligence to Construct Multimodal Resources for Chinese Character Learning" Systems 13, no. 8: 692. https://doi.org/10.3390/systems13080692

APA Style

Yu, J., Song, J., & Lu, Y. (2025). Harnessing Generative Artificial Intelligence to Construct Multimodal Resources for Chinese Character Learning. Systems, 13(8), 692. https://doi.org/10.3390/systems13080692

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Harnessing Generative Artificial Intelligence to Construct Multimodal Resources for Chinese Character Learning

Abstract

1. Introduction

2. Literature Review

2.1. Multimedia Learning Methods in Language Learning

2.2. Generative Artificial Intelligence and Foundation Models

2.3. The Present Study

3. Methods

3.1. Multimodal Learning Resource Construction Framework

3.2. Implementation of Multimodal Learning Resource Construction Framework

3.2.1. Image Illustration Construction Using Foundation Models

3.2.2. Multimodal Learning Resources Design

4. Experiment

4.1. Participants

4.2. Measuring Tools

4.3. Procedure

5. Results

5.1. Demographic Characteristics of the Participants

5.2. Learning Performance

5.3. Learning Motivation

5.4. Satisfaction and Attitude

6. Discussion

6.1. Efficient Framework Design for Constructing Multimodal Learning Resources

6.2. Effectiveness of Multimodal Resources in Distinguishing Similar Chinese Characters

6.3. Effective Feature Designs in Multimodal Resources for Distinguishing Similar Chinese Characters

7. Conclusions, Limitation and Future Research

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI