Knowing the Words, Missing the Meaning: Evaluating LLMs’ Cultural Understanding Through Sino-Korean Words and Four-Character Idioms

Lee, Eunsong; Do, Hyein; Kim, Minsu; Oh, Dongsuk

doi:10.3390/app15137561

Open AccessArticle

Knowing the Words, Missing the Meaning: Evaluating LLMs’ Cultural Understanding Through Sino-Korean Words and Four-Character Idioms

¹

Department of English Language and Literature, Kyungpook National University, Daegu 41566, Republic of Korea

²

Department of Korean Literature in Chinese Character, Kyungpook National University, Daegu 41566, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7561; https://doi.org/10.3390/app15137561

Submission received: 15 May 2025 / Revised: 29 June 2025 / Accepted: 1 July 2025 / Published: 5 July 2025

Download

Browse Figure

Versions Notes

Abstract

This study proposes a new benchmark to evaluate the cultural understanding and natural language processing capabilities of large language models based on Sino-Korean words and four-character idioms. Those are essential linguistic and cultural assets in Korea. Reflecting the official question types of the Korean Hanja Proficiency Test, we constructed four question categories—four-character idioms, synonyms, antonyms, and homophones—and systematically compared the performance of GPT-based and non-GPT LLMs. GPT-4o showed the highest accuracy and explanation quality. However, challenges remain in distinguishing the subtle nuances of individual characters and in adapting to uniquely Korean meanings as opposed to standard Chinese character interpretations. Our findings reveal a gap in LLMs’ understanding of Korea-specific Hanja culture and underscore the need for evaluation tools reflecting these cultural distinctions.

Keywords:

large language models evaluation; cultural contextual understanding; Sino-Korean vocabulary; four-character idioms; cross-lingual semantic shift

1. Introduction

As cultural diversity continues to grow in modern society, understanding the identity of each culture has become increasingly important. In particular, language not only functions as a means of communication but also reflects cultural identity and historical context. In Korean society, Hanja (Chinese characters) and four-character idioms (Sajaseongeo, 四字成語, idioms strictly composed of four characters) serve as representative cultural elements that embody symbolic meaning. Over a long historical period, they have acquired distinct meanings within Korea. In fact, Hanja accounts for approximately 70% of Korean vocabulary and over 90% of technical terms, demonstrating its substantial influence on the Korean language [1,2,3]

Chinese character idioms, a subset of Sino-Korean vocabulary, effectively encapsulate complex meanings or expressions in a concise form by leveraging the logographic nature of Hanja [4]. These idioms encompass both Gosaseongeo (故事成語, idioms derived from classical stories) and Sajaseongeo. Among them, Sajaseongeo are idiomatic expressions consisting of exactly four characters that convey the meaning of long sentences within a compact structure. This study focuses specifically on Sajaseongeo.

Hanja has its roots in China but has been adopted and developed differently across various Asian countries, including Korea, Japan, and Vietnam [5]. Chinese characters were first introduced to the Korean Peninsula during the late Gojoseon period, primarily in the Pyeongan region. In the regions of the Four Commanderies of Han, it is presumed that people were exposed to a relatively well-developed form of Classical Chinese [6,7,8]. By the Three Kingdoms period (4th–7th centuries), Goguryeo, Baekje, and Silla actively adopted and utilized Chinese characters [9,10]. By the Unified Silla period (7th–9th century), the Idu script, a writing system that borrows the pronunciation and meaning of Chinese characters to represent the Korean language, emerged. This development marked the beginning of Korean adaptations of Hanja, where native Korean words were either written phonetically using Chinese characters or combined with borrowed Chinese vocabulary to create new words [11]. During the Goryeo and Joseon dynasties, Hanja evolved beyond a mere writing system to become a central component of bureaucratic governance. In particular, Confucian classics formed the foundation of state education in the Joseon period, solidifying the use of Hanja vocabulary and four-character idioms as the common linguistic framework among scholars and officials. However, with the advent of modern times and the implementation of the Hangul exclusive policy, the use of Hanja gradually declined [12].

Throughout this historical transformation, certain Sino-Korean expressions have developed unique meanings and usage patterns within the Korean cultural context [13,14]. In particular, Korean four-character idioms have evolved in various ways, such as being derived from classical Korean texts, as seen in Gyeonmunbalgeom (見蚊拔劍, reacting with excessive anger over a trivial matter) and Myoduhyeollyeong (猫頭懸鈴, a futile discussion that cannot be put into practice), or by recombining and condensing existing Chinese idioms to create uniquely Korean expressions [15].

For instance, the idiom Hongikingan (弘益人間, “to benefit all of humanity”), originates from Samguk Yusa (Memorabilia of the Three Kingdoms), which documents the founding myth of Gojoseon. Another example is Hamheungchasa (咸興差使), a metaphor for a situation in which no news returns after sending a message or errand. In this case, Hamheung refers to a city in present-day North Korea. Additionally, Sino-Korean synonyms, antonyms, and homophones, despite their roots in the Chinese writing system, have frequently been transformed and redefined under the influence of Korean culture [16]. These examples illustrate that both Hanja and four-character idioms serve as linguistic features that require not only linguistic knowledge but also cultural understanding and contextual interpretation. Thus, they present unique challenges for evaluating the cultural contextual comprehension of large language models (LLMs).

Modern LLMs have predominantly been developed and trained on Western languages and cultural frameworks, while non-Western linguistic and cultural contexts remain underrepresented [17,18]. This limitation suggests that LLMs may struggle to fully grasp the contextual nuances presented by diverse cultures. In response, this study proposes a benchmark for evaluating the natural language processing capability and cultural comprehension of LLMs using Hanja and Sajaseongeo, which are core cultural elements in Korea. The following sections will review the relevant literature, present experimental design and methodology, describe the experimental setup and results, and conclude with limitations and directions for future research.

2. Related Works

2.1. Large Language Models

Large language models (LLMs) are models with billions of parameters, trained using deep learning techniques and vast datasets to comprehend and generate natural language [19]. Many contemporary LLMs are based on the Transformer architecture, particularly the self-attention mechanism, and have evolved by increasing model size and training data volume. This study utilizes the Generative Pre-trained Transformer (GPT) model, one of the most representative LLMs. GPT is designed using only the decoder portion of the Transformer architecture and undergoes large-scale pre-training on extensive corpora, followed by fine-tuning to enhance its performance [20]. This design enables GPT to achieve outstanding performance in various natural language processing (NLP) tasks, including document summarization, text classification, question answering, and machine translation.

The GPT model has undergone continuous improvements since its initial version, GPT-1. GPT-1 applied an unsupervised learning approach for pre-training on large-scale data, followed by supervised fine-tuning, and contained approximately 110 million parameters [20]. GPT-2 improved upon this by leveraging large-scale datasets, including WebText, for unsupervised pre-training, thereby enhancing its overall performance [21]. Unlike GPT-1, GPT-2 modified the position of LayerNorm by placing it before the residual connection, which improved training efficiency and accuracy in NLP tasks. While GPT-2 had approximately 1.4 billion parameters, GPT-3 dramatically increased the parameter count to 175 billion [22]. The scaling of model size and data, based on scaling laws, contributed to improving GPT-3’s ability to understand prompts and context in few-shot and one-shot learning settings. However, despite its advanced language generation capabilities, GPT-3 still exhibited limitations, such as failing to fully comprehend user intent and generating biased responses on sensitive topics like race and gender.

To address these issues, OpenAI introduced InstructGPT, a user-centered model developed using reinforcement learning with human feedback (RLHF) [23], which improved GPT’s adherence to user instructions while significantly enhancing safety and usefulness.

GPT-4 marked a further advancement by evolving into a multimodal model capable of processing both text and images [24]. In particular, GPT-4 demonstrated improved natural language comprehension and processing capabilities compared to its predecessor, GPT-3.5. While GPT-3.5 could process up to 4096 tokens, GPT-4 expanded this capacity to 32,768 tokens, allowing for significantly longer text input processing [25]. Additionally, benchmark results from the Massive Multitask Language Understanding (MMLU) test indicated that GPT-4 exhibited improved multilingual performance, including in both Korean and English [24]. Moreover, the GPT-4 series introduced specialized models such as GPT-4-turbo and GPT-4 Omni (GPT-4o), which were trained on different sets of updated data [26]. Notably, GPT-4o demonstrated enhanced token efficiency in non-English language processing [25]. To enable a comparative evaluation across both GPT and non-GPT large language models, this study additionally employed Claude 3, developed by Anthropic. First introduced in 2022 [27], the Claude series has undergone continuous development from Claude 1 to Claude 4. Among the three available models in the Claude 3 family—Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus—this study employed Claude 3 Opus, which is particularly well-regarded for its capabilities in logical reasoning and detailed analytical performance.

2.2. Multilingual Benchmarks and Culturally Specific Data

Benchmarks are used to evaluate and compare the performance of large language models across a wide range of natural language processing (NLP) tasks, including translation, summarization, question answering, sentiment analysis, and natural language inference (NLI). Particularly, multilingual datasets allow for the measurement of a model’s cross-lingual generalization capabilities, highlighting its ability to process linguistic diversity.

Benchmarks designed to evaluate the cultural understanding of LLMs focus on assessing how well a model processes diverse cultural contexts, social norms, and linguistic nuances. For instance, Massive Multitask Language Understanding includes questions from subjects such as history, geography, and ethics, reflecting cultural differences within its dataset [28]. Similarly, Cross-lingual Natural Language Inference (XNLI) evaluates a model’s multilingual inference ability by translating identical sentences into 15 languages, indirectly assessing its contextual understanding across different cultural backgrounds [29]. MEGA [30] aimed to evaluate how well large language models understand languages other than English by using 16 natural language processing datasets. BLEND [31] assessed large language models based on everyday lifestyles across diverse cultures and languages.

More recently, several visual question answering benchmarks based on multilingual data have been developed, including CVQA [32], CULTURALVQA [33], and GIMMICK [34]. CULTURALVQA revealed that model performance was significantly lower for content related to Islamic and African cultures compared to North American culture. GIMMICK demonstrated that models exhibit a clear bias toward Western culture. These benchmarks consistently report that large language models tend to perform poorly in understanding individual languages and cultures that fall outside high-resource linguistic and cultural contexts.

There have also been continued efforts to move beyond English-centric evaluation datasets in order to assess models’ capabilities in specific linguistic and cultural contexts. ChID [35], for instance, is a Chinese cloze test dataset built using idioms. CIP [36] introduced a task that transforms sentences containing idiomatic expressions into paraphrased sentences that preserve the original meaning but exclude the idioms themselves. Although both studies utilize idioms, they are limited to those found in Chinese, and do not reflect the meanings of idioms commonly used in Korean society.

Benchmarks such as HAE-RAE Bench [37], CLIcK [38], KoBBQ [39], KoDialogBench [40], and KULTURE Bench [41] have been constructed to evaluate general cultural knowledge related to Korea. KULTURE Bench, in particular, is a framework designed to assess Korean cultural competence and includes datasets of cultural news articles, idioms, and poetry. However, in this dataset, idioms are presented only in a cloze-style reading comprehension format.

To date, few attempts have been made to construct datasets that accurately reflect the meanings of Sino-Korean words and Sajaseongeo, and to evaluate large language models in terms of cultural understanding within the Korean context. Accordingly, this study proposes a benchmark that systematically incorporates Sajaseongeo, synonyms, antonyms, and homophones using a parallel data structure of Hanja and Korean. This benchmark aims to conduct an in-depth evaluation of large language models’ ability to comprehend Korean cultural contexts. In particular, by focusing on idiomatic expressions that carry metaphorical meaning, the benchmark assesses the extent to which a model understands not merely the sequence of Hanja characters but also their implicit connotations and the underlying cultural background. This approach enables a structured evaluation of the cultural dimensions embedded in Hanja, which are not adequately addressed by existing multilingual benchmarks.

3. Methodology

3.1. Benchmark Construction

The benchmark dataset for this study was constructed based on the question types used in the Korean Hanja Proficiency Test (https://www.hanja.re.kr/ (accessed on 14 May 2025)), a state-certified examination conducted in South Korea to assess proficiency in Hanja. Among the six official testing institutions recognized by the Ministry of Education of South Korea, we referred to the Korean Language Society’s Korean Hanja Proficiency Test. This test is structured into proficiency levels ranging from Grade 8 (beginner) to Grade 1 (advanced), with additional higher-tier classifications such as “Special Grade” and “Semi-Special Grade” for expert-level assessment.

For this study, we focused on Grade 1 to Grade 6, as these levels are widely taken by test candidates and cover the common-use Hanja shared across multiple testing institutions. The Special Grade and Semi-Special Grade levels were excluded since they assess expert-level proficiency and follow different question formats compared to other levels.

Among the common question types in Grades 1 to 6, we selected four primary question types for evaluation: four-character idioms, synonyms, antonyms, and homophones. The Hanja Proficiency Test includes different question types depending on the level, and it primarily focuses on assessing individual Hanja characters. However, questions involving four-character idioms, synonyms, antonyms, and homophones appear across Grades 1 to 6. These four types were selected for this study because they assess not individual characters but word-level expressions that often involve figurative meanings. Moreover, actual questions from each of these types in the official test were used as prompts in the experiment.

The dataset was constructed as follows:

Question Type 1 (four-character idioms): Each entry consists of a four-character idiom, its constituent Chinese characters, and the Korean definition.
Question Type 2 (synonyms) and Question Type 4 (homophones): Each entry includes a two-syllable Sino-Korean word presented as the question, its corresponding Hanja representation, and its Korean definition.
Question Type 3 (antonyms): Each entry consists of a two-syllable Sino-Korean word as the base term, its opposing two-syllable Sino-Korean word, and the Korean definition of the antonym.

For each question type, the dataset includes 200 samples, ensuring a balanced distribution across all categories. In addition, a supplementary dataset comprising 3829 individual Hanja entries was constructed for the purpose of generating answer choices. This dataset includes information on each character’s Hanja form, Korean meaning, and pronunciation. The dataset is publicly available at https://github.com/es133lolo/Knowing-the-Words-Missing-the-Meaning (accessed on 14 May 2025).

3.2. Evaluation Method

This study followed the procedure illustrated in Figure 1. First, four common question types from the official Hanja Proficiency Test, which are shared across Levels 1 through 6, were selected. Prompt templates and answer choice generation rules were then established for each question type. Subsequently, datasets specific to each question type and individual Hanja character data necessary for generating answer choices were constructed. Using the ChatGPT API, experiments were conducted based on this standardized dataset. The GPT-4o model was employed for the experiments.

Following the prompt instructions, the GPT model was tasked with generating four choices, selecting the correct answer, and providing an explanation for its choice. All responses and explanations generated by GPT were recorded. The correctness of the selected answers and the accuracy of the explanations were reviewed by graduate students specializing in Classical Chinese. Lastly, the selection rate of each answer choice and the accuracy of the model-generated explanations were analyzed.

This study evaluated the GPT model based on three metrics: the accuracy of the selected answers, the accuracy of the explanations, and the consistency in answer selection across different engines. For each GPT engine, the accuracy of explanations corresponding to both the generated and selected choices was assessed by question type. In all four question types, the correct answer was consistently set as choice 1, and choices 2, 3, and 4 were regarded as distractors. However, even if the model selected choice 1, the item was treated as containing an explanatory error if the accompanying explanation included inaccurate information.

Responses and explanations were classified into the following five categories for systematic assessment:

Both the answer and explanation were correct.
The answer was correct, but the explanation was incorrect.
The answer was incorrect, but the explanation was correct.
Both the answer and explanation were incorrect.
No explanation was provided, despite being prompted.

By applying these detailed evaluation criteria, the response quality of GPT models was assessed systematically. For items with inaccurate or flawed explanations, the two experts engaged in discussion to reach a final evaluation. To assess inter-annotator agreement, the IAA was calculated using Cohen’s Kappa coefficient. The resulting score, denoted by

κ

, was 0.607, indicating a substantial level of agreement between the two expert annotators.

κ = \frac{P_{o} - P_{e}}{1 - P_{e}}

(1)

4. Experiments

4.1. LLM Settings

In this experiment, we evaluated the performance of GPT-4o.

To determine an appropriate temperature setting, experiments were conducted using five levels: 0.2, 0.4, 0.6, 0.8, and 1. In this experiment, questions corresponding to Question Types 2 and 4 were presented to the GPT engine to identify the temperature that yielded the highest accuracy. As shown in Table 1, temperature 1 resulted in the highest accuracy for Question Types 2 and 4, and was, therefore, selected for subsequent evaluations.

The API parameters were configured as follows:

Temperature and Top_p were both set to 1.0.
Frequency_penalty and Presence_penalty were both set to 0.0.
The max_tokens parameter was set to the maximum token limit: 128,000 tokens for GPT-4o.

These settings ensured the full response generation of each model, allowing for a systematic comparison of their performance. Stop sequences and log probabilities (logprobs) were not explicitly set, as we aimed to evaluate the models under unrestricted conditions, ensuring fair and consistent assessments across all models.

4.2. Prompt Settings

Since the experiment involved four question types, separate prompts were designed for each type and presented to GPT in Korean. To closely examine the model’s reasoning process, it was also instructed to generate explanations for its answers. Examples of the questions and answer choice constructions presented to the GPT engine for each question type are included in the Appendix A.

5. Results and Analysis

5.1. Metrics

5.1.1. Accuracy

In this study, we evaluated the accuracy of different GPT engine models in solving the primary question types from the Korean Hanja Proficiency Test, including four-character idioms, synonyms, antonyms, and homonyms. Accuracy is defined as follows:

A c c u r a c y = \frac{N (C, E)}{N (t o t a l)}

(2)

C (choice) refers to Choice 1 (the correct answer), and E (explanation) denotes the explanation generated by the large language model.

N (C, E)

represents the number of questions for which the GPT model correctly selected the answer and provided an accurate explanation, and

N (t o t a l)

denotes the total number of questions evaluated for each question type. (In this study, each question type consisted of 200 questions, making the total dataset size 800 questions across all types.).

5.1.2. Partial Accuracy

Partial Accuracy (

P A

) is composed of two categories:

Correct Answer and Incorrect Explanation
Incorrect Answer and Correct Explanation

It is defined as follows:

P A = \frac{N (C, I E) + N (I C, E)}{N (t o t a l)}

(3)

where

N (C, I E)

is the number of cases where the model selected the correct answer but provided an incorrect explanation, and

N (I C, E)

is the number of cases where the model selected an incorrect answer but provided a correct explanation.

5.1.3. Error Rate

The Error Rate (

E R

) measures the proportion of cases where both the answer and the explanation were incorrect, calculated as follows:

E R = \frac{N (I C, I E)}{N (t o t a l)}

(4)

where

N (I C, I E)

represents the number of completely incorrect responses (i.e., cases where the model provided both an incorrect answer and an incorrect explanation). Since the total number of questions in the experiment is 200 per question type, the total dataset size is 800 questions across all question types.

5.1.4. No Explanation Provided Rate

The No Explanation Provided (

N E P

) rate refers to the proportion of cases where the model did not generate an explanation, despite being prompted to do so. This is calculated as follows:

N E P = \frac{N (N E)}{N (t o t a l)}

(5)

where

N (N E)

is the number of questions for which the model provided a correct answer but failed to generate an explanation.

5.1.5. Mean Scores for Comparative Performance Analysis Across Models

To compare the performance of different models, we computed the mean scores by averaging the accuracy across the four question types:

M e a n s c o r e s = \frac{\sum_{i = 1}^{4} A c c u r a c y_{i}}{4}

(6)

where

A c c u r a c y_{i}

represents the accuracy score for each of the four question types.

5.2. Overall Performance Evaluation

Table 2 presents the evaluation results of GPT-4o’s understanding of questions related to four-character idioms and Sino-Korean words, based on the metrics outlined in Section 5.1. GPT-4o demonstrated consistently high accuracy across all question types. In particular, it achieved 97.0% accuracy for Question Type 2 and 97.5% for Question Type 4. Even in Question Type 1 and Question Type 3, the model recorded accuracies of 88.0% and 97.0%, respectively, maintaining a performance level above 88% across all types.

An analysis of the incorrect responses revealed that GPT-4o rarely selected incorrect answer choices. Notably, in Question Types 2 and 4, the error rate remained below 3%, suggesting that the model is highly capable of distinguishing the most semantically appropriate choice among distractors. Furthermore, the near-zero rate of unanswered questions across all types indicates that GPT-4o generated responses with a high level of confidence when presented with questions involving four-character idioms and Hanja.

The highest accuracy was observed for Question Type 4 (97.5%), demonstrating GPT-4o’s ability to identify the most semantically suitable Hanja character among those with identical pronunciations. In contrast, the lowest accuracy was found for Question Type 1. Within this category, the model most frequently selected choice 3, which corresponds to a Hanja expression with the same meaning as the correct answer. This suggests that GPT-4o may encounter difficulties in contextually distinguishing between semantically equivalent Hanja expressions when selecting the most contextually appropriate form.

5.3. Evaluation of LLMs’ Cultural Understanding

To evaluate the cultural understanding of large language models, the accuracy of explanations generated by the GPT engine was also adopted as a key evaluation metric. This is because accuracy alone does not clearly indicate whether the model relied solely on the meanings of individual Hanja characters or correctly understood the semantic and cultural context of the given four-character idiom or Sino-Korean word. Therefore, higher accuracy in explanation was interpreted as indicative of stronger cultural comprehension by the model.

As shown in Table 3, GPT-4o demonstrated high explanation accuracy across all question types. The proportion of responses in which both the selected answer and the corresponding explanation were correct reached 88.0% for four-character idioms, 84.0% for synonyms, 86.5% for antonyms, and 90.5% for homophones, maintaining an overall accuracy of over 84%. These results suggest that GPT-4o not only selects correct answers but also possesses the ability to appropriately interpret the cultural and semantic context underlying those answers.

In the case of Question Type 4, GPT-4o achieved both a 97.5% accuracy rate and a 90.5% explanation accuracy, indicating a highly stable performance in distinguishing between identically pronounced Hanja characters and in providing meaning-based interpretations. In contrast, Question Types 2 and 3 showed relatively higher explanation error rates of 13.0% and 10.5%, respectively. This suggests that the model may have relative weaknesses in the contextual reasoning required to explain fine-grained semantic differences between closely related concepts.

5.4. Comparative Evaluation Across Models

To compare the overall performance of each model, average scores were calculated. A random sample of 100 questions was drawn from the experimental items used in Section 5.2 and Section 5.3. The additional comparative evaluation included three GPT-based engines—GPT-3.5-turbo, GPT-4-turbo, and GPT-4o—as well as a non-GPT model, Claude 3 Opus.

As shown in Table 4, GPT-4o exhibited the highest performance in terms of cultural understanding. The model achieved accuracy rates of 89%, 84%, 86%, and 93% for Question Types 1 through 4, respectively, maintaining stable performance above 84% across all categories. Notably, its overall average score of 88.0 was the highest among the evaluated models, suggesting that GPT-4o demonstrated the strongest ability to interpret cultural context, even when compared with the other GPT-based models.

In contrast, GPT-3.5-turbo recorded the lowest overall average score at 48.75. Its performance was particularly weak on antonyms and homophones, with accuracy rates of 24% and 7%, respectively. Although the model was prompted to provide explanations, it responded only with the selected choice number, omitting the requested justifications.

GPT-4-turbo yielded a mid-range performance, with an overall average of 78.0. However, it showed comparatively lower performance on the synonym task, with an accuracy rate of 75%.

Claude 3 Opus, the only non-GPT model included in the comparison, achieved the highest score of 97% on the homophone task. However, it demonstrated lower accuracy on synonyms (55%) and antonyms (45%). A notable issue with Claude was the frequency of incorrect explanations: Out of 100 questions, 35 explanations were incorrect for synonyms and 48 for antonyms, despite the model selecting the correct answer choices. Furthermore, during the antonym evaluation, Claude occasionally generated and answered additional questions beyond those specified in the prompt. All responses referring to these unauthorized "The Second Question" items were evaluated as incorrect. Claude autonomously generated 21 additional questions. Excluding those, it produced correct explanations for 18 questions from the original prompt.

Taken together, GPT-4o demonstrated consistently high and reliable accuracy across all four types of culturally grounded Hanja questions. These results suggest that GPT-4o is the most culturally adept model among those tested, exhibiting the most refined reasoning abilities in the context of Hanja literacy evaluation.

5.5. Limitations

This study limited its scope to four question types from the Korean Hanja Proficiency Test that emphasize the evaluation of Sino-Korean words and four-character idioms with broader semantic meanings, rather than focusing on the meanings of individual Hanja characters. Accordingly, the dataset used for the study was curated to align with these four question types, and the experiments were conducted using approximately 200 items per type. This constraint inherently limits the scale and diversity of the dataset.

In addition, the experiments were conducted exclusively on closed large language models. Future research should consider conducting comparative evaluations using open-source large language models on the same dataset to examine potential differences in performance.

6. Discussion

All three GPT engines failed to correctly answer questions of Type 1 in which the correct choice was a Hanja character conveying negation. Hanja characters such as 不, 否, 弗, 非, 未, 無, and 莫 all represent forms of negation, typically meaning “not” or “none.” While these characters share a similar core meaning, their usage differs depending on context and the degree of negation. However, most dictionaries define them uniformly as “not” or “none,” without clarifying the nuanced distinctions or usage contexts among them. This limitation appears to have led the GPT models to misinterpret these differences. For example, the four-character idiom 無所不爲, meaning “there is nothing one cannot do” and typically associated with absolute power, was incorrectly answered by GPT-4o, which selected the character 莫, a synonym in meaning. Likewise, GPT-4-turbo selected 未 instead of 非 in the idiom 非夢似夢 (“a state between sleep and wakefulness”), and GPT-3.5-turbo replaced 不 with 未 in 不撤晝夜 (“working tirelessly day and night”).

Additional instances were also found in which the models failed to distinguish between Hanja characters with similar meanings. For example, in the four-character idiom 事必歸正 (meaning “everything eventually returns to what is right”), GPT confused 事 (“affair” or “matter”) with 業 (“task” or “profession”), both of which can signify “work” or “activity.” Similarly, in 打草驚蛇 (meaning “to startle a snake by striking the grass”), the character 打 (“to strike”) was confused with 擊 (“to hit” or “to attack”). These errors indicate that GPT has difficulty discerning the subtle semantic nuances between individual Hanja characters. Notably, even for idioms such as 無所不爲 and 事必歸正—both of which are widely used in Korean—GPT exhibited confusion among Hanja characters with similar meanings.

The Sino-Korean words 佳約, 約婚, 定婚, and 婚約 all denote “a promise to become married.” These expressions are commonly used in Korean society. When 佳約 was the correct answer, GPT failed to select it, likely due to its primary dictionary definition of “a beautiful promise.” However, in both Korean and Chinese cultural contexts, the promise of lifelong partnership through marriage is viewed as a beautiful and virtuous act. Therefore, 佳約 also implicitly conveys the meaning of a “marriage agreement.” Similarly, the idiom 一瀉千里, which literally means “a river flows a thousand li in a single rush,” figuratively refers to something progressing rapidly and without obstruction. In this idiom, the character “一” not only means “one” but can also mean “once” or “in one go.” However, Claude interpreted “一” solely as “one,” its most basic and common meaning. These examples suggest that both Claude and GPT were trained primarily on the canonical definitions of Sino-Korean words and four-character idioms widely used in Korean, and as a result, had difficulty recognizing the extended or context-specific meanings these expressions may carry.

Although four-character idioms consist of only four Hanja characters, they often convey broad and complex meanings. Many are metaphorical in nature and are rooted in ancient stories or classical literature. These characteristics make accurate interpretation difficult for models such as GPT and Claude. For instance, in the idiom 犬馬之勞 (meaning “humble service”), the term 犬馬 (“dogs and horses”) is a well-established metaphor from classical texts. However, GPT-4o did not associate the idiom with its historical or cultural background and instead focused on finding a meaning closest to the idiom’s Korean gloss by analyzing the individual Hanja components. As a result, GPT-4o selected the character 繭 (“cocoon”), likely based on its connotation of smallness, thereby failing to associate 犬馬with the metaphor of loyal yet humble laborers such as dogs and horses.

In Question Type 2, the terms 瓜年, 瓜期, 瓜滿, 瓜時, and 破瓜 all refer to the end of a government official’s term. Additionally, 瓜年 is used to describe a woman who has reached marriageable age. Both meanings are derived metaphorically from the ripening of cucumbers. In this context, the cucumber is symbolically associated with both official duties and women, and these meanings emerged from cultural metaphors linking seasonal ripeness with life transitions. However, the GPT models focused only on the literal meaning of the character 瓜 (“cucumber”) and failed to connect it with its culturally derived metaphorical extensions. As a result, the models generated inaccurate explanations such as “the year for harvesting cucumbers” and produced incorrect answers. This case suggests that GPT engines face significant challenges in interpreting metaphorical expressions embedded in culturally specific linguistic contexts.

The origin of Hanja can be traced back to China, and the meanings of many Sino-Korean words are based on usage patterns established in Chinese contexts. However, some Sino-Korean words have diverged in meaning or usage within Korean society. For example, the term 總角 refers to an “unmarried man” in Korean, whereas in Chinese, it is interpreted as “a minor” or “a young child.” The GPT-4o model interpreted 總角 as “a child” and attempted to select the antonym “adult” as the correct answer, but was unable to do so because that option was not included among the choices. This outcome likely reflects the fact that the number of users of Chinese characters is significantly higher in China than in Korea, and that the amount of Hanja-related data available online is disproportionately skewed toward Chinese usage. In other words, GPT has been far more exposed to Chinese interpretations of Hanja, and, thus, has had insufficient training on meanings and usage cases that are specific to the Korean context.

This example underscores the need for further efforts to accurately reflect Korean language and cultural perspectives. In order for large language models to attain a deeper understanding of Korean Hanja culture, a large-scale parallel corpus of Korean and Hanja is necessary. Such a dataset must not only include surface-level meanings but also encompass culturally embedded and context-specific interpretations that are prevalent in Korean society. Moreover, to accurately infer the meanings of Sino-Korean words and four-character idioms, it is essential to shift the modeling approach away from focusing solely on the literal meanings of individual Hanja characters. Instead, the models should be informed by Korean classical literature and real-world language usage data that better capture the cultural and metaphorical dimensions of these expressions.

7. Conclusions and Future Work

This study proposes a novel benchmark for evaluating the cultural understanding and natural language processing capabilities of large language models, grounded in Hanja and four-character idioms, which are significant linguistic assets in Korean culture. Reflecting the official question types used in Korea’s Hanja Proficiency Test, the benchmark includes four question types: four-character idioms, synonyms, antonyms, and homophones. Based on this framework, the study systematically compares and analyzes the performance of both GPT-based and non-GPT-based large language models.

The experimental results indicate that GPT-4o achieved the highest accuracy and explanation quality across all four question types. This suggests the model’s potential to comprehend the nuanced meanings of Korean Sino-Korean words and four-character idioms, as well as interpret their cultural contexts. The model demonstrated particularly strong performance in antonym and homophone questions, and the four-character idiom tasks enabled an assessment of how large language models infer metaphorical expressions. However, difficulties remain in distinguishing the subtle semantic nuances of individual Hanja characters. GPT-4o often relied on the primary dictionary definitions of Hanja and tended to interpret them according to Chinese usages rather than meanings specific to Korean society.

This study reveals the current limitations in large language models’ understanding of Hanja as used in Korean contexts, underscoring the need for evaluation tools that reflect the uniquely Korean interpretations of Hanja, distinct from Chinese conventions. To develop models that accurately represent individual cultures and languages, including Korean, it is essential to build datasets that reflect the culturally embedded meanings of Korean Hanja. Moreover, collaborative frameworks that incorporate multiple tools—such as compound systems—should be actively adopted.

By constructing a benchmark based on Korean–Hanja pairs rather than a single language, this study introduces a new approach to evaluating cultural comprehension in large language models through Sino-Korean words and four-character idioms. This research contributes to advancing performance evaluation from a purely quantitative focus to a more qualitative evaluation that incorporates cultural context, bridging the fields of humanities and natural language processing. Ultimately, it aims to serve as a foundation for global benchmarking efforts that assess the linguistic and cultural competencies of large language models in non-English languages, including Korean.

Author Contributions

Conceptualization, E.L. and D.O.; methodology, D.O.; software, E.L.; validation, H.D. and M.K.; formal analysis, E.L. and M.K.; investigation, E.L., H.D., and M.K.; resources, E.L., H.D., and M.K.; data curation, E.L., H.D. and M.K.; writing—original draft preparation, E.L.; writing—review and editing, E.L. and D.O.; visualization, E.L.; supervision, D.O.; project administration, E.L. and D.O.; funding acquisition, None. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We constructed the dataset using the Hanja entries from the textbook titled “Hanja Proficiency Test Level 1” (한자능력검정시험 1급), which has been designated by the Korean Literary Association as the official guidebook for exam preparation. Upon request, we will share the analyzed data.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Prompt and Choice Examples by Question Type

Appendix A.1.1. Question Type 1 (Four-Character Idioms)

다음 사자성어의 ★에 들어갈 알맞은 한자를 고르시오.

Choose the appropriate character to fill in the ★ in the following idiom.

[★夢似夢]: A state that is neither fully asleep nor fully awake.

Choices:

1. 非 (the answer)

2. 沸

3. 未

4. 靡

Appendix A.1.2. Question Type 2 (Synonyms)

다음 한자어와 뜻이 같거나 비슷한 한자어를 고르시오.

Choose the Chinese character word that has the same or similar meaning as the following word.

Target word: 佳約

Choices:

1. 約婚 (the answer)

2. 約渾

3. 葯婚

4. 葯渾

Appendix A.1.3. Question Type 3 (Antonyms)

다음 한자어와 뜻이 반대되는 한자어를 고르시오.

Choose the Chinese character word that has the opposite meaning to the following word.

Target word: 總角

Choices:

1. 處子 (the answer)

2. 處炙

3. 凄子

4. 凄炙

Appendix A.1.4. Question Type 4 (Homophones)

다음 뜻에 알맞는 한자어를 고르시오.

Choose the Chinese character word that matches the given meaning.

Definition: To give money or goods as a reward or prize.

Choices:

1. 賞與 (the answer)

2. 賞余

3. 賞予

4. 賞汝

Appendix A.2. Answer Choice Construction Rules by Question Type

All four choices in each question were constructed to be mutually exclusive, with no duplicated entries.

Appendix A.2.1. Question Type 1 (Four-Character Idioms)

Choice 1: The correct first character of the idiom (the answer)

Choice 2: A Hanja character with the same pronunciation as Choice 1

Choice 3: A Hanja character with the same meaning as Choice 1

Choice 4: A character with the same pronunciation as Choice 3

Appendix A.2.2. Question Type 2 (Synonyms)

Choice 1: A Sino-Korean word with the same meaning as the target word (the answer)

Choice 2: A word with a first character that is homophonous with the first character of Choice 1

Choice 3: A word with a second character that is homophonous with the second character of Choice 1

Choice 4: A two-syllable Sino-Korean word combining the first character of Choice 2 and the second character of Choice 3

Appendix A.2.3. Question Type 3 (Antonyms)

Choice 1: A Sino-Korean word with the opposite meaning to the target word (the answer)

Choice 2: A word with a first character that is homophonous with the first character of Choice 1

Choice 3: A word with a second character that is homophonous with the second character of Choice 1

Choice 4: A two-syllable word composed of the first character of Choice 2 and the second character of Choice 3

Appendix A.2.4. Question Type 4 (Homophones)

Choice 1: The correct Sino-Korean word corresponding to the given Korean definition (the answer)

Choices 2, 3, 4: Words in which the first or second character is a homophone of the corresponding character in the correct answer

References

Sung, H.G. The Harmony of Native Korean Words and Sino-Korean Words. Eomun Res. 2010, 38, 35–65. [Google Scholar]
Min, H.S. Analysis of Sino-Korean Words and Chinese Characters in Elementary School Textbooks. J. Chin. Character Class. Chin. Educ. 2004, 13, 185–230. [Google Scholar]
Heo, C. Analysis of Lexical Composition and Investigation of Chinese Characters’ Word-Formation Ability through Registered Vocabulary in Korean Dictionaries. J. Orient. Class. Chin. Lit. Soc. 2008, 37, 289–349. [Google Scholar]
Liu, Y. A Comparative Analysis of the Story-Based Idioms between Chinese and Korean. Korean Lang. Cult. Stud. 2014, 11, 53–78. [Google Scholar]
Nguyen, N.T.; Lee, C.K. A Comparative Study of Aspect of Acceptance of Chinese Characters in Vietnam and Korea—Focused on Vietnamese Case. Stud. Humanit. 2017, 54, 31–57. [Google Scholar]
Lee, K.B.; Lee, K.D. Lectures on Korean History, Vol. 1: Ancient Period; Iljogak: Seoul, Republic of Korea, 1982; p. 243. [Google Scholar]
Hwang, W. The Reception Period and Early Settlement Process of Chinese Characters (1). J. Class. Chin. Educ. Stud. 1996, 10, 115–149. [Google Scholar]
Hwang, W. A Study on the Early Settlement Process of Classical Chinese (2)—The Situation Before Its Introduction. Daedong J. Class. Chin. Lit. 2000, 13, 89–130. [Google Scholar]
Lee, K.M. New Edition of Introduction to the History of the Korean Language; Taehaksa: Seoul, Republic of Korea, 1998; p. 57. [Google Scholar]
Hwang, W. A Study on the Early Settlement Process of Classical Chinese (3). J. East. Class. Chin. Lit. 2003, 24, 5–42. [Google Scholar]
Lee, Y. Sino-Korean Words in Idu Materials from the Unified Silla Period. Stud. Gugyeol 2021, 46, 87–116. [Google Scholar]
Jang, Y.H. Actual State and Direction of Chinese Characters’ Education. Korean Lang. Educ. Res. 2001, 8, 165–189. [Google Scholar]
Koo, B.K.; Ma, J.S. A Study on the Cultural Characteristics of Sino-Korean Words—Focusing on Contrast with Chinese and Japanese. Korean Lang. Lit. 2020, 192, 5–36. [Google Scholar] [CrossRef]
Shin, W. Construction and Challenges of a Chinese Character Word Database for Four East Asian Languages. J. Orient. Stud. 2022, 86, 185–210. [Google Scholar]
Lee, O.; Lee, H. A Comparative Study on the Morphological Differences of Four-Character Idioms in Japanese, Chinese, and Korean. J. Jpn. Cult. Stud. 2020, 84, 269–289. [Google Scholar]
Ma, Y. A Study on the Semantic Change of Sino-Korean Word: Focus on the Homograph Between Korean and Chinese in Kukhanhoeeo. Ph.D. Thesis, Seoul National University, Seoul, Republic of Korea, 2023. [Google Scholar]
Naous, T.; Ryan, M.J.; Ritter, A.; Xu, W. Having beer after prayer? measuring cultural bias in large language models. arXiv 2023, arXiv:2305.14456. [Google Scholar]
Ramesh, K.; Sitaram, S.; Choudhury, M. Fairness in language models beyond English: Gaps and challenges. arXiv 2023, arXiv:2302.12578. [Google Scholar]
Wang, Z.; Chu, Z.; Doan, T.V.; Ni, S.; Yang, M.; Zhang, W. History, development, and principles of large language models: An introductory survey. AI Ethics 2024, 5, 1955–1971. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 6 January 2025).
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
OpenAI. Hello GPT-4o; OpenAI: San Francisco, CA, USA, 2024. [Google Scholar]
OpenAI. GPT-4 Models Documentation; OpenAI: San Francisco, CA, USA, 2024. [Google Scholar]
Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional ai: Harmlessness from ai feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar]
Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring massive multitask language understanding. arXiv 2020, arXiv:2009.03300. [Google Scholar]
Conneau, A.; Lample, G.; Rinott, R.; Williams, A.; Bowman, S.R.; Schwenk, H.; Stoyanov, V. XNLI: Evaluating cross-lingual sentence representations. arXiv 2018, arXiv:1809.05053. [Google Scholar]
Ahuja, K.; Diddee, H.; Hada, R.; Ochieng, M.; Ramesh, K.; Jain, P.; Nambi, A.; Ganu, T.; Segal, S.; Axmed, M.; et al. Mega: Multilingual evaluation of generative ai. arXiv 2023, arXiv:2303.12528. [Google Scholar]
Myung, J.; Lee, N.; Zhou, Y.; Jin, J.; Putri, R.; Antypas, D.; Borkakoty, H.; Kim, E.; Perez-Almendros, C.; Ayele, A.A.; et al. Blend: A benchmark for llms on everyday knowledge in diverse cultures and languages. Adv. Neural Inf. Process. Syst. 2024, 37, 78104–78146. [Google Scholar]
Romero, D.; Lyu, C.; Wibowo, H.A.; Lynn, T.; Hamed, I.; Kishore, A.N.; Mandal, A.; Dragonetti, A.; Abzaliev, A.; Tonja, A.L.; et al. Cvqa: Culturally-diverse multilingual visual question answering benchmark. arXiv 2024, arXiv:2406.05967. [Google Scholar]
Nayak, S.; Jain, K.; Awal, R.; Reddy, S.; van Steenkiste, S.; Hendricks, L.A.; Staczak, K.; Agrawal, A. Benchmarking vision language models for cultural understanding. arXiv 2024, arXiv:2407.10920. [Google Scholar]
Schneider, F.; Holtermann, C.; Biemann, C.; Lauscher, A. GIMMICK–Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking. arXiv 2025, arXiv:2502.13766. [Google Scholar]
Zheng, C.; Huang, M.; Sun, A. ChID: A large-scale Chinese IDiom dataset for cloze test. arXiv 2019, arXiv:1906.01265. [Google Scholar]
Qiang, J.; Li, Y.; Zhang, C.; Li, Y.; Zhu, Y.; Yuan, Y.; Wu, X. Chinese idiom paraphrasing. Trans. Assoc. Comput. Linguist. 2023, 11, 740–754. [Google Scholar] [CrossRef]
Son, G.; Lee, H.; Kim, S.; Kim, H.; Lee, J.; Yeom, J.W.; Jung, J.; Kim, J.W.; Kim, S. Hae-rae bench: Evaluation of korean knowledge in language models. arXiv 2023, arXiv:2309.02706. [Google Scholar]
Kim, E.; Suk, J.; Oh, P.; Yoo, H.; Thorne, J.; Oh, A. CLIcK: A benchmark dataset of cultural and linguistic intelligence in Korean. arXiv 2024, arXiv:2403.06412. [Google Scholar]
Jin, J.; Kim, J.; Lee, N.; Yoo, H.; Oh, A.; Lee, H. KoBBQ: Korean bias benchmark for question answering. Trans. Assoc. Comput. Linguist. 2024, 12, 507–524. [Google Scholar] [CrossRef]
Jang, S.; Lee, S.; Yu, H. KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark. arXiv 2024, arXiv:2402.17377. [Google Scholar]
Wang, X.; Yeo, J.; Lim, J.H.; Kim, H. KULTURE Bench: A Benchmark for Assessing Language Model in Korean Cultural Context. arXiv 2024, arXiv:2412.07251. [Google Scholar]

Figure 1. Four question types from the official Hanja proficiency exams were selected, and prompt and answer choice generation rules were established accordingly. A list detailing the meanings and pronunciations of individual Hanja was compiled by a Korean Classical Chinese specialist. Based on predefined rules for generating answer choices, LLMs were instructed to construct distractors. Subsequently, prompts corresponding to each question type were presented to the LLMs. The accuracy of the responses generated by LLMs was analyzed by a Korean Classical Chinese specialist.

Table 1. GPT-4o’s accuracy by temperature settings for synonyms and homophones.

Temperature	Synonyms					Homophones
Temperature	1.0	0.8	0.6	0.4	0.2	1.0	0.8	0.6	0.4	0.2
Number of Incorrect Answers	6	11	11	12	12	5	7	5	6	6
Number of Correct Answers	194	189	189	188	188	195	193	195	194	194
Accuracy (%)	97.0	94.5	94.5	94.0	94.0	97.5	96.5	97.5	97.0	97.0

Table 2. GPT-4o’s selection rate for each choice (Unit: %).

Choice	Four-Character Idioms	Synonyms	Antonyms	Homophones
Choice 1 (the answer)	88.0	97.0	97.0	97.5
Choice 2	3.0	1.0	0.0	0.5
Choice 3	7.0	1.0	0.5	1.0
Choice 4	2.0	1.0	0.0	1.0
No choice selected	0.0	0.0	2.5	0.0

Table 3. Performance of GPT-4o across different question types, including accuracy and explanation evaluation (unit: %).

Metrics	Four-Character Idioms	Synonyms	Antonyms	Homophones
Accuracy	88.0	84.0	86.5	90.5
${PA}_{C, \neg E}$	0.0	13.0	10.5	0.5
${PA}_{\neg C, E}$	1.5	0.5	1.0	0.0
Error Rate	10.5	2.5	2.0	2.5
No Explanation Rate	0.0	0.0	0.0	6.5

Table 4. Comparative Evaluation Across Models.

Models	Four-Character Idioms	Synonyms	Antonyms	Homophones	Mean Scores
GPT-3.5-turbo	76	88	24	7	48.75
GPT-4-turbo	86	75	66	85	78.00
GPT-4o	89	84	86	93	88.00
Claude 3 Opus	82	55	45	97	69.75

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, E.; Do, H.; Kim, M.; Oh, D. Knowing the Words, Missing the Meaning: Evaluating LLMs’ Cultural Understanding Through Sino-Korean Words and Four-Character Idioms. Appl. Sci. 2025, 15, 7561. https://doi.org/10.3390/app15137561

AMA Style

Lee E, Do H, Kim M, Oh D. Knowing the Words, Missing the Meaning: Evaluating LLMs’ Cultural Understanding Through Sino-Korean Words and Four-Character Idioms. Applied Sciences. 2025; 15(13):7561. https://doi.org/10.3390/app15137561

Chicago/Turabian Style

Lee, Eunsong, Hyein Do, Minsu Kim, and Dongsuk Oh. 2025. "Knowing the Words, Missing the Meaning: Evaluating LLMs’ Cultural Understanding Through Sino-Korean Words and Four-Character Idioms" Applied Sciences 15, no. 13: 7561. https://doi.org/10.3390/app15137561

APA Style

Lee, E., Do, H., Kim, M., & Oh, D. (2025). Knowing the Words, Missing the Meaning: Evaluating LLMs’ Cultural Understanding Through Sino-Korean Words and Four-Character Idioms. Applied Sciences, 15(13), 7561. https://doi.org/10.3390/app15137561

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Knowing the Words, Missing the Meaning: Evaluating LLMs’ Cultural Understanding Through Sino-Korean Words and Four-Character Idioms

Abstract

1. Introduction

2. Related Works

2.1. Large Language Models

2.2. Multilingual Benchmarks and Culturally Specific Data

3. Methodology

3.1. Benchmark Construction

3.2. Evaluation Method

4. Experiments

4.1. LLM Settings

4.2. Prompt Settings

5. Results and Analysis

5.1. Metrics

5.1.1. Accuracy

5.1.2. Partial Accuracy

5.1.3. Error Rate

5.1.4. No Explanation Provided Rate

5.1.5. Mean Scores for Comparative Performance Analysis Across Models

5.2. Overall Performance Evaluation

5.3. Evaluation of LLMs’ Cultural Understanding

5.4. Comparative Evaluation Across Models

5.5. Limitations

6. Discussion

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Prompt and Choice Examples by Question Type

Appendix A.1.1. Question Type 1 (Four-Character Idioms)

Appendix A.1.2. Question Type 2 (Synonyms)

Appendix A.1.3. Question Type 3 (Antonyms)

Appendix A.1.4. Question Type 4 (Homophones)

Appendix A.2. Answer Choice Construction Rules by Question Type

Appendix A.2.1. Question Type 1 (Four-Character Idioms)

Appendix A.2.2. Question Type 2 (Synonyms)

Appendix A.2.3. Question Type 3 (Antonyms)

Appendix A.2.4. Question Type 4 (Homophones)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI