Next Article in Journal
Constructing Non-Markovian Decision Process via History Aggregator
Previous Article in Journal
Tensile and Fatigue Performance of Cold-Work Tool Steels for Adjustable Forming Tools
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

KLSBench: Evaluating LLM Capabilities on Korean Literary Sinitic Texts in Historical Context

1
Department of Artificial Intelligence, Ajou University, Suwon 16499, Republic of Korea
2
Department of Sinographic Literatures, Korea University, Seoul 02841, Republic of Korea
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2026, 16(2), 953; https://doi.org/10.3390/app16020953
Submission received: 29 November 2025 / Revised: 23 December 2025 / Accepted: 12 January 2026 / Published: 16 January 2026

Featured Application

KLSBench provides essential evaluation protocols for deploying LLM-powered applications on digitized Korean Literary Sinitic archives. This benchmark enables systematic assessment of automated annotation systems, intelligent search interfaces, and educational platforms for digital humanities projects, supporting historians, philologists, and students in accessing and analyzing historical text collections.

Abstract

Large language models (LLMs) show limited capability in processing low-resource historical languages due to insufficient training data and domain-specific linguistic structures. Korean Literary Sinitic (KLS), the principal written medium of the Joseon dynasty, remains particularly under-resourced despite its lexical overlap with modern Korean and shared script with classical Chinese. To enable systematic evaluation in this domain, we introduce KLSBench, a comprehensive benchmark for assessing LLM performance on KLS. KLSBench contains 7871 instances sourced from Joseon dynasty civil service examination archives and parallel corpora of the Four Books, and encompasses five task categories: classification, retrieval, punctuation restoration, natural language inference, and translation. Our evaluation suggests KLSBench could work as an effective diagnostic tool that distinguishes lexical recall from deeper linguistic comprehension in low-resource historical languages. Beyond establishing evaluation baselines, KLSBench provides practical frameworks for deploying LLM-based tools in digital humanities contexts, including automated annotation systems and intelligent search interfaces for classical text repositories.

1. Introduction

Large language models (LLMs) [1,2,3,4] have demonstrated impressive performance on a wide range of natural language processing tasks [5,6,7] and benchmarks, particularly for modern high-resource languages [8,9]. However, the ability of LLMs to generalize from high-resource modern benchmarks to low-resource historical languages remains poorly understood, particularly in scenarios where training data is limited and linguistic conventions differ substantially from modern usage. Addressing this limitation is important for evaluating the practical applicability of LLMs to historical and low-resource language processing tasks.
A key challenge in studying this problem is the lack of appropriate evaluation benchmarks. Existing multilingual and historical benchmarks often focus on modernized or translated texts, making it difficult to isolate whether LLMs can leverage cross-lingual knowledge when direct supervision is limited [10,11,12]. As a result, we currently lack systematic evidence on how LLMs perform when faced with historical languages that are low-resource yet structurally connected to high-resource languages.
Korean Literary Sinitic (KLS; see Appendix C for a definition) serves as an appropriate testbed for addressing this gap. Historically used in Korea as the primary written language for scholarship and administration, KLS is currently low-resource in terms of digital corpora, while maintaining close linguistic connections to modern Chinese through shared logographic characters and to modern Korean through extensive Sino-Korean vocabulary (see Appendix C for a definition) [13,14]. This combination of data scarcity and cross-lingual relatedness enables systematic evaluation of whether LLMs can leverage cross-lingual information to mitigate limited in-language resources, thereby making KLS well suited for examining cross-lingual generalization in low-resource historical language settings.
Beyond data scarcity, KLS also poses linguistic challenges that further stress current models. Texts are written in condensed classical syntax with implicit grammatical relations [15] and are typically preserved without punctuation (白文; see Appendix C for a definition), requiring models to infer sentence boundaries and structure. Moreover, KLS texts contain dense cultural references and rhetorical conventions rooted in classical East Asian traditions [16]. These characteristics distinguish KLS from modern language benchmarks and highlight the need for evaluation settings that reflect the realities of historical text processing.
To evaluate LLM performance under these conditions, we introduce KLSBench, a benchmark designed to assess understanding of Korean Literary Sinitic. Following the design philosophy of C 3 Bench [17] for Chinese classical texts, KLSBench is constructed from Joseon dynasty civil service examination materials and canonical Confucian texts from the Four Books (四書). The benchmark comprises 7871 instances across five tasks—classification, retrieval, punctuation, natural language inference, and translation—covering both linguistic form and semantic understanding. We evaluate seven LLMs, including proprietary models (GPT-4 Turbo, GPT-3.5 Turbo, Claude 3.5 Sonnet, Claude 3 Opus) and open-source models (LLaMA 3.1, Qwen 2.5, EXAONE 3.0). Our results reveal clear performance stratification across task types, offering new insights into the strengths and limitations of current LLMs in leveraging cross-lingual knowledge for low-resource historical languages.
The main contributions of this work are as follows:
1.
We introduce KLSBench, the first comprehensive benchmark for evaluating LLM performance on Korean Literary Sinitic, which contains 7871 instances across five task categories, constructed from Joseon civil service examination materials and parallel corpora of the Four Books.
2.
We conduct a systematic evaluation of seven LLMs covering general-purpose, Chinese-specialized, and Korean-specialized pretraining profiles. The evaluation results provide a quantitative performance landscape for KLS understanding, enabling reproducible comparison across model families and establishing a public leaderboard.
3.
We publicly release KLSBench along with evaluation code and detailed documentation to facilitate reproducible research and continued advancement in Korean Literary Sinitic understanding. The benchmark, code, and zero-shot outcomes are available at https://songhune.github.io/korean_R-CoA/ (accessed on 10 January 2026).
Section 2 reviews related work on LLM evaluation benchmarks and classical text processing in historical languages. Section 3 presents the construction of KLSBench, including data sources, task design for all five tasks, and dataset statistics. Section 4 describes the experimental setup, evaluated models, and evaluation metrics. Section 5 presents comprehensive results and analyses for each task, including overall performance, model comparisons, and error analysis. Section 6 discusses key findings, implications for classical text understanding, and limitations. Finally, Section 7 concludes the paper and outlines directions for future research.

2. Related Works

2.1. LLM Evaluation Benchmarks

Large language model development has driven creation of evaluation benchmarks to assess capabilities across tasks and domains. MMLU (Massive Multi-task Language Understanding) [8] evaluates models across 57 subjects including humanities, STEM, and social sciences. HELM (Holistic Evaluation of Language Models) [9] provides a framework covering accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency across 42 scenarios. BigBench [18] extends evaluation with over 200 tasks contributed by the research community.
Domain-specific benchmarks evaluate specialized knowledge and reasoning. Legal-Bench [19] assesses legal reasoning across tasks including case analysis and contract understanding. Medical benchmarks such as MedQA [20] and PubMedQA [21] evaluate medical question answering using USMLE-style questions and biomedical literature. ScienceQA [22] provides multimodal science question answering across physics, chemistry, and biology.
Multilingual evaluation has received attention [11,23]. XTREME [24] and XGLUE [25] benchmark cross-lingual understanding and generation across 40+ languages, focusing on modern high-resource languages. XNLI [26] extends natural language inference evaluation to 14 languages including low-resource languages. For Korean, KorNLI and KorSTS [27] evaluate natural language inference and semantic textual similarity in contemporary Korean, while KLUE [28] provides a benchmark suite for modern Korean language understanding.

2.2. Low-Resource and Classical Language Processing

Evaluating LLMs on classical and historical languages presents challenges beyond modern language processing. Cao et al. [17] introduced C 3 Bench, a classical Chinese understanding benchmark comprising approximately 50,000 instances across five tasks spanning ten domains of classical literature. Their evaluation of 15 LLMs revealed performance gaps between classical and modern Chinese understanding, showing that proficiency in contemporary language does not automatically transfer to classical texts. This work established baselines for classical Chinese evaluation, yet evaluating classical texts in other East Asian contexts remains unexplored.
Literary Sinitic text processing involves difficulties distinct from modern language tasks [29,30,31]. The absence of explicit word boundaries necessitates segmentation algorithms [15], while named entity recognition must account for historical figures, places, and institutions no longer in contemporary use [16,32]. Classical texts employ grammatical structures and literary conventions that differ from modern forms [33], requiring linguistic knowledge for interpretation. These challenges are compounded by limited training data [34], absence of modern native speakers, and sparse linguistic resources [35,36].
Korean classical texts present challenges beyond C 3 Bench for Chinese classical texts [17]. KLS is a low-resource language with connections to high-resource languages. Traditional Literary Sinitic using orthodox characters (正體字) has been largely excluded from contemporary Chinese NLP research following mainland China’s standardization of simplified characters (字) in the 1950s–1960s. However, KLS has connections to two high-resource languages: modern Chinese and modern Korean. Modern Chinese is historically and lexically connected to KLS through their shared use of the logographic character system and substantial overlap in classical lexical forms, while modern Korean has inherited a substantial body of vocabulary from Sino-Korean etymological roots. This creates a scenario where transfer learning from related high-resource languages could compensate for training data scarcity—yet whether LLMs exploit these cross-lingual bridges remains unverified.
Korean’s dual-script nature (Hangul and Literary Sinitic, detailed explanation described in Appendix C), character encoding complexities (multiple Unicode normalization forms, polyphone characters like 更and 度, initial-sound law variations), and region-specific cultural interpretations require specialized preprocessing. No evaluation framework exists for Korean classical literature understanding. Establishing such a benchmark addresses this gap and helps understand how LLMs leverage cross-lingual knowledge for low-resource historical languages with connections to contemporary high-resource languages.

3. KLSBench: Benchmark Construction

Figure 1 provides an overview of the KLSBench architecture from data sources through evaluation. The benchmark integrates two data sources (Joseon dynasty Gwageo records and Four Books) through a data processing pipeline. We evaluate five tasks using seven LLMs and produce performance analysis. Task examples are provided in the Appendix A.1.

3.1. Data Sources

KLSBench uses two primary data sources drawn from the Joseon dynasty’s literary and examination traditions.

3.1.1. Joseon Dynasty Civil Service Examination Records

The Gwageo (科擧) civil service examination system selected government officials during the Joseon dynasty, operating from 1392 until its abolition in the Gabo Reforms of 1894. The examination consisted of multiple stages and required mastery of KLS literature, philosophy, and literary genres. The dataset draws from historical examination records including original questions and model answers preserved in literary collections and examination anthologies.
We extract data from three classes in the knowledge graph: Question, Answer, and Sigwon (physical answer sheets). After preprocessing including duplicate removal, length filtering, format standardization, and quality verification, we retained 2849 instances from an original corpus of 3348 question–answer pairs. Each instance includes the original KLS text with Korean and English translations, spanning 21 literary genres. Figure 2 shows the temporal distribution of examination questions across the Joseon dynasty, with concentration in later periods due to better preservation.

3.1.2. Four Books (四書)

The Four Books (Saseo: 論語Analects, 孟子Mencius, 大學Great Learning, 中庸Doctrine of the Mean) are the canonical texts of Neo-Confucian education and served as the foundation of the Gwageo examination curriculum. These texts, part of the Four Books and Five Classics (四書五經) corpus, contain Confucian philosophical concepts and ethical principles that shaped intellectual discourse throughout the Joseon dynasty.
We use trilingual editions of the Four Books published by the Korean Institute for Classical Humanities [37,38,39], based on the Jipju (集註) tradition by Zhu Xi (朱熹, 1130–1200), which served as the official text for Joseon civil service examinations. We extract data from Work (KLS original texts) and Collection (Korean and English translations with commentaries) classes in the knowledge graph. After preprocessing including duplicate removal, segmentation standardization, and quality verification, we obtained 2119 instances from an original corpus of 2624 passage–translation–commentary triplets.
Table 1 summarizes the composition and utilization of both data sources across KLSBench tasks.

3.2. Task Design

KLSBench comprises five tasks to evaluate different dimensions of classical literature comprehension. Each task addresses capabilities required for understanding historical texts, ranging from surface-level text processing to semantic reasoning. Examples of each task are provided in Appendix A. (Table A3Table A6).

3.2.1. Classification Task

Objective: the classification task evaluates LLMs’ ability to identify literary genres (文體; see Appendix C for a definition) in KLS texts. During the Joseon dynasty, examinations required mastery of literary genres, each governed by distinct conventions regarding structure, argumentation, and stylistic features. Accurate classification requires understanding genre-specific characteristics rather than recognizing superficial textual patterns.
Data Construction: we construct a dataset of 808 instances spanning 21 literary genres extracted from Gwageo examination records. The major genres include bu (賦, rhyme-prose), si (詩, regulated poetry), eui (疑, essays on doubtful points), ui (義, argumentative essays), chaek (策, policy proposals), pyo (表, memorials), ron (論, discourses), myeong (銘, inscriptions), and jeon (箋, annotations). We employ stratified sampling with 95 instances per major genre, supplemented by samples from minor genres. Each instance consists of KLS text extracted from either the abstract or full content of examination questions and answers.
Evaluation: we report accuracy as the primary metric, supplemented by macro-averaged precision, recall, and F1 scores to account for class imbalances across the 21 genres.

3.2.2. Retrieval Task

Objective: the retrieval task assesses LLMs’ ability to identify the source text and chapter from which a KLS passage originates. This task simulates source attribution and tests models’ familiarity with canonical texts that formed the foundation of Joseon dynasty education.
Data Construction: we construct 1209 instances from the Four Books, as these texts are well-defined canonical works with established chapter structures. The distribution reflects the relative sizes of the source texts: Lunyu (論語, Analects, 500 instances), Mengzi (孟子, Mencius, 500 instances), Zhongyong (中庸, Doctrine of the Mean, 137 instances), and Daxue (大學, Great Learning, 72 instances). Each instance requires the model to output both the book title and the chapter from which the passage is drawn.
Evaluation: we measure accuracy based on exact match of both book title and chapter name. Partial credit is not awarded, as precise source attribution requires identifying both components correctly.

3.2.3. Punctuation Task

Objective: Literary Sinitic texts were traditionally written without punctuation marks (白文), requiring readers to segment sentences and clauses based on syntactic and semantic understanding. The punctuation restoration task evaluates LLMs’ ability to insert punctuation marks into unpunctuated KLS texts, testing their comprehension of classical grammar, sentence boundaries, and logical structure.
Data Construction: we create 2000 instances by removing punctuation from KLS texts. The dataset maintains balanced representation from both data sources: 1000 instances (50%) from Four Books and 1000 instances (50%) from Gwageo examination records. Preprocessing includes filtering for minimum text length (10 characters or more). We preserve only content-bearing punctuation marks (periods, commas, question marks) used in modern scholarly editions while removing formatting punctuation. This task requires models to understand KLS syntax and discourse structure to identify clause and sentence boundaries.
Evaluation: we adopt character-level F1 score as the primary metric. We treat punctuation restoration as a sequence labeling task. Additionally, we report ROUGE-1, ROUGE-2, and ROUGE-L scores to capture n-gram overlap and longest common subsequence similarity between predicted and reference punctuated texts.

3.2.4. Natural Language Inference Task

Objective: Natural language inference (NLI) evaluates LLMs’ ability to determine logical relationships between pairs of statements. Given a premise and hypothesis, models must classify the relationship as entailment (the hypothesis logically follows from the premise), contradiction (the hypothesis contradicts the premise), or neutral (no logical relationship exists).
Data Construction: we generate 1854 NLI instances using construction strategies to ensure coverage of inference types. The label distribution reflects natural occurrence patterns: entailment (1313 instances, 70.8%), neutral (400 instances, 21.6%), and contradiction (141 instances, 7.6%). We employ six generation categories:
1.
Translation equivalence: pairing original KLS with Korean translations
2.
Cross-lingual entailment: connecting Korean and English translation pairs
3.
Cross-text relation: comparing passages from different texts or chapters
4.
Negation-based: introducing logical negation to create contradictions
5.
Metaphorical reasoning: testing understanding of figurative language
6.
Analogical reasoning: evaluating comprehension of analogies and parallels
This approach ensures the task assesses reasoning capabilities beyond semantic similarity matching.
Evaluation: we report overall accuracy as the primary metric, complemented by per-class precision, recall, and F1 scores for each of the three labels (entailment, neutral, contradiction) to identify specific reasoning weaknesses.

3.2.5. Translation Task

Objective: the translation task evaluates LLMs’ multilingual comprehension and generation across three languages: KLS, Korean, and English. Translation requires lexical knowledge and understanding of cultural context, philosophical concepts, and stylistic conventions across linguistic boundaries.
Data Construction: we construct 2000 translation instances covering two language pair directions. Literary Sinitic to Korean translation (1320 instances, 66%) constitutes the majority, reflecting the importance of rendering classical texts into vernacular Korean. Korean-to-English translation (680 instances, 34%) tests the ability to convey Korean interpretations of classical concepts into English. The asymmetric distribution mirrors translation practice priorities in classical Korean studies, where vernacularization of KLS remains the primary activity.
Evaluation: we employ BLEU score [40] as the primary evaluation metric. This metric measures n-gram precision between model outputs and reference translations. We additionally report ROUGE-1, ROUGE-2, and ROUGE-L scores [41] to capture recall-oriented similarity and lexical overlap. Translation quality assessment requires human expert judgment, which we use for qualitative error analysis.
Table 2 presents examples illustrating translation challenges in KLSBench. The first example demonstrates KLS-to-Korean translation, where condensed classical syntax (子曰學而時習之不亦說乎) requires expansion in the target language to convey implicit subjects and philosophical nuances. The second example shows Korean-to-English translation, requiring preservation of formal register while capturing culturally specific concepts of compassion (차마 해치지 못하는 마음). The third example highlights the challenge of translating core Neo-Confucian terminology (克己復禮, 仁) that resists direct lexical equivalence and requires cultural contextualization rather than word-by-word correspondence.

3.3. Data Statistics and Quality Control

3.3.1. Overall Statistics

Table 3 presents statistics for KLSBench, showing the distribution of instances across all five tasks and their evaluation metrics.

3.3.2. Label Distribution

For tasks requiring categorical prediction, we provide detailed label distribution statistics. Table 4 shows the distribution for Classification and NLI tasks, which exhibit different balance characteristics.
The Classification task contains 21 literary genres with imbalanced distribution. Six major genres (詩, 表, 策, 義, 疑, 賦) contain exactly 95 instances each through stratified sampling, while the remaining 15 minor genres range from 2 to 53 instances, reflecting the natural frequency distribution of classical literary genres in examination records. The NLI task displays natural label imbalance. Entailment constitutes the majority (70.8%), which reflects the prevalence of translation equivalence pairs in our construction methodology.

3.3.3. Text Length Statistics

Text length variation impacts both task difficulty and computational requirements. Table 5 presents length statistics across all tasks, measured in characters for Korean text and tokens for KLS.
The Classification task exhibits the highest mean length (324 characters) and maximum length (3847 characters). This reflects the inclusion of complete examination essay responses. Retrieval and NLI tasks feature shorter texts (mean 142–156 characters). These correspond to canonical philosophical passages from the Four Books. This length diversity ensures the benchmark tests comprehension capabilities across varying context windows.
Figure 3 illustrates text length distributions across major literary genres in the classification task. This demonstrates that different literary genres exhibit characteristic length patterns. For instance, bu (賦, rhyme-prose) and chaek (策, policy proposals) tend toward longer compositions, while si (詩, regulated poetry) maintains relatively uniform shorter lengths due to strict formal constraints.

3.3.4. Quality Control Measures

We implement rigorous quality control procedures to ensure benchmark reliability and validity:
Duplicate Detection and Removal. We identify and eliminate duplicate instances using exact text matching and near-duplicate detection based on character-level edit distance. From initial corpora of 3348 Gwageo records and 2624 Four Books passages, we removed 499 and 505 duplicates, respectively. We retained 2849 and 2119 unique instances.
Length Filtering. We apply minimum length thresholds to exclude trivially short instances that provide insufficient context for meaningful evaluation. Specifically, we filter instances shorter than 10 characters for all tasks. This ensures that remaining examples contain substantive content requiring actual comprehension rather than simple pattern matching.
Format Standardization. To ensure stable processing of KLS text, we apply Unicode Normalization Form C (NFC) to all instances. This step resolves inconsistencies arising from heterogeneous character sources, particularly when a visually identical graphic like ‘女(female)’ can be encoded either as U+5973 or (in Korean-compatibility mode) as U+F981. In historical corpora annotated under Korean orthographic variants (due to the 頭音法則—“initial-sound rule”) the divergence of initial-sound representations (‘녀’ vs. ‘여’) often co-occurs with differing codepoints rather than purely pronunciation difference. Without normalization, a tokenizer may treat U+5973 and U+F981 (or variant sequences) as distinct symbols, causing artificial sparsity, segmentation mismatches, or embedding misalignment. NFC consolidation maps such variants to a canonical form (typically U+5973), thus unifying initial-sound rule-related and encoding-derived variants, and ensuring consistent downstream processing of characters that are linguistically identical.
Label Validation. For Classification and Retrieval tasks derived from structured databases, we validate label accuracy by cross-referencing with original source metadata. For NLI instances generated through automated strategies, we conduct manual verification of a random 10% sample (185 instances). We achieve 94.6% agreement with intended label assignments. Disagreement cases primarily involved subtle neutral-entailment boundary judgments in cross-text relations.
Translation Quality Verification. Reference translations from the Korean Institute for Classical Humanities [42] are officially published scholarly materials, providing a reliable baseline for the Translation task.
These quality control measures ensure that KLSBench provides reliable, high-quality evaluation data suitable for systematic benchmarking of LLM capabilities on Korean classical literature understanding.

4. Experimental Setup

4.1. Evaluated Models

We evaluate seven large language models, including proprietary API-based systems and open-source models. These models are grouped by access mode (proprietary vs. open-source) and linguistic pretraining backgrounds. This selection covers a range of model families and training profiles under a unified evaluation setting. Across both proprietary and open-source models, we standardize the query format and prompting procedure following the empirical inference protocol of Yamaguchi et al. [43] to ensure consistent and reproducible evaluation inputs. Table 6 summarizes the specifications of all evaluated models.
Proprietary Models. We evaluate four proprietary models accessed through their official APIs. GPT-4 Turbo and GPT-3.5 Turbo (OpenAI) [1] represent the GPT-4 and GPT-3.5 families and are evaluated via the OpenAI API. Claude 3.5 Sonnet and Claude 3 Opus (Anthropic) represent the Claude 3 family and are accessed through the Anthropic API. All proprietary models are evaluated using default inference configurations provided by their respective APIs. Due to their closed-source nature, architectural details and parameter counts are not publicly disclosed.
Open-Source Models. We evaluate three open-source instruction-tuned models [44,45] in the 7–8B parameter range. This narrow parameter range is chosen to control for model size and to support a more balanced comparison among open-source models under local inference settings. Llama 3.1 8B Instruct (Meta) [2] is a general-purpose multilingual model. Qwen 2.5 7B Instruct (Alibaba Cloud) [3] is trained with a strong emphasis on Chinese-language data. EXAONE 3.0 7.8B Instruct (LG AI Research) [46] is developed using Korean-language resources. All open-source models are evaluated using their official Hugging Face implementations with identical instruction-following templates and decoding settings.

4.2. Evaluation Metrics

We employ task-specific evaluation metrics aligned with established practices in NLP benchmarks. All metrics are computed using standard implementations from scikit-learn (classification metrics) and NLTK/rouge-score libraries (generation metrics).

4.2.1. Classification and Retrieval Tasks

For Classification and Retrieval tasks, we report the following:
  • Accuracy: Proportion of correctly predicted labels, serving as the primary metric.
  • Macro-averaged F1: Harmonic mean of precision and recall averaged across all classes, addressing potential class imbalance.
  • Per-class Precision and Recall: Class-specific metrics to identify strengths and weaknesses across different literary genres (classification) or source texts (retrieval).

4.2.2. Punctuation Restoration Task

For the Punctuation task, we treat punctuation insertion as a sequence labeling problem and evaluate the following:
  • Character-level F1 Score: Primary metric measuring precision and recall of punctuation mark placement at the character level.
  • ROUGE-L F1: Longest common subsequence-based similarity between predicted and reference punctuated texts, capturing overall structural preservation.
  • ROUGE-1 and ROUGE-2 F1: Unigram and bigram overlap metrics, respectively, providing complementary assessment of local accuracy.

4.2.3. Natural Language Inference Task

For NLI, we report the following:
  • Overall Accuracy: Primary metric measuring correct classification across all three labels.
  • Per-class F1 Scores: Separate F1 scores for entailment, neutral, and contradiction to diagnose model behavior across different logical relationships.
  • Confusion Matrix Analysis: Qualitative assessment of systematic misclassification patterns, particularly distinguishing between neutral-entailment and neutral-contradiction confusions.

4.2.4. Translation Task

For Translation, we employ multiple automatic metrics to capture different aspects of translation quality:
  • BLEU Score: Primary metric measuring n-gram precision with brevity penalty, standard for machine translation evaluation.
  • ROUGE-1, ROUGE-2, ROUGE-L F1: Recall-oriented metrics providing complementary assessment of lexical overlap and structural similarity.
We acknowledge that automatic metrics provide approximations of translation quality and conduct qualitative error analysis on sample translations to complement quantitative results.

4.3. Implementation Details

4.3.1. Hardware Configuration

All experiments are conducted on a single NVIDIA H100 GPU (80 GB VRAM) with 12-core CPU and 240 GB system memory (238 GB application memory, 2 GB shared memory). This configuration supports efficient inference for both API-based models (via network requests) and local deployment of open-source models. API models are accessed through official endpoints with standard rate limiting and retry mechanisms. Open-source models are loaded using PyTorch with bfloat16 precision for memory efficiency while maintaining numerical stability.

4.3.2. Prompt Design and Inference Strategy

We adopt a zero-shot evaluation paradigm to assess models’ inherent capabilities without task-specific examples. All prompts follow a consistent structure:
1.
System Prompt: Task-specific instructions clearly defining the expected input–output format and behavioral constraints.
2.
User Prompt: The actual test instance formatted according to task requirements.
For classification, we provide the KLS text and request identification of the literary style from a predefined list. For retrieval, we present a passage and ask for the source book and chapter. For punctuation, we provide unpunctuated KLS text (白文) and request insertion of appropriate punctuation marks to segment sentences and clauses. For NLI, we present premise–hypothesis pairs and request classification into entailment, neutral, or contradiction. For translation, we specify the source language, target language, and text to be translated.
In designing the prompt descriptions, we use the descriptive conventions widely used in the most established LLM benchmarks, GLUE [47]. Accordingly, each task includes a clear statement of its objective, a precise explanation of the input structure, explicit definitions of all output labels, and a concise summary of the evaluation protocol. Although such benchmarks do not impose a rigid template, they consistently provide these components, and we followed the same principles to ensure clarity, comparability, and reproducibility.

4.3.3. Inference Parameters

We standardize inference hyperparameters across all models to ensure fair comparison:
  • Temperature: 0.0 (deterministic decoding for reproducibility).
  • Top-p (nucleus sampling): 1.0 (disabled due to temperature = 0.0).
  • Max output tokens: Task-dependent (classification/retrieval/NLI: 100; punctuation: 2000; translation: 1500).
  • Random seed: 42 (for open-source models).
For API models, we use the latest stable API versions as of June–September 2025, including OpenAI API v1 and Anthropic API. For open-source models, we use Transformers library version 4.40.0 with PyTorch 2.2.0.

4.3.4. Reproducibility

To ensure reproducibility, we provide the following:
  • Complete evaluation code and prompts available at the project repository;
  • Fixed random seeds for all stochastic processes;
  • API version specifications and model checkpoint identifiers;
  • Detailed hyperparameter configurations;
  • Raw model outputs for all evaluated instances.
All evaluation runs were conducted from June to September 2025, and API model behaviors are documented with their respective version timestamps to account for potential model updates. KLSBench will undergo regular updates in the future, including continuous evaluation of new LLM models, data quality improvements through expert validation, and expansion with additional tasks.

5. Results and Analysis

5.1. Overall Performance

Table 7 presents comprehensive performance across all seven evaluated models and five tasks. The results reveal dramatic performance variations both across tasks and across models. This demonstrates that classical literature understanding presents fundamentally different challenges from modern text processing.

5.2. Task-Specific Analysis

5.2.1. Classification Results

Literary style classification remains challenging, with mean accuracy of 31.8%. Performance shows substantial variation across models: Qwen 2.5 7B achieves the highest accuracy at 51.5%, dramatically outperforming Claude 3 Opus (44.0%), GPT-3.5 Turbo (36.6%), GPT-4 Turbo (32.7%), EXAONE 3.0 (31.2%), Llama 3.1 8B (15.6%), and Claude Sonnet 4.5 (11.3%). The performance range (11.3% to 51.5%) reveals substantial differences in classical genre recognition capabilities. This hierarchy demonstrates that Chinese-specialized pretraining provides decisive advantages: Qwen 2.5 7B substantially exceeds both general-purpose (ChatGPT, Claude) and Korean-specialized (EXAONE) models, revealing that exposure to classical Chinese texts during pretraining is critical for rhetorical genre recognition. The substantial performance gap between Qwen 2.5 7B’s 51.5% and the next-best Claude 3 Opus at 44.0% underscores the importance of classical corpus exposure.
The confusion matrix analysis (Figure 4) reveals systematic error patterns that illuminate the linguistic challenges of classical genre recognition. Models exhibit strong bias toward predicting 詩(poetry), the most structurally distinctive genre, resulting in high false-positive rates, with certain genres frequently misclassified as poetry. This pattern suggests models rely heavily on surface-level stylistic features (metrical patterns, parallelism) that characterize poetry but fail to recognize the deeper rhetorical structures distinguishing other genres. Conversely, 疑(essays on doubtful points) demonstrates relatively strong diagonal strength (56.4% recall), indicating that its distinctive question–answer dialectical structure provides more robust classification signals. The particularly weak diagonals for 賦(13.8%) and 義(13.3%) confirm that these genres’ subtle argumentative conventions—requiring understanding of classical rhetorical theory rather than surface textual patterns—remain largely inaccessible to current LLMs.

5.2.2. Retrieval Results

Retrieval demonstrates moderate performance (mean 64.9%), with considerable variation across models. Claude Sonnet 4.5 achieves the highest accuracy (91.7%), followed by Claude 3 Opus (80.8%), GPT-4 Turbo (75.8%), EXAONE 3.0 (63.3%), GPT-3.5 Turbo (60.0%), and Qwen 2.5 (59.2%). Notably, Llama 3.1 8B shows the lowest performance (23.3%), suggesting limited exposure to East Asian classical texts. The task relies on text matching and memorization rather than deep comprehension. The Four Books represent canonical texts extensively cited across East Asian intellectual history, likely appearing in various forms within LLM training data. Models can succeed through pattern matching without necessarily understanding philosophical content.

5.2.3. Punctuation Results

Punctuation restoration exhibits the most dramatic performance variation across models, with F1 scores [48] ranging from 0.356 to 0.940. Most remarkably, GPT-3.5 Turbo achieves an F1 score of 0.940, substantially outperforming all other models including the ostensibly more capable GPT-4 Turbo (0.768). Llama 3.1 (0.839) also demonstrates strong performance, while Qwen 2.5 (0.511), Claude 3 Opus (0.473), Claude Sonnet 4.5 (0.406), and EXAONE 3.0 (0.356) lag significantly behind.

5.2.4. Natural Language Inference Results

NLI performance reveals surprisingly strong reasoning capabilities, with mean accuracy of 59.3%. GPT-4 Turbo achieves the highest performance at 82.7%, followed by Qwen 2.5 (75.7%), GPT-3.5 Turbo (74.1%), Llama 3.1 (74.1%), EXAONE 3.0 (71.9%), and Claude 3 Opus (36.2%). Critically, Claude Sonnet 4.5 achieves near-zero accuracy (0.5%), suggesting severe limitations in its instruction-following or reasoning capabilities for this specific task format.
The NLI task requires determining logical relationships between premise–hypothesis pairs drawn from classical philosophical texts. This demands comprehension of logical structure and philosophical concepts. The strong performance of GPT-4 Turbo and several other models suggests that modern LLMs possess substantial formal reasoning capabilities over KLS semantic content. However, Claude Sonnet 4.5’s catastrophic failure (0.5%) despite strong retrieval performance (91.7%) reveals that NLI success depends critically on task-specific instruction tuning rather than general language understanding.

5.2.5. Translation Results

Translation demonstrates uniformly low but variable performance, with BLEU scores ranging from 0.033 to 0.329. Claude Sonnet 4.5 leads at 0.329, followed by GPT-4 Turbo (0.307), GPT-3.5 Turbo (0.253), Claude 3 Opus (0.250), Llama 3.1 (0.189), EXAONE 3.0 (0.144), and Qwen 2.5 (0.033).
These low BLEU scores reflect fundamental challenges in classical-to-modern language translation. Literary Sinitic employs highly condensed syntax and culturally embedded idioms requiring substantial interpretive work. Reference translations often include explanatory glosses absent from model outputs, yielding low n-gram overlap despite potential semantic adequacy. The relatively consistent performance suggests improvements require advances in cross-lingual understanding of classical concepts rather than model-specific optimizations.

5.3. Model Comparison and Key Insights

Models demonstrate task-specific strengths but struggle to achieve balanced performance across all dimensions, indicating that classical literature understanding comprises multiple orthogonal skill sets. Individual models excel at specific tasks: Claude Sonnet 4.5 achieves exceptional retrieval (91.7%), GPT-3.5 Turbo dominates punctuation (94.0%), GPT-4 Turbo leads NLI (82.7%), and Qwen 2.5 7B achieves the highest classification accuracy (51.5%). However, no single model maintains strong performance across all five tasks. Critically, classification remains challenging across all models (mean 31.8%), though Qwen 2.5 7B (51.5%) substantially outperforms all competitors, demonstrating the decisive advantage of Chinese-specialized pretraining for classical genre recognition. The substantial performance gap between Qwen (51.5%) and Claude 3 Opus (44.0%) reveals that classical corpus exposure during pretraining provides critical benefits, while EXAONE’s moderate performance (31.2%) demonstrates meaningful but limited cross-lingual transfer from Korean. Open-source models show competitive NLI performance (Qwen 75.7%, Llama 74.1%, EXAONE 71.9%), challenging assumptions that reasoning tasks require proprietary models, while classification results reveal that pretraining corpus composition substantially impacts understanding of Neo-Confucian rhetorical taxonomy.

5.4. Temperature Ablation Study

To assess robustness of model performance to generation parameters, we conduct a temperature ablation study at settings 0.0 (deterministic), 0.3 (moderate), and 0.7 (high) for GPT-3.5 Turbo, GPT-4 Turbo, Claude 3 Opus, and Claude Sonnet 4.5. All other hyperparameters remain fixed as specified in Section 4.
Table 8 shows that most models exhibit stable performance across temperature settings. Commercial models (GPT-3.5, GPT-4, Claude Sonnet 4.5) demonstrate robust performance with mean absolute deviation below 0.03 for most task–model combinations, validating our use of temperature = 0.0 for main evaluations. GPT-3.5’s punctuation (0.940) and GPT-4’s retrieval (0.758) remain nearly constant across all temperatures.
Claude 3 Opus shows higher temperature sensitivity, with performance degrading at T = 0.7 due to increased stochasticity disrupting token-level precision. Retrieval demonstrates the highest stability across all models, confirming this task’s reliance on pattern recognition rather than generation quality. These findings confirm that observed performance differences reflect genuine capability gaps rather than parameter-dependent artifacts, and that the task difficulty hierarchy (retrieval > NLI > punctuation > translation > classification) remains consistent across temperature settings.

6. Discussion

6.1. Key Findings from KLSBench Evaluation

Our evaluation reveals three critical insights for low-resource historical language processing. First, pretraining corpus composition critically determines genre recognition capabilities: the Chinese-specialized model substantially outperforms Korean-specialized alternatives for classical genre classification, demonstrating that exposure to KLS texts during pretraining provides decisive advantages despite KLS’s Korean lexical elements. Second, logical reasoning capabilities transfer robustly across domains: multiple models achieve strong NLI performance despite limited classical Chinese exposure, suggesting that formal inference abilities generalize beyond training distribution. However, catastrophic failures in specific models reveal brittleness in instruction-following. Third, task difficulty stratification reflects distinct cognitive demands: lexical matching (retrieval) proves substantially easier than genre recognition (classification), which requires understanding of Neo-Confucian rhetorical conventions and historical context.

6.2. Implications for Historical Language Processing

These findings carry practical implications for historical text processing. Current LLMs demonstrate utility for specific tasks like retrieval and punctuation assistance, with moderate capabilities for genre classification and logical inference. However, deploying LLMs for cultural–historical understanding requires careful model selection and expert verification. Future development should prioritize instruction-tuning integrating rhetorical theory and philosophical concepts beyond classical corpus exposure, exploring approaches including continued pretraining [5,49], retrieval-augmented generation [50,51], and multi-task learning [44,45].

6.3. Generality and Extensibility

While KLSBench is constructed for Korean Literary Sinitic, the benchmark design itself is not limited to this language. Korean Literary Sinitic constitutes a particularly informative case because it exhibits several properties commonly observed in historical and low-resource languages, including limited annotated data, condensed classical syntax, and sustained connections to high-resource modern languages through shared writing systems and lexical overlap [17,31,33]. These characteristics make KLS suitable for examining cross-lingual generalization, while remaining representative rather than exceptional among historical language settings.
More generally, the design of KLSBench follows principles that are not tied to any single language. First, the benchmark draws on historically authoritative and educational texts to ensure linguistic and semantic validity, a practice widely adopted in prior work on classical and historical language benchmarks [17,18]. Second, it includes a range of task types that assess both surface-level form understanding (e.g., punctuation and classification) and deeper semantic reasoning (e.g., inference and translation), in line with established multi-task benchmark design practices [25,47]. Third, the benchmark minimizes reliance on modernized preprocessing, allowing models to be evaluated on texts that remain close to their original historical form, consistent with prior studies on classical text processing [16,31]. Together, these design choices support application of the benchmark framework to other historical or low-resource languages with comparable textual traditions.
Accordingly, the KLSBench framework can be adapted to other settings, including classical variants of Chinese used in different regions, historical Japanese texts and other classical languages supported by long-standing scholarly corpora [17,33]. Although language-specific expertise is required for data curation and annotation, the overall benchmark structure, task formulation, and evaluation protocol remain transferable. From this perspective, KLSBench is intended not only as a benchmark for Korean Literary Sinitic, but also as a reference framework for the development of evaluation benchmarks for historical and low-resource languages more broadly.

7. Conclusions and Future Work

This study presents KLSBench, the first benchmark for KLS text understanding, comprising 7871 instances across five tasks. Our evaluation of seven LLMs reveals dramatic performance variations. KLSBench establishes evaluation protocols for deploying LLM-powered applications in digital humanities contexts. Beyond benchmark metrics, this work provides practical guidance for developing automated annotation systems, intelligent search interfaces, and educational platforms that can assist researchers in accessing and analyzing digitized historical text collections. KLSBench thus contributes both domain-specific evaluation infrastructure and actionable frameworks for building robust NLP applications on classical language resources.

7.1. Limitations

Benchmark Scope and Coverage. KLSBench comprises 7871 instances, smaller than C 3 Bench (50,000 instances) for Chinese classical texts. This reflects availability of Korean historical text data, which is less digitized and annotated than Chinese literature. Our data sources are limited to two categories: Joseon dynasty civil service examination records and the Four Books. This focus excludes portions of KLS—including historical chronicles, private correspondence, Buddhist and Taoist texts, medical treatises, and administrative documents. Our benchmark primarily represents scholarly and canonical texts. Future expansion should incorporate broader coverage across genres and historical periods.
Language Pairs and Multilingual Evaluation. Our Translation task covers two language directions: KLS–Korean and Korean–English. This reflects historical practices in Korean classical studies but excludes language pairs: direct KLS–English translation, KLS–Chinese comparison (to assess differences from Chinese Literary Sinitic), and KLS–Japanese cross-lingual understanding (to compare with kanbun tradition). Our translation evaluation relies on automatic metrics (BLEU, ROUGE), which capture limited aspects of translation quality—lexical overlap and structural similarity rather than cultural appropriateness, stylistic fidelity, or philosophical nuance preservation. Human evaluation would provide more assessment of translation quality.
Evaluation Protocols and Settings. Our evaluation is limited to zero-shot settings, providing only task instructions to LLMs without in-context examples. This measures out-of-the-box model capabilities but does not assess whether few-shot learning could improve performance. Research shows few-shot prompting can bring gains on reasoning tasks, suggesting our NLI and classification results may be pessimistic. We use a single prompt template for all models, which may not be optimized for specific models or tasks. Prompt optimization would likely improve performance numbers though not change model rankings. Our automatic metrics (accuracy, F1, BLEU, ROUGE) provide scalable evaluation but do not fully capture human judgment. For generative tasks, especially translation and punctuation, human evaluation would provide more assessment of output quality.
Model Selection and Scope. We evaluate seven LLMs, but this selection is constrained by availability and computational resources. We do not evaluate larger open-source models (e.g., Llama 3.1 70B, Qwen 2.5 72B), limiting assessment of how model size scales to historical text understanding. We do not evaluate domain-specific models or models fine-tuned on KLS, as such models are not publicly available. Our evaluation is conducted from June–September 2025, and API model behaviors may change due to ongoing updates. Future work should evaluate a broader range of models, scale to larger model sizes, and explore fine-tuning and domain adaptation strategies.
Data Quality and Annotation Process. Our benchmark data derives from scholarly resources (Korean Institute for Classical Humanities, Academy of Korean Studies), which ensures high quality but limits our control over annotation processes. Classification labels derive from original metadata, which may reflect historical categorization rather than modern literary theory. NLI instances are generated through automated strategies and receive sample validation (94.6% agreement), but 5.4% disagreement indicates label noise. Translations are provided by the Korean Institute for Classical Humanities and officially published, but translation choices may vary across translators and periods, introducing subjectivity into translation evaluation.
Future work should improve annotation quality through expert validation, conduct inter-annotator reliability studies, and explore the incorporation of structured linguistic representations as complementary annotations for evaluating and supporting model behavior.

7.2. Future Research Directions

Our findings point to a methodological extension that directly addresses the structural challenges revealed by KLSBench. Korean Literary Sinitic is a highly formalized historical language characterized by stable syntactic and rhetorical regularities, including clause-based organization, parallel constructions, and genre-specific compositional schemes. Because such structure is not explicitly represented in the current benchmark, models are required to implicitly infer latent organization in tasks such as punctuation restoration, genre classification, and translation, which contributes to output instability and inter-genre confusion.
A natural extension of this work is to incorporate explicit structural representations as a complementary layer to probabilistic LLM outputs. This includes formalizing linguistic structure through clause segmentation, rhetorical templates, or relation-based representations derived through information extraction techniques (e.g., named entity and relation extraction). Such information can be organized into symbolic resources, including graph-based representations such as RDF-style subject–predicate–object triples, enabling explicit modeling of entities, relations, and hierarchical structure in KLS texts.
Under these conditions, hybrid inference paradigms that combine neural generation with post hoc structural validation offer a promising direction. In this paradigm, an LLM first generates multiple candidate outputs, which are subsequently filtered using KLS-specific syntactic or rhetorical constraints to ensure structural coherence and consistency. This approach has the potential to directly address the limitations observed in our zero-shot evaluation and provides a principled way to integrate generative flexibility with formal constraints.
Beyond the present evaluation, KLSBench opens several avenues for future research. Expanding benchmark coverage to additional historical genres would enable broader assessment across diverse textual traditions. Comparative studies across Literary Sinitic traditions may further clarify shared and localized properties of classical text processing. At the same time, developing human–AI collaborative frameworks remains essential for responsible deployment in digital humanities contexts. By providing systematic evaluation protocols and baselines, KLSBench establishes a foundation for such interdisciplinary exploration.

Author Contributions

Conceptualization, S.-H.H., W.-S.Y., X.M. and T.-S.C.; methodology, S.-H.H. and W.-S.Y.; software, S.-H.H.; validation, S.-H.H., W.-S.Y. and X.M.; formal analysis, S.-H.H.; investigation, S.-H.H. and W.-S.Y.; resources, W.-S.Y. and T.-S.C.; data curation, S.-H.H. and W.-S.Y.; writing—original draft preparation, S.-H.H.; writing—review and editing, S.-H.H., W.-S.Y., X.M. and T.-S.C.; visualization, S.-H.H.; supervision, X.M. and T.-S.C.; project administration, T.-S.C.; funding acquisition, T.-S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) under the ITRC (Information Technology Research Center) support program (IITP-2025-RS-2021-II212051) funded by the Korea government (MSIT) and the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2025S1A6B5A02004128, NRF-2022S1A5C2A02093644).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The KLSBench dataset, evaluation code, and model outputs are publicly available at https://songhune.github.io/korean_R-CoA/ (accessed on 10 January 2026). All materials are released under open-source licenses to facilitate reproducible research.

Acknowledgments

We appreciate the administrative and technical support from Ajou University BK21 Graduate School Innovation Support program for their support (project name: Evaluation of Classical Chinese Semantic Relations Based on Joseon Civil Service Examination Texts: Comparing LLM Performance from NLI and STS Perspectives) in dataset curation and infrastructure management. We also express our gratitude to Korea University Department of Sinographic Literatures for providing valuable advisory support on classical Chinese resources.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LLMLarge Language Model
KLSKorean Literary Sinitic
NLINatural Language Inference
BLEUBilingual Evaluation Understudy
ROUGERecall-Oriented Understudy for Gisting Evaluation
APIApplication Programming Interface

Appendix A. Dataset Composition and Distribution Analysis

This appendix provides detailed visualizations of KLSBench’s dataset composition. These visualizations complement the statistical summaries presented in the main text. The figures reveal the dataset’s structural characteristics, temporal distribution patterns, and linguistic diversity across multiple dimensions.
Table A1 presents the distribution of 21 literary genres used in the genre classification task. The dataset covers a wide range of Korean Literary Sinitic genres, including philosophical treatises (論), historical narratives (記), expository essays (說), commemorative inscriptions (銘), and poetic compositions.
To reflect historical genre prevalence while maintaining evaluation stability, six major genres were constructed using stratified sampling with 95 instances each. The remaining 15 minor genres exhibit naturally skewed frequencies, ranging from 2 to 53 instances.
Table A2 presents the distribution of source texts across the retrieval task. The Four Books corpus—Analects (論語), Mencius (孟子), Great Learning (大學), and Doctrine of the Mean (中庸)—shows unequal representation reflecting the original texts’ varying lengths and canonical significance.
Table A1. Classification task: genre distribution (21 classes).
Table A1. Classification task: genre distribution (21 classes).
Major Genres (6 classes, 95 instances each)
賦(Fu)95義(Yi)95
詩(Shi)95策(Ce)95
疑(Yi)95表(Biao)95
Subtotal570
Minor Genres (15 classes, 2–53 instances each)
箴(Zhen)53書義6
頌(Song)49禮義6
箋(Jian)24詔(Zhao)5
論(Lun)12制(Zhi)3
銘(Ming)9論(Lun)2
詩義7講(Jiang)2
易義7擬(Ni)2
Subtotal238
Total (21 classes)808
Table A2. Retrieval task: distribution by source book.
Table A2. Retrieval task: distribution by source book.
Source BookCount
論語(Analects)500
孟子(Mencius)500
中庸(Doctrine of the Mean)137
大學(Great Learning)72
Total1209

Appendix A.1. Task Examples

This subsection presents comprehensive examples for each of the five tasks in KLSBench, organized into separate tables (Table 2 and Table A3Table A6) for clarity. These examples are carefully selected to illustrate the diverse challenges inherent in Korean Literary Sinitic text understanding. Each task requires distinct cognitive capabilities: genre recognition demands understanding of rhetorical conventions and compositional structures; retrieval tests familiarity with canonical texts; punctuation requires syntactic parsing of unpunctuated classical Chinese; natural language inference evaluates logical reasoning over philosophical content; and translation assesses cross-lingual semantic transfer while preserving cultural nuances. The examples below demonstrate why these tasks pose significant challenges even for state-of-the-art large language models, as discussed in Section 5.
Translation. Table 2 (Section 3) presents translation task examples demonstrating three primary challenges: (1) syntactic expansion from condensed classical forms to modern grammatical structures, (2) cultural–linguistic adaptation of philosophical concepts across language boundaries, and (3) interpretation of Neo-Confucian terminology that resists direct lexical equivalence. These challenges explain the uniformly low BLEU scores in our evaluation (mean 21.5%, Section 5), as automatic metrics cannot assess philosophical meaning preservation or cultural contextualization. See Section 3 for detailed examples and analysis.
Classification. Table A3 demonstrates genre classification through typical examples of philosophical discourse, historical narrative, and expository essays from the civil service examination corpus.
The classification task requires models to identify literary genres (文體) based on subtle rhetorical and structural features rather than explicit lexical markers. Example 1 illustrates philosophical discourse (論) characterized by Confucius’s didactic teaching style with rhetorical questions (不亦...乎 pattern) that guide readers toward moral insights about learning, friendship, and self-cultivation. Example 2 shows historical narrative (記) distinguished by temporal markers (三年秋七月), chronological event sequencing, and formal diplomatic language documenting tributary relations. Example 3 presents expository essay (說) featuring analogical reasoning—the water-and-soil metaphor (水之性清而土汩之) serves as philosophical argumentation establishing that human nature requires cultivation to resist external corruption. These genre distinctions were fundamental to Joseon civil service examinations, where candidates had to master diverse compositional forms. Our evaluation reveals that models achieve 31.8% mean accuracy on this task (Section 5), with substantial variation (11.3% to 51.5%), indicating that while contemporary LLMs demonstrate moderate capabilities in recognizing certain classical literary genres, deep understanding of Neo-Confucian rhetorical taxonomy remains challenging.
Table A3. Classification task examples.
Table A3. Classification task examples.
Example 1—Philosophical Discourse (論)
Input: 子曰學而時習之不亦說乎有朋自遠方來
不亦樂乎人不知而不慍不亦君子乎
Expected Output: 論(Ron, Discourse)
Rationale: This passage exhibits characteristic
features of philosophical discourse: didactic tone,
rhetorical questions, and moral reasoning about
learning and self-cultivation.
Example 2—Historical Narrative (記)
Input: 太祖在位三年秋七月朝鮮國王遣使來朝貢
方物其使曰臣國自古事大以誠今聞天子即位故遣
臣來獻
Expected Output: 記(Gi, Record)
Rationale: Contains temporal markers,
chronological structure, and narrative of historical
events involving diplomatic relations.
Example 3—Expository Essay (說)
Input: 夫水之性清而土汩之故不清人之性善而
物誘之故不善是以君子必慎其獨也
Expected Output: 說(Seol, Exposition)
Rationale: Analytical argumentation using
analogical reasoning (water metaphor) to explain
abstract philosophical principles.
Retrieval. Table A4 illustrates the retrieval task with passages from the Four Books, testing models’ ability to identify source texts and specific chapters.
This task evaluates whether models have memorized canonical Confucian texts that formed the intellectual foundation of Joseon dynasty education. Example 1 presents the famous definition of 仁(humaneness) through 克己復禮 (self-discipline and returning to propriety) from the Analects’ Yan Yuan chapter—one of the most frequently cited passages in Neo-Confucian discourse. Example 2 shows Mencius’s foundational argument for innate moral sentiments (不忍人之心, the heart that cannot bear to see others suffer), establishing the philosophical basis for compassionate governance. Example 3 quotes the opening of the Great Learning, articulating the three fundamental principles (三綱領: 明明德, 親民, 止於至善) that structure classical education. Unlike other tasks requiring deep semantic understanding, retrieval primarily tests pattern recognition and text memorization. Our results confirm this distinction: models achieve 64.9% mean accuracy on retrieval (Section 5), with substantial variation across models, suggesting varying exposure to canonical texts in pretraining corpora. This moderate retrieval performance indicates that while some models have memorized portions of the Four Books, the task remains challenging overall, though it achieves higher accuracy than culturally embedded tasks like classification (31.8%) while being comparable to reasoning tasks like NLI (59.3%).
Table A4. Retrieval task examples.
Table A4. Retrieval task examples.
Example 1—Analects Passage
Input: 子曰克己復禮為仁一日克己復禮
天下歸仁焉為仁由己而由人乎哉
Expected Output: 論語顏淵篇
(Analects, Chapter Yan Yuan 12.1)
Context: Famous passage defining 仁
(humaneness) through self-discipline and ritual
propriety.
Example 2—Mencius Passage
Input: 孟子曰人皆有不忍人之心先王有
不忍人之心斯有不忍人之政矣
Expected Output: 孟子公孫丑上篇
(Mencius, Gongsun Chou I 2A.6)
Context: Foundational argument for innate
moral sentiments and compassionate governance.
Example 3—Great Learning Passage
Input: 大學之道在明明德在親民在止於至善
知止而後有定
Expected Output: 大學經一章
(Great Learning, Classic Text Chapter 1)
Context: Opening statement defining the three
fundamental principles of classical education.
Punctuation. Table A5 shows punctuation restoration examples, demonstrating the challenge of segmenting unpunctuated classical Chinese text into properly delimited clauses.
Table A5. Punctuation task examples.
Table A5. Punctuation task examples.
Example 1—Dialogue with Multiple Clauses
Input (白文): 子曰學而時習之不亦說乎
有朋自遠方來不亦樂乎人不知而不慍不亦君子乎
Expected Output: 子曰, 學而時習之,
不亦說乎. 有朋自遠方來, 不亦樂乎.
人不知而不慍, 不亦君子乎.
Key Challenge: Identifying parallel rhetorical
structures and distinguishing main clauses from
subordinate clauses.
Example 2—Complex Conditional Statement
Input (白文): 孟子曰人皆有不忍人之心
所以謂人皆有不忍人之心者今人乍見孺子將入於井
皆有怵惕惻隱之心
Expected Output: 孟子曰, 人皆有不忍人之心.
所以謂人皆有不忍人之心者, 今人乍見孺子將入於井,
皆有怵惕惻隱之心.
Key Challenge: Parsing complex embedded
structures with causal reasoning and hypothetical
scenarios.
Classical Literary Sinitic texts were traditionally written as continuous character sequences without punctuation marks (白文), requiring readers to parse sentence and clause boundaries through deep understanding of classical grammar and discourse structure. Example 1 illustrates the complexity of punctuating parallel rhetorical structures: the three rhetorical questions (不亦說乎, 不亦樂乎, 不亦君子乎) must be properly segmented to distinguish the main attribution clause (子曰) from three coordinate philosophical statements about learning, friendship, and self-cultivation. Each statement contains subordinate clauses (學而時習之, 有朋自遠方來, 人不知而不慍) that must be delimited with commas before the concluding rhetorical questions. Example 2 presents even greater complexity with embedded causal structures: the passage contains a primary assertion (人皆有不忍人之心), followed by an explanatory clause introduced by 所以謂...者 (the reason for saying...), which itself contains a hypothetical scenario (今人乍見孺子將入於井) demonstrating the innate moral sentiment. Proper punctuation requires understanding both syntactic structure and logical argumentation flow. Our evaluation reveals dramatic model-dependent performance variation (35.6% to 94.0% F1), with GPT-3.5 Turbo achieving exceptional 94.0% accuracy while Claude models struggle below 48% (Section 5). This variance suggests punctuation restoration captures specialized grammatical knowledge of classical Chinese syntax that varies significantly across model training compositions.
Natural Language Inference. Table A6 presents natural language inference examples with premise–hypothesis pairs labeled as entailment, neutral, or contradiction, assessing models’ logical reasoning capabilities.
Table A6. Natural language inference task examples.
Table A6. Natural language inference task examples.
Example 1—Entailment
Premise (KLS): 子曰學而時習之不亦說乎
Hypothesis (Korean): 공자께서 배움과
복습의 기쁨에 대해 말씀하셨다
Label: Entailment
Justification: The hypothesis accurately
captures the semantic content of Confucius
discussing the joy of learning and practice.
Example 2—Neutral
Premise (KLS): 子曰克己復禮為仁
Hypothesis (Korean): 공자께서 예의범절을
지켜야 한다고 말씀하셨다
Label: Neutral
Justification: While related, the hypothesis
focuses narrowly on ritual propriety without
capturing the core concept of self-discipline or
humaneness.
Example 3—Contradiction
Premise (KLS): 子曰人不知而不慍不亦君子乎
Hypothesis (Korean): 공자께서 남이 알아주지
않으면 화를 내야 한다고 하셨다
Label: Contradiction
Justification: The hypothesis directly
contradicts the premise’s assertion that a noble
person remains unperturbed by lack of recognition.
The NLI task evaluates whether models can determine logical relationships between statements drawn from classical philosophical texts, requiring both semantic understanding and formal reasoning. Example 1 demonstrates clear entailment: the Korean hypothesis (공자께서 배움과 복습의 기쁨에 대해 말씀하셨다, “Confucius spoke about the joy of learning and practice”) accurately captures the semantic content of the classical Chinese premise (子曰學而時習之不亦說乎, “The Master said: to learn and regularly practice, is that not a pleasure?”). This translation-equivalence relationship represents the most straightforward inference type. Example 2 illustrates neutral relationship requiring subtle philosophical discrimination: while the hypothesis mentions ritual propriety (예의범절을 지켜야 한다, “one should observe ritual propriety”), it fails to capture the premise’s core concepts of self-discipline (克己) and humaneness (仁), representing only partial semantic overlap without full entailment. Example 3 shows direct contradiction through logical negation: the premise asserts that a noble person remains calm when unrecognized (人不知而不慍不亦君子乎), while the hypothesis claims Confucius said one should become angry when unrecognized (남이 알아주지 않으면 화를 내야 한다), directly inverting the ethical teaching. These examples require models to reason about philosophical concepts and logical relationships rather than merely matching surface lexical patterns. Our evaluation reveals surprisingly strong reasoning capabilities: mean NLI accuracy reaches 59.3%, with GPT-4 Turbo achieving 82.7% and multiple models exceeding 70% accuracy (Section 5). This strong performance indicates that current LLMs possess substantial formal reasoning capabilities over classical philosophical content, though performance varies significantly by model, with Claude Sonnet 4.5 achieving near-zero accuracy (0.5%) despite strong retrieval performance.

Appendix A.2. Dataset Composition Summary

The dataset composition reflects careful balance between naturalistic representation and controlled evaluation requirements. Genre distribution (Table A1) maintains diversity while respecting historical frequencies, ensuring models encounter authentic distribution patterns. Source text distribution (Table A2) follows canonical significance, with more extensive texts contributing proportionally more instances.
Table A7 presents the label distribution for NLI and translation task compositions. The NLI task maintains a relatively balanced three-class distribution (entailment, neutral, contradiction) through controlled instance generation, ensuring unbiased logical reasoning evaluation. The translation pairs show the multilingual structure of KLSBench, with classical Chinese-to-Korean translation comprising the majority of instances (1,320) compared to Korean-to-English translation (680).
Table A7. NLI label and translation pair distributions.
Table A7. NLI label and translation pair distributions.
NLI TaskTranslation Task
LabelCountTranslation PairCount
Entailment1313Classical Chinese → Korean1320
Neutral400Korean → English680
Contradiction141
Total1854Total2000
Table 2 and Table A3Table A6 provide concrete examples for each of the five tasks, illustrating the input formats, expected outputs, and evaluation criteria that models must satisfy.

Appendix B. Detailed Model Performance Analysis

This appendix presents fine-grained model performance analysis across multiple dimensions. The analysis reveals task-specific patterns, error characteristics, and model-dependent behaviors not fully captured by aggregate metrics.

Appendix B.1. Genre-Specific Classification Performance

Figure A1 reveals task-specific difficulty patterns where certain genres like poetry (詩, 65.5%) and essays on doubtful points (疑, 56.4%) achieve substantially higher accuracy than others like rhyme-prose (賦, 13.8%) and meaning clarification (義, 13.3%), possibly reflecting more distinctive structural conventions and surface features in the former. The mean classification accuracy of 31.8% indicates moderate but challenging performance rather than complete failure, though substantial variation exists across genres. This pattern suggests models can leverage surface-level stylistic features for certain genres but struggle with those requiring deeper understanding of classical rhetorical conventions and argumentative structures.
Figure A1. Per-genre classification accuracy revealing task-specific difficulty patterns.
Figure A1. Per-genre classification accuracy revealing task-specific difficulty patterns.
Applsci 16 00953 g0a1

Appendix B.2. Per-Book Retrieval Performance

Figure A2 shows differential memorization patterns reflecting canonical text hierarchies in pretraining data. The Analects (論語) achieves the highest mean retrieval accuracy across models, reflecting its canonical status and likely overrepresentation in pretraining corpora. The Doctrine of the Mean (中庸), despite being shorter and potentially more concentrated in representation, shows moderately lower performance, suggesting that retrieval success depends not merely on frequency but also on textual distinctiveness. Claude Sonnet 4.5 achieves the strongest overall retrieval performance (91.7%), while other models show substantial variation (23.3% to 80.8%), indicating differential exposure to canonical texts during pretraining.
Figure A2. Per-book retrieval accuracy showing differential memorization patterns.
Figure A2. Per-book retrieval accuracy showing differential memorization patterns.
Applsci 16 00953 g0a2

Appendix B.3. Error Pattern Analysis

Figure A3 distinguishes between systematic bias (classification) and partial reasoning capability (NLI), providing nuanced understanding of failure modes. Classification errors predominantly manifest as systematic bias toward majority classes, with models defaulting to frequent genres (論, 記) regardless of input characteristics—indicating reliance on prior probabilities rather than content analysis. NLI errors show different patterns: models achieving above-random performance exhibit confusion between entailment and neutral labels, suggesting partial logical reasoning capability, while models at chance level distribute errors uniformly, indicating complete reasoning failure.
Figure A3. Error pattern analysis distinguishing systematic bias from partial reasoning capability.
Figure A3. Error pattern analysis distinguishing systematic bias from partial reasoning capability.
Applsci 16 00953 g0a3

Appendix C. Glossary

Table A8. Glossary of key terms used in this paper.
Table A8. Glossary of key terms used in this paper.
TermDefinition
Literary Sinitic (文言文)A classical written language historically used in East Asia, characterized by condensed syntax, implicit grammatical relations, and reliance on shared cultural and philosophical knowledge.
Korean Literary Sinitic (KLS)Literary Sinitic texts produced and used in the Korean historical and institutional context, particularly during the Joseon dynasty, reflecting local genre conventions and rhetorical practices.
Hangul (한글)The modern Korean alphabetic writing system used for contemporary Korean, distinct from the character-based script used in Literary Sinitic texts.
Sino-Korean vocabularyKorean lexical items derived from Chinese characters, forming a substantial portion of modern Korean vocabulary and providing partial lexical continuity with Literary Sinitic.
Unpunctuated text (白文)Classical text presented without modern punctuation or sentence segmentation, reflecting original manuscript conventions and requiring implicit structural interpretation.
Literary genre (文體)A classification label indicating the rhetorical or functional type of a text (e.g., poetry, argumentative essay), derived from historical metadata associated with examination texts.

References

  1. OpenAI. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
  2. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
  3. Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
  4. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  5. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
  6. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
  7. Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Gray, A.; et al. Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2022)—Proceedings, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  8. Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring massive multitask language understanding. In Proceedings of the International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
  9. Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic evaluation of language models. Trans. Mach. Learn. Res. 2023, 8, 1–162. [Google Scholar]
  10. Lin, P.; Ji, S.; Tiedemann, J.; Martins, A.F.; Schütze, H. MaLA-500: Massive language adaptation of large language models. arXiv 2024, arXiv:2401.13303. [Google Scholar] [CrossRef]
  11. Imani, A.; Lin, P.; Kargaran, A.H.; Severini, S.; Sabet, M.J.; Kassner, N.; Ma, C.; Schmid, H.; Martins, A.F.T.; Yvon, F.; et al. Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 1082–1117. [Google Scholar] [CrossRef]
  12. Pava, J.N.; Meinhardt, C.; Badi Uz Zaman, H.; Friedman, T.; Truong, S.T.; Zhang, D.; Cryst, E.; Marivate, V.; Koyejo, S. Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts; Stanford Institute for Human-Centered Artificial Intelligence (HAI): Stanford, CA, USA, 2025. [Google Scholar]
  13. Heo. A study of using various corpus on set grade Basic Han-characters and Sino-Korean words for Korean learning(1)—Review basic Han-character set for Hanmun 漢文education. J. Chin. Characters Educ. Korea 2019, 52, 95–114. [Google Scholar]
  14. Ito, C.; Feldman, N.H. Iterated Learning Models of Language Change: A Case Study of Sino-Korean Accent. Cogn. Sci. 2022, 46, e13115. [Google Scholar] [CrossRef]
  15. Zhang, H.; Yang, M.; Zhao, T. Exploring hybrid character-words representational unit in classical-to-modern Chinese machine translation. In Proceedings of the IALP, Suzhou, China, 24–25 October 2015; pp. 33–36. [Google Scholar]
  16. Arora, A.; Silcock, E.; Heldring, L.; Dell, M. Contrastive Entity Coreference and Disambiguation for Historical Texts. arXiv 2024, arXiv:2406.15576. [Google Scholar] [CrossRef]
  17. Cao, J.; Shi, Y.; Peng, D.; Liu, Y.; Jin, L. C3Bench: A comprehensive classical Chinese understanding benchmark for large language models. arXiv 2024, arXiv:2405.17732. [Google Scholar]
  18. Micallef, K.; Borg, C. MELABenchv1: Benchmarking large language models against smaller fine-tuned models for low-resource Maltese NLP. In Proceedings of the Findings of the association for computational linguistics: ACL, Vienna, Austria, 27 July–1 August 2025; pp. 20505–20527. [Google Scholar] [CrossRef]
  19. Guha, N.; Nyarko, J.; Ho, D.; Ré, C.; Chilton, A.; Chohlas-Wood, A.; Peters, A.; Waldon, B.; Rockmore, D.; Zambrano, D.; et al. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 44123–44279. [Google Scholar] [CrossRef]
  20. Jin, D.; Pan, E.; Oufattole, N.; Weng, W.H.; Fang, H.; Szolovits, P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 2021, 11, 6421. [Google Scholar] [CrossRef]
  21. Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.; Lu, X. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2567–2577. [Google Scholar] [CrossRef]
  22. Lu, P.; Mishra, S.; Xia, T.; Qiu, L.; Chang, K.W.; Zhu, S.C.; Tafjord, O.; Clark, P.; Kalyan, A. Learn to explain: Multimodal reasoning via thought chains for science question answering. Adv. Neural Inf. Process. Syst. 2022, 35, 2507–2521. [Google Scholar]
  23. Scao, T.L.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, F.; Gallé, M.; et al. BLOOM: A 176B-parameter open-access multilingual language model. arXiv 2022, arXiv:2211.05100. [Google Scholar]
  24. Hu, J.; Ruder, S.; Siddhant, A.; Neubig, G.; Firat, O.; Johnson, M. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 12–18 July 2020; pp. 4411–4421. [Google Scholar]
  25. Liang, Y.; Duan, N.; Gong, Y.; Wu, N.; Guo, F.; Qi, W.; Gong, M.; Shou, L.; Jiang, D.; Cao, G.; et al. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6008–6018. [Google Scholar] [CrossRef]
  26. Conneau, A.; Rinott, R.; Lample, G.; Williams, A.; Bowman, S.; Schwenk, H.; Stoyanov, V. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2475–2485. [Google Scholar] [CrossRef]
  27. Ham, J.; Choe, Y.J.; Park, K.; Choi, I.; Soh, H. KorNLI and KorSTS: New benchmark datasets for korean natural language understanding. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020; pp. 422–430. [Google Scholar] [CrossRef]
  28. Park, S.; Kim, S.; Moon, J.; Cho, W.I.; Cho, K.; Han, J.; Park, J.; Song, C.; Kim, J.; Song, Y.; et al. KLUE: Korean language understanding evaluation. In Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems (NeurIPS 2021), Online, 6–14 December 2021. [Google Scholar]
  29. Liu, C.; Wang, D.; Zhao, Z.; Hu, D.; Wu, M.; Lin, L.; Shen, S.; Li, B.; Liu, J.; Zhang, H.; et al. SikuGPT: A generative pre-trained model for intelligent information processing of ancient texts. arXiv 2023, arXiv:2304.07778. [Google Scholar]
  30. Wang, D.; Liu, C.; Zhao, Z.; Shen, S.; Liu, L.; Li, B.; Hu, H.; Wu, M.; Lin, L.; Zhao, X.; et al. GujiBERT and GujiGPT: Construction of intelligent information processing foundation language models for ancient texts. arXiv 2023, arXiv:2307.05354. [Google Scholar]
  31. Cao, J.; Peng, D.; Zhang, P.; Shi, Y.; Liu, Y.; Ding, K.; Jin, L. TongGu: Mastering classical Chinese understanding with knowledge-grounded large language models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 4196–4210. [Google Scholar] [CrossRef]
  32. Zheng, C.; Zhu, Y.; Bi, H. Exploring the application of 7B llms for named entity recognition in chinese ancient texts. In Proceedings of the Second Workshop on Ancient Language Processing, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 150–155. [Google Scholar] [CrossRef]
  33. Jiang, Z.; Wang, J.; Cao, J.; Gao, X.; Jin, L. Towards better translations from classical to modern Chinese: A new dataset and a new method. In Proceedings of the NLPCC 2023, Foshan, China, 12–15 October 2023; pp. 387–399. [Google Scholar]
  34. McGiff, J.; Nikolov, N.S. Overcoming data scarcity in generative language modelling for low-resource languages: A systematic review. arXiv 2025, arXiv:2505.04531. [Google Scholar] [CrossRef]
  35. Gurgurov, D.; Hartmann, M.; Ostermann, S. Adapting multilingual llms to low-resource languages with knowledge graphs via adapters. arXiv 2024, arXiv:2407.01406. [Google Scholar] [CrossRef]
  36. Alsharbi, B.M. Optimizing large language models for low-resource languages: A case study on saudi dialects. Int. J. Adv. Comput. Sci. Appl. 2025, 16, 848–853. [Google Scholar] [CrossRef]
  37. Seong, B. Lunyu: Annoted Edition (論語集註); Korean Institute for Classical Humanities: Seoul, Republic of Korea, 2017; 540p. [Google Scholar]
  38. Seong, B. Mencius: Annotated Edition (孟子集註); Korean Humanities Classics Research Institute: Seoul, Republic of Korea, 2017; p. 600p. [Google Scholar]
  39. Seong, B. Daxue&Zhongyong: Annoted Edition (大學中庸集註); Korean Humanities Classics Research Institute: Seoul, Republic of Korea, 2017. [Google Scholar]
  40. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
  41. Lin, C.Y. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
  42. The Academy of Korean Studies. Digital Jangseogak OpenAPI. Available online: https://jsg.aks.ac.kr/api/help (accessed on 10 January 2026).
  43. Yamaguchi, A.; Villavicencio, A.; Aletras, N. An empirical study on cross-lingual vocabulary adaptation for efficient language model inference. arXiv 2024, arXiv:2402.10712. [Google Scholar]
  44. Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 2024, 25, 1–53. [Google Scholar]
  45. Peng, B.; Li, C.; He, P.; Galley, M.; Gao, J. Instruction tuning with GPT-4. arXiv 2023, arXiv:2304.03277. [Google Scholar] [CrossRef]
  46. An, S.; Bae, K.; Choi, E.; Choi, S.J.; Choi, Y.; Hong, S.; Hong, Y.; Hwang, J.; Jeon, H.; Jo, G.J.; et al. EXAONE 3.0 7.8B instruction tuned language model. arXiv 2024, arXiv:2408.03541. [Google Scholar] [CrossRef]
  47. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; pp. 353–355. [Google Scholar] [CrossRef]
  48. Powers, D. Evaluation: From precision, recall and f-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
  49. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training; Technical Report; OpenAI: San Francisco, CA, USA, 2018. [Google Scholar]
  50. Zhao, X.; Li, D.; Zhong, Y.; Hu, B.; Chen, Y.; Hu, B.; Zhang, M. SEER: Self-Aligned Evidence Extraction for Retrieval-Augmented Generation. arXiv 2024, arXiv:2410.11315. [Google Scholar] [CrossRef]
  51. Li, Z.; Zhang, J.; Yan, C.; Das, K.; Kumar, S.; Kantarcioglu, M.; Malin, B.A. Do You Know What You Are Talking About? Characterizing Query-Knowledge Relevance For Reliable Retrieval Augmented Generation. arXiv 2024, arXiv:2410.08320. [Google Scholar] [CrossRef]
Figure 1. KLSBench architecture showing the evaluation pipeline. Data from two sources (Gwageo examination records and Four Books) undergoes processing before evaluation across five tasks (Classification, Retrieval, Punctuation, NLI, Translation) using seven models (four proprietary and three open-source models). The right panel shows mean performance across tasks: Retrieval (64.9%), Punctuation (61.3%), NLI (59.3%), Translation (21.5%), and Classification (31.8%). See Appendix A.1 for task examples.
Figure 1. KLSBench architecture showing the evaluation pipeline. Data from two sources (Gwageo examination records and Four Books) undergoes processing before evaluation across five tasks (Classification, Retrieval, Punctuation, NLI, Translation) using seven models (four proprietary and three open-source models). The right panel shows mean performance across tasks: Retrieval (64.9%), Punctuation (61.3%), NLI (59.3%), Translation (21.5%), and Classification (31.8%). See Appendix A.1 for task examples.
Applsci 16 00953 g001
Figure 2. Temporal distribution of Gwageo examination questions across the Joseon dynasty (1392–1897). The distribution reflects historical preservation patterns, with higher density in later periods (18th–19th centuries). Questions span multiple centuries, providing temporal diversity for evaluation across different historical periods.
Figure 2. Temporal distribution of Gwageo examination questions across the Joseon dynasty (1392–1897). The distribution reflects historical preservation patterns, with higher density in later periods (18th–19th centuries). Questions span multiple centuries, providing temporal diversity for evaluation across different historical periods.
Applsci 16 00953 g002
Figure 3. Text length distribution comparison across major literary genres in the classification task. The box plots show median length (center line), interquartile range (box), and outliers (points) for each literary genre, revealing substantial variation in typical essay lengths across different genres.
Figure 3. Text length distribution comparison across major literary genres in the classification task. The box plots show median length (center line), interquartile range (box), and outliers (points) for each literary genre, revealing substantial variation in typical essay lengths across different genres.
Applsci 16 00953 g003
Figure 4. Average confusion matrix across seven models for classification of six major literary genres (詩: poetry, 疑: essays on doubtful points, 賦: fu/rhapsody, 義:/argumentative prose). The matrix reveals systematic misclassification patterns: models consistently confuse structurally similar genres, with certain genres frequently misclassified as others, while achieving relatively high recall on poetry (詩, 65.5%) and essays on doubtful points (疑, 56.4%). The strong diagonal for 疑indicates distinguishable rhetorical features, while weak diagonals for 賦(13.8%) and 義(13.3%) reveal fundamental challenges in recognizing these genres’ subtle argumentative structures.
Figure 4. Average confusion matrix across seven models for classification of six major literary genres (詩: poetry, 疑: essays on doubtful points, 賦: fu/rhapsody, 義:/argumentative prose). The matrix reveals systematic misclassification patterns: models consistently confuse structurally similar genres, with certain genres frequently misclassified as others, while achieving relatively high recall on poetry (詩, 65.5%) and essays on doubtful points (疑, 56.4%). The strong diagonal for 疑indicates distinguishable rhetorical features, while weak diagonals for 賦(13.8%) and 義(13.3%) reveal fundamental challenges in recognizing these genres’ subtle argumentative structures.
Applsci 16 00953 g004
Table 1. Data source composition and KLSBench task utilization.
Table 1. Data source composition and KLSBench task utilization.
SourceData ClassContentTasks
Gwageo
Records
(2849)
QuestionTest prompts withClassification
21 literary genres(808)
AnswerModel responsesPunctuation
from anthologies(1000)
SigwonHistorical answerTranslation
sheets with grades(1320)
Four Books
(2119)
WorkLiterary SiniticRetrieval
original texts(1209)
CollectionKorean/EnglishNLI (1854),
translationsPunctuation
with commentaries(1000),
Translation
(680)
Total Instances4968
CoverageAll 5 tasks
Table 2. Examples of translation challenges in KLSBench.
Table 2. Examples of translation challenges in KLSBench.
Example 1: KLS to Korean
Source (KLS): 子曰學而時習之不亦說乎有朋自遠方來不亦樂乎
Target (Korean): 공자께서 말씀하시기를, 배우고 때때로 익히면 또한 기쁘지 아니한가. 벗이 먼 곳에서 찾아오니 또한 즐겁지 아니한가.
Challenge: Preserving classical philosophical tone while conveying rhetorical questions in modern Korean.
Example 2: Korean to English
Source (Korean): 孟子께서 말씀하시기를, 사람은 모두 남을 차마 해치지 못하는 마음이 있다.
Target (English): Mencius said, "All people have a heart that cannot bear to see the suffering of others."
Challenge: Maintaining philosophical register and capturing nuanced compassion concepts in English academic style.
Example 3: Culturally Embedded Terms
Source (KLS): 克己復禮為仁一日克己復禮天下歸仁焉
Target (Korean): 자기를 이기고 예로 돌아가는 것이 인을 행하는 것이니, 하루라도 자기를 이기고 예로 돌아가면 천하가 인으로 돌아올 것이다.
Table 3. KLSBench overall statistics.
Table 3. KLSBench overall statistics.
TaskInstancesPercentageMetrics
Classification80810.3%Acc, P, R, F1
Retrieval120915.4%Accuracy
Punctuation200025.4%F1, ROUGE
NLI185423.6%Acc, P, R, F1
Translation200025.4%BLEU, ROUGE
Total7871100%-
Table 4. Label distribution for classification and NLI tasks.
Table 4. Label distribution for classification and NLI tasks.
TaskLabelCountPercentage
Classification 1賦(Bu)9511.8%
詩(Si)9511.8%
疑(Eui)9511.8%
義(Ui)9511.8%
策(Chaek)9511.7%
表(Pyo)9511.7%
論(Ron)516.3%
銘(Myeong)536.6%
Others (13 categories)13416.5%
NLIEntailment131370.8%
Neutral40021.6%
Contradiction1417.6%
1 Percentages may not sum to exactly 100% due to rounding.
Table 5. Text length statistics across tasks (in characters).
Table 5. Text length statistics across tasks (in characters).
TaskMeanMedianMinMax
Classification (input)324198123847
Retrieval (input)1569881245
Punctuation (input)187142101892
NLI (premise)1421086982
NLI (hypothesis)1381056976
Translation (source)16512781456
Table 6. Specifications of evaluated models.
Table 6. Specifications of evaluated models.
ModelTypeParametersRelease
Proprietary API Models
GPT-4 TurboAPIUnknown2024
GPT-3.5 TurboAPIUnknown2023
Claude 3.5 SonnetAPIUnknown2024
Claude 3 OpusAPIUnknown2024
Open-Source Models
Llama 3.1 InstructOpen8B (Billion)2024
Qwen 2.5 InstructOpen7B2024
EXAONE 3.0 InstructOpen7.8B2024
Table 7. Overall performance across all tasks and models.
Table 7. Overall performance across all tasks and models.
ModelTypeClassif.RetrievalPunct.NLITrans.
AccAccF1AccBLEU
GPT-4 TurboAPI0.3270.7580.7680.8270.307
GPT-3.5 TurboAPI0.3660.6000.9400.7410.253
Claude Sonnet 4.5API0.1130.9170.4060.0050.329
Claude 3 OpusAPI0.4400.8080.4730.3620.250
Llama 3.1 8BOpen0.1560.2330.8390.7410.189
Qwen 2.5 7BOpen0.5150.5920.5110.7570.033
EXAONE 3.0 7.8BOpen0.3120.6330.3560.7190.144
Mean 0.3180.6490.6130.5930.215
Std Dev 0.0980.2020.2270.2930.097
Table 8. Temperature sensitivity analysis: model performance across temperature settings.
Table 8. Temperature sensitivity analysis: model performance across temperature settings.
ModelTaskT = 0.0T = 0.3T = 0.7
GPT-3.5
Turbo
Classification0.3660.3860.290
Retrieval0.6000.6420.633
Punctuation0.9400.9410.936
NLI0.7410.7280.735
Translation0.2530.2430.230
GPT-4
Turbo
Classification0.3270.3510.305
Retrieval0.7580.7580.775
Punctuation0.7680.7700.775
NLI0.8270.8220.834
Translation0.3070.3070.298
Claude 3
Opus
Classification0.4400.4030.000
Retrieval0.8080.8081.000
Punctuation0.4730.4650.000
NLI0.3620.3790.000
Translation0.2500.1240.000
Claude
Sonnet 4.5
Classification0.1130.1250.100
Retrieval0.9170.9170.900
Punctuation0.4060.4150.404
NLI0.0050.0100.000
Translation0.3290.3340.329
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, S.-H.; Yang, W.-S.; Ma, X.; Chung, T.-S. KLSBench: Evaluating LLM Capabilities on Korean Literary Sinitic Texts in Historical Context. Appl. Sci. 2026, 16, 953. https://doi.org/10.3390/app16020953

AMA Style

Han S-H, Yang W-S, Ma X, Chung T-S. KLSBench: Evaluating LLM Capabilities on Korean Literary Sinitic Texts in Historical Context. Applied Sciences. 2026; 16(2):953. https://doi.org/10.3390/app16020953

Chicago/Turabian Style

Han, Seung-Hyun, Won-Seok Yang, Xiaohan Ma, and Tae-Sun Chung. 2026. "KLSBench: Evaluating LLM Capabilities on Korean Literary Sinitic Texts in Historical Context" Applied Sciences 16, no. 2: 953. https://doi.org/10.3390/app16020953

APA Style

Han, S.-H., Yang, W.-S., Ma, X., & Chung, T.-S. (2026). KLSBench: Evaluating LLM Capabilities on Korean Literary Sinitic Texts in Historical Context. Applied Sciences, 16(2), 953. https://doi.org/10.3390/app16020953

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop