This appendix provides detailed visualizations of KLSBench’s dataset composition. These visualizations complement the statistical summaries presented in the main text. The figures reveal the dataset’s structural characteristics, temporal distribution patterns, and linguistic diversity across multiple dimensions.
Table A1 presents the distribution of 21 literary genres used in the genre classification task. The dataset covers a wide range of Korean Literary Sinitic genres, including philosophical treatises (論), historical narratives (記), expository essays (說), commemorative inscriptions (銘), and poetic compositions.
To reflect historical genre prevalence while maintaining evaluation stability, six major genres were constructed using stratified sampling with 95 instances each. The remaining 15 minor genres exhibit naturally skewed frequencies, ranging from 2 to 53 instances.
Table A1.
Classification task: genre distribution (21 classes).
| Major Genres (6 classes, 95 instances each) |
| 賦(Fu) | 95 | 義(Yi) | 95 |
| 詩(Shi) | 95 | 策(Ce) | 95 |
| 疑(Yi) | 95 | 表(Biao) | 95 |
| Subtotal | 570 |
| Minor Genres (15 classes, 2–53 instances each) |
| 箴(Zhen) | 53 | 書義 | 6 |
| 頌(Song) | 49 | 禮義 | 6 |
| 箋(Jian) | 24 | 詔(Zhao) | 5 |
| 論(Lun) | 12 | 制(Zhi) | 3 |
| 銘(Ming) | 9 | 論(Lun) | 2 |
| 詩義 | 7 | 講(Jiang) | 2 |
| 易義 | 7 | 擬(Ni) | 2 |
| Subtotal | 238 |
| Total (21 classes) | 808 |
Table A2.
Retrieval task: distribution by source book.
| Source Book | Count |
|---|
| 論語(Analects) | 500 |
| 孟子(Mencius) | 500 |
| 中庸(Doctrine of the Mean) | 137 |
| 大學(Great Learning) | 72 |
| Total | 1209 |
Appendix A.1. Task Examples
This subsection presents comprehensive examples for each of the five tasks in KLSBench, organized into separate tables (
Table 2 and
Table A3–
Table A6) for clarity. These examples are carefully selected to illustrate the diverse challenges inherent in Korean Literary Sinitic text understanding. Each task requires distinct cognitive capabilities: genre recognition demands understanding of rhetorical conventions and compositional structures; retrieval tests familiarity with canonical texts; punctuation requires syntactic parsing of unpunctuated classical Chinese; natural language inference evaluates logical reasoning over philosophical content; and translation assesses cross-lingual semantic transfer while preserving cultural nuances. The examples below demonstrate why these tasks pose significant challenges even for state-of-the-art large language models, as discussed in
Section 5.
Translation.
Table 2 (
Section 3) presents translation task examples demonstrating three primary challenges: (1) syntactic expansion from condensed classical forms to modern grammatical structures, (2) cultural–linguistic adaptation of philosophical concepts across language boundaries, and (3) interpretation of Neo-Confucian terminology that resists direct lexical equivalence. These challenges explain the uniformly low BLEU scores in our evaluation (mean 21.5%,
Section 5), as automatic metrics cannot assess philosophical meaning preservation or cultural contextualization. See
Section 3 for detailed examples and analysis.
Classification.
Table A3 demonstrates genre classification through typical examples of philosophical discourse, historical narrative, and expository essays from the civil service examination corpus.
The classification task requires models to identify literary genres (文體) based on subtle rhetorical and structural features rather than explicit lexical markers. Example 1 illustrates philosophical discourse (論) characterized by Confucius’s didactic teaching style with rhetorical questions (不亦...乎 pattern) that guide readers toward moral insights about learning, friendship, and self-cultivation. Example 2 shows historical narrative (記) distinguished by temporal markers (三年秋七月), chronological event sequencing, and formal diplomatic language documenting tributary relations. Example 3 presents expository essay (說) featuring analogical reasoning—the water-and-soil metaphor (水之性清而土汩之) serves as philosophical argumentation establishing that human nature requires cultivation to resist external corruption. These genre distinctions were fundamental to Joseon civil service examinations, where candidates had to master diverse compositional forms. Our evaluation reveals that models achieve 31.8% mean accuracy on this task (
Section 5), with substantial variation (11.3% to 51.5%), indicating that while contemporary LLMs demonstrate moderate capabilities in recognizing certain classical literary genres, deep understanding of Neo-Confucian rhetorical taxonomy remains challenging.
Table A3.
Classification task examples.
Table A3.
Classification task examples.
| Example 1—Philosophical Discourse (論) |
| Input: 子曰學而時習之不亦說乎有朋自遠方來 |
| 不亦樂乎人不知而不慍不亦君子乎 |
| Expected Output: 論(Ron, Discourse) |
| Rationale: This passage exhibits characteristic |
| features of philosophical discourse: didactic tone, |
| rhetorical questions, and moral reasoning about |
| learning and self-cultivation. |
| Example 2—Historical Narrative (記) |
| Input: 太祖在位三年秋七月朝鮮國王遣使來朝貢 |
| 方物其使曰臣國自古事大以誠今聞天子即位故遣 |
| 臣來獻 |
| Expected Output: 記(Gi, Record) |
| Rationale: Contains temporal markers, |
| chronological structure, and narrative of historical |
| events involving diplomatic relations. |
| Example 3—Expository Essay (說) |
| Input: 夫水之性清而土汩之故不清人之性善而 |
| 物誘之故不善是以君子必慎其獨也 |
| Expected Output: 說(Seol, Exposition) |
| Rationale: Analytical argumentation using |
| analogical reasoning (water metaphor) to explain |
| abstract philosophical principles. |
Retrieval.
Table A4 illustrates the retrieval task with passages from the Four Books, testing models’ ability to identify source texts and specific chapters.
This task evaluates whether models have memorized canonical Confucian texts that formed the intellectual foundation of Joseon dynasty education. Example 1 presents the famous definition of 仁(humaneness) through 克己復禮 (self-discipline and returning to propriety) from the Analects’ Yan Yuan chapter—one of the most frequently cited passages in Neo-Confucian discourse. Example 2 shows Mencius’s foundational argument for innate moral sentiments (不忍人之心, the heart that cannot bear to see others suffer), establishing the philosophical basis for compassionate governance. Example 3 quotes the opening of the Great Learning, articulating the three fundamental principles (三綱領: 明明德, 親民, 止於至善) that structure classical education. Unlike other tasks requiring deep semantic understanding, retrieval primarily tests pattern recognition and text memorization. Our results confirm this distinction: models achieve 64.9% mean accuracy on retrieval (
Section 5), with substantial variation across models, suggesting varying exposure to canonical texts in pretraining corpora. This moderate retrieval performance indicates that while some models have memorized portions of the Four Books, the task remains challenging overall, though it achieves higher accuracy than culturally embedded tasks like classification (31.8%) while being comparable to reasoning tasks like NLI (59.3%).
Table A4.
Retrieval task examples.
Table A4.
Retrieval task examples.
| Example 1—Analects Passage |
| Input: 子曰克己復禮為仁一日克己復禮 |
| 天下歸仁焉為仁由己而由人乎哉 |
| Expected Output: 論語顏淵篇 |
| (Analects, Chapter Yan Yuan 12.1) |
| Context: Famous passage defining 仁 |
| (humaneness) through self-discipline and ritual |
| propriety. |
| Example 2—Mencius Passage |
| Input: 孟子曰人皆有不忍人之心先王有 |
| 不忍人之心斯有不忍人之政矣 |
| Expected Output: 孟子公孫丑上篇 |
| (Mencius, Gongsun Chou I 2A.6) |
| Context: Foundational argument for innate |
| moral sentiments and compassionate governance. |
| Example 3—Great Learning Passage |
| Input: 大學之道在明明德在親民在止於至善 |
| 知止而後有定 |
| Expected Output: 大學經一章 |
| (Great Learning, Classic Text Chapter 1) |
| Context: Opening statement defining the three |
| fundamental principles of classical education. |
Punctuation.
Table A5 shows punctuation restoration examples, demonstrating the challenge of segmenting unpunctuated classical Chinese text into properly delimited clauses.
Table A5.
Punctuation task examples.
Table A5.
Punctuation task examples.
| Example 1—Dialogue with Multiple Clauses |
| Input (白文): 子曰學而時習之不亦說乎 |
| 有朋自遠方來不亦樂乎人不知而不慍不亦君子乎 |
| Expected Output: 子曰, 學而時習之, |
| 不亦說乎. 有朋自遠方來, 不亦樂乎. |
| 人不知而不慍, 不亦君子乎. |
| Key Challenge: Identifying parallel rhetorical |
| structures and distinguishing main clauses from |
| subordinate clauses. |
| Example 2—Complex Conditional Statement |
| Input (白文): 孟子曰人皆有不忍人之心 |
| 所以謂人皆有不忍人之心者今人乍見孺子將入於井 |
| 皆有怵惕惻隱之心 |
| Expected Output: 孟子曰, 人皆有不忍人之心. |
| 所以謂人皆有不忍人之心者, 今人乍見孺子將入於井, |
| 皆有怵惕惻隱之心. |
| Key Challenge: Parsing complex embedded |
| structures with causal reasoning and hypothetical |
| scenarios. |
Classical Literary Sinitic texts were traditionally written as continuous character sequences without punctuation marks (白文), requiring readers to parse sentence and clause boundaries through deep understanding of classical grammar and discourse structure. Example 1 illustrates the complexity of punctuating parallel rhetorical structures: the three rhetorical questions (不亦說乎, 不亦樂乎, 不亦君子乎) must be properly segmented to distinguish the main attribution clause (子曰) from three coordinate philosophical statements about learning, friendship, and self-cultivation. Each statement contains subordinate clauses (學而時習之, 有朋自遠方來, 人不知而不慍) that must be delimited with commas before the concluding rhetorical questions. Example 2 presents even greater complexity with embedded causal structures: the passage contains a primary assertion (人皆有不忍人之心), followed by an explanatory clause introduced by 所以謂...者 (the reason for saying...), which itself contains a hypothetical scenario (今人乍見孺子將入於井) demonstrating the innate moral sentiment. Proper punctuation requires understanding both syntactic structure and logical argumentation flow. Our evaluation reveals dramatic model-dependent performance variation (35.6% to 94.0% F1), with GPT-3.5 Turbo achieving exceptional 94.0% accuracy while Claude models struggle below 48% (
Section 5). This variance suggests punctuation restoration captures specialized grammatical knowledge of classical Chinese syntax that varies significantly across model training compositions.
Natural Language Inference.
Table A6 presents natural language inference examples with premise–hypothesis pairs labeled as entailment, neutral, or contradiction, assessing models’ logical reasoning capabilities.
Table A6.
Natural language inference task examples.
Table A6.
Natural language inference task examples.
| Example 1—Entailment |
| Premise (KLS): 子曰學而時習之不亦說乎 |
| Hypothesis (Korean): 공자께서 배움과 |
| 복습의 기쁨에 대해 말씀하셨다 |
| Label: Entailment |
| Justification: The hypothesis accurately |
| captures the semantic content of Confucius |
| discussing the joy of learning and practice. |
| Example 2—Neutral |
| Premise (KLS): 子曰克己復禮為仁 |
| Hypothesis (Korean): 공자께서 예의범절을 |
| 지켜야 한다고 말씀하셨다 |
| Label: Neutral |
| Justification: While related, the hypothesis |
| focuses narrowly on ritual propriety without |
| capturing the core concept of self-discipline or |
| humaneness. |
| Example 3—Contradiction |
| Premise (KLS): 子曰人不知而不慍不亦君子乎 |
| Hypothesis (Korean): 공자께서 남이 알아주지 |
| 않으면 화를 내야 한다고 하셨다 |
| Label: Contradiction |
| Justification: The hypothesis directly |
| contradicts the premise’s assertion that a noble |
| person remains unperturbed by lack of recognition. |
The NLI task evaluates whether models can determine logical relationships between statements drawn from classical philosophical texts, requiring both semantic understanding and formal reasoning. Example 1 demonstrates clear entailment: the Korean hypothesis (공자께서 배움과 복습의 기쁨에 대해 말씀하셨다, “Confucius spoke about the joy of learning and practice”) accurately captures the semantic content of the classical Chinese premise (子曰學而時習之不亦說乎, “The Master said: to learn and regularly practice, is that not a pleasure?”). This translation-equivalence relationship represents the most straightforward inference type. Example 2 illustrates neutral relationship requiring subtle philosophical discrimination: while the hypothesis mentions ritual propriety (예의범절을 지켜야 한다, “one should observe ritual propriety”), it fails to capture the premise’s core concepts of self-discipline (克己) and humaneness (仁), representing only partial semantic overlap without full entailment. Example 3 shows direct contradiction through logical negation: the premise asserts that a noble person remains calm when unrecognized (人不知而不慍不亦君子乎), while the hypothesis claims Confucius said one should become angry when unrecognized (남이 알아주지 않으면 화를 내야 한다), directly inverting the ethical teaching. These examples require models to reason about philosophical concepts and logical relationships rather than merely matching surface lexical patterns. Our evaluation reveals surprisingly strong reasoning capabilities: mean NLI accuracy reaches 59.3%, with GPT-4 Turbo achieving 82.7% and multiple models exceeding 70% accuracy (
Section 5). This strong performance indicates that current LLMs possess substantial formal reasoning capabilities over classical philosophical content, though performance varies significantly by model, with Claude Sonnet 4.5 achieving near-zero accuracy (0.5%) despite strong retrieval performance.
Appendix A.2. Dataset Composition Summary
The dataset composition reflects careful balance between naturalistic representation and controlled evaluation requirements. Genre distribution (
Table A1) maintains diversity while respecting historical frequencies, ensuring models encounter authentic distribution patterns. Source text distribution (
Table A2) follows canonical significance, with more extensive texts contributing proportionally more instances.
Table A7 presents the label distribution for NLI and translation task compositions. The NLI task maintains a relatively balanced three-class distribution (entailment, neutral, contradiction) through controlled instance generation, ensuring unbiased logical reasoning evaluation. The translation pairs show the multilingual structure of KLSBench, with classical Chinese-to-Korean translation comprising the majority of instances (1,320) compared to Korean-to-English translation (680).
Table A7.
NLI label and translation pair distributions.
Table A7.
NLI label and translation pair distributions.
| NLI Task | Translation Task |
|---|
| Label | Count | Translation Pair | Count |
|---|
| Entailment | 1313 | Classical Chinese → Korean | 1320 |
| Neutral | 400 | Korean → English | 680 |
| Contradiction | 141 | | |
| Total | 1854 | Total | 2000 |
Table 2 and
Table A3–
Table A6 provide concrete examples for each of the five tasks, illustrating the input formats, expected outputs, and evaluation criteria that models must satisfy.