1. Introduction
Classical texts are characterized by short syntax, implicit semantics, and abundant polysemous words in which a character or phrase usually has different interpretations owing to historical change, textual context, and cultural shifts. All these characteristics challenge contemporary natural language processing systems, with large models like large language models (LLMs) typically trained on contemporary text [
1]. However, when applying these models in the study of ancient Chinese texts and literature, they suffer from various problems, including generating hallucinated content, superficial meaning, and semantic shift. In classical Chinese teaching materials, there is often a table for organizing the information regarding the poem, authors, dynasties, and annotations, but current retrieval-enhanced generation frameworks seldom explicitly search them [
2,
3].
Retrieval-augmented generation (RAG) avoids the tedious and time-consuming process of retraining large models by utilizing an engineering trick of injecting critical information from external knowledge into prompt words, thus enabling better performance on question-answering tasks [
4]. Nevertheless, several problems remain to be addressed: many existing RAG frameworks take the retrieved coherent narrative (or table) as one block of text. This design introduces two key problems in the process of dealing with ancient Chinese resources: (1) context fragmentation, which splits up the context that should correspond to a table; (2) implicit inference that requires the LLMs to painstakingly go through a whole table to find the correct record, thus increasing the error.
To overcome these drawbacks, we consider every row within a given table as one single semantic unit in our work, explicitly keeping the semantics of different rows intact to enable engineering-oriented retrieval. Herein lies our key observation that in both education and literature datasets, each row usually represents a full-semantic record; the unit of retrieval is a row, which changes the granularity and reasoning behaviors of retrieval [
5].
This study makes three key contributions:
First, we propose a table-based row-level semantic approach that splits tables into independent row-level units.
Second, we design a row-level embedding and retrieval method to achieve precise semantic localization.
Finally, we implement this method at the system level and evaluate it for ancient Chinese learning tasks.
The core contribution of this study does not lie in the isolated use of LangChain, Chroma, Ollama, or any specific embedding or generation model backend, but in the design of the Smart Table-Aware Loader. This loader reconstructs the document loading and semantic representation process by preserving the row-level semantic relationships within tables, allowing each row to serve as an independent retrieval unit.
2. Related Work
Initially, research on ancient Chinese processing primarily relied on manually compiled dictionaries or explanatory notes. In recent years, the continuous development of LLMs has made data-driven translation and interpretation methods a viable option. However, due to the high semantic density and contextual dependence of ancient Chinese, approaches based on large models often encounter the problem of hallucination.
TongGu demonstrated that integrating external knowledge through retrieval-enhanced mechanisms can improve the semantic localization capability in ancient Chinese tasks [
6]. However, these methods primarily focus on unstructured textual knowledge and fail to explicitly model the common structured tables found in educational materials.
RAG has been widely adopted to enhance factual grounding without requiring retraining of language models. Early RAG frameworks primarily focused on block-level text retrieval, while recent studies have begun exploring structured and hybrid documents.
HD-RAG can retrieve documents containing both text and tables, while TableRAG further prioritizes tables as the primary knowledge source in RAG. TableRAG mainly focus on SQL-style or database tables, whereas our work addresses tables embedded in documents. These studies highlight the importance of table perception, but typically only at the table or block level, delegating selection tasks to LLMs [
7,
8].
Beyond RAG, semantic table retrieval techniques have been extensively studied. Early research enhanced query-table alignment by integrating keyword-matching table-aware representation. Recent domain-aware retrieval methods independently model table headers and cells, with studies demonstrating that preserving table structure improves retrieval accuracy [
9,
10].
Unlike existing approaches, this study focuses on granularity control at the operational system level. By decomposing tables into independent semantic units at the row level and implementing a single-row single-embedding strategy during document ingestion, the method shifts the filtering authority from the generator to the retriever. This design significantly enhances retrieval determinism and interpretability, particularly suitable for classical Chinese education datasets where each table row represents a complete semantic record.
While works such as HD-RAG and TableRAG highlight the importance of table-aware retrieval, their focus is mainly on hybrid documents or SQL-style/database tables. In contrast, our work addresses document-embedded tables, where each row naturally corresponds to a complete semantic record. Therefore, our contribution lies not in replacing existing structured retrieval methods in general, but in introducing a loader-centered row-level modeling strategy tailored to this document setting.
3. Data Acquisition and Processing
3.1. Initial Corpus Construction
The source of the classical Chinese corpus is a related official website about Chinese education textbooks, and we filtered out those that have language errors, reliability for teaching, and consistency with the official contents. Compared to other webpages, the annotated errors, advertising interferences, and semantic inconsistencies within textbook contents were greatly reduced [
11].
We transformed the raw data into Markdown documents to perform a cleaner preprocessing step of filtering out content like ads, navigation texts, duplicated titles, and non-annotated contents. Next, for every poem/classical text, the following core fields were retrieved:
This step outputs a clean and normalized raw corpus to use as input, later expanded semantically.
3.2. Semantic Expansion via Radial and Chain Extensions
In addition to the original corpus, we further expand the data using radial expansion and chain expansion approaches. In radial expansion, the core meaning is expanded to several semantically related meanings from various perspectives, while chain expansion models the logical meaning evolution via an intermediary of semantic links.
Through these two means, polysemous words in classical Chinese are broken down into multi-layer semantics and thus create extra entries explicitly explaining the usually implied semantic relationship of the original text [
12].
Radial extension: This is the extension from one meaning to different aspects.
For example, the word “节” has the original meaning of bamboo. From this original meaning arises the following derivations, as shown in
Figure 1.
For trees, it refers to wood knots. “The Book of the Later Han Dynasty: The Biography of Yu Xu”: “The Roots Are Intricate”.
For animals, it signifies the knuckles.
For time, it is the solar term.
For music, it is rhythm. Bai Juyi’s “Pipa Xing”: “The silver grate of the head is broken”.
For social politics, it refers to the law. “The Book of Rites: Qu Li”: “Etiquette is not a festival.”
For moral purposes, it is temperance. Wen Tianxiang’s “Song of Righteousness”: “The time is poor.”
For action, it is moderation and saving. “The Analects of Learning”: “Saving and loving others.”
This way of deriving is flexible. The same original meaning can be associated from a variety of different angles, so there are various extensions from different angles. When designing a data source, we can only list all the possible meanings from the radial derivation [
13].
Chain extension: This is from the A meaning to the B meaning, and from the B meaning to the C meaning, so that the link is extended to the extension of the ring.
For example, “要” is an ancient word for “腰”, that is, one person with two hands on the waist. This includes the following meanings: “sayings” said “yes, in the body”; “Mozi and Love”; “The king of Xichu Ling is a good man”; “That’s what it’s meant to be”. Due to the various characteristics of the “waist” in the body, it has also led to the meaning of “middle”, “interception in the middle”, “coercion”, “seeking”, and “need”. If it is drawn, it is waist → 1. middle → 2. halfway interception→ 3. coercion→ 4. to get → 5. need (
Figure 2).
The peculiarity of this method of extension is that one of the meanings is only directly related to the two meanings that are adjacent to it, and it is far from the others. Take “to” as an example: “need” is a commonly used meaning in modern Chinese, and it is entirely unrelated to the meaning of “waist” at first glance. However, if we determine the intermediate link between them, we can clearly see the context of the development of the meaning of the word in between. The two types of extension, “radial” and “chain”, are not distinctly separate. For example, in addition to the above meaning of “to”, “to” can also be derived from other perspectives, such as “important”, “brief”, “essential”, and “agreement”.
The meaning of “intercepting from the halfway” can also lead to the meaning of “invitation”, such as in Tao Yuanming’s “The Story of the Peach Blossom Spring”, i.e., “I have to return home”. Again, this is a radial extension, and if represented graphically, it can also be derived from a tree diagram of a word’s meaning.
LangChain transform corpus to vectors. The vectors will be saved in the chroma vector DB, from which we can retrieve the related vectors with accuracy and lower latency, which strongly supports the case of table-aware row-wise RAG. We provide some examples of processed spreadsheets in
Figure 3.
4. System Overview
The system was implemented in Python3.12.3 using LangChain, Chroma, and Ollama. For vector generation, we used the nomic-embed-text embedding model via OllamaEmbeddings. For answer generation, we used qwen3:8b with temperature set to 0.1. Text content was split using RecursiveCharacterTextSplitter with chunk_size = 200 and chunk_overlap = 20, while tables were preserved as row-level semantic units by the Smart Table-Aware Loader. The vector store was built with Chroma. The system was deployed locally through Ollama.
Our implementation choices were made to support a fully local, lightweight, and reproducible pipeline while still meeting the experimental requirements of the study. Therefore, these settings were chosen as task-oriented engineering configurations that support resource-saving local deployment and reproducibility, rather than being claimed as the theoretical novelty of the work.
The proposed system (
Figure 4) is designed only to evaluate classical Chinese texts. There are three layers in this system, namely the (1) document ingestion layer, (2) representation and indexing layer, and (3) retrieval-augmented generation layer.
4.1. Document Ingestion Layer
The document ingestion layer ingests the original classical Chinese resources, including classical poems in the textbook, radial corpus, and chain semantic expansion, as well as converting these data into a readable format for computer processing.
In this step, the Smart Table-Aware Loader can detect the narrative text and the contents of a table from within the document. We extract the content while keeping an explicit representation for tables (rather than turning it into plain text). It allows us to keep track of the rows’ semantics relations afterwards, without affecting the semantic coherence of row-wise information.
4.2. Representation and Indexing Layer
In this next step, we both represent and index incoming documents as retrievable semantic vectors. Textual narratives will be segmented using a recursive splitting technique to maintain contextuality while keeping segments at reasonable lengths. In contrast, tabular data are decomposed into semantically atomic rows that represent complete pieces of knowledge, such as poem–author–dynasty–interpretation.
Every text segment and table row is individually encoded using an embedding model. The resulting vectors, alongside structured metadata encompassing the document types, table names, column names, rows, etc., of a vector database, thus enable fine-grained search with semantic provenance.
4.3. Retrieval-Augmented Generation Layer
The retrieval-augmented generation layer coordinates the querying comprehension, context retrieval, evidence fusion, and answer generation. Besides the normal retrieval pipeline, Dual-Channel Retrieval and Fusion are introduced to our proposed system to jointly leverage authoritative textural context as well as model-driven semantic reasoning [
14,
15].
4.3.1. Channel I: RAG-Based Contextual Retrieval
The first channel follows the traditional RAG paradigm, focusing on context retrieval. Given a user query, it conducts a similarity search within the vector database to retrieve related full-text classical Chinese poems, retrieved original source texts, and textbook-aligned annotations. The retrieved original source text usually contains the poem text, author, dynasty, and official interpretations.
These important components are then fed into a local LLM for semantic reasoning via querying over the local KB, which enables the system to answer abstract questions, perform disambiguation across text boundaries, and maintain adherence to the literature. The channel follows this rule by ensuring its answer is consistent with that taught in the education texts, without introducing bias.
4.3.2. Channel II: Model-Driven Semantic Retrieval
The second channel conducts model-driven semantic search with the aid of the reasoning ability of a local LLM. In this channel, the user question and keywords extracted from the candidate poems are analyzed together to discover salient topics, ambiguous words, and implicit semantic cues.
The two channels are not intended to contribute in an identical or strictly additive manner. Instead, they play complementary roles within the same retrieval-augmented framework. Channel I provides authoritative text-based grounding for answer generation, whereas Channel II provides semantic focusing when the query requires finer-grained reasoning over preserved row-level records or implicit semantic cues. In the current system, route selection is performed according to question characteristics and evidence type. For example, when a query is identified as table-oriented, the system preferentially invokes the table-focused retrieval chain; otherwise, it uses the general RAG chain. If neither route retrieves sufficient evidence, the system falls back to direct answering by the local model.
5. Table-Aware Row-Level Semantic Modeling
In this section, we describe the study’s main contribution: table-aware row-wise semantic modeling. The proposed method differs from traditional RAG approaches—in which a table is considered an unstructured block of text—instead using the table structure to re-define the semantic granularity of retrieval and reasoning [
16]. Detailed steps are shown in
Figure 5, and we provide some representative code excerpts in
Figure 6.
5.1. Motivation: Limitations of Table-Level Encoding
In traditional RAG pipelines, tables are usually flattened to plain text and then encoded by one vector. Though it is computationally friendly, it completely neglects the structural semantics in tables. For educational and philological data (especially for learning classical Chinese), each table row often corresponds to an entire fact or interpretation unit, for instance, one example of poetry along with its writer, dynasty, and description [
17].
5.2. Table-Aware Parsing and Structural Preservation
The platform performs a table-aware parse on documents ingested into it. It detects tables and stores them as such rather than converting the contents to free text. A table is split into rows, with columns related to each other in the same row [
18,
19].
Formally, table T comprising n rows is represented as follows:
where each row of ri constitutes a semantically complete record containing multiple attributes. This structural preservation retains strong semantic correlations within a row while avoiding weak or unrelated correlations across rows.
5.3. Comparison of Document Parsing and Loading Strategies
Current document ingestion pipelines for retrieval-augmented generation often utilize general-purpose loaders like python-docx, unstructured, and docx2txt. Although capable of extracting only pure textual information from a DOCX file format, they cannot handle well-structured educational contents with rich table-based information, such as classical Chinese textbooks [
20,
21].
Although there are libraries like python-docx that gives us a lower-level access to all the parts of a DOCX file, it is cumbersome to manually map the semantics between cells or rows without any further logic, and tables are considered raw texts or loose groups of cells, which leads to serious semantic disintegration.
The design of the docx2txt loader focuses on ease of use and low latency for fast turn-around. Although it works well with prose-rich documents, it breaks the row–column associations while flattening a table into textual format. This method can lead to chaotic table cells, forcing large models to rebuild the structure implicitly.
The unstructured loader keeps large atomic units such as heading, list, and table intact. However, it still blocks the tables; i.e., it treats each entire table as one block instead of splitting tables into semantically meaningful atoms, which leads to low retrieval granularity.
The Smart Table-Aware Loader we are proposing can be used to implement such row-wise semantic modeling, contrary to approaches based on data flattening or block encoding. The loader actively identifies the table structure and then divides each table into a series of rows which are semantic units. Each row is considered a piece of independent knowledge but also retains the relationships between columns in one row. A detailed comparison can be found in
Table 1.
5.4. Loader Design Impact on Embedding and Retrieval
The Smart Table-Aware Loader establishes a one-to-one correspondence between semantic records and document units before embedding. Each row-level unit is independently embedded and indexed, ensuring that one vector corresponds to exactly one record.
In the retrieval process, similarity matching is performed on the complete semantic line-level unit, not on the noisy block, and the retriever can directly select the relevant records, which reduces the implicit inference of the generative model.
The real novelty is not the embedding architecture itself but how we fundamentally rethink the data loading process to build a representation where semantics are captured as early as possible: the Smart Table-Aware Loader provides a solid foundation, which facilitates an effective row-wise embedding application with precise retrieval and stable generation capability.
Unlike methods that preserve tables mainly at the table or block level, our approach performs semantic decomposition earlier during document ingestion. The Smart Table-Aware Loader establishes a one-to-one correspondence between a semantic record and an embedding unit, so that record selection is handled primarily by the retriever rather than being left implicitly to the generator.
5.5. Advantages over Conventional RAG
The advantages of the proposed table-aware row-wise semantics model, compared with existing table encoders, are as follows:
Increased retrieval accuracy due to minimized semantic conflicts among unrelated records.
Improved explainability through clear row-level evidence.
Reduced hallucination by limiting implicit reasoning in the LLM.
Better compatibility with structured educational knowledge and classical Chinese learning scenarios.
Taken together, all these properties make row-level semantic modeling an attractive, principled, and effective foundation for table-aware retrieval-augmented generation. For conventional RAG models, encoding flattens tables to text and encodes them via a single-embedding strategy. As shown in
Figure 7 (left), a query fetches the entire table, pushing the LLM to reason on multiple rows and implicitly deduce relevancy, resulting in an indeterminate procedure susceptible to hallucination.
6. Experimental Evaluation
6.1. Minimal Row-Level Retrieval Benchmark
To assess the performance of line-level semantic retrieval, we design a minimum benchmark with respect to the corpus. The benchmark is directly built upon tables in the corpus, where each row is a whole semantic knowledge entry consisting of a poem, a keyword and the expanded semantics for this word obtained by radial expansion and chain expansion [
22,
23,
24]. Our dataset has a total of 64 questions on representative poems with diverse semantic phenomena, including metaphorical symbolism, political allegory, and philosophical notions. Each question is paired with one line of text that can be easily analyzed.
Our question generator is designed to seek the underlying meaning and concepts behind each word, instead of simply asking keyword retrieval questions. If a user knows what kind of symbols some words in poetry represent or how they are related to other words, the model can find relevant sentences more precisely. In this way, we simulate the process for training on traditional Chinese texts, thus effectively testing the semantic recognition ability of the retriever. We show our query test set in
Figure 8.
We define a new metric called Row Hit@k, which evaluates whether the ground-truth semantic row corresponding to a given query appears in the first k retrieved items or not.
Let
denote a query;
denote the set of top-k rows retrieved for ;
denote the unique ground-truth row associated with .
Row Hit@k is defined as follows:
For a set of queries
, the overall Row Hit@k score is computed as the average:
where
denotes the set of queries,
the ground-truth semantic row, and
the top-k retrieved rows.
A retrieval result is considered a hit if the correct document is found in the first k results. The results presented in this experiment consider k = 1, 3, and 5. All retrieval setups—including embedding models, vector databases, and similarity metrics—were standardized between the various document loaders.
The test is not for broad statistical validation, but is a method-oriented diagnostic assessment to verify whether table-aware row-level modeling can reliably access target semantic data and reduce reliance on implicit inference during generation.
Table 2 shows the Row Hit@k scores for different document loaders from the minimal row-level retrieval task. Python-docx, docx2txt, and unstructured loaders obtained low scores of 0.28 to 0.34 for Row Hit@1, suggesting that if tables were handled at a block level, the retrievers would be unable to directly identify the correct semantic row, even though there is contextual information. Although a higher k-value could provide better performance, it cannot be optimized to an arbitrary level due to the semantic interference from fragmentation or a mixed row. High line hit rates for all k-values are achieved using the Smart Table-Aware Loader, with Row Hit@1 around 0.69, showing that the retriever correctly identifies semantic lines at a rate of about 70%. As the k-value increases to 3 and 5, the hit rates increase to 0.88 and 0.94, respectively, indicating that there are always correct lines within retrieved results.
Our experiments demonstrate that row-wise semantic modeling improves retrieval determinism. Our Smart Table-Aware Loader mitigates cross-row semantic entanglement since we shift the filter operation from generation time into retrieval time. There is thus no need for inference from LLMs over row data, thus minimizing the potential for hallucinations, which is consistent with our original motivation (
Section 5).
6.2. User Evaluation Based on Student Feedback
In addition to establishing a test set for evaluation, a user-based survey was conducted to assess the educational impact and practicality of the system. A total of 38 students participated in the evaluation and scored the system against a five-point Likert scale, whereby 1 point represented the lowest score and 5 points represented the highest score. The evaluation details are systematic searches for the interpretations of specific words in ancient poems, and the satisfaction of students with the search results is investigated because students only have a concept of the search results, but not the system itself.
The current student-based questionnaire can only reflect users’ perceptions of the system’s effectiveness in practice and does not allow us to draw strong conclusions about the system itself. Therefore, our next step will be to conduct a more systematic evaluation involving frontline teachers. We believe that feedback from experienced teachers will provide substantial support for a more rigorous assessment of the system.
Figure 9 presents the student score distribution, with the results showing that 81.6% of the participants gave a score of 4 or 5, reflecting high overall satisfaction. Only one student scored fewer than 3 points, indicating less negative experience.
6.3. Inference Efficiency Across Different Language Models
We measure the average latency of the whole system using various Qwen models, but we utilize fixed retrieval and prompt generation as computation efficiency evaluation indicators.
Figure 10 shows a comparison of the three models’ response times, from which we see the average response time of Qwen2-7B is 11.2 s and Qwen2.5-7B shortens it to 9.5 s, although the fastest one is Qwen3-8B with a mean latency of only 8.3 s. We find that better local models (e.g., more parameters) tend to have lower latency.
6.4. Limitations of the Benchmark
The row-wise search benchmarking suffers from the following drawbacks:
- (1)
The classical Chinese textbook poetry benchmark only covers poems, as it is easier to label and retrieve rows correctly; however, it does not provide a complete statistical evaluation or cover a broad spectrum of research across various genres and historical epochs.
- (2)
The benchmark is based on well-structured spreadsheet data, in which each row constitutes a semantically consistent knowledge unit. Although this premise is applicable to many educational and literature-based materials, the proposed evaluation method may not fully reflect the retrieval performance in high noise, loosely structured tables.
- (3)
This benchmark is focused not on end-to-end response quality, but on retrieval determinism and semantic coherence. While this goal is in line with the aims of the current work, future research may leverage higher benchmarks, as well as more granular evaluation of tasks that can better test the generalizability of line-level semantic representation for other domains or applications.
- (4)
The present experiments validate the effectiveness of row-level semantic preservation relative to flattened or block-level document loading. However, since the current study does not include a head-to-head comparison with systems such as TableRAG, the results should be interpreted as evidence for the value of loader-centered row-level granularity in this document setting, rather than as a universal claim of superiority over all structured retrieval approaches.
7. Conclusions
In this study, we propose a table-aware row-level RAG model to perform classical Chinese comprehension over highly structured educational knowledge such as textbooks. Because we treat each row separately, the proposed method mitigates the semantic confusion between irrelevant entries and decouples fine-grained discrimination from the generative network to the retrieval component. This facilitates a more deterministic grounding as well as better explainability than traditional table-level encodings [
25,
26].
From an engineering perspective, we believe one of our major novelties is not in modifying the underlying LM but in how a document is ingested and semantically represented. Our experiment demonstrates that retrieval accuracy, inference efficiency, and user satisfaction are improved by controlling the granularity of retrieval in a principled way, with no need for model retraining or additional computation cost.
It should be noted that the proposed system assumes reasonably well-structured tabular data. Handling highly noisy or weakly structured tables remains a challenge for future work. In addition, while the proposed framework is evaluated on classical Chinese learning tasks, the underlying design principles are not limited to this domain and may be extended to other applications involving structured knowledge retrieval [
27,
28].
In conclusion, our experiments show that table-aware row-level semantic modeling facilitates simple yet effective information retrieval, thus providing a robust foundation for RAG systems within education domains.