Next Article in Journal
Multimodal Recognition of Out-of-Distribution Individuals Using Contrastive Learning
Previous Article in Journal
Explicit and Implicit Learning Mechanisms in AI Educational Assistants: A Systematic Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TechDocRAG: Relation-Preserving Retrieval-Augmented Generation (RAG) for Technical Documents

1
Department of EECI Engineering, Hanyang University, Seoul 04763, Republic of Korea
2
Division of Electrical Engineering, Hanyang University, Seoul 04763, Republic of Korea
*
Author to whom correspondence should be addressed.
AI 2026, 7(5), 161; https://doi.org/10.3390/ai7050161
Submission received: 15 March 2026 / Revised: 21 April 2026 / Accepted: 27 April 2026 / Published: 6 May 2026

Abstract

Technical documents differ from general text corpora in ways that complicate retrieval-augmented generation (RAG). Evidence for a single answer is often distributed across numbered clauses, tables, figures, captions, and ordered procedures rather than expressed in one passage. Standard RAG pipelines typically flatten these elements into independent chunks. This can break the document relations needed for exact evidence tracing. We introduce TechDocRAG, a relation-preserving framework for technical document question answering. The framework represents each document as a heterogeneous element graph and aligns three retrieval views for each element: technical identifiers, semantic summaries, and raw evidence. At query time, retrieval proceeds from identifier-aware recall to summary-level reranking and raw evidence bundling. We evaluate TechDocRAG on four benchmarks with more than 7500 evaluated question–answer pairs covering product manuals, engineering documents, and long multimodal PDFs. Across the suite, TechDocRAG improves the mean end-to-end score by 20.3 points over the strongest flat baseline and by 9.3 points over the strongest non-flat baseline. On the evidence-annotated subset, the strict raw evidence hit rate increases from 0.510 to 0.942. Resource profiling shows query time latency comparable to standard hybrid retrieval. Robustness tests show gradual degradation under relation loss, but clear sensitivity to severe identifier corruption. Overall, the results indicate that reliable RAG for technical documents depends less on retrieving more passages than on preserving the relations that make evidence interpretable.

1. Introduction

Technical documents are core references in engineering practice. Specifications, user guides, product manuals, and standards describe how systems are designed, configured, and operated. They differ from open-domain prose in both structure and evidentiary form. Evidence is often distributed across clauses, tables, figures, captions, and procedures. A definition may appear in one clause, while the valid range appears in a table and the operating condition is clarified in a later step. Benchmarks on manuals, engineering documents, and long multimodal PDFs show that this pattern is common and that text-only passage retrieval is often insufficient [1,2,3,4].
RAG provides a useful basis for this problem because it grounds generation in external evidence rather than relying only on model memory [5]. Recent general-purpose RAG methods improve the retrieval pipeline in several ways. Self-RAG adds adaptive retrieval and self-reflection [6]. CRAG repairs weak retrieval results [7]. RAPTOR searches over recursive summary trees [8]. GraphRAG, LightRAG, and HippoRAG 2 use graph structure or memory-like propagation [9,10,11]. VisRAG extends retrieval to visual document representations [12]. These methods provide strong comparison points, but their retrieval units are usually passages, summaries, entities, graph communities, or pages.
Technical document QA often requires a smaller and more precise evidence unit. Many queries are anchored by exact identifiers, such as clause numbers, parameter names, figure labels, or revision tags. The answer also depends on a linked set of elements. Examples include a clause and its cited table, a figure and its caption, or a troubleshooting step and its neighboring steps. A generic retriever may find related text without returning the complete evidence chain.
Failures may also arise before retrieval. OCR and PDF parsing can distort identifiers, units, layout, or figure–caption alignment. Retrieval then starts from a damaged representation. OCR Hinders RAG shows that such errors propagate into both retrieval and generation [13]. For technical documents, high recall alone is therefore not enough. The system must also preserve the relations that make the evidence interpretable.
TechDocRAG addresses this problem by keeping document elements and their links explicit. It parses each document into typed elements, including clauses, paragraphs, tables, figures, captions, sections, and procedure steps. It then performs retrieval in three stages: identifier-aware recall, summary-level reranking, and raw evidence bundling. The parsed elements remain retrieval units throughout the pipeline instead of being reduced to anonymous text chunks.

2. Related Work

2.1. Hybrid and Lexical–Semantic Retrieval

Hybrid retrieval combines exact lexical matching with dense semantic retrieval. It is a common baseline for document QA, especially when queries mix domain terms with natural-language paraphrases. In technical documents, two limitations are recurrent. First, OCR errors, formatting changes, or token fragmentation can damage exact identifiers. Second, linked evidence is easily separated. A system may retrieve the relevant clause but miss the table or figure cited by that clause. Hybrid retrieval is therefore competitive, but it does not fully address technical document QA.

2.2. Adaptive and Corrective RAG

Self-RAG and CRAG make retrieval more selective. Self-RAG learns when to retrieve and how to critique a draft answer [6]. CRAG estimates retrieval quality and repairs weak results [7]. These mechanisms are useful in noisy corpora and serve as strong general baselines. Their focus, however, is not the same as ours. Technical document QA often fails because the retrieved material omits linked evidence, not because the system simply chose the wrong number of passages. Self-RAG and CRAG do not explicitly model those element-to-element relations.

2.3. Hierarchical, Graph-Based, and Memory-Oriented RAG

RAPTOR, GraphRAG, LightRAG, and HippoRAG 2 organize retrieval with richer structures. RAPTOR uses recursive summary trees [8]. GraphRAG builds graph communities and summarizes them for a query [9]. LightRAG combines graph indexing with dual-level retrieval [10]. HippoRAG 2 treats retrieval as a memory problem and uses graph propagation with deeper passage integration [11]. These methods improve long-document retrieval and reasoning. Their structures are mainly semantic, however. They do not directly preserve literal document relations such as clause references, caption links, table lookups, and procedure order.

2.4. Multimodal Document RAG and Evidence Fidelity

VisRAG highlights an important issue in document QA: visual layout may be part of the evidence, and text extraction can discard it [12]. OCR Hinders RAG reaches a related conclusion from the parsing side. When the parsed representation is wrong, downstream retrieval quality drops [13]. Technical-document QA therefore needs more than a cleaned text transcript. It requires raw evidence objects and explicit links among them.

2.5. Technical Document QA Benchmarks and Positioning

Recent benchmarks clarify the scope of the task. MPMQA uses product manuals and evaluates both page retrieval and answer generation [1]. DesignQA tests grounded understanding over engineering regulations, CAD images, and drawings [2]. MMLongBench-Doc and LongDocURL focus on long-document reasoning, cross-page evidence, and evidence location in visually rich PDFs [3,4]. These benchmarks show that the hard part is not only finding related text. The system must also localize and connect different evidence types. TechDocRAG targets this gap. Compared with adaptive and corrective RAG, it focuses on relation-preserving evidence units. Compared with hierarchical and graph-based RAG, it focuses on document-element connectivity rather than abstract knowledge organization. Compared with multimodal document RAG, it adds identifier-aware retrieval and raw evidence traceability for technical documents.

3. Problem Definition

3.1. Technical Documents as Heterogeneous Element Graphs

Let C = { d 1 , , d N } denote a corpus of technical documents. Each document d may be a specification, user guide, maintenance manual, or technical standard. We represent d as a heterogeneous element graph
G d = ( V d , E d ) ,
where each node v V d is one document element. Examples include a clause, paragraph, table, figure, caption, section, or procedure step. Each node has the form
v = ( τ v , r v , k v , s v , m v ) ,
where τ v is the element type, r v is the raw document object, k v is the set of technical identifiers and keywords, s v is a semantic summary, and m v is metadata. The metadata include the page index, bounding box, section path, document type, version, and normative label.
The edge set preserves both structure and references. Structural edges include contains, precedes, same_section, and step_next. Referential edges include clause_ref, table_ref, figure_ref, caption_of, same_identifier, version_of, and supersedes. The graph therefore keeps dependencies that flat chunking usually removes.

3.2. Task Formulation

For a query q, retrieval should return more than isolated passages. It should return an evidence subgraph over the corpus-level graph G = d C G d . The target subgraph H q G must be relevant and structurally complete. We write the objective as
H q * = arg max H G ( Rel ( q , H ) + λ Conn ( H ) + γ Valid ( q , H ) μ Cost ( H ) ) ,
where Rel ( q , H ) measures lexical and semantic relevance, Conn ( H ) measures evidence connectivity and completeness, Valid ( q , H ) measures metadata consistency such as version or document type, and Cost ( H ) represents retrieval and context-construction cost.
The answer generator is conditioned on the selected evidence subgraph:
( y q , Π q ) = f θ ( q , H q * ) ,
where Π q stores provenance links from generated claims to raw evidence nodes. This formulation makes the retrieval target explicit. The goal is to recover the connected evidence needed for grounded technical QA, not merely to find semantically related content.

4. Proposed Framework

4.1. System Overview

TechDocRAG operates in four stages. It first parses each document into heterogeneous elements and records structural and referential links. It then aligns each element with three views: technical identifiers, semantic summaries, and raw document objects. At query time, it infers the retrieval intent and selects the relation types used for expansion. Retrieval then proceeds from identifier recall to summary-level reranking and raw evidence bundling before the answer is generated with provenance.
Figure 1 summarizes the architecture. The figure separates offline index construction from online query-time retrieval and grounded generation.

4.2. Relation-Preserving Parsing and Canonicalization

Document parsing follows the units that readers use to navigate technical material: headings, clauses, paragraphs, tables, figures, captions, list items, and procedure steps. The system converts these units into typed graph nodes and stores layout metadata such as page position, bounding boxes, section hierarchy, and local order.
This stage also canonicalizes technical identifiers. The same evidential concept may appear under several surface forms, including variant clause citations, parameter aliases, command syntax, release names, and version labels. We normalize these identifiers before indexing. For each node v, identifier extraction and summary generation are defined as
k v = ExtractId ( r v , m v ) , s v = Summarize ( r v , N ( v ) ) ,
where N ( v ) denotes the local relation neighborhood of v. The local neighborhood is included because many elements are only interpretable with nearby context.

4.3. Identifier–Summary–Raw Database Construction

After parsing, each element is stored in three aligned views. For each element v, the identifier set, summary, raw object, and metadata remain tied to the same element identity:
ϕ ( v ) = { k v , s v , r v , m v } .
At the corpus level, the database is defined as
D = ( I id , I sum , R , G ) ,
where I id is the identifier index, I sum is the summary index, R is the raw evidence store, and G is the relation store. The stores serve different purposes. The identifier index keeps sparse lexical evidence, including clause numbers, section paths, parameter names, command tokens, API names, error codes, figure labels, table identifiers, units, and other domain-specific anchors. The summary index stores compact semantic summaries for elements and their local contexts. The raw store preserves modality-native evidence, such as verbatim text spans, structured tables, figure regions paired with captions, and ordered procedure segments.

4.4. Query Analysis and Intent-Aware Graph Expansion

Technical document queries require different retrieval paths. We therefore decompose each query into three components:
( k q , e q , z q ) = Analyze ( q ) ,
where k q is the set of extracted technical identifiers and keywords, e q is the semantic query representation, and z q is the query intent. Typical intents include definition lookup, requirement lookup, procedural guidance, troubleshooting, comparison, multimodal interpretation, and version-sensitive reasoning. The predicted intent determines the graph expansion policy Ω ( z q ) .
Figure 2 shows the query-time control flow. The flowchart focuses on online decisions rather than the full system architecture. It shows how identifier extraction and intent prediction select the relation expansion policy before summary reranking, raw evidence bundling, and grounded generation.
Table 1 makes the query-time policy explicit by listing the relation types, traversal depth, and evidence bundles used for each intent class.

4.5. Coarse-to-Fine Retrieval and Grounded Generation

At query time, retrieval has three steps: identifier-aware recall, summary-level reranking, and raw evidence resolution. The first step retrieves candidate elements from the identifier index:
C id = TopK v V λ 1 BM25 ( q , k v ) + λ 2 IdMatch ( k q , k v ) + λ 3 MetaMatch ( q , m v ) .
This step captures exact anchors such as clause numbers, parameter identifiers, release tags, command strings, and table or figure labels. The second step expands the candidate set according to the intent-specific relation policy,
C ˜ id = C id Expand ( C id , Ω ( z q ) ) ,
and reranks the expanded candidates in summary space:
C sum = TopL v C ˜ id α cos ( e q , e v ) + β RelScore ( v , C id ) + γ TypePrior ( z q , τ v ) .
The third step resolves reranked summary nodes into raw evidence bundles:
C raw = v C sum Bundle ( v ) ,
Bundling assembles relation-complete evidence units. Examples include a clause with its referenced table, or a figure crop with its caption and referring paragraph. The final evidence set is packed under a context budget B,
E q = Pack ( C raw , B ) ,
and passed to the answer generator:
( y ^ q , Π q ) = g θ ( q , E q ) .
Algorithm 1 summarizes the retrieval and bundling procedure.
Algorithm 1 TechDocRAG coarse-to-fine retrieval pipeline.
Require: Query q; element graph G = ( V , E ) ; indices ( I id , I sum ) ; budget B; intent policy Ω
Ensure: Grounded answer y q and provenance set Π q
  Query analysis
  1: ( k q , e q , z q ) A NALYZE ( q ) ▹ Extract identifiers, query embedding, and intent
  Level 1: Identifier-aware recall
  2: C id S EARCH ( I id , k q , top _ k = 10 ) ▹ Direct target recall
  Level 2: Intent-aware graph expansion
  3: C expanded C id
  4: for all v C id do
  5:      C expanded C expanded T RAVERSE ( G , v , Ω ( z q ) , 2 )
  6: end for
  Level 3: Summary-level reranking
  7: C sum R ERANK ( C expanded , e q , I sum , top _ l = 5 ) ▹ Filter via semantic similarity
  Level 4: Raw evidence bundling
  8: C raw
  9: for all v C sum do
10:        C raw C raw F ETCH R AW ( v ) F ETCH C ONNECTED ( G , v , { caption , table } )
11: end for
  Generation
12: E q P ACK ( C raw , B )
13: ( y q , Π q ) G ENERATE ( q , E q )
14: return y q , Π q
We define provenance at the claim level as
Π q = { ( c i , U i ) } i = 1 M , U i C raw ,
where c i is a generated claim and U i is the set of raw evidence objects supporting that claim.

5. Experimental Setup

5.1. Datasets

The evaluation uses four benchmarks. MPMQA covers multimodal question answering on product manuals. Its PM209 corpus includes 209 manuals and 22,021 human-annotated question–answer pairs [1]. DesignQA focuses on engineering document understanding with Formula SAE regulations, CAD images, and engineering drawings [2]. MMLongBench-Doc serves as a long-context stress test. It contains 1062 expert-annotated questions over 130 lengthy PDFs, with an average length of 49.4 pages [3]. LongDocURL provides 2325 question–answer pairs spanning more than 33,000 pages. It separates understanding, reasoning, and locating tasks [4]. The parenthetical counts in Table 4 indicate the full benchmark scale. To keep the comparison balanced, the controlled evaluation protocol uses a curated subset of more than 7500 question–answer pairs across the four benchmarks.

5.2. Baselines

The comparison includes simple retrieval baselines and recent general-purpose RAG systems. We include standard flat baselines based on dense retrieval and a dense+BM25 hybrid. We then compare against Self-RAG [6], CRAG [7], RAPTOR [8], GraphRAG [9], LightRAG [10], HippoRAG 2 [11], and VisRAG [12]. Together, they cover adaptive retrieval, corrective retrieval, hierarchical retrieval, graph-based retrieval, memory-oriented retrieval, and multimodal document retrieval.

5.3. Fair Comparison and Implementation Details

CRAG is restricted to corpus-only retrieval; no external web search is allowed. Methods without native raw-visual access operate on the same canonicalized OCR/text+table representation. In Table 2 and Table 3, all systems use the same answer generator, Gemini-3.1-Flash-Lite-Preview. They use the same evidence budget whenever their design permits it. For page- or region-level retrievers, the final context is normalized to an equivalent budget of 2048 text tokens and 10 visual regions.

5.4. Evaluation Metrics

We evaluate standard retrieval metrics—Recall@k, MRR, and nDCG@k—at three granularities: identifier, element, and evidence bundle. Answer quality is measured with EM, token-level F1, task accuracy, or the benchmark-specific metric, depending on the dataset. Because the paper focuses on evidence chains, we also report relation-aware grounding metrics.
Let Q denote the evaluation query set. For query q, let V q and E q be the gold evidence nodes and edges, and let S ^ q , V ^ q , and E ^ q be the retrieved summary nodes, raw evidence nodes, and relation edges.
We define the Raw Evidence Hit Rate:
REHR @ K = 1 | Q | q Q 1 V q V ^ q ( K ) .
We define Summary-to-Raw Trace Accuracy:
SRTA = 1 q Q | S ^ q | q Q s S ^ q 1 Align ( s ) V q .
We define Evidence Connectivity Recall:
ECR = 1 | Q | q Q | E ^ q E q | | E q | .
For version-sensitive tasks, we define the Version Consistency Score:
VCS = 1 q Q M q q Q i = 1 M q 1 ver ( c q , i ) = ver ( q ) .
For procedure-centric queries, we additionally report Procedure Order Accuracy:
POA = 1 | Q proc | q Q proc | O ( P ^ q ) O ( P q ) | | O ( P q ) | .
Finally, we measure the Claim Support Rate:
CSR = 1 q Q M q q Q i = 1 M q 1 c q , i is supported by at least one raw evidence bundle .

6. Results

We first report end-to-end performance, followed by grounding quality, query-type behavior, component ablations, and robustness analyses.

6.1. Overall Performance

Table 4 summarizes end-to-end answer quality on the four-benchmark evaluation suite. TechDocRAG is best on all four benchmarks, reaching 68.5 on MPMQA, 62.2 on DesignQA, 58.4 on MMLongBench-Doc, and 55.2 on LongDocURL. Averaged across the four benchmarks, the margin over the strongest flat baseline, Hybrid (Dense+BM25), is 20.3 points, while the margin over the strongest non-flat baseline, VisRAG, is 9.3 points. The improvement is not confined to a single dataset or question type. It appears across manuals, engineering documents, and long visually rich PDFs.

6.2. Grounding Quality

The grounding results show a similar pattern. Table 5 compares raw evidence hit rate under progressively relaxed matching criteria. At the strictest level (L0, exact identifier match), TechDocRAG reaches 0.942, whereas the hybrid baseline reaches 0.510. The gap narrows as the criterion is relaxed, but it does not disappear. This indicates that TechDocRAG is not merely retrieving semantically related context. It more often recovers the exact evidential node or its immediate neighborhood.
Table 6 reports the same comparison on the evidence-annotated subset. The same pattern appears in Recall@10, MRR, SRTA, ECR, and CSR. The gains in SRTA and ECR are especially relevant. They show that the summary layer and relation layer do more than rerank generic passages; they preserve the path from a retrieved summary to the raw evidence used in the final answer.

6.3. Breakdown by Query Type

Table 7 shows where the gains are largest. The margin is modest on clause lookup and parameter questions, where strong lexical baselines already do reasonably well. It becomes much larger on procedures, text–table questions, cross-reference resolution, and version-sensitive questions. These are the cases in which retrieval must carry structural information forward instead of treating evidence as independent chunks.

6.4. Ablation Study

The ablation results match the intended role of each component in Table 8. Removing identifier-aware recall causes the sharpest drop in REHR. This is expected because exact anchors are often the most reliable entry point into a technical document. Removing relation edges hurts ECR the most. This indicates that graph connectivity is what allows the system to reconstruct the evidence chain after initial recall. Removing raw bundling mainly hurts CSR. This suggests that answer quality deteriorates when the generator receives disconnected fragments instead of coherent evidence bundles. The query-type analysis mirrors this pattern: relation edges matter most for cross-reference and procedure queries, raw bundling matters most for text–table and text–figure questions, and version metadata is decisive for version-sensitive cases.

6.5. Resource Costs and Technical Requirements

Table 9 summarizes the resource profile. TechDocRAG requires more offline indexing time than flat retrieval because it has to construct the element graph and align identifiers, summaries, and raw evidence. Even so, it remains lighter than the graph-intensive baseline. On DesignQA, indexing is roughly twice as fast as HippoRAG 2 and the resulting index is considerably smaller. More importantly, the query-time latency is close to standard hybrid retrieval. The main gains therefore do not appear to come from larger context windows or a much larger online compute budget.

6.6. Robustness Under Identifier Corruption and Relation Dropout

We also evaluate sensitivity to noisy parsing. Table 10 separates two perturbations. In Experiment 2A, OCR-like identifier corruption is injected into clause IDs, parameter names, and figure/table labels. In Experiment 2B, key structural edges such as table_ref and step_next are randomly removed. The results separate two failure modes. Identifier corruption is tolerable up to about 10%, after which performance collapses sharply. Relation loss is much less damaging: even after dropping 30% of the relevant edges, ECR decreases only moderately. Thus, TechDocRAG is robust once retrieval is anchored correctly, but first-stage identifier recall still depends on reasonably faithful parsing.

7. Discussion

The results point to a consistent pattern. Performance improves when retrieval returns a connected evidence set rather than an isolated chunk. Technical questions usually do not fail because no related text is found. They fail because the retrieved context is incomplete. For example, a clause may appear without the table that constrains it, or a procedure step may appear without the surrounding steps that make it interpretable. The gains in REHR, SRTA, and ECR support this interpretation.

7.1. The Evidence-Chain Gap

The hybrid baseline is already competitive on standard retrieval tasks. Graph-based baselines are also strong in long-context reasoning. The remaining gap comes from another source. OCR and PDF parsing make technical identifiers brittle, and small formatting changes can break lexical matching. Even when a baseline retrieves the right clause text, it still has to recover the evidence linked to that clause. Without explicit document relations, this second step is unreliable. The strict REHR contrast between Hybrid RAG and TechDocRAG illustrates the point. Both systems can find related text, but TechDocRAG more consistently retrieves the exact evidential node and its immediate support.

7.2. Resource Profile and Robustness

The additional experiments show that these gains do not require an impractical resource budget. TechDocRAG adds offline work because it constructs an element graph and aligns three retrieval views. At query time, however, latency remains close to standard hybrid retrieval and well below the heavier graph baseline in the profiling study. Thus, the method improves grounding without a larger inference budget.
The robustness results also clarify the limitations. Relation dropout causes gradual degradation. This suggests that summary–raw alignment provides redundancy when part of the graph is missing. Identifier corruption is more damaging. Performance remains stable under moderate corruption, but it collapses when the parser damages technical anchors too severely. The system is therefore more tolerant of partial relation loss than of severe identifier loss in the first retrieval stage.

7.3. Limitations and Practical Mitigations

TechDocRAG still depends on document parsing quality. Errors in table extraction, caption alignment, or clause numbering can remove the relations that retrieval relies on [13]. The system also handles explicit references more reliably than implicit ones. Many manuals and standards use layout conventions or shorthand references that are clear to readers but difficult to recover automatically. Cross-version ambiguity remains another limitation. Closely related revisions often reuse identifiers while changing the conditions attached to them. Finally, relation-aware grounding metrics require manual evidence annotation, which limits how broadly they can be applied.
Several mitigations are practical. Confidence-based OCR filtering can flag low-quality pages before graph construction. Selective parser fallback or re-parsing can target pages with corrupted identifiers. Rule-based validation of clause IDs, figure labels, and version tags can catch common normalization failures. Relation-confidence thresholds can also help when table–caption or figure–caption alignment is uncertain. These steps do not remove the problem entirely, but they reduce the chance that retrieval starts from a damaged representation.
Cost-aware parsing is also important for deployment. High-accuracy parsing does not need to be applied uniformly to every page. A lightweight first pass can extract text blocks, bounding boxes, section paths, and candidate technical identifiers. Rule-based validators can then check whether clause numbers, table labels, figure labels, units, and version tags are internally consistent. More expensive OCR, layout analysis, or vision-based extraction can be reserved for low-confidence pages or regions. This includes pages with corrupted identifiers or inconsistent cross-references. Nonstandard layouts and overlapping logical structures do not need to be forced into a single tree. The graph can represent the same element with multiple relation edges and confidence scores. When confidence remains low, the system can fall back to page-level or region-level retrieval and mark the provenance as uncertain. This cascaded strategy can reduce average parsing cost while limiting the effect of parser failures on downstream retrieval.

8. Conclusions

This paper examined retrieval-augmented generation for technical documents, focusing on manuals, engineering documents, and standards. In technical document QA, the main difficulty is not document length alone. The evidence for one answer is usually spread across different kinds of document objects and tied together by explicit references and local structure. Standard chunk-based retrieval tends to break those links.
TechDocRAG addresses this problem by keeping document elements and their relations explicit throughout the pipeline. Each element is indexed through identifiers, summaries, and raw evidence; retrieval moves from exact anchors to semantic reranking and finally to bundled supporting evidence. Across four benchmarks and more than 7500 evaluated question–answer pairs, the method improves answer quality, grounding quality, and relation recovery relative to strong generic RAG baselines. The mean end-to-end improvement is 20.3 points over the strongest flat baseline and 9.3 points over the strongest non-flat baseline, while strict raw evidence hit rate rises from 0.510 to 0.942 on the evidence-annotated subset.
The profiling and robustness results clarify both the strengths and the limits of the approach. Query-time latency stays close to standard hybrid retrieval, but the method still relies on reasonably faithful parsing at the identifier stage. Future work should therefore focus on cost-aware technical document parsing, stronger handling of nonstandard layouts and implicit references, and more robust conflict resolution across document versions. Appendix A summarizes the notation used in the formulation, and Appendix B lists representative structural and referential relations used in the element graph.

Author Contributions

Conceptualization, S.L.; methodology, S.L.; software, S.L.; validation, S.L.; formal analysis, S.L.; investigation, S.L.; resources, S.L.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, S.L. and M.C.; visualization, S.L.; supervision, M.C.; project administration, M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Public benchmark datasets were analyzed in this study. These datasets are available from the sources cited in the manuscript, including MPMQA, DesignQA, MMLongBench-Doc, and LongDocURL. No new large-scale benchmark dataset was created. Derived annotations, prompts, and implementation details supporting the findings of this study are available from the first author upon request.

Acknowledgments

The authors thank the maintainers of the public technical document benchmarks used in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
RAGRetrieval-Augmented Generation
QAQuestion Answering
OCROptical Character Recognition
MRRMean Reciprocal Rank
EMExact Match
REHRRaw Evidence Hit Rate
SRTASummary-to-Raw Trace Accuracy
ECREvidence Connectivity Recall
VCSVersion Consistency Score
CSRClaim Support Rate

Appendix A. Notation Summary

Table A1 summarizes the main symbols used in the problem formulation and retrieval pipeline.
Table A1. Notation summary.
Table A1. Notation summary.
SymbolMeaning
C Corpus of technical documents
G d = ( V d , E d ) Element graph for document d
τ v Element type of node v
r v Raw document object of node v
k v Technical identifiers and keywords of node v
s v Semantic summary of node v
m v Metadata of node v
H q Retrieved evidence subgraph for query q
E q Packed evidence context passed to the generator
Π q Claim-level provenance mapping

Appendix B. Representative Relation Types

Table A2 lists representative structural and referential relations used by TechDocRAG.
Table A2. Representative relation types used in the element graph.
Table A2. Representative relation types used in the element graph.
RelationDescription
containsHierarchical containment between sections and elements
precedesLocal reading order between neighboring elements
same_sectionMembership within the same section or subsection
step_nextSequential order between procedural steps
clause_refExplicit reference from one clause to another
table_refReference from text to a table
figure_refReference from text to a figure
caption_ofLink between a figure or table and its caption
same_identifierReuse of the same technical identifier across elements
version_of/supersedesCross-version relation between revisions

References

  1. Zhang, L.; Hu, A.; Zhang, J.; Hu, S.; Jin, Q. MPMQA: Multimodal Question Answering on Product Manuals. arXiv 2023, arXiv:2304.09660. [Google Scholar] [CrossRef]
  2. Doris, A.C.; Grandi, D.; Tomich, R.; Alam, M.F.; Ataei, M.; Cheong, H.; Ahmed, F. DesignQA: A Multimodal Benchmark for Evaluating Large Language Models’ Understanding of Engineering Documentation. arXiv 2024, arXiv:2404.07917. [Google Scholar] [CrossRef]
  3. Ma, Y.; Zang, Y.; Chen, L.; Chen, M.; Jiao, Y.; Li, X.; Lu, X.; Liu, Z.; Ma, Y.; Dong, X.; et al. MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations. arXiv 2024, arXiv:2407.01523. [Google Scholar]
  4. Deng, C.; Yuan, J.; Bu, P.; Wang, P.; Li, Z.Z.; Xu, J.; Li, X.H.; Gao, Y.; Song, J.; Zheng, B.; et al. LongDocURL: A Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating. arXiv 2024, arXiv:2412.18424. [Google Scholar]
  5. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
  6. Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv 2023, arXiv:2310.11511. [Google Scholar] [CrossRef]
  7. Yan, S.Q.; Gu, J.C.; Zhu, Y.; Ling, Z.H. Corrective Retrieval Augmented Generation. arXiv 2024, arXiv:2401.15884. [Google Scholar] [CrossRef]
  8. Sarthi, P.; Abdullah, S.; Tuli, A.; Khanna, S.; Goldie, A.; Manning, C.D. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  9. Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From Local to Global: A GraphRAG Approach to Query-Focused Summarization. arXiv 2024, arXiv:2404.16130. [Google Scholar]
  10. Guo, Z.; Xia, L.; Yu, Y.; Ao, T.; Huang, C. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv 2024, arXiv:2410.05779. [Google Scholar]
  11. Gutiérrez, B.J.; Shu, Y.; Qi, W.; Zhou, S.; Su, Y. From RAG to Memory: Non-Parametric Continual Learning for Large Language Models. arXiv 2025, arXiv:2502.14802. [Google Scholar] [CrossRef]
  12. Yu, S.; Tang, C.; Xu, B.; Cui, J.; Ran, J.; Yan, Y.; Liu, Z.; Wang, S.; Han, X.; Liu, Z.; et al. VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents. arXiv 2024, arXiv:2410.10594. [Google Scholar]
  13. Zhang, J.; Zhang, Q.; Wang, B.; Ouyang, L.; Wen, Z.; Li, Y.; Chow, K.H.; He, C.; Zhang, W. OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation. arXiv 2024, arXiv:2412.02592. [Google Scholar] [CrossRef]
Figure 1. Block diagram of the proposed TechDocRAG framework.
Figure 1. Block diagram of the proposed TechDocRAG framework.
Ai 07 00161 g001
Figure 2. Query-time flowchart of TechDocRAG.
Figure 2. Query-time flowchart of TechDocRAG.
Ai 07 00161 g002
Table 1. Intent-specific relation expansion and evidence bundling policy.
Table 1. Intent-specific relation expansion and evidence bundling policy.
IntentExpanded RelationsMax HopPrimary TargetBundle CompositionExample Query
Definition/requirementcontains, same_identifier, clause_ref1–2Clause or paragraphMatched clause, defining paragraph, nearby normative note, and linked table header when present“What does parameter X mean in Section 4.2?”
Procedure/troubleshootingstep_next, contains, same_section2Procedure stepsRetrieved step, preceding and following steps, warning/note blocks, and section title“How do I recalibrate sensor Y?”
Text–table reasoningtable_ref, caption_of, same_identifier2Table row/header pathReferring clause, relevant row or cell, header path, and table caption“Which temperature range is allowed for mode A?”
Text–figure groundingfigure_ref, caption_of, same_section2Figure/caption pairFigure crop, caption, and referring paragraph“What does Figure 3 indicate about the connector layout?”
Cross-reference resolutionclause_ref, table_ref, figure_ref2Referenced elementSource clause plus the cited clause, table, or figure and its local context“In Section 5.2, what does Table 7 clarify?”
Version-sensitive queryversion_of, supersedes, same_identifier1Active revision nodesCurrent clause, matched prior revision when needed, and version metadata“What changed for parameter Z in revision B?”
Table 2. Fair-comparison conditions across baselines.
Table 2. Fair-comparison conditions across baselines.
MethodRetrieval UnitExternal SearchMultimodal AccessReranking/Repair StageGenerator Interface and Budget Normalization
Dense Chunk RAGFixed text chunksNoOCR text + tables onlyDense similarity rankingGemini-3.1-Flash-Lite-Preview; packed to 2048 text tokens
Hybrid (Dense+BM25)Fixed text chunksNoOCR text + tables onlyDense + BM25 fusion (RRF)Gemini-3.1-Flash-Lite-Preview; packed to 2048 text tokens
Self-RAGFixed chunksNoOCR text + tables onlySelf-reflective retrieval controlGemini-3.1-Flash-Lite-Preview; same text budget
CRAGFixed chunksNoOCR text + tables onlyCorpus-only corrective retrievalGemini-3.1-Flash-Lite-Preview; same text budget
RAPTORRecursive summary treeNoOCR text + tables onlyTree-level retrieval over summariesGemini-3.1-Flash-Lite-Preview; same text budget
GraphRAGEntity/community graphNoOCR text + tables onlyGraph traversal with community summariesGemini-3.1-Flash-Lite-Preview; same text budget
LightRAGDual graph/vector indexNoOCR text + tables onlyGraph + vector fusionGemini-3.1-Flash-Lite-Preview; same text budget
HippoRAG 2Memory graph + passagesNoOCR text + tables onlyMemory-guided propagationGemini-3.1-Flash-Lite-Preview; same text budget
VisRAGPages/visual regionsNoNative visual accessVisual page retrieval and rerankingGemini-3.1-Flash-Lite-Preview; normalized to 2048 text tokens + 10 visual regions
TechDocRAGIdentifier–summary–raw graphNoNative text/tables/figuresSummary reranking + raw bundlingGemini-3.1-Flash-Lite-Preview; normalized to 2048 text tokens + 10 visual regions
Table 3. Implementation details of TechDocRAG.
Table 3. Implementation details of TechDocRAG.
ComponentConfiguration
Document parserOCR-based multimodal parser with layout segmentation, figure–caption alignment, and typed element extraction
Identifier canonicalizationRule-based normalization of clause IDs, figure/table labels, parameter names, and version tags, with fuzzy matching at query time
Identifier indexBM25 over normalized identifiers, clause numbers, labels, and domain keywords
Summary vector indexall-MiniLM-L6-v2
Query analyzerTinyLlama-1.1B-Chat-v1.0, batch size 16, used for identifier extraction and intent prediction
Answer generatorGemini-3.1-Flash-Lite-Preview, shared across all methods under the fair-comparison protocol
Relation extractionExplicit cross-reference parsing plus layout heuristics for caption_of, table_ref, figure_ref, and step_next edges
Default retrieval hyperparametersTop-10 identifier recall, 2-hop graph expansion, top-5 summary reranking
Context budget2048 text tokens and 10 visual regions
Evaluation hardwareSingle-GPU inference; resource profiling reported separately in Table 9
Table 4. End-to-end answer performance on the full evaluation suite. Parenthetical counts denote benchmark scale; the reported comparison is computed on the balanced 7500+ QA evaluation subset described in Section 5.
Table 4. End-to-end answer performance on the full evaluation suite. Parenthetical counts denote benchmark scale; the reported comparison is computed on the balanced 7500+ QA evaluation subset described in Section 5.
MethodMPMQA (22k) ↑DesignQA (4k) ↑MMLong (1k) ↑LongDoc (2k) ↑Avg. Gain ↑
Dense Chunk RAG44.840.533.231.4
Hybrid (Dense+BM25)48.244.136.534.2+3.4
Self-RAG51.847.539.837.5+6.8
CRAG52.448.840.538.2+7.5
RAPTOR50.546.241.438.8+5.4
GraphRAG54.850.544.842.4+9.8
LightRAG55.651.246.143.8+10.7
HippoRAG 256.552.847.545.2+11.6
VisRAG54.257.548.846.5+12.6
TechDocRAG (Ours)68.562.258.455.2+21.2
Table 5. Retrieval performance under extended strategy relaxation.
Table 5. Retrieval performance under extended strategy relaxation.
Strategy LevelHit CriteriaDense REHR ↑Hybrid REHR ↑TechDocRAG REHR ↑
L0 (Strict)Exact ID Match0.4520.5100.942
L1 (1-hop)Within 1-hop0.5850.6220.965
L2 (2-hop)Within 2-hops0.6420.6810.984
L3 (3-hop)Within 3-hops0.7250.7580.992
L4 (4-hop)Within 4-hops0.7840.8121.000
Table 6. Retrieval and grounding performance on the full evidence-annotated subset.
Table 6. Retrieval and grounding performance on the full evidence-annotated subset.
MethodRecall@10 ↑ MRR ↑REHR@10 ↑SRTA ↑ECR ↑CSR ↑
Dense Chunk RAG42.50.280.4520.4120.3820.425
Hybrid (Dense+BM25)45.80.310.5100.4850.4250.512
Self-RAG48.20.340.5420.5120.4510.545
CRAG49.40.350.5510.5220.4640.552
RAPTOR52.10.380.5820.5740.5050.578
GraphRAG55.50.410.6540.6220.5850.624
LightRAG56.80.430.6820.6540.6210.652
HippoRAG 258.50.450.7210.6850.6520.704
VisRAG54.20.400.6540.6110.5820.625
TechDocRAG69.20.560.9420.9140.8520.942
Table 7. Performance breakdown by technical document query type.
Table 7. Performance breakdown by technical document query type.
MethodClauseParam.ProcedureText–TableText–FigureCross-RefVersionMacro Avg.
Hybrid Chunk RAG52.148.435.231.428.525.122.434.7
Self-RAG54.551.238.434.231.028.524.837.5
CRAG55.252.540.135.832.430.226.538.9
RAPTOR58.454.144.239.535.834.230.142.3
GraphRAG61.258.548.544.240.138.534.246.4
LightRAG58.856.446.142.538.236.432.544.4
HippoRAG 262.560.151.247.543.441.237.549.1
VisRAG55.453.242.149.552.432.128.444.7
TechDocRAG65.864.262.160.558.457.254.160.3
Table 8. Ablation study of the proposed framework.
Table 8. Ablation study of the proposed framework.
VariantMain Score ↑ REHR@10 ↑ECR ↑CSR ↑VCS ↑POA ↑Latency (s) ↓
Full Model62.50.940.850.820.780.750.42
w/o Relation Edges51.20.880.420.650.740.480.35
w/o Identifier Recall54.80.650.720.750.680.650.32
w/o Summary Routing56.40.910.810.780.720.700.38
w/o Raw Bundling58.20.920.830.450.750.720.36
w/o Intent-Aware Expansion55.60.900.580.720.740.610.38
w/o Version Metadata59.40.930.840.810.320.740.40
Table 9. Resource and technical requirements measured on DesignQA.
Table 9. Resource and technical requirements measured on DesignQA.
MethodOffline Indexing Time (s)Index Size (MB)Avg. Query Latency (ms)Peak GPU Memory (MB)
Dense Chunk RAG2.52.05.2∼150
Hybrid (Dense+BM25)3.83.57.8∼180
HippoRAG 212.48.245.6∼450
TechDocRAG (Ours)6.362.288.3314.4
Table 10. Robustness analysis under parsing and relation perturbation.
Table 10. Robustness analysis under parsing and relation perturbation.
Noise/Loss LevelREHR (Exp. 2A: ID Corruption) ↑ ECR (Exp. 2B: Edge Dropout) ↑
0% (baseline)1.00000.9875
5%1.00000.9825
10%1.00000.9725
20%0.07000.9525
30%0.9310
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, S.; Choi, M. TechDocRAG: Relation-Preserving Retrieval-Augmented Generation (RAG) for Technical Documents. AI 2026, 7, 161. https://doi.org/10.3390/ai7050161

AMA Style

Lee S, Choi M. TechDocRAG: Relation-Preserving Retrieval-Augmented Generation (RAG) for Technical Documents. AI. 2026; 7(5):161. https://doi.org/10.3390/ai7050161

Chicago/Turabian Style

Lee, Seungjoon, and Myungryul Choi. 2026. "TechDocRAG: Relation-Preserving Retrieval-Augmented Generation (RAG) for Technical Documents" AI 7, no. 5: 161. https://doi.org/10.3390/ai7050161

APA Style

Lee, S., & Choi, M. (2026). TechDocRAG: Relation-Preserving Retrieval-Augmented Generation (RAG) for Technical Documents. AI, 7(5), 161. https://doi.org/10.3390/ai7050161

Article Metrics

Back to TopTop