TRACE: Topical Reasoning with Adaptive Contextual Experts

Ye, Jiabin; Xin, Qiuyi; Zhang, Chu; Qi, Hengnian

doi:10.3390/bdcc10010031

Open AccessArticle

TRACE: Topical Reasoning with Adaptive Contextual Experts

School of Information Engineering, Huzhou University, Huzhou 313000, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(1), 31; https://doi.org/10.3390/bdcc10010031

Submission received: 13 November 2025 / Revised: 31 December 2025 / Accepted: 8 January 2026 / Published: 13 January 2026

(This article belongs to the Special Issue Generative AI and Large Language Models)

Download

Browse Figures

Versions Notes

Abstract

Retrieval-Augmented Generation (RAG) is widely used for long-text summarization due to its efficiency and scalability. However, standard RAG methods flatten documents into independent chunks, disrupting sequential flow and thematic structure, resulting in significant loss of contextual information. This paper presents MOEGAT, a novel graph-enhanced retrieval framework that addresses this limitation by explicitly modeling document structure. MOEGAT constructs an Orthogonal Context Graph to capture sequential discourse and global semantic relationships—long-range dependencies between non-adjacent text spans that reflect topical similarity and logical associations beyond local context. It then employs a query-aware Mixture-of-Experts Graph Attention Network to dynamically activate specialized reasoning pathways. Experiments conducted on three public long-text summarization datasets demonstrate that MOEGAT achieves state-of-the-art performance. Notably, on the WCEP dataset, it outperforms the previous state-of-the-art Graph of Records (GOR) baseline by 14.9%, 18.1%, and 18.4% on ROUGE-L, ROUGE-1, and ROUGE-2, respectively. These substantial gains, especially the 14.9% improvement in ROUGE-L, reflect significantly better capture of long-range coherence and thematic integrity in summaries. Ablation studies confirm the effectiveness of the orthogonal graph and Mixture-of-Experts components. Overall, this work introduces a novel structure-aware approach to RAG that explicitly models and leverages document structure through an orthogonal graph representation and query-aware Mixture-of-Experts reasoning.

Keywords:

graph-based retrieval-augmented generation; MOEGAT; long-text summarization; document structure

1. Introduction

The rapid advancements of Large Language Models have recently led to remarkable performance across a wide array of language modeling tasks [1]. Within this landscape, the ability to accurately and efficiently comprehend and summarize long-form documents has emerged as a key frontier challenge [2,3]. To address this, the research community has largely pursued two principal technical approaches: one focusing on long-context architectures that expand the model’s native context window, and the other on the Retrieval-Augmented Generation paradigm [4,5]. Unlike long-context strategies that attempt to process the entire document in one pass, RAG offers a more flexible and computationally efficient solution. It functions by precisely locating and extracting a small set of critical information snippets to guide the LLM’s generation process. This fundamental “retrieve-then-generate” principle has demonstrated its efficacy in a multitude of knowledge-intensive tasks [6,7]. Notably, our approach operates in a zero-shot manner without dataset-specific fine-tuning, prioritizing strong generalization while maintaining competitive generation quality.

However, despite these advantages, conventional RAG methods face a critical limitation: they often fail to capture the holistic meaning and global context necessary for understanding complex documents [6,7,8]. This deficiency arises primarily from the widely used “chunk flattening” strategy. Unlike human reading, which follows a continuous narrative flow, this approach mechanically partitions documents into fixed-size segments based on token counts rather than semantic boundaries [9,10]. Such arbitrary fragmentation fundamentally undermines sequential coherence by severing long-range dependencies—often cutting through cause-and-effect chains and leaving premises detached from their conclusions [11]. By treating segments as independent units, the retriever becomes blind to the cross-referential links and recurrent motifs that bind scattered information into a unified whole, leading to phenomena such as the “lost in the middle” effect where central contextual information is overlooked [11,12]. Consequently, the intrinsic connections between semantic units are lost, making the reconstruction of global context exceptionally difficult. This issue is particularly pronounced in domains that rely heavily on deep contextual understanding, such as clinical documents [13] and legal texts [14]. To overcome these structural deficiencies, Graph-Augmented Retrieval Generation has gradually emerged. Its core idea is to convert a document into a graph structure, thereby explicitly modeling entities, concepts, and the complex relationships among them. This paradigm has undergone a notable research evolution: early work primarily focused on enhancing language models with pre-constructed external knowledge graphs [15,16], but this approach was constrained by high construction costs and limited flexibility. Consequently, recent research has shifted towards text-native Graph RAG, which dynamically constructs graph structures directly from the source text, treating “graph construction for content” as a core task [17]. This approach has already demonstrated potential in general-purpose frameworks [18] and for tasks such as summarization [19,20].

Despite this progress, Graph RAG still faces several significant challenges in the application of long-document summarization. At the representation level, existing graph construction methods typically compress the rich semantic hierarchy of a document into a single-layer graph structure. Such methods, including those that rely on entity-relation extraction to identify named entities and their connections or those that link co-occurring [21] or similar terms through keyword association, fail to preserve the multi-level logical hierarchy from high-level themes and mid-level arguments down to low-level details [22]. Although some studies have begun to explore hierarchical or multi-granularity graph construction, including hierarchical Graph Neural Networks [23,24] and hierarchical Transformers [25,26], the challenge of systematically representing the complete logical hierarchy of a document remains unsolved. At the reasoning level, mainstream homogeneous graph models, such as standard Graph Neural Networks or Graph Language Models [27], apply the same aggregation mechanism to all nodes in the graph. This approach fails to differentiate between the distinct reasoning patterns required for processing abstract concepts and concrete facts. Although the Mixture-of-Experts architecture, which routes different inputs to specialized sub-networks, has proven effective in multi-task scenarios [28,29], its potential when combined with graph attention networks to handle heterogeneous nodes has yet to be thoroughly explored. At the training level, the lack of fine-grained, node-level annotations [14] for long-document summarization tasks means that existing Graph RAG methods often focus on optimizing intermediate stages, such as graph construction and retrieval [17,30]. This practice creates a significant gap between their training objectives and the final summarization task. Lacking the direct supervisory signals needed to identify critical nodes, the model struggles to learn precise information selection strategies, rendering the training process fraught with uncertainty [31,32].

Motivated by challenges in representation, reasoning, and training in standard RAG systems, we propose MOEGAT, a graph-augmented retrieval framework (Figure 1). MOEGAT takes a document represented as an Orthogonal Context Graph and a user query as inputs. The query is encoded into an embedding that serves two roles: (1) driving a Router module to dynamically select the top-K most relevant Graph Attention Network experts, and (2) guiding final node scoring. The selected experts process graph nodes independently along two orthogonal subspaces—sequential discourse structure and long-range semantic dependencies—followed by non-linear integration of their outputs. The fused node representations are then scored against the query embedding in a Scoring Head. The top-K highest-scoring nodes (e.g., top 6 in Figure 1) are retrieved and passed to a Large Language Model for generation.

MOEGAT consists of three synergistic modules that work together to enable structure-aware retrieval and reasoning. The first module constructs an Orthogonal Context Graph. This graph explicitly encodes the document’s intrinsic structure through two types of orthogonal edges: sequential adjacency edges that preserve the linear narrative flow and textual order, and semantic proximity edges that capture global thematic relationships across distant segments. This dual representation provides a comprehensive structural prior that mitigates the context loss in traditional chunk-based approaches. The second module is a query-aware Mixture-of-Experts Graph Attention Network. Given the query embedding, a router dynamically selects a sparse set of specialized experts. Each chosen expert operates along a distinct reasoning pathway—corresponding to a specific semantic subspace such as sequential discourse or long-range semantics—by performing query-modulated attention and neighborhood aggregation within that pathway. This enables adaptive, task-specific context propagation across the graph. The third module introduces self-supervised training objectives to address the supervision gap. We generate differentiable soft labels by computing BERT-Score-based [33] semantic similarity between each node and the reference summary. Training is then formulated as a pairwise ranking task, supplemented by auxiliary objectives that promote balanced expert utilization and stable optimization.

Experiments on multiple public long-document summarization datasets show that MOEGAT achieves competitive performance. Our main contributions are summarized as follows:

(1): Orthogonal Context Graph Modeling: A dual-edge graph that jointly captures sequential discourse and global semantic relationships, providing a richer structural prior for retrieval.
(2): Hierarchical Graph Reasoning with Dynamic Expert Routing: A sparse Mixture-of-Experts GAT with query-guided expert selection and subspace-specific attention, enabling efficient and adaptive context aggregation.
(3): Self-supervised Multi-objective Training: A ranking-based paradigm with soft semantic labels and balance regularization, facilitating effective training without fine-grained annotations.

2. Related Work

While prior Graph-RAG approaches mainly improve local retrieval for question-answering tasks using pre-constructed graphs [34,35,36], they treat documents as flat chunks. This fails to preserve sequential flow and global thematic structure needed for holistic long-document summarization. In contrast, MOEGAT introduces an Orthogonal Context Graph to disentangle local sequential and long-range semantic dependencies within the document. It pairs this with a query-aware Mixture-of-Experts Graph Attention Network for dynamic reasoning. Thus, it enables effective holistic long-document summarization.

Building on these Graph-RAG efforts, Retrieval-Augmented Generation has emerged as a mainstream paradigm to address the inherent knowledge limitations of Large Language Models and to mitigate their generation of hallucinations. Conventional RAG methods primarily rely on vector-based retrieval over unstructured text corpora, which essentially treats knowledge as isolated information fragments, thereby struggling to effectively capture and utilize the complex structural relationships embedded within the text [37]. To overcome this limitation, researchers have recently begun to explore the integration of graph structures into the RAG framework, leading to the development of Graph-Augmented RAG methods. Graph structures can explicitly model the semantic associations between entities and have been demonstrated to enhance model performance in various question-answering and reasoning tasks. Specifically, current research is advancing along two primary fronts: one focused on structured retrieval enhancement based on graph indexing, and the other dedicated to leveraging Graph Neural Networks for representation learning and reasoning enhancement.

Structured Retrieval Enhancement Based on Graph Indexing. The core idea of this direction is to index knowledge by constructing text into a graph structure, which enables the retrieval of semantically coherent knowledge subgraphs to mitigate semantic drift and provide LLMs with precise context. For instance, some studies construct domain-specific knowledge graphs to support complex fault diagnosis [38], while others utilize graph structures to address the challenges of cross-document information extraction and fusion, thereby improving the accuracy and detail of information retrieval [39]. Other work models question-answering as an optimization problem on the graph, seeking an optimal subgraph to balance retrieval relevance with generation efficiency [34].

GNN-Driven Representation Learning and Reasoning Enhancement. In contrast to static graph indexing, this line of research aims to integrate the dynamic learning capabilities of Graph Neural Networks (GNNs) to mine the deep, implicit semantic information within the graph. Through its characteristic message-passing mechanism, a GNN aggregates features from neighboring nodes, allowing the vector representation of a node to encapsulate not only its own information but also the semantics of its contextual environment. This creates a deep complementarity with the text comprehension abilities of LLMs. This capability serves multiple roles within the RAG pipeline: during the retrieval phase, GNNs can enhance recall quality by modeling semantic relations between passages [35]; in the post-retrieval stage, they can function as re-rankers, optimizing the organizational logic of candidate documents based on graph-based structural associations [40]; furthermore, end-to-end GNN-RAG frameworks have been developed to tackle question-answering scenarios that require complex graph reasoning [36].

Although existing Graph-RAG research, while varied in its technical approaches, is typically based on pre-constructed graph structures, its application paradigms have primarily centered on question-answering tasks. The objective is to enhance the quality of retrieval and generation for locally relevant information [34,35]; even when applied to long-document summarization, these methods are often confined to query-focused summarization [9]. However, for summarization tasks that require a holistic comprehension of long documents, such QA-centric paradigms face significant challenges due to their inability to effectively integrate global information. Therefore, this paper explores the application of graph structures to enhance global summarization capabilities for long documents, with a focus on identifying and integrating the core themes and logical threads that span the entire text to produce a summary that reflects the document’s overall content.

3. Methods

We designed a graph-augmented retrieval framework for the task of long-document summarization, the overall architecture of which is illustrated in Figure 1. The framework first constructs a long document into an Orthogonal Context Graph (Ortho-Graph). Subsequently, we employ a query-aware Mixture-of-Experts Graph Attention Network (MoeGAT) to rank nodes on the Orthogonal Context Graph and retrieve query-relevant information. The Mixture-of-Experts module uses a gating network to route the input query to a subset of specialized experts, each comprising a lightweight Graph Attention layer focused on one graph subspace (sequential or semantic). The activated experts collectively form a query-tailored reasoning pathway—a dynamic sub-network that aggregates information along sequential or semantic edges based on the query’s topical demands. The method for graph construction and the specific architecture of our model is detailed in the following sections.

3.1. Construction of the Orthogonal Context Graph (Ortho-Graph)

Conventional RAG approaches often process long documents by flattening them into independent, sequential chunks. This process inherently discards the rich structural information vital for deep comprehension. To address this limitation, we propose the Orthogonal Context Graph (Ortho-Graph). The Ortho-Graph is designed to explicitly represent two fundamental dimensions of text understanding: local sequential coherence and global semantic associations. We treat these dimensions as conceptually orthogonal. This orthogonality is formally realized by separating the information flow into two mutually exclusive axes: a sequential axis, which preserves the linear narrative flow, and a semantic axis, which captures long-range logical dependencies independent of linear proximity.

The Ortho-Graph is constructed by first splitting the document

D

into an ordered sequence of text chunks

C = {c_{1}, c_{2}, {\dots, c}_{n}}

(chunk-size

w

, overlap

s

). Each chunk

c_{i}

corresponds to a node

v_{i}

, and a pre-trained embedding model

Φ

maps each node to a representation vector

x_{i} = Φ (v_{i})

, forming the node feature matrix

x_{i} = Φ (v_{i})

.

Graph connectivity is defined via an adjacency matrix

A

, which is decomposed into two structurally distinct components:

A = A^{s e q} + A^{s e m}

(1)

An edge exists between nodes

v_{i}

and

v_{j}

if

A_{i j} = 1

.

The sequential adjacency matrix

A^{s e q}

captures local narrative flow by linking only consecutive chunks:

A_{i j}^{s e q} = δ (|i - j|, 1)

(2)

where

δ

is the Kronecker delta (see Supplementary Section S1 for details).

The semantic adjacency matrix

A^{s e m}

connects nodes based on content similarity while deliberately excluding local sequential neighbors to maintain orthogonality. This is enforced by a neighborhood exclusion mask, ensuring disjoint supports (

A^{s e q} ⨀ A^{s e m} = 0

) and restricting semantic edges to long-range associations. Full details of the masking and top-k selection process are provided in Supplementary Section S2.

By integrating sequential coherence with global semantic relations, the Ortho-Graph provides a rich structural prior that enhances retrieval and supports more coherent summarization.

3.2. Node-Adaptive Dynamic Routing

Nodes in the Ortho-Graph have different functional roles. Standard Graph Neural Networks use uniform message passing. This ignores node heterogeneity. To solve this, we introduce a node-adaptive dynamic routing mechanism. It builds a customized reasoning pathway for each node. The pathway consists of a sparse subset of expert modules.

The process works as follows. Given node features

X ϵ R^{N \times D i m}

, we first compute expert affinities:

L = X W_{r o u t e r}

(3)

where

W_{r o u t e r} ϵ R^{D i m \times E}

is learnable. Each

L_{i j}

measures how well node

i

matches expert

j

. For efficiency and specialization, we apply sparse activation. For each node

i

, we select the top-k experts with highest affinity (See Supplementary Section S3 for selection details):

T_{i} = \underset{S \in 1, \dots, E, |S| = K}{arg max} \sum_{j \in S} L_{i j}

(4)

We then normalize weights locally over the selected experts:

g_{i j} = \{\begin{matrix} \frac{e x p (L_{i j})}{\sum_{k \in T_{i}} e x p (L_{i k})}, j \in T_{i} \\ 0, o t h e r w i s e \end{matrix}

(5)

This ensures

\sum_{j \in T_{i}} g_{i j} = 1

for each node. After experts process the graph in parallel, the final node representation is a weighted combination:

h_{i}^{*} = \sum_{j \in T_{i}} g_{i j} \cdot E_{j} (q, X, A) [j]

(6)

Here,

E_{j} (q, X, A) [j]

is the output of expert

j

for node

i

. This design promotes expert specialization. It also keeps computation efficient through sparsity.

3.3. Query-Modulated Graph Attention Expert

Each activated expert is a Query-Modulated Graph Attention Expert (QM-GAE). Its goal is query-aware relational reasoning on the Ortho-Graph. Figure 2 illustrates the overall process. The process has three steps. First, the input query is encoded as a dense vector

q

. Second, each node feature

x_{i}

is made query-aware. We concatenate

x_{i}

and

q

, then pass them through an MLP:

F_{l a t e n t} = M L P ([x_{i} | | q])

(7)

Third, multi-head graph attention aggregates information from neighbors. For head

h

, the attention coefficient is:

α_{i j}^{(h) h} = \frac{e x p (L e a k y R e L U (a^{T} [W^{h} f_{i} | | W^{h} f_{j}]))}{\sum_{k \in N_{i}} e x p (L e a k y R e L U (a^{T} [W^{h} f_{i} | | W^{h} f_{k}]))}

(8)

where

N_{i}

includes node

i

and its direct neighbors.

a^{T}

is a shared weight vector and

W^{h}

is the weight matrix for the h-th attention head. The final node representation is the concatenation of all heads:

h_{i}^{'} = {| |}_{h = i}^{H} σ (\sum_{j \in N_{i}} α_{i j}^{(h)} W^{h} f_{j})

(9)

By injecting query semantics into both feature projection and attention (as shown in Figure 2), QM-GAE produces task-relevant node representations. See Supplementary Section S4 for full notation and implementation details.

3.4. Node Ranking and Summary Generation

Upon completion of the query-aware graph reasoning by the expert modules, a final representation

h_{i}^{'}

, is generated for each node. This representation is rich in contextual information and aligned with the query’s intent. The ultimate task of this stage is to convert these complex node representations into a coherent, natural language summary that synthesizes the core information. This process is initiated by a Scoring Head. The objective of this module is to quantify the relevance of each node by projecting its final representation

h_{i}^{'}

, back into the semantic space of the query

q

. This is achieved by computing the inner product of

h_{i}^{'}

and

q

, which generates a scalar relevance score

s_{i}

, for each node

v_{i}

. These scores provide the quantitative basis for information selection. Based on these scores, the top-N nodes with the highest values are selected. The set of indices for these nodes represents the most salient content fragments within the document. Finally, this context is injected into an instruction template and is synthesized into the final summary

S

, by a Large Language Model.

3.5. Self-Supervised Semantic Alignment and Expert Load Balancing

To train the model effectively, we employ a composite loss function that pursues two main objectives: learning a meaningful node importance ranking and ensuring balanced utilization of experts. The total loss combines a ranking term and a load-balancing term:

L_{t o t a l} = λ_{r a n k i n g} L_{r a n k i n g} + λ_{b a l a n c e} L_{b a l a n c e}

(10)

The ranking loss is self-supervised. It uses semantic relevance scores derived from BERTScore, computed with DeBERTa-xlarge-mnli embeddings between each node and the reference summary [20]. Nodes whose relevance exceeds the mean form the positive set

P^{+}

; the remainder form the negative set

P^{-}

. A pairwise margin ranking loss is then applied over all positive-negative pairs:

L_{r a n k i n g} = \frac{1}{|P|} \sum_{(p, n) \subseteq P} \max (0, m - (f_{p} - f_{n}))

(11)

where

P

represents the set of all positive-negative sample pairs in the graph,

m

is a margin hyperparameter that controls the degree of separation between positive and negative samples, and

f_{p}

and

f_{n}

are the model’s predicted scores for the positive and negative sample nodes, respectively. This encourages the model to assign higher scores to more relevant nodes without requiring manual annotations.

To prevent routing collapse and promote expert specialization, we introduce an auxiliary load-balancing loss. This term combines a utilization component, which encourages uniform assignment of nodes across experts, and a confidence component, which favors decisive routing decisions. The detailed formulation, including both components averaged over layers, is provided in Supplementary Section S5. Together, these objectives enable stable training and high-quality summarization performance.

4. Results

The preceding sections have introduced our core contributions: an Orthogonal Context Graph that explicitly separates local discourse from global semantic associations, and a node-adaptive Mixture-of-Experts routing mechanism that forms query-tailored reasoning pathways under self-supervised objectives. We now turn to comprehensive experimental evaluation on three public long-text summarization datasets to demonstrate these designs’ effectiveness.

Experimental Setup

Datasets: Experiments were conducted on three public long-text summarization datasets to ensure the robustness and generalizability of our evaluation. These datasets cover diverse domains and introduce distinct structural challenges. QMSum [41] consists of meeting transcripts from various domains; in this study, we focused on its “general query” portion, which allows validation of the model’s ability to integrate global information in informal, multi-speaker conversational scenarios. BookSum [42] provides long narrative texts from literary works, thereby testing the model’s effectiveness in capturing long-range dependencies within extremely long contexts and complex narrative structures. Finally, WCEP [43] addresses multi-document summarization of news events, enabling assessment of cross-document information deduplication, fusion, and organization. Performance was quantified using the standard ROUGE metrics [44], namely ROUGE-1, ROUGE-2, and ROUGE-L, which measure content coverage and accuracy through lexical overlap between generated and reference summaries.

Implementation Details: In our experiments, long documents were segmented into text chunks of 256 tokens with an overlap of 64 tokens. Initial embeddings were generated using the Contriever model. Based on these text chunks as nodes, we constructed a base graph by connecting adjacent chunks to preserve the local narrative flow and by connecting each chunk to its 5 semantically closest non-adjacent chunks to establish global relationships. This graph was then input into a 3-layer Moe-GAT model, where the node and hidden layer dimensions were both set to 768. Each layer contained 4 experts, and a Top-2 routing strategy was used to dynamically select experts for each node. The model was trained using the AdamW optimizer with an initial learning rate of 1 × 10⁻⁴. The composite loss function consisted of two components: a self-supervised ranking loss (with weight

L_{r a n k i n g}

) and an expert load-balancing loss (with weight

L_{b a l a n c e}

), to ensure that the model learned node importance while maintaining a balanced utilization of experts. During the final summary generation stage, our trained Moe-GAT model functioned as the retriever. The query and the complete Orthogonal Context Graph were fed into the model, which then computed relevance scores for all nodes in the graph relative to the query. The top 6 most relevant nodes were selected based on these scores, and their content was input into the LLaMA-2-7b-chat model to generate the final summary via a greedy decoding strategy.

Baselines: To ensure a comprehensive evaluation, we compare our proposed Moe-GAT model with multiple representative baselines, including: (1) Sparse Retrievers (i.e., BM25 [41] and TF-IDF [42]), (2) Dense Retrievers (i.e., Contriever [43], DPR [44], SBERT [45], and Dragon [46]), (3) Full-context Modeling (i.e., directly processing entire documents without retrieval), and (4) Structured Retrieval Methods (i.e., Node2Vec [47], GOR [16], and Thought-R [48]). This systematic comparison enables thorough validation of our model’s effectiveness across different technical approaches.

4.1. Main Results

We conducted comprehensive experiments on the QMSum, WCEP, and BookSum datasets, comparing our proposed method against several baselines to evaluate its long-context global summarization capabilities. The results are presented in Table 1. Our proposed Moe-GAT consistently outperforms all baseline methods across nearly all evaluation metrics and datasets. Thanks to the mixture-of-experts architecture combined with graph attention mechanisms, our model can effectively capture and integrate complex semantic relationships across different text chunks, thereby enhancing the retrieval and representation of key information for summarization. Moreover, the structured modeling of document content enables more precise identification of salient content, which is crucial for generating high-quality summaries.

Moe-GAT shows clear superiority over traditional retrieval methods. While structured retrieval methods like GOR demonstrate competitive performance, our approach further advances the state-of-the-art by leveraging more sophisticated graph-based representations. Compared to full-context modeling, Moe-GAT achieves better performance with significantly shorter input lengths (approximately 1.5K tokens), demonstrating its efficiency in handling long documents without information loss from truncation.

Additional Findings. (1) Sparse retrievers (BM25 and TF-IDF) produce suboptimal results as they rely solely on lexical matching without capturing semantic meanings. (2) Dense retrievers show varying performance, with Contriever achieving relatively better results, though still inferior to our method due to limitations in modeling global document structure. (3) Node2Vec produces unsatisfactory results since its random walk-based embeddings cannot be effectively optimized for the summarization task. (4) Although Thought-R demonstrates competitive results, it is still inferior to Moe-GAT due to insufficient exploration of the hierarchical relationships between document segments. (5) The “Full Context” approach suffers from information loss when processing extremely long documents, resulting in suboptimal performance despite accessing the complete document content.

Overall, Moe-GAT achieves the best results compared with various baselines across all three datasets, demonstrating the effectiveness of our proposed method in addressing the challenges of long-context global summarization.

4.2. Generalization Capability of the Framework Across Different Large Language Models

To rigorously evaluate the generalization capability and model-agnostic nature of the proposed framework, a comprehensive experimental analysis was conducted on three representative text summarization datasets: QMSum, Booksum, and WCEP. The experiments utilized several different backbone language models with the aim of verifying that the performance improvements stem from the intrinsic advantages of the Moe-GAT architecture itself, rather than from a coincidental synergy with any specific large model. We selected a series of high-performing and representative open-source large language models as backbones and evaluated them under two settings: “LLM-Only” and “Ours + LLM”. By comparing the performance of each backbone model, both standalone and in conjunction with the Moe-GAT framework, on the ROUGE-L, ROUGE-1, and ROUGE-2 metrics (with results presented in Table 2, Table 3 and Table 4), we have derived the following key findings.

The experimental results demonstrate that the Moe-GAT framework induces universal and robust performance gains across various summarization tasks. On the QMSum dataset, which is characterized by multi-turn dialogs with complex interlocutor dependencies, all tested backbone models achieved consistent improvements in their ROUGE-L and ROUGE-1 scores after the integration of Moe-GAT. This improvement stems primarily from the Orthogonal Context Graph, which effectively preserves sequential discourse flows across turns. Such flows are frequently disrupted during direct full-text summarization by standalone LLMs.

This positive trend was further validated on the Booksum and WCEP datasets. On the Booksum dataset, which features lengthy narrative texts with strong sequential coherence, all seven models exhibited clear performance improvements when combined with our framework. The gains for Hunyuan-MT-7B were especially significant, achieving substantial leaps of over 12 points in ROUGE-L and 24 points in ROUGE-1. These pronounced gains highlight the framework’s strength in modeling long-range sequential dependencies via its dedicated sequential axis. This specifically benefits models that struggle with processing extended contexts in full-text mode.

Similarly, a comprehensive and consistent performance enhancement was observed across all models on the WCEP dataset, a multi-document news clustering task rich in thematic overlaps. In this case, the semantic axis of the Orthogonal Context Graph plays a pivotal role by capturing cross-document thematic links. This enables more accurate consensus identification than direct full-text processing by standalone LLMs. Such systematic improvement across different domains and task difficulties provides strong evidence for the universal effectiveness of the Moe-GAT framework in enhancing summary generation quality.

Furthermore, the Moe-GAT framework possesses excellent versatility and broad enhancement capabilities. Evaluation across text summarization tasks confirms that the framework is effective for a diverse range of large language models with heterogeneous architectures, including those from the GPT, Gemini, and Qwen families. Regardless of the backbone model, the introduction of Moe-GAT consistently improved summary quality. This confirms that its success relies on universal design principles rather than the internal mechanisms of any specific model.

Further quantitative analysis reveals an important pattern: the magnitude of the performance gain is inversely correlated with the baseline capability of the backbone model. Stronger models, such as InternLM2.5-7B-Chat or GLM-4-9B, already possess relatively effective internal context modeling for full-text summarization, leaving less room for external structural augmentation. In contrast, weaker baselines suffer more severely from context limitations in direct full-text processing. Examples include Hunyuan-MT-7B on Booksum or various Gemini variants. For these models, the query-aware Moe-GAT and Orthogonal Context Graph provide substantial complementary structural signals, resulting in larger absolute gains. Specifically, the framework provides robust yet modest improvements for highly optimized models. Conversely, it yields more substantial performance leaps for models with weaker baseline capabilities. This phenomenon indicates that Moe-GAT can adaptively provide a commensurate level of optimization based on the potential of different models, thereby universally elevating summarization performance across a wide range of architectures.

The experimental results demonstrate that: (1) The Moe-GAT framework induces universal and robust performance gains across different types of summarization tasks. On the QMSum dataset (characterized by multi-turn dialogs with complex interlocutor dependencies), all tested backbone models achieved consistent improvements in their ROUGE-L and ROUGE-1 scores after the integration of Moe-GAT. This is primarily because the Orthogonal Context Graph effectively preserves sequential discourse flows across turns, which are frequently disrupted in direct full-text summarization by standalone LLMs. This positive trend was further validated on the Booksum and WCEP datasets. Particularly on the Booksum dataset (featuring lengthy narrative texts with strong sequential coherence), all seven models exhibited clear performance improvements when combined with our framework, with the improvement for Hunyuan-MT-7B being especially significant, achieving substantial leaps of over 12 points in ROUGE-L and 24 points in ROUGE-1. These pronounced gains on Booksum highlight the framework’s strength in modeling long-range sequential dependencies via its dedicated sequential axis, particularly benefiting models that struggle with processing extended contexts in full-text mode. On the WCEP dataset (a multi-document news clustering task rich in thematic overlaps), a comprehensive and consistent performance enhancement was also observed across all models. Here, the semantic axis of the Orthogonal Context Graph plays a pivotal role by capturing cross-document thematic links, enabling more accurate consensus identification than direct full-text processing by standalone LLMs. This systematic improvement, spanning different domains and task difficulties, provides strong evidence for the universal effectiveness of the Moe-GAT framework in enhancing the quality of summary generation. (2) The Moe-GAT framework possesses excellent versatility and broad enhancement capabilities. In the evaluation of text summarization tasks, the framework was effectively validated on a diverse range of large language models with heterogeneous architectures, including those from the GPT, Gemini, and Qwen families. Regardless of the model used as the backbone, the introduction of Moe-GAT consistently improved the quality of the generated summaries, confirming the universal applicability of its design principles rather than a dependency on the internal mechanisms of any specific model. Further quantitative analysis reveals an important pattern: the magnitude of the performance gain is inversely correlated with the baseline capability of the backbone model. This pattern arises because stronger models (e.g., internlm2-5-7b-chat or glm-4-9b) already possess relatively effective internal context modeling for full-text summarization, leaving less room for external structural augmentation. In contrast, weaker baselines (e.g., Hunyuan-MT-7B on Booksum or gemini-2.5 variants) suffer more severely from context limitations in direct full-text processing; the query-aware MoE-GAT and Orthogonal Context Graph therefore provide substantial complementary structural signals, resulting in larger absolute gains. Specifically, for highly optimized models such as internlm2-5-7b-chat, the framework provides robust yet modest improvements. Conversely, for models with weaker baseline capabilities, it yields more substantial performance leaps. This phenomenon clearly indicates that Moe-GAT can adaptively provide a commensurate level of optimization based on the potential of different models, thereby universally elevating the summarization performance across a wide range of models.

4.3. Ablation Study

We performed a series of ablation experiments to deeply investigate the contribution of each core component in the proposed framework. The results appear in Table 5. Our analysis reveals the distinct and complementary roles of each component, allowing us to quantify their relative importance to the overall performance.

(1) The Moe architecture is fundamental for modeling long and heterogeneous documents. We implemented the “w/o Moe” variant by setting the expert number to 1, forcing all graph nodes to be processed by a single expert with identical parameters. The resulting performance degradation, consistent across all datasets, underscores that expert specialization is non-trivial. The impact is most pronounced on BookSum, which features the longest and most thematically diverse narratives, with a 15.1% relative drop in ROUGE-L. This indicates that the Moe framework contributes most significantly to handling the varied reasoning demands in lengthy texts.

(2) The orthogonal graph decomposition is crucial, with the semantic and sequential structures contributing differentially depending on the dataset characteristics. The “w/o Semantic” variant, which removes semantic adjacency, causes the most severe drops on datasets rich in thematic content; for instance, ROUGE-1 on BookSum plunges from 30.2 to 17.1 (a 43.3% relative drop), highlighting that the semantic graph is the primary driver for capturing long-range topical dependencies. Conversely, the “w/o Sequential” variant, which removes sequential links, more significantly harms coherence in structured dialogs or narratives, as seen in the notable ROUGE-L decrease on QMSum (from 22.1 to 18.3). Therefore, while both structures are essential, the semantic component contributes most to overall informativeness on content-rich datasets, whereas the sequential component is vital for maintaining local flow.

(3) The composite training objective ensures task alignment and training stability. The “w/o

L_{r a n k i n g}

” variant removes the self-supervised ranking loss and causes clear performance reductions across datasets. This result highlights the loss’s role in producing discriminative node representations optimized for relevance ranking in summarization. The “w/o

L_{b a l a n c e}

” variant removes the balancing loss and leads to moderate but consistent declines. For example, ROUGE-1 drops from 30.2 to 21.2 on BookSum. The drop is larger here because longer documents generate more nodes and amplify load imbalance issues. This outcome confirms that explicit load balancing prevents expert collapse and preserves the full representational power of the Moe architecture.

4.4. Computational Efficiency Analysis

We address concerns regarding the computational complexity of MOEGAT by distinguishing between graph construction (preprocessing) and inference stages. Measurements are conducted on the WCEP dataset using a consistent LLM backbone, excluding identical LLM generation time across methods.

Graph construction entails splitting the document into overlapping chunks, encoding them with a pre-trained embedding model to generate node representations, and constructing the sparse sequential and masked top-k semantic adjacency matrices. The dominant cost lies in the embedding step and varies significantly with the embedding model. Subsequent steps—building the sparse sequential adjacency matrix (O(N)) and the masked top-k semantic adjacency matrix—are considerably lighter, as they operate on precomputed embeddings with efficient sparse operations. Graph construction is performed once per document and can be fully offline, making it amortizable in practical deployments.

During inference, MOEGAT performs dynamic routing, query-aware Mixture-of-Experts Graph Attention propagation over the pre-constructed sparse Orthogonal Context Graph, and node ranking via composite losses. The highly sparse graph ensures efficient attention. As shown in Table 6, MOEGAT averages 1.2 s per query, versus 0.58 s for GoR—a modest 0.6-s increase mainly from dynamic routing and multi-expert layers. Given the substantial gains (+14.9% ROUGE-L, +18.1% ROUGE-1, +18.4% ROUGE-2 over GoR), this minor overhead is well-justified for practical applications.

5. Discussion

5.1. Efficacy of the Scoring Mechanism

This mechanism is a critical component of the framework, with its core function being the accurate ranking of nodes. To this end, the model-predicted score for each node on the test set was calculated and compared against its corresponding ground-truth score to evaluate the consistency between the two. Specifically, a plot of the model’s output scores versus the target scores was generated to visually investigate the correlation between them.

We conducted a quantitative analysis on the test set to evaluate the effectiveness of our scoring mechanism. Results are visualized in Figure 3. Figure 3a shows the linear regression fit between model-predicted relevance scores and ground-truth scores for each node. We observe a strong positive linear correlation, evidenced by a Pearson correlation coefficient of r = 0.759 (p < 0.001). Additionally, Spearman’s rank correlation coefficient of ρ = 0.714 confirms high monotonic consistency in ranking. The coefficient of determination R² = 0.576 indicates that the model explains more than half of the variance in ground-truth scores. Compared to typical relevance prediction performance reported in complex natural language understanding tasks [53], these results demonstrate robust and reliable score prediction capabilities. This finding is further supported by the two-dimensional kernel density estimation in Figure 3b, where the joint distribution reveals that most data points concentrate tightly around the regression line. This pattern confirms consistent performance across the majority of samples and validates that our scoring head successfully assigns higher relevance to critical content segments.

5.2. Sensitivity Analysis of the Number of Semantic Neighbors

We conducted a systematic study to examine the influence of k, the number of semantic neighbors in the Orthogonal Context Graph, on ranking performance. Results for the WCEP and QMSum datasets are shown in Figure 4. As k increases from small values, performance on both datasets improves rapidly at first, reflecting the benefit of incorporating more long-range semantic connections. Performance then reaches a peak and subsequently declines or plateaus, indicating the onset of noise from overly distant or irrelevant neighbors. The optimal k and the model’s sensitivity to this hyperparameter vary markedly between datasets. On the WCEP dataset, which features high content heterogeneity due to multi-document news clusters containing diverse topics, advertisements, and ancillary elements, the model achieves peak ranking performance at k = 5. Beyond this point, further increases in k lead to a steady decline, as additional semantic edges frequently connect core content to noisy or irrelevant segments. This noise dilutes the signal during graph propagation and reduces retrieval precision. In contrast, the QMSum dataset consists of meeting transcripts that are more semantically dense and thematically coherent within each document. Here, the model attains its highest performance at k = 7, with only minor degradation even as k continues to increase up to 10 or higher. This greater robustness suggests that extra semantic neighbors in QMSum are more likely to link genuinely related utterances across speakers or agenda items, providing useful supplementary context rather than noise.

These dataset-specific patterns underscore the value of tuning k based on document characteristics: lower values suit heterogeneous or noisy sources to minimize spurious connections, while higher values are preferable for focused, cohesive texts. Overall, the results confirm that our Orthogonal Context Graph construction is not only effective but also exhibits appropriate sensitivity to this key hyperparameter, enabling strong adaptability across diverse long-document scenarios in practice.

5.3. Analysis of the Impact of the Load-Balancing Loss on Architectural Stability

We performed a quantitative comparison between the full model, which incorporates the load-balancing loss, and an ablated variant without this loss (w/o load-balancing). Results are presented in Figure 5 from two perspectives: the dynamic evolution of expert load during training (a) and the steady-state distribution after convergence (b). Without the load-balancing loss, the routing mechanism exhibits severe imbalance after convergence, as shown in Figure 5b. One expert (expert 5) processes 45.9% of the input tokens—nearly four times the ideal uniform share of 12.5% for 8 experts—while experts 2, 3, 4, and 7 receive almost no tokens. This “winner-take-all” phenomenon causes the multi-expert system to degenerate into a structure dominated by a single expert, substantially reducing overall model capacity. The load balance score drops dramatically from 0.912 in the full model to 0.341 in the ablated version, confirming routing collapse.

In contrast, the full model with the load-balancing loss maintains healthy utilization. Figure 5a illustrates that, despite initial fluctuations, the load on all eight experts rapidly converges and stabilizes around the ideal average. This demonstrates that the load-balancing loss serves as an effective regularizer, enforcing uniform input distribution and ensuring each expert receives sufficient training signal. By preserving expert diversity and capacity, this mechanism contributes directly to the superior ROUGE performance observed in the full model.

5.4. Analysis of the Role of the Self-Supervised Ranking Loss in the Separability of the Representation Space

We investigated the role of the self-supervised ranking loss in representation learning through a comparative visualization study. Final node representations from the full model and an ablated variant without the ranking loss are projected into two dimensions using UMAP. Results are shown in Figure 6, where red points represent golden nodes that are highly relevant to the reference summary and blue points represent ordinary nodes. In the ablated model (right panel), golden and ordinary nodes are heavily intermingled with no discernible separable structure. This indicates that the main task objective alone provides insufficient signal for the model to distinguish nodes of varying importance, resulting in poor representation separability. The full model (left panel) exhibits markedly different behavior. Golden nodes form a compact, well-defined cluster clearly separated from the region occupied by ordinary nodes. This pronounced boundary demonstrates that the self-supervised ranking loss supplies a task-aligned supervisory signal that effectively guides the encoder to enhance separability between critical and secondary content. By serving as a carefully designed proxy task, the ranking loss imposes structuredness on the representation space, enabling better differentiation of node importance and contributing directly to improved summary quality.

5.5. Analysis of the Correlation Between Internal Decision Mechanisms and Summarization Performance

High-quality summarization depends critically on the model’s ability to accurately assess sentence importance. To examine this internal decision-making process and its relation to generation performance, we conducted a qualitative visual analysis on 34 test samples from the QMSum dataset. Results are presented in Figure 7, with the top panel showing ROUGE-L scores for each sample and the bottom panel displaying a heatmap of model-assigned importance scores for sentences in the corresponding source document (darker colors indicate higher importance; selected summary sentences are bordered in dark gray). Based on this visual analysis, the following conclusions were drawn:

(1) Accurate Judgments and an Effective Scoring Mechanism: The heatmaps collectively indicate that the model can precisely identify salient sentences. In the vast majority of samples, the sentences ultimately selected for the summary (highlighted with dark gray borders) show a high degree of correspondence with the most intensely colored regions in the heatmap, which represent the highest importance scores. For example, in samples 17, 20, and 34, the model exhibits a pronounced “peak” effect, where a few sentences are assigned extremely high scores while the rest receive markedly lower ones. This demonstrates a strong alignment between the model’s internal scoring and its summary generation behavior, confirming that the designed scoring architecture can effectively identify high-information-density content. (2) Overcoming Positional Bias with Global Information Awareness: The analysis reveals that the model successfully avoids the “lead bias” commonly found in traditional summarization methods. Observing the vertical axis of the heatmaps (normalized sentence position), high-scoring salient sentences are not concentrated at the beginning of the documents (in the 0.0–0.2 range) but are instead distributed widely throughout the text. For instance, the key information in sample 20 is concentrated in the initial part of the text, whereas the salient sentences in samples 5, 15, and 33 are located primarily in the middle and latter sections. This indicates that the model possesses a global information-aware capability, enabling it to assess sentence importance based on semantic value rather than positional cues, which is particularly crucial for processing long documents with diverse structures.

(3) Positive Correlation between Scoring Confidence and Summary Quality: A comparison of the ROUGE-L scores and the heatmap patterns reveals a clear association between the two. In samples with high ROUGE-L scores (e.g., samples 17 and 34), the heatmaps exhibit high contrast, with a few sentences receiving scores significantly higher than the rest. Conversely, in samples with lower scores (e.g., samples 5 and 29), the color distribution is relatively uniform. This suggests that the more “confident” the model is in its judgments—that is, when the scores of salient sentences are highly prominent—the higher the quality of the generated summary tends to be. In summary, the model demonstrates high accuracy and global awareness in assessing sentence importance, providing a solution for long-document summarization that is both effective and interpretable.

5.6. Limitations

A primary limitation of the proposed method lies in its strong dependence on the quality of the graph construction. We front-load the complex semantic analysis task; while this reduces the computational burden on the model, it also shifts the performance bottleneck to the graph construction stage. This limitation is particularly pronounced on the BookSum dataset, where for extremely long and complex texts such as books, the current graph construction method struggles to fully capture the macroscopic narrative structure. This reveals a critical issue: even a model with a highly stable and load-balanced internal architecture (see Figure 5) will have its ultimate efficacy constrained by the quality of its input representation (i.e., the graph structure). Another limitation stems from the implicit division of labor among the experts, which restricts the model’s controllability. While we successfully ensured that all experts are activated via the load-balancing loss, we did not explicitly guide their functional differentiation. This causes the model to function as an effective “generalist” rather than a collection of interpretable and steerable “specialists” Consequently, the model lacks the flexibility required for applications that involve generating summaries from a specific perspective.

6. Conclusions

In this paper, we proposed Orthogonal Graph Expert Network, an innovative framework designed to enhance the performance of long-document summarization. Our method uniquely combines a hierarchical graph structure with a Mixture-of-Experts architecture to effectively model the complex, long-range, and heterogeneous semantic relationships inherent in long texts. Intuitively, we construct a graph to capture both local and global dependencies within the document and apply a Graph Attention Network (GAT) on top of it. We then leverage the Moe architecture to dynamically route different types of textual information to specialized experts for processing, thereby achieving a more fine-grained and adaptive content comprehension. To optimize this framework, we designed a composite training objective centered on a self-supervised ranking loss function, which directly aligns the node representations with the downstream summarization task, and a load-balancing loss function, which ensures the stability and full utilization of the model’s capacity. Extensive experiments conducted on multiple long-document summarization datasets demonstrate that Orthogonal Graph Expert Network significantly outperforms a variety of strong baseline models, including retrieval-based methods, long-context large language models, and other graph-based approaches. This validates the effectiveness and superiority of our proposed framework.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/bdcc10010031/s1. Section S1: Sequential Adjacency Matrix; Section S2: Semantic Adjacency Matrix and Orthogonality Constraint; Section S3: Node-Adaptive Dynamic Routing; Section S4: Query-Modulated Graph Attention Expert (QM-GAE); Section S5: Load-Balancing Loss.

Author Contributions

J.Y.: Investigation, Data curation, Software, Methodology, Formal analysis, Visualization, Writing—original draft, Writing—review and editing; Q.X.: Investigation, Writing—review and editing; C.Z.: Methodology, Validation, Supervision, Writing—review and editing; H.Q.: Conceptualization, Resources, Methodology, Funding acquisition, Project administration, Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Zhejiang Key R&D Plan; grant number: 2017C03047.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in references [37,38,39].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 39. [Google Scholar] [CrossRef]
AbuRa’ed, A.; Saggion, H.; Shvets, A.; Bravo, À. Automatic related work section generation: Experiments in scientific document abstracting. Scientometrics 2020, 125, 3159–3185. [Google Scholar] [CrossRef]
Luo, Z.; Xie, Q.; Ananiadou, S. Factual Consistency Evaluation of Summarization in the Era of Large Language Models. Expert Syst. Appl. 2024, 254, 124456. [Google Scholar] [CrossRef]
Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Ratner, N.; Levine, Y.; Belinkov, Y.; Ram, O.; Magar, I.; Abend, O.; Karpas, E.; Shashua, A.; Leyton-Brown, K.; Shoham, Y. Parallel Context Windows for Large Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 6383–6402. [Google Scholar] [CrossRef]
Jiang, Z.; Xu, F.F.; Gao, L.; Sun, Z.; Liu, Q.; Dwivedi-Yu, J.; Yang, Y.; Callan, J.; Neubig, G. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Sentosa, Singapore, 6–10 December 2023; pp. 7969–7992. [Google Scholar]
Shi, W.; Min, S.; Yasunaga, M.; Seo, M.; James, R.; Lewis, M.; Zettlemoyer, L.; Yih, W.-t. Replug: Retrieval-augmented black-box language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Washington, DC, USA, 16–21 June 2024; pp. 8371–8384. [Google Scholar]
Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the Middle: How Language Models Use Long Contexts. Trans. Assoc. Comput. Linguist. 2024, 12, 157–173. [Google Scholar] [CrossRef]
Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv 2024, arXiv:2404.16130. [Google Scholar] [CrossRef]
Procko, T.T.; Ochoa, O. Graph Retrieval-Augmented Generation for Large Language Models: A Survey. In Proceedings of the Conference on AI, Science, Engineering, and Technology (AIxSET), Laguna Hills, CA, USA, 30 September–2 October 2024; pp. 166–169. [Google Scholar] [CrossRef]
Peng, B.; Zhu, Y.; Liu, Y.; Bo, X.; Shi, H.; Hong, C.; Zhang, Y.; Tang, S. Graph Retrieval-Augmented Generation: A Survey. ACM Trans. Inf. Syst. 2024, 44, 1–52. [Google Scholar] [CrossRef]
Kim, D.; Yoo, S.; Jeong, O. MedSumGraph: Enhancing GraphRAG for Medical QA with Summarization and Optimized Prompts. Artif. Intell. Med. 2025, 172, 103311. [Google Scholar] [CrossRef]
Zakka, C.; Shad, R.; Chaurasia, A.; Dalal, A.R.; Kim, J.L.; Moor, M.; Fong, R.; Phillips, C.; Alexander, K.; Ashley, E.; et al. Almanac—Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI 2024, 1, AIoa2300068. [Google Scholar] [CrossRef] [PubMed]
Borchmann, L.; Wisniewski, D.; Gretkowski, A.; Kosmala, I.; Jurkiewicz, D.; Szalkiewicz, L.; Palka, G.; Kaczmarek, K.; Kaliska, A.; Gralinski, F. Contract Discovery: Dataset and a Few-Shot Semantic Retrieval Challenge with Competitive Baselines. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020; pp. 4254–4268. [Google Scholar] [CrossRef]
Gubanov, M.; Pyayt, A.; Karolak, A. Cancerkg. org-a web-scale, interactive, verifiable knowledge graph-llm hybrid for assisting with optimal cancer treatment and care. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 4497–4505. [Google Scholar]
Sarmah, B.; Mehta, D.; Hall, B.; Rao, R.; Patel, S.; Pasquali, S. Hybridrag: Integrating knowledge graphs and vector retrieval augmented generation for efficient information extraction. In Proceedings of the 5th ACM International Conference on AI in Finance, Brooklyn, NY, USA, 14–17 November 2024; pp. 608–616. [Google Scholar]
Hu, Y.; Lei, Z.; Zhang, Z.; Pan, B.; Ling, C.; Zhao, L. GRAG: Graph Retrieval-Augmented Generation. In Human Language Technology—The Baltic Perspectiv; IOS Press: Amsterdam, The Netherlands, 2025; pp. 4145–4157. [Google Scholar] [CrossRef]
Zhao, Y.; Zhu, J.; Guo, Y.; He, K.; Li, X. E^ 2GraphRAG: Streamlining Graph-based RAG for High Efficiency and Effectiveness. arXiv 2025, arXiv:2505.24226. [Google Scholar]
Wu, W.; Li, W.; Xiao, X.; Liu, J.; Cao, Z.; Li, S.; Wu, H.; Wang, H. BASS: Boosting Abstractive Summarization with Unified Semantic Graph. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; pp. 6052–6067. [Google Scholar] [CrossRef]
Zhang, H.; Feng, T.; You, J. Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 27 July–1 August 2025; pp. 23780–23799. [Google Scholar] [CrossRef]
Zhang, Z.; Yu, B.; Shu, X.; Mengge, X.; Liu, T.; Guo, L. From what to why: Improving relation extraction with rationale graph. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 86–95. [Google Scholar]
Garg, M.; Kumar, M. KEST: A graph-based keyphrase extraction technique for tweets summarization using Markov decision process. Expert Syst. Appl. 2022, 209, 118110. [Google Scholar] [CrossRef]
Ying, Z.; You, J.; Morris, C.; Ren, X.; Hamilton, W.; Leskovec, J. Hierarchical Graph Representation Learning with Differentiable Pooling. Adv. Neural Inf. Process. Syst. 2018, 31, 4800–4810. [Google Scholar]
Zhao, C.; Zhou, X.; Xie, X.; Zhang, Y. Hierarchical Attention Graph for Scientific Document Summarization in Global and Local Level. In Human Language Technology—The Baltic Perspectiv; IOS Press: Amsterdam, The Netherlands, 2024. [Google Scholar] [CrossRef]
Liu, Y.; Lapata, M. Hierarchical Transformers for Multi-Document Summarization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 5070–5081. [Google Scholar] [CrossRef]
Zhang, X.; Wei, F.; Zhou, M. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July 28–2 August 2019; pp. 5059–5069. [Google Scholar] [CrossRef]
Plenz, M.; Frank, A. Graph Language Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; pp. 4477–4494. [Google Scholar] [CrossRef]
Chen, T.; Chen, X.; Du, X.; Rashwan, A.; Yang, F.; Chen, H.; Wang, Z.; Li, Y. AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts. In Proceedings of the IEEE International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17346–17357. [Google Scholar] [CrossRef]
Li, Y.; Jiang, S.; Hu, B.; Wang, L.; Zhong, W.; Luo, W.; Ma, L.; Zhang, M. Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3424–3439. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, X.; Yu, J.; Tang, J.; Tang, J.; Li, C.; Chen, H. Subgraph Retrieval Enhanced Model for Multi-hop Knowledge Base Question Answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; pp. 5773–5784. [Google Scholar] [CrossRef]
Liang, X.; Niu, S.; Li, Z.; Zhang, S.; Song, S.; Wang, H.; Yang, J.; Xiong, F.; Tang, B.; Xi, C. Empowering Large Language Models to Set Up a Knowledge Retrieval Indexer Via Self-Learning. arXiv 2024, arXiv:2405.16933. [Google Scholar] [CrossRef]
Liu, S.; Cao, J.; Yang, R.; Wen, Z. Generating a Structured Summary of Numerous Academic Papers: Dataset and Method. Comput. Res. Repos. 2022, 4280–4288. [Google Scholar] [CrossRef]
Gutierrez, B.J.; Shu, Y.; Gu, Y.; Yasunaga, M.; Su, Y. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. In Proceedings of the 38th Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; pp. 59532–59569. [Google Scholar] [CrossRef]
He, X.; Tian, Y.; Sun, Y.; Chawla, N.V.; Laurent, T.; LeCun, Y.; Bresson, X.; Hooi, B. G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering. In Proceedings of the 38th Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; pp. 132876–132907. [Google Scholar] [CrossRef]
Li, Z.; Guo, Q.; Shao, J.; Song, L.; Bian, J.; Zhang, J.; Wang, R. Graph Neural Network Enhanced Retrieval for Question Answering of Large Language Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 6612–6633. [Google Scholar] [CrossRef]
Mavromatis, C.; Karypis, G. GNN-RAG: Graph Neural Retrieval for Efficient Large Language Model Reasoning on Knowledge Graphs. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 16682–16699. [Google Scholar] [CrossRef]
Hu, Y.; Lu, Y. RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing. arXiv 2024, arXiv:2404.19543. [Google Scholar] [CrossRef]
Bahr, L.; Wehner, C.; Wewerka, J.; Bittencourt, J.; Schmid, U.; Daub, R. Knowledge Graph Enhanced Retrieval-Augmented Generation for Failure Mode and Effects Analysis. J. Ind. Inf. Integr. 2025, 45, 100807. [Google Scholar] [CrossRef]
Suryawanshi, S.; Waghmode, S.; Sawant, R.; Gupta, M. A Knowledge Graph-based RAG for Cross-Document Information Extraction. In Proceedings of the 5th International Conference on Pervasive Computing Social Networking, Salem, India, 14–16 May 2025. [Google Scholar] [CrossRef]
Zhu, X.; Xie, Y.; Liu, Y.; Li, Y.; Hu, W. Knowledge Graph-Guided Retrieval Augmented Generation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 8912–8924. [Google Scholar] [CrossRef]
Zhong, M.; Yin, D.; Yu, T.; Zaidi, A.; Mutuma, M.; Jha, R.; Awadallah, A.H.; Celikyilmaz, A.; Liu, Y.; Qiu, X.; et al. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Online, 6–11 June 2021; pp. 5905–5921. [Google Scholar] [CrossRef]
Kryściński, W.; Rajani, N.; Agarwal, D.; Xiong, C.; Radev, D. BookSum: A Collection of Datasets for Long-form Narrative Summarization. arXiv 2022, arXiv:2105.08209. [Google Scholar] [CrossRef]
Ghalandari, D.G.; Hokamp, C.; Pham, N.T.; Glover, J.; Ifrim, G. A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1302–1308. [Google Scholar] [CrossRef]
Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Robertson, S.E.; Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 2009, 3, 333–389. [Google Scholar] [CrossRef]
Tawfeeq, N.Z.; Abed, W.S.; Ghazal, O.G. A semantic model of morphological information retrieval: A comparative accumulative analysis. In Proceedings of the 2020 2nd Annual International Conference on Information and Sciences (AiCIS), Fallujah, Iraq, 24–25 November 2020. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar] [CrossRef]
Lin, S.-C.; Asai, A.; Li, M.; Oguz, B.; Lin, J.; Mehdad, Y.; Yih, W.-t.; Chen, X. How to Train Your Dragon: Diverse Augmentation Towards Generalizable Dense Retrieval. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 6385–6400. [Google Scholar]
Izacard, G.; Caron, M.; Hosseini, L.; Riedel, S.; Bojanowski, P.; Joulin, A.; Grave, E. Unsupervised Dense Information Retrieval with Contrastive Learning. Trans. Mach. Learn. Res. 2021, 2022, 2835–8856. [Google Scholar]
Karpukhin, V.; Oğuz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.-t. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6769–6781. [Google Scholar]
Grover, A.; Leskovec, J. Node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), New York, NY, USA, 13–17 August 2016; pp. 855–864. [Google Scholar] [CrossRef]
Feng, T.; Han, P.; Lin, G.; Liu, G.; You, J. Thought-Retriever: Don’t Just Retrieve Raw Data, Retrieve Thoughts. International Conference on Learning Representations Workshop: How Far Are We from AGI. 2024. Available online: https://openreview.net/pdf/006ed5ed24038d996f1ced687940c3a1078d72d0.pdf (accessed on 12 November 2025).
Zhou, Y.; Camacho-Collados, J.; Bollegala, D. A Predictive Factor Analysis of Social Biases and Task-Performance in Pretrained Masked Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 11082–11100. [Google Scholar] [CrossRef]

Figure 1. Overview of the MOEGAT architecture, illustrating the Orthogonal Context Graph construction, query-aware Mixture-of-Experts routing, node scoring, and top-K retrieval process. The red line indicates local sequential coherence. The black line indicates global semantic associations.

Figure 2. Model Structure of Query-Modulated Graph Attention Expert.

Figure 3. Efficacy evaluation of the scoring mechanism. (a) Regression fit between model-predicted and ground-truth node scores on the test set. (b) Two-dimensional kernel density estimation of the joint distribution of predicted and ground-truth scores.

Figure 4. Influence of the number of semantic neighbor’s k on ranking performance. Results are shown for the WCEP and QMSum datasets.

Figure 5. Impact of the load-balancing loss on expert utilization in the multi-expert architecture. (a) Dynamic evolution of expert load during training with the load-balancing loss. (b) Steady-state expert load distribution after convergence, comparing the full model and the model without the load-balancing loss.

Figure 6. UMAP visualization of node representation spaces. Red points denote golden nodes (highly relevant to the reference summary); blue points denote ordinary nodes. Left: full model with self-supervised ranking loss. Right: ablated model without the ranking loss.

Figure 7. Qualitative analysis of the relationship between internal sentence importance scoring and summarization performance on 34 QMSum test samples. Top: line plot of ROUGE-L scores per sample. Bottom: heatmap of model-assigned importance scores for sentences in each corresponding source document (darker regions indicate higher importance; sentences selected for the summary are highlighted with dark gray borders).

Table 1. Experimental results for the long-context global summarization task on the WCEP, QMSum, and BookSum datasets, reporting ROUGE-L (R-L), ROUGE-1 (R-1), and ROUGE-2 (R-2). Note that the average LLM input token length is 6 × 256 (≈1.5K). The highest scores are shown in bold.

Model	WCEP			QMSUM			BookSum
Model	R-L	R-1	R-2	R-L	R-1	R-2	R-L	R-1	R-2
BM25 [45]	15.5	22.6	7.3	18.4	32.1	6.1	13.7	26.7	4.9
TF-IDF [46]	15.3	22.3	7.3	18.3	31.2	6.3	13.6	26.6	4.9
SBERT [47]	13.7	20.5	5.5	19.0	33.0	7.4	14.4	28.9	5.4
Dragon [48]	14.6	21.8	6.8	19.2	33.5	7.7	13.7	27.2	4.8
Contriever [49]	15.7	23.5	7.7	19.1	32.7	7.7	14.4	29.8	5.5
DPR [50]	15.6	22.5	7.5	18.6	32.1	6.7	13.8	27.1	4.8
Full Context	14.4	21.0	7.1	19.4	33.1	6.8	14.4	28.9	5.9
Node2Vec [51]	13.9	20.1	6.3	18.5	31.8	6.3	13.6	27.4	4.6
Thought-R [52]	15.2	22.4	7.4	19.0	33.9	7.6	14.2	29.5	5.7
GOR [20]	18.1	25.4	9.2	19.8	34.5	7.8	14.9	31.5	6.6
ours	20.7	30.0	10.8	22.1	39.6	9.6	15.2	30.2	7.3

Table 2. Performance comparison of the proposed Moe-GAT framework before and after integration with different Large Language Models (LLMs) on the QMSum dataset. The highest scores are shown in bold.

LLM Model	LLM-Only (Baseline)			Ours + LLM
LLM Model	R-L	R-1	R-2	R-L	R-1	R-2
gpt5-nano	14.9	29.93	4.3	16.8	32.7	5.7
gemini-2.5	12.8	22.0	5.7	18.1	33.7	6.6
QwenLong-L1-32B	15.5	29.5	4.7	17.0	32.0	5.6
glm-4-9b	20.3	37.3	7.3	21.2	38.5	8.8
Hunyuan-MT-7B	21.2	37.2	8.2	21.7	39.2	8.7
Qwen2.5-7B-Instruct	20.1	36.3	9.7	20.7	38.7	8.9
internlm2-5-7b-chat	20.7	35.7	9.6	21.4	38.1	9.1

Table 3. Performance comparison of the proposed Moe-GAT framework before and after integration with different Large Language Models (LLMs) on the Booksum dataset. The highest scores are shown in bold.

LLM Model	LLM-Only (Baseline)			Ours + LLM
LLM Model	R-L	R-1	R-2	R-L	R-1	R-2
gpt5-nano	14.9	32.3	6.5	15.1	32.9	6.7
gemini-2.5	16.1	34.8	8.2	16.4	41.6	8.1
QwenLong-L1-32B	12.8	25.9	4.3	13.0	26.8	5.1
glm-4-9b	14.8	32.5	5.5	16.3	37.7	8.2
Hunyuan-MT-7B	9.2	14.6	2.7	21.7	39.2	8.7
Qwen2.5-7B-Instruct	15.1	33.9	6.8	15.8	35.6	7.8
internlm2-5-7b-chat	13.3	25.6	5.8	15.1	33.1	7.2

Table 4. Performance comparison of the proposed Moe-GAT framework before and after integration with different Large Language Models (LLMs) on the WCEP dataset. The highest scores are shown in bold.

LLM Model	LLM-Only (Baseline)			Ours + LLM
LLM Model	R-L	R-1	R-2	R-L	R-1	R-2
gpt5-nano	14.8	22.4	7.7	15.5	24.2	8.1
gemini-2.5	14.5	22.3	7.6	16.7	25.6	9.4
QwenLong-L1-32B	15.3	22.8	8.1	15.4	24.6	7.7
glm-4-9b	18.9	26.2	9.6	20.4	29.5	10.9
Hunyuan-MT-7B	18.4	26.5	9.8	19.4	28.8	10.4
Qwen2.5-7B-Instruct	14.8	21.7	8.4	16.9	24.6	9.2
internlm2-5-7b-chat	13.9	19.8	7.7	16.3	24.6	8.8

Table 5. Results of the ablation study on the QMSum, WCEP, and BookSum datasets. The highest scores are shown in bold.

Variant	QMSUM			WCEP			BookSum
Variant	R-L	R-1	R-2	R-L	R-1	R-2	R-L	R-1	R-2
w/o Moe	18.9	34.7	6.9	18.2	27.2	9.5	12.9	25.6	5.6
w/o Semantic	19.1	35.2	6.9	18.4	26.3	8.7	10.3	17.1	2.6
w/o Sequential	18.3	34.4	6.6	17.2	26.6	8.8	10.6	16.8	3.4
w/o $L_{r a n k i n g}$	19.6	37.9	8.6	17.8	26.0	8.2	12.4	23.7	4.9
w/o $L_{b a l a n c e}$	20.2	37.7	9.3	20.0	28.8	9.6	11.4	21.2	4.3
Full Model (ours)	22.1	39.6	9.6	20.7	30.0	10.8	15.2	30.2	7.3

Table 6. Inference efficiency analysis inference time per query on the WCEP dataset.

Baselines	Node2Vec	BM25	Contriever	SBERT	GOR	MOEGAT
Inference Time (s)	9.40	0.02	0.20	0.01	0.58	1.16

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ye, J.; Xin, Q.; Zhang, C.; Qi, H. TRACE: Topical Reasoning with Adaptive Contextual Experts. Big Data Cogn. Comput. 2026, 10, 31. https://doi.org/10.3390/bdcc10010031

AMA Style

Ye J, Xin Q, Zhang C, Qi H. TRACE: Topical Reasoning with Adaptive Contextual Experts. Big Data and Cognitive Computing. 2026; 10(1):31. https://doi.org/10.3390/bdcc10010031

Chicago/Turabian Style

Ye, Jiabin, Qiuyi Xin, Chu Zhang, and Hengnian Qi. 2026. "TRACE: Topical Reasoning with Adaptive Contextual Experts" Big Data and Cognitive Computing 10, no. 1: 31. https://doi.org/10.3390/bdcc10010031

APA Style

Ye, J., Xin, Q., Zhang, C., & Qi, H. (2026). TRACE: Topical Reasoning with Adaptive Contextual Experts. Big Data and Cognitive Computing, 10(1), 31. https://doi.org/10.3390/bdcc10010031

Article Menu

TRACE: Topical Reasoning with Adaptive Contextual Experts

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Construction of the Orthogonal Context Graph (Ortho-Graph)

3.2. Node-Adaptive Dynamic Routing

3.3. Query-Modulated Graph Attention Expert

3.4. Node Ranking and Summary Generation

3.5. Self-Supervised Semantic Alignment and Expert Load Balancing

4. Results

4.1. Main Results

4.2. Generalization Capability of the Framework Across Different Large Language Models

4.3. Ablation Study

4.4. Computational Efficiency Analysis

5. Discussion

5.1. Efficacy of the Scoring Mechanism

5.2. Sensitivity Analysis of the Number of Semantic Neighbors

5.3. Analysis of the Impact of the Load-Balancing Loss on Architectural Stability

5.4. Analysis of the Role of the Self-Supervised Ranking Loss in the Separability of the Representation Space

5.5. Analysis of the Correlation Between Internal Decision Mechanisms and Summarization Performance

5.6. Limitations

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI