1. Introduction
The rapid advancements of Large Language Models have recently led to remarkable performance across a wide array of language modeling tasks [
1]. Within this landscape, the ability to accurately and efficiently comprehend and summarize long-form documents has emerged as a key frontier challenge [
2,
3]. To address this, the research community has largely pursued two principal technical approaches: one focusing on long-context architectures that expand the model’s native context window, and the other on the Retrieval-Augmented Generation paradigm [
4,
5]. Unlike long-context strategies that attempt to process the entire document in one pass, RAG offers a more flexible and computationally efficient solution. It functions by precisely locating and extracting a small set of critical information snippets to guide the LLM’s generation process. This fundamental “retrieve-then-generate” principle has demonstrated its efficacy in a multitude of knowledge-intensive tasks [
6,
7]. Notably, our approach operates in a zero-shot manner without dataset-specific fine-tuning, prioritizing strong generalization while maintaining competitive generation quality.
However, despite these advantages, conventional RAG methods face a critical limitation: they often fail to capture the holistic meaning and global context necessary for understanding complex documents [
6,
7,
8]. This deficiency arises primarily from the widely used “chunk flattening” strategy. Unlike human reading, which follows a continuous narrative flow, this approach mechanically partitions documents into fixed-size segments based on token counts rather than semantic boundaries [
9,
10]. Such arbitrary fragmentation fundamentally undermines sequential coherence by severing long-range dependencies—often cutting through cause-and-effect chains and leaving premises detached from their conclusions [
11]. By treating segments as independent units, the retriever becomes blind to the cross-referential links and recurrent motifs that bind scattered information into a unified whole, leading to phenomena such as the “lost in the middle” effect where central contextual information is overlooked [
11,
12]. Consequently, the intrinsic connections between semantic units are lost, making the reconstruction of global context exceptionally difficult. This issue is particularly pronounced in domains that rely heavily on deep contextual understanding, such as clinical documents [
13] and legal texts [
14]. To overcome these structural deficiencies, Graph-Augmented Retrieval Generation has gradually emerged. Its core idea is to convert a document into a graph structure, thereby explicitly modeling entities, concepts, and the complex relationships among them. This paradigm has undergone a notable research evolution: early work primarily focused on enhancing language models with pre-constructed external knowledge graphs [
15,
16], but this approach was constrained by high construction costs and limited flexibility. Consequently, recent research has shifted towards text-native Graph RAG, which dynamically constructs graph structures directly from the source text, treating “graph construction for content” as a core task [
17]. This approach has already demonstrated potential in general-purpose frameworks [
18] and for tasks such as summarization [
19,
20].
Despite this progress, Graph RAG still faces several significant challenges in the application of long-document summarization. At the representation level, existing graph construction methods typically compress the rich semantic hierarchy of a document into a single-layer graph structure. Such methods, including those that rely on entity-relation extraction to identify named entities and their connections or those that link co-occurring [
21] or similar terms through keyword association, fail to preserve the multi-level logical hierarchy from high-level themes and mid-level arguments down to low-level details [
22]. Although some studies have begun to explore hierarchical or multi-granularity graph construction, including hierarchical Graph Neural Networks [
23,
24] and hierarchical Transformers [
25,
26], the challenge of systematically representing the complete logical hierarchy of a document remains unsolved. At the reasoning level, mainstream homogeneous graph models, such as standard Graph Neural Networks or Graph Language Models [
27], apply the same aggregation mechanism to all nodes in the graph. This approach fails to differentiate between the distinct reasoning patterns required for processing abstract concepts and concrete facts. Although the Mixture-of-Experts architecture, which routes different inputs to specialized sub-networks, has proven effective in multi-task scenarios [
28,
29], its potential when combined with graph attention networks to handle heterogeneous nodes has yet to be thoroughly explored. At the training level, the lack of fine-grained, node-level annotations [
14] for long-document summarization tasks means that existing Graph RAG methods often focus on optimizing intermediate stages, such as graph construction and retrieval [
17,
30]. This practice creates a significant gap between their training objectives and the final summarization task. Lacking the direct supervisory signals needed to identify critical nodes, the model struggles to learn precise information selection strategies, rendering the training process fraught with uncertainty [
31,
32].
Motivated by challenges in representation, reasoning, and training in standard RAG systems, we propose MOEGAT, a graph-augmented retrieval framework (
Figure 1). MOEGAT takes a document represented as an Orthogonal Context Graph and a user query as inputs. The query is encoded into an embedding that serves two roles: (1) driving a Router module to dynamically select the top-K most relevant Graph Attention Network experts, and (2) guiding final node scoring. The selected experts process graph nodes independently along two orthogonal subspaces—sequential discourse structure and long-range semantic dependencies—followed by non-linear integration of their outputs. The fused node representations are then scored against the query embedding in a Scoring Head. The top-K highest-scoring nodes (e.g., top 6 in
Figure 1) are retrieved and passed to a Large Language Model for generation.
MOEGAT consists of three synergistic modules that work together to enable structure-aware retrieval and reasoning. The first module constructs an Orthogonal Context Graph. This graph explicitly encodes the document’s intrinsic structure through two types of orthogonal edges: sequential adjacency edges that preserve the linear narrative flow and textual order, and semantic proximity edges that capture global thematic relationships across distant segments. This dual representation provides a comprehensive structural prior that mitigates the context loss in traditional chunk-based approaches. The second module is a query-aware Mixture-of-Experts Graph Attention Network. Given the query embedding, a router dynamically selects a sparse set of specialized experts. Each chosen expert operates along a distinct reasoning pathway—corresponding to a specific semantic subspace such as sequential discourse or long-range semantics—by performing query-modulated attention and neighborhood aggregation within that pathway. This enables adaptive, task-specific context propagation across the graph. The third module introduces self-supervised training objectives to address the supervision gap. We generate differentiable soft labels by computing BERT-Score-based [
33] semantic similarity between each node and the reference summary. Training is then formulated as a pairwise ranking task, supplemented by auxiliary objectives that promote balanced expert utilization and stable optimization.
Experiments on multiple public long-document summarization datasets show that MOEGAT achieves competitive performance. Our main contributions are summarized as follows:
- (1)
Orthogonal Context Graph Modeling: A dual-edge graph that jointly captures sequential discourse and global semantic relationships, providing a richer structural prior for retrieval.
- (2)
Hierarchical Graph Reasoning with Dynamic Expert Routing: A sparse Mixture-of-Experts GAT with query-guided expert selection and subspace-specific attention, enabling efficient and adaptive context aggregation.
- (3)
Self-supervised Multi-objective Training: A ranking-based paradigm with soft semantic labels and balance regularization, facilitating effective training without fine-grained annotations.
2. Related Work
While prior Graph-RAG approaches mainly improve local retrieval for question-answering tasks using pre-constructed graphs [
34,
35,
36], they treat documents as flat chunks. This fails to preserve sequential flow and global thematic structure needed for holistic long-document summarization. In contrast, MOEGAT introduces an Orthogonal Context Graph to disentangle local sequential and long-range semantic dependencies within the document. It pairs this with a query-aware Mixture-of-Experts Graph Attention Network for dynamic reasoning. Thus, it enables effective holistic long-document summarization.
Building on these Graph-RAG efforts, Retrieval-Augmented Generation has emerged as a mainstream paradigm to address the inherent knowledge limitations of Large Language Models and to mitigate their generation of hallucinations. Conventional RAG methods primarily rely on vector-based retrieval over unstructured text corpora, which essentially treats knowledge as isolated information fragments, thereby struggling to effectively capture and utilize the complex structural relationships embedded within the text [
37]. To overcome this limitation, researchers have recently begun to explore the integration of graph structures into the RAG framework, leading to the development of Graph-Augmented RAG methods. Graph structures can explicitly model the semantic associations between entities and have been demonstrated to enhance model performance in various question-answering and reasoning tasks. Specifically, current research is advancing along two primary fronts: one focused on structured retrieval enhancement based on graph indexing, and the other dedicated to leveraging Graph Neural Networks for representation learning and reasoning enhancement.
Structured Retrieval Enhancement Based on Graph Indexing. The core idea of this direction is to index knowledge by constructing text into a graph structure, which enables the retrieval of semantically coherent knowledge subgraphs to mitigate semantic drift and provide LLMs with precise context. For instance, some studies construct domain-specific knowledge graphs to support complex fault diagnosis [
38], while others utilize graph structures to address the challenges of cross-document information extraction and fusion, thereby improving the accuracy and detail of information retrieval [
39]. Other work models question-answering as an optimization problem on the graph, seeking an optimal subgraph to balance retrieval relevance with generation efficiency [
34].
GNN-Driven Representation Learning and Reasoning Enhancement. In contrast to static graph indexing, this line of research aims to integrate the dynamic learning capabilities of Graph Neural Networks (GNNs) to mine the deep, implicit semantic information within the graph. Through its characteristic message-passing mechanism, a GNN aggregates features from neighboring nodes, allowing the vector representation of a node to encapsulate not only its own information but also the semantics of its contextual environment. This creates a deep complementarity with the text comprehension abilities of LLMs. This capability serves multiple roles within the RAG pipeline: during the retrieval phase, GNNs can enhance recall quality by modeling semantic relations between passages [
35]; in the post-retrieval stage, they can function as re-rankers, optimizing the organizational logic of candidate documents based on graph-based structural associations [
40]; furthermore, end-to-end GNN-RAG frameworks have been developed to tackle question-answering scenarios that require complex graph reasoning [
36].
Although existing Graph-RAG research, while varied in its technical approaches, is typically based on pre-constructed graph structures, its application paradigms have primarily centered on question-answering tasks. The objective is to enhance the quality of retrieval and generation for locally relevant information [
34,
35]; even when applied to long-document summarization, these methods are often confined to query-focused summarization [
9]. However, for summarization tasks that require a holistic comprehension of long documents, such QA-centric paradigms face significant challenges due to their inability to effectively integrate global information. Therefore, this paper explores the application of graph structures to enhance global summarization capabilities for long documents, with a focus on identifying and integrating the core themes and logical threads that span the entire text to produce a summary that reflects the document’s overall content.
3. Methods
We designed a graph-augmented retrieval framework for the task of long-document summarization, the overall architecture of which is illustrated in
Figure 1. The framework first constructs a long document into an Orthogonal Context Graph (Ortho-Graph). Subsequently, we employ a query-aware Mixture-of-Experts Graph Attention Network (MoeGAT) to rank nodes on the Orthogonal Context Graph and retrieve query-relevant information. The Mixture-of-Experts module uses a gating network to route the input query to a subset of specialized experts, each comprising a lightweight Graph Attention layer focused on one graph subspace (sequential or semantic). The activated experts collectively form a query-tailored reasoning pathway—a dynamic sub-network that aggregates information along sequential or semantic edges based on the query’s topical demands. The method for graph construction and the specific architecture of our model is detailed in the following sections.
3.1. Construction of the Orthogonal Context Graph (Ortho-Graph)
Conventional RAG approaches often process long documents by flattening them into independent, sequential chunks. This process inherently discards the rich structural information vital for deep comprehension. To address this limitation, we propose the Orthogonal Context Graph (Ortho-Graph). The Ortho-Graph is designed to explicitly represent two fundamental dimensions of text understanding: local sequential coherence and global semantic associations. We treat these dimensions as conceptually orthogonal. This orthogonality is formally realized by separating the information flow into two mutually exclusive axes: a sequential axis, which preserves the linear narrative flow, and a semantic axis, which captures long-range logical dependencies independent of linear proximity.
The Ortho-Graph is constructed by first splitting the document into an ordered sequence of text chunks (chunk-size , overlap ). Each chunk corresponds to a node , and a pre-trained embedding model maps each node to a representation vector , forming the node feature matrix .
Graph connectivity is defined via an adjacency matrix
, which is decomposed into two structurally distinct components:
An edge exists between nodes and if .
The sequential adjacency matrix
captures local narrative flow by linking only consecutive chunks:
where
is the Kronecker delta (see
Supplementary Section S1 for details).
The semantic adjacency matrix
connects nodes based on content similarity while deliberately excluding local sequential neighbors to maintain orthogonality. This is enforced by a neighborhood exclusion mask, ensuring disjoint supports (
) and restricting semantic edges to long-range associations. Full details of the masking and top-k selection process are provided in
Supplementary Section S2.
By integrating sequential coherence with global semantic relations, the Ortho-Graph provides a rich structural prior that enhances retrieval and supports more coherent summarization.
3.2. Node-Adaptive Dynamic Routing
Nodes in the Ortho-Graph have different functional roles. Standard Graph Neural Networks use uniform message passing. This ignores node heterogeneity. To solve this, we introduce a node-adaptive dynamic routing mechanism. It builds a customized reasoning pathway for each node. The pathway consists of a sparse subset of expert modules.
The process works as follows. Given node features
, we first compute expert affinities:
where
is learnable. Each
measures how well node
matches expert
. For efficiency and specialization, we apply sparse activation. For each node
, we select the top-k experts with highest affinity (See
Supplementary Section S3 for selection details):
We then normalize weights locally over the selected experts:
This ensures
for each node. After experts process the graph in parallel, the final node representation is a weighted combination:
Here, is the output of expert for node . This design promotes expert specialization. It also keeps computation efficient through sparsity.
3.3. Query-Modulated Graph Attention Expert
Each activated expert is a Query-Modulated Graph Attention Expert (QM-GAE). Its goal is query-aware relational reasoning on the Ortho-Graph.
Figure 2 illustrates the overall process. The process has three steps. First, the input query is encoded as a dense vector
. Second, each node feature
is made query-aware. We concatenate
and
, then pass them through an MLP:
Third, multi-head graph attention aggregates information from neighbors. For head
, the attention coefficient is:
where
includes node
and its direct neighbors.
is a shared weight vector and
is the weight matrix for the h-th attention head. The final node representation is the concatenation of all heads:
By injecting query semantics into both feature projection and attention (as shown in
Figure 2), QM-GAE produces task-relevant node representations. See
Supplementary Section S4 for full notation and implementation details.
3.4. Node Ranking and Summary Generation
Upon completion of the query-aware graph reasoning by the expert modules, a final representation , is generated for each node. This representation is rich in contextual information and aligned with the query’s intent. The ultimate task of this stage is to convert these complex node representations into a coherent, natural language summary that synthesizes the core information. This process is initiated by a Scoring Head. The objective of this module is to quantify the relevance of each node by projecting its final representation , back into the semantic space of the query . This is achieved by computing the inner product of and , which generates a scalar relevance score , for each node . These scores provide the quantitative basis for information selection. Based on these scores, the top-N nodes with the highest values are selected. The set of indices for these nodes represents the most salient content fragments within the document. Finally, this context is injected into an instruction template and is synthesized into the final summary , by a Large Language Model.
3.5. Self-Supervised Semantic Alignment and Expert Load Balancing
To train the model effectively, we employ a composite loss function that pursues two main objectives: learning a meaningful node importance ranking and ensuring balanced utilization of experts. The total loss combines a ranking term and a load-balancing term:
The ranking loss is self-supervised. It uses semantic relevance scores derived from BERTScore, computed with DeBERTa-xlarge-mnli embeddings between each node and the reference summary [
20]. Nodes whose relevance exceeds the mean form the positive set
; the remainder form the negative set
. A pairwise margin ranking loss is then applied over all positive-negative pairs:
where
represents the set of all positive-negative sample pairs in the graph,
is a margin hyperparameter that controls the degree of separation between positive and negative samples, and
and
are the model’s predicted scores for the positive and negative sample nodes, respectively. This encourages the model to assign higher scores to more relevant nodes without requiring manual annotations.
To prevent routing collapse and promote expert specialization, we introduce an auxiliary load-balancing loss. This term combines a utilization component, which encourages uniform assignment of nodes across experts, and a confidence component, which favors decisive routing decisions. The detailed formulation, including both components averaged over layers, is provided in
Supplementary Section S5. Together, these objectives enable stable training and high-quality summarization performance.
4. Results
The preceding sections have introduced our core contributions: an Orthogonal Context Graph that explicitly separates local discourse from global semantic associations, and a node-adaptive Mixture-of-Experts routing mechanism that forms query-tailored reasoning pathways under self-supervised objectives. We now turn to comprehensive experimental evaluation on three public long-text summarization datasets to demonstrate these designs’ effectiveness.
Experimental Setup
Datasets: Experiments were conducted on three public long-text summarization datasets to ensure the robustness and generalizability of our evaluation. These datasets cover diverse domains and introduce distinct structural challenges. QMSum [
41] consists of meeting transcripts from various domains; in this study, we focused on its “general query” portion, which allows validation of the model’s ability to integrate global information in informal, multi-speaker conversational scenarios. BookSum [
42] provides long narrative texts from literary works, thereby testing the model’s effectiveness in capturing long-range dependencies within extremely long contexts and complex narrative structures. Finally, WCEP [
43] addresses multi-document summarization of news events, enabling assessment of cross-document information deduplication, fusion, and organization. Performance was quantified using the standard ROUGE metrics [
44], namely ROUGE-1, ROUGE-2, and ROUGE-L, which measure content coverage and accuracy through lexical overlap between generated and reference summaries.
Implementation Details: In our experiments, long documents were segmented into text chunks of 256 tokens with an overlap of 64 tokens. Initial embeddings were generated using the Contriever model. Based on these text chunks as nodes, we constructed a base graph by connecting adjacent chunks to preserve the local narrative flow and by connecting each chunk to its 5 semantically closest non-adjacent chunks to establish global relationships. This graph was then input into a 3-layer Moe-GAT model, where the node and hidden layer dimensions were both set to 768. Each layer contained 4 experts, and a Top-2 routing strategy was used to dynamically select experts for each node. The model was trained using the AdamW optimizer with an initial learning rate of 1 × 10−4. The composite loss function consisted of two components: a self-supervised ranking loss (with weight ) and an expert load-balancing loss (with weight ), to ensure that the model learned node importance while maintaining a balanced utilization of experts. During the final summary generation stage, our trained Moe-GAT model functioned as the retriever. The query and the complete Orthogonal Context Graph were fed into the model, which then computed relevance scores for all nodes in the graph relative to the query. The top 6 most relevant nodes were selected based on these scores, and their content was input into the LLaMA-2-7b-chat model to generate the final summary via a greedy decoding strategy.
Baselines: To ensure a comprehensive evaluation, we compare our proposed Moe-GAT model with multiple representative baselines, including: (1) Sparse Retrievers (i.e., BM25 [
41] and TF-IDF [
42]), (2) Dense Retrievers (i.e., Contriever [
43], DPR [
44], SBERT [
45], and Dragon [
46]), (3) Full-context Modeling (i.e., directly processing entire documents without retrieval), and (4) Structured Retrieval Methods (i.e., Node2Vec [
47], GOR [
16], and Thought-R [
48]). This systematic comparison enables thorough validation of our model’s effectiveness across different technical approaches.
4.1. Main Results
We conducted comprehensive experiments on the QMSum, WCEP, and BookSum datasets, comparing our proposed method against several baselines to evaluate its long-context global summarization capabilities. The results are presented in
Table 1. Our proposed Moe-GAT consistently outperforms all baseline methods across nearly all evaluation metrics and datasets. Thanks to the mixture-of-experts architecture combined with graph attention mechanisms, our model can effectively capture and integrate complex semantic relationships across different text chunks, thereby enhancing the retrieval and representation of key information for summarization. Moreover, the structured modeling of document content enables more precise identification of salient content, which is crucial for generating high-quality summaries.
Moe-GAT shows clear superiority over traditional retrieval methods. While structured retrieval methods like GOR demonstrate competitive performance, our approach further advances the state-of-the-art by leveraging more sophisticated graph-based representations. Compared to full-context modeling, Moe-GAT achieves better performance with significantly shorter input lengths (approximately 1.5K tokens), demonstrating its efficiency in handling long documents without information loss from truncation.
Additional Findings. (1) Sparse retrievers (BM25 and TF-IDF) produce suboptimal results as they rely solely on lexical matching without capturing semantic meanings. (2) Dense retrievers show varying performance, with Contriever achieving relatively better results, though still inferior to our method due to limitations in modeling global document structure. (3) Node2Vec produces unsatisfactory results since its random walk-based embeddings cannot be effectively optimized for the summarization task. (4) Although Thought-R demonstrates competitive results, it is still inferior to Moe-GAT due to insufficient exploration of the hierarchical relationships between document segments. (5) The “Full Context” approach suffers from information loss when processing extremely long documents, resulting in suboptimal performance despite accessing the complete document content.
Overall, Moe-GAT achieves the best results compared with various baselines across all three datasets, demonstrating the effectiveness of our proposed method in addressing the challenges of long-context global summarization.
4.2. Generalization Capability of the Framework Across Different Large Language Models
To rigorously evaluate the generalization capability and model-agnostic nature of the proposed framework, a comprehensive experimental analysis was conducted on three representative text summarization datasets: QMSum, Booksum, and WCEP. The experiments utilized several different backbone language models with the aim of verifying that the performance improvements stem from the intrinsic advantages of the Moe-GAT architecture itself, rather than from a coincidental synergy with any specific large model. We selected a series of high-performing and representative open-source large language models as backbones and evaluated them under two settings: “LLM-Only” and “Ours + LLM”. By comparing the performance of each backbone model, both standalone and in conjunction with the Moe-GAT framework, on the ROUGE-L, ROUGE-1, and ROUGE-2 metrics (with results presented in
Table 2,
Table 3 and
Table 4), we have derived the following key findings.
The experimental results demonstrate that the Moe-GAT framework induces universal and robust performance gains across various summarization tasks. On the QMSum dataset, which is characterized by multi-turn dialogs with complex interlocutor dependencies, all tested backbone models achieved consistent improvements in their ROUGE-L and ROUGE-1 scores after the integration of Moe-GAT. This improvement stems primarily from the Orthogonal Context Graph, which effectively preserves sequential discourse flows across turns. Such flows are frequently disrupted during direct full-text summarization by standalone LLMs.
This positive trend was further validated on the Booksum and WCEP datasets. On the Booksum dataset, which features lengthy narrative texts with strong sequential coherence, all seven models exhibited clear performance improvements when combined with our framework. The gains for Hunyuan-MT-7B were especially significant, achieving substantial leaps of over 12 points in ROUGE-L and 24 points in ROUGE-1. These pronounced gains highlight the framework’s strength in modeling long-range sequential dependencies via its dedicated sequential axis. This specifically benefits models that struggle with processing extended contexts in full-text mode.
Similarly, a comprehensive and consistent performance enhancement was observed across all models on the WCEP dataset, a multi-document news clustering task rich in thematic overlaps. In this case, the semantic axis of the Orthogonal Context Graph plays a pivotal role by capturing cross-document thematic links. This enables more accurate consensus identification than direct full-text processing by standalone LLMs. Such systematic improvement across different domains and task difficulties provides strong evidence for the universal effectiveness of the Moe-GAT framework in enhancing summary generation quality.
Furthermore, the Moe-GAT framework possesses excellent versatility and broad enhancement capabilities. Evaluation across text summarization tasks confirms that the framework is effective for a diverse range of large language models with heterogeneous architectures, including those from the GPT, Gemini, and Qwen families. Regardless of the backbone model, the introduction of Moe-GAT consistently improved summary quality. This confirms that its success relies on universal design principles rather than the internal mechanisms of any specific model.
Further quantitative analysis reveals an important pattern: the magnitude of the performance gain is inversely correlated with the baseline capability of the backbone model. Stronger models, such as InternLM2.5-7B-Chat or GLM-4-9B, already possess relatively effective internal context modeling for full-text summarization, leaving less room for external structural augmentation. In contrast, weaker baselines suffer more severely from context limitations in direct full-text processing. Examples include Hunyuan-MT-7B on Booksum or various Gemini variants. For these models, the query-aware Moe-GAT and Orthogonal Context Graph provide substantial complementary structural signals, resulting in larger absolute gains. Specifically, the framework provides robust yet modest improvements for highly optimized models. Conversely, it yields more substantial performance leaps for models with weaker baseline capabilities. This phenomenon indicates that Moe-GAT can adaptively provide a commensurate level of optimization based on the potential of different models, thereby universally elevating summarization performance across a wide range of architectures.
The experimental results demonstrate that: (1) The Moe-GAT framework induces universal and robust performance gains across different types of summarization tasks. On the QMSum dataset (characterized by multi-turn dialogs with complex interlocutor dependencies), all tested backbone models achieved consistent improvements in their ROUGE-L and ROUGE-1 scores after the integration of Moe-GAT. This is primarily because the Orthogonal Context Graph effectively preserves sequential discourse flows across turns, which are frequently disrupted in direct full-text summarization by standalone LLMs. This positive trend was further validated on the Booksum and WCEP datasets. Particularly on the Booksum dataset (featuring lengthy narrative texts with strong sequential coherence), all seven models exhibited clear performance improvements when combined with our framework, with the improvement for Hunyuan-MT-7B being especially significant, achieving substantial leaps of over 12 points in ROUGE-L and 24 points in ROUGE-1. These pronounced gains on Booksum highlight the framework’s strength in modeling long-range sequential dependencies via its dedicated sequential axis, particularly benefiting models that struggle with processing extended contexts in full-text mode. On the WCEP dataset (a multi-document news clustering task rich in thematic overlaps), a comprehensive and consistent performance enhancement was also observed across all models. Here, the semantic axis of the Orthogonal Context Graph plays a pivotal role by capturing cross-document thematic links, enabling more accurate consensus identification than direct full-text processing by standalone LLMs. This systematic improvement, spanning different domains and task difficulties, provides strong evidence for the universal effectiveness of the Moe-GAT framework in enhancing the quality of summary generation. (2) The Moe-GAT framework possesses excellent versatility and broad enhancement capabilities. In the evaluation of text summarization tasks, the framework was effectively validated on a diverse range of large language models with heterogeneous architectures, including those from the GPT, Gemini, and Qwen families. Regardless of the model used as the backbone, the introduction of Moe-GAT consistently improved the quality of the generated summaries, confirming the universal applicability of its design principles rather than a dependency on the internal mechanisms of any specific model. Further quantitative analysis reveals an important pattern: the magnitude of the performance gain is inversely correlated with the baseline capability of the backbone model. This pattern arises because stronger models (e.g., internlm2-5-7b-chat or glm-4-9b) already possess relatively effective internal context modeling for full-text summarization, leaving less room for external structural augmentation. In contrast, weaker baselines (e.g., Hunyuan-MT-7B on Booksum or gemini-2.5 variants) suffer more severely from context limitations in direct full-text processing; the query-aware MoE-GAT and Orthogonal Context Graph therefore provide substantial complementary structural signals, resulting in larger absolute gains. Specifically, for highly optimized models such as internlm2-5-7b-chat, the framework provides robust yet modest improvements. Conversely, for models with weaker baseline capabilities, it yields more substantial performance leaps. This phenomenon clearly indicates that Moe-GAT can adaptively provide a commensurate level of optimization based on the potential of different models, thereby universally elevating the summarization performance across a wide range of models.
4.3. Ablation Study
We performed a series of ablation experiments to deeply investigate the contribution of each core component in the proposed framework. The results appear in
Table 5. Our analysis reveals the distinct and complementary roles of each component, allowing us to quantify their relative importance to the overall performance.
(1) The Moe architecture is fundamental for modeling long and heterogeneous documents. We implemented the “w/o Moe” variant by setting the expert number to 1, forcing all graph nodes to be processed by a single expert with identical parameters. The resulting performance degradation, consistent across all datasets, underscores that expert specialization is non-trivial. The impact is most pronounced on BookSum, which features the longest and most thematically diverse narratives, with a 15.1% relative drop in ROUGE-L. This indicates that the Moe framework contributes most significantly to handling the varied reasoning demands in lengthy texts.
(2) The orthogonal graph decomposition is crucial, with the semantic and sequential structures contributing differentially depending on the dataset characteristics. The “w/o Semantic” variant, which removes semantic adjacency, causes the most severe drops on datasets rich in thematic content; for instance, ROUGE-1 on BookSum plunges from 30.2 to 17.1 (a 43.3% relative drop), highlighting that the semantic graph is the primary driver for capturing long-range topical dependencies. Conversely, the “w/o Sequential” variant, which removes sequential links, more significantly harms coherence in structured dialogs or narratives, as seen in the notable ROUGE-L decrease on QMSum (from 22.1 to 18.3). Therefore, while both structures are essential, the semantic component contributes most to overall informativeness on content-rich datasets, whereas the sequential component is vital for maintaining local flow.
(3) The composite training objective ensures task alignment and training stability. The “w/o ” variant removes the self-supervised ranking loss and causes clear performance reductions across datasets. This result highlights the loss’s role in producing discriminative node representations optimized for relevance ranking in summarization. The “w/o ” variant removes the balancing loss and leads to moderate but consistent declines. For example, ROUGE-1 drops from 30.2 to 21.2 on BookSum. The drop is larger here because longer documents generate more nodes and amplify load imbalance issues. This outcome confirms that explicit load balancing prevents expert collapse and preserves the full representational power of the Moe architecture.
4.4. Computational Efficiency Analysis
We address concerns regarding the computational complexity of MOEGAT by distinguishing between graph construction (preprocessing) and inference stages. Measurements are conducted on the WCEP dataset using a consistent LLM backbone, excluding identical LLM generation time across methods.
Graph construction entails splitting the document into overlapping chunks, encoding them with a pre-trained embedding model to generate node representations, and constructing the sparse sequential and masked top-k semantic adjacency matrices. The dominant cost lies in the embedding step and varies significantly with the embedding model. Subsequent steps—building the sparse sequential adjacency matrix (O(N)) and the masked top-k semantic adjacency matrix—are considerably lighter, as they operate on precomputed embeddings with efficient sparse operations. Graph construction is performed once per document and can be fully offline, making it amortizable in practical deployments.
During inference, MOEGAT performs dynamic routing, query-aware Mixture-of-Experts Graph Attention propagation over the pre-constructed sparse Orthogonal Context Graph, and node ranking via composite losses. The highly sparse graph ensures efficient attention. As shown in
Table 6, MOEGAT averages 1.2 s per query, versus 0.58 s for GoR—a modest 0.6-s increase mainly from dynamic routing and multi-expert layers. Given the substantial gains (+14.9% ROUGE-L, +18.1% ROUGE-1, +18.4% ROUGE-2 over GoR), this minor overhead is well-justified for practical applications.
5. Discussion
5.1. Efficacy of the Scoring Mechanism
This mechanism is a critical component of the framework, with its core function being the accurate ranking of nodes. To this end, the model-predicted score for each node on the test set was calculated and compared against its corresponding ground-truth score to evaluate the consistency between the two. Specifically, a plot of the model’s output scores versus the target scores was generated to visually investigate the correlation between them.
We conducted a quantitative analysis on the test set to evaluate the effectiveness of our scoring mechanism. Results are visualized in
Figure 3.
Figure 3a shows the linear regression fit between model-predicted relevance scores and ground-truth scores for each node. We observe a strong positive linear correlation, evidenced by a Pearson correlation coefficient of r = 0.759 (
p < 0.001). Additionally, Spearman’s rank correlation coefficient of ρ = 0.714 confirms high monotonic consistency in ranking. The coefficient of determination R
2 = 0.576 indicates that the model explains more than half of the variance in ground-truth scores. Compared to typical relevance prediction performance reported in complex natural language understanding tasks [
53], these results demonstrate robust and reliable score prediction capabilities. This finding is further supported by the two-dimensional kernel density estimation in
Figure 3b, where the joint distribution reveals that most data points concentrate tightly around the regression line. This pattern confirms consistent performance across the majority of samples and validates that our scoring head successfully assigns higher relevance to critical content segments.
5.2. Sensitivity Analysis of the Number of Semantic Neighbors
We conducted a systematic study to examine the influence of
k, the number of semantic neighbors in the Orthogonal Context Graph, on ranking performance. Results for the WCEP and QMSum datasets are shown in
Figure 4. As
k increases from small values, performance on both datasets improves rapidly at first, reflecting the benefit of incorporating more long-range semantic connections. Performance then reaches a peak and subsequently declines or plateaus, indicating the onset of noise from overly distant or irrelevant neighbors. The optimal
k and the model’s sensitivity to this hyperparameter vary markedly between datasets. On the WCEP dataset, which features high content heterogeneity due to multi-document news clusters containing diverse topics, advertisements, and ancillary elements, the model achieves peak ranking performance at
k = 5. Beyond this point, further increases in
k lead to a steady decline, as additional semantic edges frequently connect core content to noisy or irrelevant segments. This noise dilutes the signal during graph propagation and reduces retrieval precision. In contrast, the QMSum dataset consists of meeting transcripts that are more semantically dense and thematically coherent within each document. Here, the model attains its highest performance at
k = 7, with only minor degradation even as
k continues to increase up to 10 or higher. This greater robustness suggests that extra semantic neighbors in QMSum are more likely to link genuinely related utterances across speakers or agenda items, providing useful supplementary context rather than noise.
These dataset-specific patterns underscore the value of tuning k based on document characteristics: lower values suit heterogeneous or noisy sources to minimize spurious connections, while higher values are preferable for focused, cohesive texts. Overall, the results confirm that our Orthogonal Context Graph construction is not only effective but also exhibits appropriate sensitivity to this key hyperparameter, enabling strong adaptability across diverse long-document scenarios in practice.
5.3. Analysis of the Impact of the Load-Balancing Loss on Architectural Stability
We performed a quantitative comparison between the full model, which incorporates the load-balancing loss, and an ablated variant without this loss (w/o load-balancing). Results are presented in
Figure 5 from two perspectives: the dynamic evolution of expert load during training (a) and the steady-state distribution after convergence (b). Without the load-balancing loss, the routing mechanism exhibits severe imbalance after convergence, as shown in
Figure 5b. One expert (expert 5) processes 45.9% of the input tokens—nearly four times the ideal uniform share of 12.5% for 8 experts—while experts 2, 3, 4, and 7 receive almost no tokens. This “winner-take-all” phenomenon causes the multi-expert system to degenerate into a structure dominated by a single expert, substantially reducing overall model capacity. The load balance score drops dramatically from 0.912 in the full model to 0.341 in the ablated version, confirming routing collapse.
In contrast, the full model with the load-balancing loss maintains healthy utilization.
Figure 5a illustrates that, despite initial fluctuations, the load on all eight experts rapidly converges and stabilizes around the ideal average. This demonstrates that the load-balancing loss serves as an effective regularizer, enforcing uniform input distribution and ensuring each expert receives sufficient training signal. By preserving expert diversity and capacity, this mechanism contributes directly to the superior ROUGE performance observed in the full model.
5.4. Analysis of the Role of the Self-Supervised Ranking Loss in the Separability of the Representation Space
We investigated the role of the self-supervised ranking loss in representation learning through a comparative visualization study. Final node representations from the full model and an ablated variant without the ranking loss are projected into two dimensions using UMAP. Results are shown in
Figure 6, where red points represent golden nodes that are highly relevant to the reference summary and blue points represent ordinary nodes. In the ablated model (right panel), golden and ordinary nodes are heavily intermingled with no discernible separable structure. This indicates that the main task objective alone provides insufficient signal for the model to distinguish nodes of varying importance, resulting in poor representation separability. The full model (left panel) exhibits markedly different behavior. Golden nodes form a compact, well-defined cluster clearly separated from the region occupied by ordinary nodes. This pronounced boundary demonstrates that the self-supervised ranking loss supplies a task-aligned supervisory signal that effectively guides the encoder to enhance separability between critical and secondary content. By serving as a carefully designed proxy task, the ranking loss imposes structuredness on the representation space, enabling better differentiation of node importance and contributing directly to improved summary quality.
5.5. Analysis of the Correlation Between Internal Decision Mechanisms and Summarization Performance
High-quality summarization depends critically on the model’s ability to accurately assess sentence importance. To examine this internal decision-making process and its relation to generation performance, we conducted a qualitative visual analysis on 34 test samples from the QMSum dataset. Results are presented in
Figure 7, with the top panel showing ROUGE-L scores for each sample and the bottom panel displaying a heatmap of model-assigned importance scores for sentences in the corresponding source document (darker colors indicate higher importance; selected summary sentences are bordered in dark gray). Based on this visual analysis, the following conclusions were drawn:
(1) Accurate Judgments and an Effective Scoring Mechanism: The heatmaps collectively indicate that the model can precisely identify salient sentences. In the vast majority of samples, the sentences ultimately selected for the summary (highlighted with dark gray borders) show a high degree of correspondence with the most intensely colored regions in the heatmap, which represent the highest importance scores. For example, in samples 17, 20, and 34, the model exhibits a pronounced “peak” effect, where a few sentences are assigned extremely high scores while the rest receive markedly lower ones. This demonstrates a strong alignment between the model’s internal scoring and its summary generation behavior, confirming that the designed scoring architecture can effectively identify high-information-density content. (2) Overcoming Positional Bias with Global Information Awareness: The analysis reveals that the model successfully avoids the “lead bias” commonly found in traditional summarization methods. Observing the vertical axis of the heatmaps (normalized sentence position), high-scoring salient sentences are not concentrated at the beginning of the documents (in the 0.0–0.2 range) but are instead distributed widely throughout the text. For instance, the key information in sample 20 is concentrated in the initial part of the text, whereas the salient sentences in samples 5, 15, and 33 are located primarily in the middle and latter sections. This indicates that the model possesses a global information-aware capability, enabling it to assess sentence importance based on semantic value rather than positional cues, which is particularly crucial for processing long documents with diverse structures.
(3) Positive Correlation between Scoring Confidence and Summary Quality: A comparison of the ROUGE-L scores and the heatmap patterns reveals a clear association between the two. In samples with high ROUGE-L scores (e.g., samples 17 and 34), the heatmaps exhibit high contrast, with a few sentences receiving scores significantly higher than the rest. Conversely, in samples with lower scores (e.g., samples 5 and 29), the color distribution is relatively uniform. This suggests that the more “confident” the model is in its judgments—that is, when the scores of salient sentences are highly prominent—the higher the quality of the generated summary tends to be. In summary, the model demonstrates high accuracy and global awareness in assessing sentence importance, providing a solution for long-document summarization that is both effective and interpretable.
5.6. Limitations
A primary limitation of the proposed method lies in its strong dependence on the quality of the graph construction. We front-load the complex semantic analysis task; while this reduces the computational burden on the model, it also shifts the performance bottleneck to the graph construction stage. This limitation is particularly pronounced on the BookSum dataset, where for extremely long and complex texts such as books, the current graph construction method struggles to fully capture the macroscopic narrative structure. This reveals a critical issue: even a model with a highly stable and load-balanced internal architecture (see
Figure 5) will have its ultimate efficacy constrained by the quality of its input representation (i.e., the graph structure). Another limitation stems from the implicit division of labor among the experts, which restricts the model’s controllability. While we successfully ensured that all experts are activated via the load-balancing loss, we did not explicitly guide their functional differentiation. This causes the model to function as an effective “generalist” rather than a collection of interpretable and steerable “specialists” Consequently, the model lacks the flexibility required for applications that involve generating summaries from a specific perspective.