Cognitive Chain-Based Dual Fusion Framework for Multi-Document Summarization

Li, Chenyang; Zhang, Long; Zhang, Junshuai; Zheng, Qiusheng

doi:10.3390/electronics14224545

Open AccessArticle

Cognitive Chain-Based Dual Fusion Framework for Multi-Document Summarization

¹

School of Cybersecurity, Zhongyuan University of Technology, Zhengzhou 450007, China

²

Central China Institute of Artificial Intelligence, Zhengzhou 450046, China

³

Henan Academy of Science, Zhengzhou 450046, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(22), 4545; https://doi.org/10.3390/electronics14224545

Submission received: 29 October 2025 / Revised: 19 November 2025 / Accepted: 19 November 2025 / Published: 20 November 2025

(This article belongs to the Special Issue Emerging Theory and Applications in Natural Language Processing, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Multi-Document Summarization (MDS) is a critical task in natural language processing that aims to condense document clusters into concise and comprehensive summaries. However, existing approaches based on large language models (LLMs) often lack structured quality monitoring and depth refinement mechanisms. This opacity and lack of self-correction can compromise the reliability, depth, and controllability of the resulting summaries. To address these limitations, this paper introduces the Self-Optimizing Multi-Path Fusion Framework (MFOG), a novel conceptual architecture for MDS. MFOG treats MDS as a collaborative process that optimizes both summary depth and breadth. The framework uses a dual-path architecture to balance summary depth and breadth. A depth-focused path, augmented by Retrieval-Augmented Generation (RAG), progressively enhances content depth and logical coherence. Concurrently, a breadth-first parallel path ensures comprehensive coverage. A final fusion module then performs a weighted integration of these outputs. We present an illustrative experimental study on benchmark datasets. On Multi-News, MFOG achieves ROUGE-1 and ROUGE-2 scores of 51.08 and 22.76, representing improvements of 1.23 and 1.11, respectively, over the strongest baselines.On DUC-2004, it achieves a ROUGE-1 score of 36.12 (a 1.30 improvement) and a BERTScore of 40.16 (a 1.14 improvement). This preliminary study validates the feasibility of the MFOG framework, demonstrating its potential to produce summaries that are both comprehensive and coherent.

Keywords:

multi-document summarization; cognitive chains; dual-path fusion; retrieval-augmented generation; language models

1. Introduction

Large language models (LLMs) have demonstrated considerable capabilities in text comprehension and generation, significantly advancing the field of MDS. However, the direct application of LLMs to MDS tasks is hindered by challenges such as inconsistent quality and a tendency to hallucinate. These issues stem from two fundamental limitations in current methodologies. First, there is a lack of structured, multi-dimensional mechanisms for quality monitoring and diagnostics. This transforms the generation process into an opaque “black box” [1], complicating the real-time evaluation and attribution of summary quality. Second, existing models lack intelligent, closed-loop strategies for depth refinement. Consequently, even when deficiencies like missing or biased information are identified, the models cannot perform targeted, self-driven optimization [2]. These limitations create a significant bottleneck, undermining the reliability and practical applicability of contemporary MDS systems. To illustrate these limitations and the benefits of our approach, we present a case study in Figure 1. This figure visually contrasts an LLM-only generated summary with one enhanced by our MFOG framework, demonstrating the significant disparities in both content depth and human evaluation metrics.

To address these challenges, this paper introduces a Self-Optimizing Multi-Path Fusion Framework. The core idea is to break down the complex MDS task into a coordinated process that balances the depth and breadth of the resulting summary. The framework first decomposes the summarization task into a series of analytical dimensions. Based on this decomposition, an innovative dual-path generation architecture is employed. A depth-focused path follows the dimensional sequence, utilizing a closed-loop “Generate–Evaluate–Optimize” model. Within this path, content generated for each dimension is assessed by a dedicated evaluation model; if quality is insufficient, an optimization module is triggered to retrieve additional information and regenerate the content, thereby ensuring depth and logical coherence. Concurrently, a breadth-first parallel path generates content for all dimensions simultaneously and independently. This parallel approach ensures high processing efficiency and prevents the accumulation of errors common in serial processes, thus guaranteeing summary breadth and comprehensiveness. Finally, a fusion module performs a weighted integration of the summaries from both paths to produce a final output that optimally balances these critical qualities.

The primary contributions of this paper are as follows. First, we propose the Self- Optimizing Multi-Path Fusion Framework (MFOG), a new conceptual architecture for MDS that combines depth-focused refinement with breadth-focused parallel generation to balance summary depth and comprehensiveness. Second, we design a structured and self-correcting generation mechanism, which incorporates a “cognitive chain” for logical task decomposition and a closed-loop optimization system. This system is designed to enable a shift from uncontrolled generation to a diagnostic, self-correcting paradigm. Third, we provide an illustrative empirical validation through extensive experiments on multiple benchmark datasets. The results validate the potential of the MFOG architecture and confirm the distinct contributions of its core components through a thorough ablation analysis.

2. Related Work

2.1. Traditional Multi-Document Summarization Methods

MDS methods have evolved from traditional algorithmic approaches to the current dominance of large pre-trained language models. Early research in MDS was characterized by two primary approaches. Extractive methods, including seminal works like TextRank [3], focused on identifying and selecting salient sentences, often incorporating algorithms such as Maximal Marginal Relevance (MMR) [4] to improve content diversity. Concurrently, early abstractive methods, exemplified by the work of Barzilay and McKeown [5], sought to fuse information from multiple sources. However, these traditional approaches were fundamentally limited by shallow semantic understanding and difficulties in generating coherent, novel text.

The advent of deep learning marked a significant turning point. Neural sequence-to-sequence models, enhanced by innovations like the attention mechanism and pointer-generator networks [6], greatly improved the capacity for capturing document-level semantics. Subsequent advancements, including hierarchical architectures [7] and graph neural networks [8], further refined the ability to model complex inter-document relationships. To improve informational precision, other approaches operated at a sub-sentential level. ProCluster [9], for example, represents this proposition-based paradigm by using Open Information Extraction (OpenIE) to extract, cluster, and then fuse key propositions into summary statements, thereby mitigating redundancy.

Currently, the state of the art is dominated by Transformer-based [10] pre-trained language models (PLMs). Models such as BART [11], T5 [12], and PEGASUS [13] have established new performance benchmarks on standard datasets like Multi-News. Further specializing these architectures, PRIMERA [14] is an advanced generative model specifically pre-trained for MDS that employs an innovative pyramid-based masking strategy to enhance cross-document information aggregation. In addition to task-specific architectures, our work also aligns with a broader methodological tradition of integrating machine learning architectures with structured, domain-specific analytical pipelines. For example, recent work by Pazhouhan et al. [15] demonstrates a relevant framework combining ML modeling with structured feature engineering and domain ontologies, supporting the value of structured, domain-aware systems over generic models.

2.2. LLM-Based MDS Methods

The advent of large language models (LLMs) has significantly reshaped the landscape of MDS. Foundational models such as GPT-3 first demonstrated the powerful zero-shot learning capabilities of LLMs in complex multi-document contexts. Subsequent models, including GPT-3.5 and GPT-4, have shown that through sophisticated prompt engineering, they can effectively manage information redundancy and conflicts across multiple source documents.

To address the inherent limitations of LLMs, such as factual hallucinations and context length constraints, RAG [16] has emerged as a critical paradigm. RAG frameworks enhance summary quality by dynamically retrieving and incorporating relevant text passages, thereby mitigating information loss. Foundational work in this area, such as DPR [17], established efficient dense vector retrieval techniques that are instrumental in multi-document settings. Recent research has focused on augmenting RAG with more sophisticated mechanisms for reasoning and self-correction. A primary direction is the decomposition of complex tasks into logical steps. CoT prompting [18], for example, enhances performance on complex MDS tasks by simulating a human-like reasoning process. In parallel with structuring the reasoning process, other approaches introduce structured representations of the source documents. RAPTOR [19], for instance, constructs a tree-based structure via recursive summarization and clustering to effectively integrate information from local details to global themes. Similarly, GraphRAG [20] also structures the source data but by constructing a knowledge graph to guide retrieval. Another advanced technique, NEXUSSUM [21], utilizes a multi-stage, hierarchical LLM agent framework with a modular “preprocessing–summarization–compression” workflow to improve the coherence and quality of summaries for extensive documents. To enhance the reliability of such frameworks, another line of research has focused on self-correction. Self-RAG [22] trains a language model to perform on-demand retrieval and self-reflection, substantially improving content quality and factual accuracy. Complementing this, CRAG [23] incorporates an evidence-weighted correction mechanism to specifically address information conflicts in multi-document scenarios. The underlying principle of structured, progressive reasoning is further validated in other domains. For instance, Relation-R1 [24] demonstrated in visual understanding that decomposing a complex recognition task into a progressive sequence (e.g., entity identification → relation inference) significantly improves model performance. This cross-domain evidence provides strong theoretical support for applying similar chain-of-reasoning methodologies to enhance summary coherence and depth in MDS.

3. Cognitive Chain-Driven Generative Framework

This section details the architecture and components of the proposed Self-Optimizing Multi-Path Fusion Framework (MFOG), which is designed to deconstruct the complex MDS task into a coordinated, multi-strategy process.

3.1. Overall Framework Architecture

To implement our approach, we designed the MFOG framework, the architecture of which is illustrated in Figure 2. Given a document collection

D = {d_{1}, d_{2}, \dots, d_{N}}

, the framework’s objective is to generate a concise, coherent, and comprehensive summary

S_{f i n a l}

. The core of the framework is an innovative dual-path generation module that balances summarization depth and breadth:

The depth-focused path follows the dimensional sequence, utilizing a closed-loop “Generate–Evaluate–Optimize” model to ensure logical coherence.
The breadth-first parallel path generates content for all dimensions simultaneously, ensuring high efficiency and preventing error accumulation.

Figure 2. The overall architecture of the proposed MFOG framework. The diagram illustrates the dual-path (depth and parallel) generation process, the self-optimization loop, and the final fusion stage.

Finally, a fusion module performs a weighted integration of the outputs from both paths,

W_{1}

(Depth-Path-Generated Document) and

W_{2}

(Breadth-Path-Generated Document), to produce the

S_{f i n a l}

.

3.2. The Cognitive Chain

The core component of our framework is the Cognitive Chain. Inspired by the stepwise reasoning inherent in human cognition, we approach the MDS task by systematically deconstructing it into a sequence of progressive analytical dimensions. This approach moves beyond unstructured generation, imposing a logical pathway to guide the model’s summarization process.

We formally define the Cognitive Chain as an ordered sequence

C C

:

C C = 〈 \dim_{1}, \dim_{2}, \dots, \dim_{M} 〉 .

(1)

A typical chain,

〈 Background, Content, Impact, Trends 〉

, ensures a logical exploration. Figure 3 illustrates this deconstruction by contrasting a fragmented, LLM-only summary with the structured, multi-faceted output from our MFOG framework for a natural disaster event.

3.3. Implementation of the Cognitive Chain

The implementation of the Cognitive Chain maps each analytical dimension to a specific, engineered prompt. These prompts are structured templates that guide the LLM generation for each sub-task. This structured methodology constrains the model’s generative space, mitigating ambiguity and hallucination from overly complex, single-shot requests.

A critical aspect of this implementation is the fixed, hierarchical sequence of the dimensions, which follows a logical progression from basic understanding to detailed analysis. We define this sequence as

〈 Background, Content, Impact, Trends 〉

. This fixed order is not arbitrary; it establishes a crucial dependency, ensuring that each step is predicated upon the solid foundation laid by the previous one. For example, a coherent analysis of Impact is only possible after the core Content has been established, and forecasting Trends requires a robust understanding of that impact. This sequential, dependency-aware process is key to building a comprehensive and logically sound summary. To provide full methodological transparency and address the core issue of reproducibility, Table 1 details the exact prompt templates used to implement this framework. These prompts are the critical implementation artifacts guiding the LLM. The table specifies the distinct prompts used for the sequential depth-focused path (which consumes output from the previous step) and the breadth-first parallel path.

3.4. Dual-Path Generation and Optimization

To effectively balance summary depth and breadth, the core of the MFOG framework is a dual-path generation mechanism comprising two complementary strategies: A depth-focused path and a breadth-first parallel path. The depth path is designed to ensure logical coherence and informational depth. It sequentially generates content by strictly adhering to the sequence defined by the Cognitive Chain, ensuring that each step builds upon the optimized output of the preceding one. The content generated for the m-th dimension,

C_{m}

, can be expressed as

C_{m}^{n} = G_{L L M} (D, P_{m}, C_{m - 1})

(2)

where

G_{L L M}

is the generation function, D is the document set,

P_{n}

is the prompt for the current dimension, and

C_{m - 1}

is the optimized output from the previous dimension serving as prior knowledge. A key component of this path is the self-optimization mechanism, O, which assesses the generated content against a predefined quality threshold,

θ

:

O (C_{m}^{n}) = \{\begin{matrix} C_{m}^{n}, & if Score (C_{m}^{n}) \geq θ \\ Regenerate (C_{m}^{n}), & otherwise \end{matrix}

(3)

To implement this, the

Score (C_{m}^{n})

is obtained via dimension-specific evaluation models. These evaluators are realized by employing dedicated scoring prompts for each analytical dimension, guiding the foundational LLM to assess the content it has generated. In all experiments, the quality threshold

θ

was empirically set to 0.8.

It is crucial to clarify these two distinct evaluation mechanisms. The “dimension-specific evaluation models” (Section 3.4) are part of the internal generation loop, providing a rapid, automated quality heuristic (the [0, 1] score) to decide if regeneration is needed. In contrast, the “Overall Quality Assessment” (Section 3.5) and the ROUGE/BERTScore metrics (Section 4.2) are the external, formal evaluation frameworks used after generation is complete to measure the final summary’s quality for our experimental results. The internal heuristic does not directly participate in the final external evaluation.

Should the quality score,

Score (C_{m}^{n})

, fall below the threshold, the mechanism triggers a “Generate–Evaluate–Optimize” closed loop. It is important to clarify the dual role of retrieval within MFOG. While an initial ’Retriever’ (shown in Figure 2) is used in a pre-processing step to select the initial TopK documents (TopK Doc), the RAG module activated here serves a distinct, ’on-demand’ corrective function. As detailed in the ’Optimizing Module’ diagram, an ’LLM Diagnosis’ step first identifies the specific content deficiency and formulates a ’New Query’. This targeted query is then used by the RAG module to retrieve additional information from the vector database to address the diagnosed flaw. This new information is then used to refine the generation prompt for content regeneration, creating a self-correcting cycle. The final output of the depth path is the aggregation of all optimized dimensional outputs, denoted as

S_{d e p t h}

.

In contrast, the breadth-first parallel path is engineered for high efficiency and comprehensive coverage. It generates content for all M analytical dimensions in parallel. As illustrated in Figure 2, this path also incorporates an optimization step to enhance quality. However, to maintain high throughput and distinguish it from the depth path, this optimization is limited to a single pass (i.e., one round of optimization) and does not trigger the iterative RAG-based closed loop.

3.4.1. Self-Optimization Logic and Evaluation Prompts

The core of the self-optimization loop is the Score(C) function, which is designed to transform a qualitative “scoring prompt” into a reliable numerical metric. To address the opacity of traditional ’black box’ generation, we implement a transparent, two-stage quantitative assessment mechanism that explicitly decouples qualitative critique from quantitative scoring. This approach fundamentally shifts the role of the base LLM from a simple generator to a dual-function evaluator.

In the first stage, the model operates as an expert critic, prompted to generate a detailed textual critique that identifies specific factual errors or omissions based on the source documents. Subsequently, in the second stage, the model functions as a strict scoring engine, converting this evidence-based critique into a normalized numerical score

S \in [0, 1]

. This structured progression ensures that the final score is not an arbitrary impression but is grounded in specific, interpretable observations. The precise logic of this “Generate–Evaluate–Optimize” cycle is detailed in Algorithm 1, and the structured prompts implementing each stage are provided in Table 2. The quality threshold

θ

was empirically set to 0.8, a value determined through preliminary testing to optimally balance summary quality with computational cost.

Algorithm 1 Dimension-specific evaluation logic (internal loop).

Require:: Generated Content C, Dimension $d i m$ , Source Documents D
Ensure:: Quality Score $S \in [0, 1]$
1:: procedure Evaluate( $C, d i m, D$ )
2:: Initialization:
3:: $S \leftarrow 0.0$
4:: // Step 1: Generate Qualitative Critique via LLM
5:: // Construct prompt with context and dimension-specific criteria
6:: $a r g s_{c r i t} \leftarrow {T_{c r i t i q u e}, C, d i m, D}$
7:: $P_{c r i t} \leftarrow FILLTEMPLATE (a r g s_{c r i t})$
8:: // LLM acts as an expert evaluator to identify flaws
9:: $C r i t i q u e \leftarrow L L M (P_{c r i t})$ ▹ Identify factual errors/missing info
10:: // Step 2: Quantitative Scoring based on the Critique
11:: // Transform the textual critique into a numerical score
12:: $a r g s_{s c o r e} \leftarrow {T_{s c o r i n g}, C, C r i t i q u e}$
13:: $P_{s c o r e} \leftarrow FILLTEMPLATE (a r g s_{s c o r e})$
14:: $R a w O u t p u t \leftarrow L L M (P_{s c o r e})$ ▹ e.g., “Score: 0.85” (Example > Threshold 0.8)
15:: // Step 3: Parse Output and Finalize
16:: $S \leftarrow PARSEFLOAT (R a w O u t p u t)$
17:: return S
18:: end procedure

3.4.2. On-Demand RAG and Query Formulation

When

Score (C) < θ

, the RAG module is triggered to perform targeted correction. As shown in Figure 2, this process is managed by an “LLM Diagnosis” step, which utilizes the text_critique from Section 3.4.1 to formulate a precise, context-aware search query. The query then retrieves additional information to facilitate content regeneration. The detailed procedural logic for this trigger mechanism is presented in Algorithm 2, and the specific prompts used for diagnosis and regeneration are detailed in Table 3.

Algorithm 2 Pseudo-code for RAG trigger and query formulation.

Require:: Content C, Critique $C r i t$ , Source Documents D, Vector DB V
Ensure:: Regenerated Content $C_{n e w}$
1:: procedure Regenerate( $C, C r i t, D$ )
2:: Initialization:
3:: $Q u e r y_{n e w} \leftarrow null$
4:: // Step 1: Diagnostic Query Formulation
5:: // Analyze critique to identify missing information
6:: $a r g s_{d i a g} \leftarrow {T_{d i a g n o s i s}, C, C r i t}$
7:: $P_{d i a g} \leftarrow FILLTEMPLATE (a r g s_{d i a g})$
8:: $Q u e r y_{n e w} \leftarrow LLM (P_{d i a g})$ ▹ e.g., “economic impact of typooon”
9:: // Step 2: Targeted Retrieval
10:: // Retrieve supplementary evidence from Vector Database
11:: $D o c s_{s u p p} \leftarrow RETRIEVE (V, Q u e r y_{n e w})$
12:: // Step 3: Content Regeneration
13:: // Rewrite content using original context and new evidence
14:: $a r g s_{r e g e n} \leftarrow {T_{r e g e n}, C, D o c s_{s u p p}}$
15:: $P_{r e g e n} \leftarrow FILLTEMPLATE (a r g s_{r e g e n})$
16:: $C_{n e w} \leftarrow LLM (P_{r e g e n})$
17:: return $C_{n e w}$
18:: end procedure

3.5. Overall Quality Assessment

To facilitate a granular and structured evaluation, we established a multi-dimensional quality assessment framework. This framework decomposes summary quality into four distinct analytical dimensions, each with a set of detailed scoring criteria.

The Background Information dimension is assessed based on three criteria. Completeness requires the inclusion of essential background (e.g., time, location, and subjects). Accuracy measures the fidelity of this information to the source documents. Finally, Clarity evaluates the conciseness and comprehensibility of the presentation.

For the Event Content, we evaluate four core elements. Completeness pertains to the coverage of core processes and key event details. Accuracy ensures factual consistency with the source material. Structure assesses the logical coherence of the narrative, while Significance measures the effective emphasis on the most critical information.

The Impact Analysis dimension is evaluated on its depth, analyzing underlying causal mechanisms rather than just superficial facts; its breadth, covering influence across multiple domains (e.g., social and economic); its reasonableness, based on evidential support and logical rigor; and its uniqueness, focusing on the extraction of novel, unstated insights.

Finally, the Trend Forecasting dimension is assessed for its Reasonableness, which demands a logical and evidential basis for predictions. Foresight evaluates the insightful nature of the predictions, Diversity considers the presentation of multiple potential pathways, and Prudence examines the appropriate level of certainty in the language used. For practical implementation, the core criteria for each dimension are formulated into structured prompts to guide an LLM-based evaluator. The model then assigns a quality score from 0 to 10 for each dimension, enabling a quantitative assessment.

These dimensions are structured to follow a logical cognitive progression. Background Information establishes the necessary Background, upon which the factual narrative of the Event Content is constructed. Subsequently, Impact Analysis elevates this factual understanding to a deeper interpretation of the event’s significance. Finally, Trend Forecasting extends the analytical horizon to future trajectories, providing forward-looking insights that can support decision-making.

3.6. LLM Fusion Document Module

Following the generation of the depth-focused summary (

S_{d e p t h}

) and the breadth-first parallel summary (

S_{p a r a l l e l}

), a final integration step is required to synthesize these outputs. Mere concatenation is insufficient as demonstrated in our ablation studies (Section 4.5), where simple concatenation resulted in degraded structural coherence. Instead, the “LLM Fusion Document” module executes a weighted integration process. Empirical parameter studies indicate that optimal performance is achieved by explicitly instructing the LLM to prioritize the logical flow of

S_{d e p t h}

while incorporating complementary details from

S_{p a r a l l e l}

. This strategy corresponds to a functional 7:3 (depth-to-parallel) ratio, which is implemented via the prioritization instruction in the fusion prompt. The specific prompt used to achieve this weighted fusion is detailed in Table 4.

Additionally, further validation confirmed the internal design choices of the Cognitive Chain. Ablation experiments affirmed the necessity of all four dimensions, with the ’Content’ dimension proving to be the most critical foundation, while all dimensions contribute collectively to the summary’s holistic quality. Finally, the framework’s robustness and model-agnostic nature were demonstrated by integrating it with various base LLMs (e.g., Llama, Gemma, and Mistral). These experiments confirmed that MFOG performance scales with the capability of the underlying model, validating its function as a universal enhancement framework for complex summarization tasks.

4. Experimental Evaluation

4.1. Dataset

The performance of the proposed MFOG framework was evaluated on three established MDS benchmark datasets: Multi-News, DUC-2004, and Multi-XScience. Each dataset provides document clusters and corresponding reference summaries, enabling a comprehensive assessment of our method across different domains and complexities.

Multi-News [25] is a large-scale dataset derived from the news aggregator newser.com, comprising 56,216 instances. Each instance consists of 2–10 related news articles and a human-authored summary. The documents, which span diverse domains such as politics, business, and sports, feature significant informational overlap and complementarity. This dataset is therefore ideal for assessing the ability of our four-dimensional Cognitive Chain to integrate complex, multifaceted information.
DUC-2004 [26] is a classic MDS benchmark consisting of 50 document clusters, each containing 10 news articles on a specific topic (e.g., political events and natural disasters). For each cluster, four human-written reference summaries are provided, from which one is used as the target in our experiments. As a gold standard in the field, DUC-2004 provides an authoritative platform for benchmarking the performance of our framework against prior extractive and abstractive work.
Multi-XScience [27] is a large-scale dataset of scientific articles sourced from ArXiv and the Microsoft Academic Graph, comprising over 40,000 instances. Each instance includes the abstract of a target paper and the abstracts of the related papers it cites. This dataset was selected to evaluate the cross-domain generalization of our framework and to test the adaptability of the Cognitive Chain to academic texts. Notably, we applied the same fixed Cognitive Chain to the Multi-XScience dataset without modification. We hypothesize that this generalization is effective because our defined chain aligns robustly with the standard structure of scientific communication:
·
Background maps naturally to the Introduction and Related Work, setting the research context.
·
Content corresponds to the Methodology and Results, detailing the core scientific contribution.
·
Impact aligns with the Discussion and Conclusion, interpreting the significance of the findings.
·
Trends relates to the Future Work sections, forecasting subsequent research directions.
This alignment suggests that the proposed Cognitive Chain is not an arbitrary construct but a fundamental cognitive sequence for analyzing informational documents. The instruction-tuned Llama-2-7B-chat model demonstrates the flexibility to adapt these abstract dimensions to the specific structural norms of diverse domains, confirming the framework’s robust generalization capabilities.

4.2. Evaluation Metrics

To comprehensively evaluate the performance of our framework, we conducted both automated and human evaluations. This study utilizes ROUGE-1, ROUGE-2, and ROUGE-L from the ROUGE suite [28] to measure the lexical overlap between the generated and reference summaries at the word, phrase, and sentence levels, respectively. BERTScore [29]: This metric assesses semantic similarity by computing cosine similarity between the contextual embeddings of tokens from both the generated and reference summaries, leveraging the deep contextual understanding of BERT [30].

METEOR (Metric for Evaluation of Translation with Explicit ORdering) [31]: METEOR provides a more flexible semantic alignment score by incorporating stemming, synonymy, and paraphrase matching, moving beyond exact lexical overlap. To complement the automated metrics, we conducted a manual evaluation with five graduate students specializing in natural language processing. These expert evaluators assessed the generated summaries on a 10-point scale across five standard criteria [32]: coherence, consistency, fluency, relevance, and overall quality.

4.3. Implementation Environment

To ensure a fair and reproducible comparison, the experimental setup was standardized across all LLM-based methods.

Foundational Model: Both our proposed MFOG framework and all state-of-the-art baselines (GraphRAG, RAPTOR, and NEXUSSUM) were implemented using Llama-2-7B-chat as the core generative model. We selected the Llama-2-7B-chat variant over the base Llama-2-7B model because it has undergone extensive instruction fine-tuning and reinforcement learning from human feedback (RLHF). This makes it more adept at interpreting the complex, multi-dimensional instructions and the structured guidance of our Cognitive Chain. This controlled variable ensures that observed performance differences are attributable to the frameworks themselves, rather than variations in the underlying models.

Experimental Environment: All experiments were conducted on a workstation equipped with two NVIDIA RTX 4090 (24 GB) GPUs. The software stack consisted of Ubuntu 22.04 LTS, Python 3.10, PyTorch 2.0.1, Transformers 4.30.2, PEFT 0.5.0, and FAISS 1.7.4.

4.4. Result Analysis

The selection of baselines was designed to cover the primary methods of multi-document summarization. TextRank was chosen as a classic, non-neural graph-based extractive method. ProCluster represents advanced proposition-based clustering methods. PRIMERA serves as a strong, pre-trained baseline specifically designed for MDS tasks. Finally, GraphRAG, RAPTOR, and NexusSum were included, as they represent the current state of the art in LLM-based hierarchical and graph-based approaches, providing a robust benchmark for verifying the effectiveness of our dual-path architecture. We evaluated the MFOG framework against the six baseline models across the three benchmark datasets.

As shown in Table 5, MFOG consistently and significantly outperforms all baseline methods across all datasets and evaluation metrics. This demonstrates the superior and robust performance of our proposed multi-path fusion architecture.

The experimental results consistently demonstrate a clear performance hierarchy among the different classes of models. The conventional baselines, including TextRank and ProCluster, lag significantly behind all LLM-based approaches, underscoring the major advancement in semantic understanding and generative capabilities brought by large language models. PRIMERA, with its MDS-specific pre-training, occupies an intermediate position, validating its focused design but also highlighting its limitations compared to more advanced, instruction-tuned LLM frameworks.

Among the state-of-the-art baselines, the results reveal a pattern of specialized strengths rather than uniform superiority. Each framework excels on different facets of the summarization task, attributable to their distinct architectures. For instance, the NEXUSSUM multi-stage workflow appears optimized for detail retention, leading to its strong ROUGE-2 score on Multi-News. In contrast, the RAPTOR hierarchical tree structure excels at global content selection, securing the highest ROUGE-1 score in the same category. Similarly, the strength of GraphRAG on the DUC-2004 dataset can be linked to its knowledge graph-based approach for capturing global context. While these methods demonstrate impressive performance on specific metrics, none achieve consistent dominance across all evaluation criteria.

In contrast, the MFOG framework demonstrates consistent and robust superiority across all three datasets and all evaluation metrics. This superior performance is directly attributable to its unique dual-path fusion architecture, which combines two complementary strategies. The depth path, guided by the Cognitive Chain (Background → Content → Impact → Trend), ensures the summary possesses deep logical coherence and progressively builds a sophisticated understanding of the topic. Concurrently, the breadth-first parallel path guarantees comprehensive coverage and high efficiency by processing all analytical dimensions simultaneously. By intelligently integrating the outputs of these two paths, MFOG produces summaries that are simultaneously deep in reasoning and broad in scope, resolving the trade-offs that limit the performance of other specialized frameworks and establishing its state-of-the-art performance.

The results of the human evaluation, presented in Table 6, confirm the superiority of the MFOG framework across all five assessment criteria. On the Multi-News dataset, MFOG consistently achieves the highest average scores. For instance, in Coherence, MFOG (9.35) surpasses the strongest baseline, RAPTOR (8.65). This significant improvement is attributable to the Cognitive Chain-driven depth path, which ensures the preservation of logical connections across documents. Similarly, its leading score of 9.01 in consistency outperforms competitors like GraphRAG (8.82), highlighting the effectiveness of the dual-path fusion strategy in mitigating informational conflicts.

Beyond the average scores, a critical finding is the exceptional stability of the MFOG framework. MFOG consistently exhibits the lowest standard deviation (SD) across all dimensions (0.06–0.07), indicating a high degree of reliability and predictable performance. This enhanced stability validates the efficacy of the “Generate–Evaluate–Optimize” closed-loop mechanism in producing consistent, high-quality outputs.

The analysis also reveals the distinct performance profiles and inherent trade-offs of the baseline models. The high variance of extractive methods like TextRank and ProCluster (SD: 0.14–0.19) confirms their sensitivity to the quality of the input documents. Even advanced frameworks exhibit specialized strengths; for instance, GraphRAG achieves high scores in consistency (8.82) but is weaker in coherence, suggesting that while its knowledge graph preserves factual links, it may not fully capture the overall narrative flow. These results underscore the primary advantage of the MFOG framework: its ability to successfully integrate complementary strategies, thereby overcoming the specialized limitations of other approaches and achieving a state-of-the-art balance of quality, coherence, and stability.

Despite the strong performance of MFOG, failure analysis identifies two specific limitations. First, the model may synthesize unsupported figures when handling highly contradictory numerical data. Second, in extensive document clusters, the depth path occasionally exhibits “topic drifting” by focusing excessively on sub-topics. These findings indicate that while the Cognitive Chain enhances structure, future work should prioritize granular conflict resolution mechanisms.

4.5. Analysis of Ablation Studies

The results of our ablation studies, presented in Table 7, quantify the contribution of each core component within the MFOG framework. The significant performance gap between the Standard RAG baseline and all other variants immediately underscores the inadequacy of a simple, single-pass approach for complex MDS tasks, establishing the necessity of our structured architecture.

A direct comparison of the generation paths reveals that the depth-only variant (W/O Parallel Path) consistently outperforms the parallel-only variant (W/O depth Path). The depth-focused process guided by the Cognitive Chain is the primary driver of summary quality. Furthermore, the study highlights the necessity of the weighted fusion module. While simple concatenation (W/O Fusion Module) improves upon single-path models, it yields inconsistent performance and can disrupt structural coherence (e.g., a lower ROUGE-L score than the depth-only model). In contrast, the full MFOG model’s weighted fusion combines the strengths of both paths to achieve superior scores across all metrics.

The analysis further isolates the distinct contributions of the foundational modules. Ablating the RAG module (W/O RAG) disproportionately impacts semantic metrics like BERTScore and METEOR, confirming its critical role in grounding the summary in factual evidence. Conversely, removing the Cognitive Chain (W/O Cognitive Chains) leads to a substantial degradation in the ROUGE-L score, which is strongly correlated with narrative coherence. This underscores the mechanism’s importance in imposing a logical structure and flow. Ultimately, the superior performance of the full MFOG model over all ablated variants indicates a powerful combined effect. The factual content supplied by RAG and the structural guidance from the Cognitive Chain are not merely additive; their integration is key to achieving a summary that is both factually accurate and logically coherent, driving the model to its state-of-the-art performance.

It is also important to acknowledge the computational trade-offs inherent in our architecture. The “Generate–Evaluate–Optimize” closed-loop, a core component of the depth path, necessarily incurs additional computational overhead compared to a single-pass approach like the Standard RAG baseline. This cost arises from the iterative evaluation and potential regeneration steps required to ensure high-quality, coherent outputs. As our ablation analysis demonstrates, the very mechanisms contributing to this overhead are indispensable for achieving the observed state-of-the-art performance. Therefore, we posit that this increase in computational demand is a justified trade-off for the significant and quantifiable improvements in summary coherence, factual accuracy, and overall quality, which are critical for building reliable and controllable MDS systems.

5. Discussion

While the results presented in Section 4.4 validate the potential of our conceptual framework, a deeper discussion of its implications, limitations, and practical utility is warranted.

Practical Implications and Workflow Transformation. The MFOG framework offers a practical solution for complex summarization tasks. Traditional workflows rely on human analysts to manually synthesize conflicting information. Our dual-path architecture automates this synthesis. The depth-path (guided by the Cognitive Chain) explicitly mimics the human analytical process, while the breadth-path provides rapid coverage. For organizations, this changes the human role from a “content generator” to a “reviewer”, significantly scaling throughput.

Limitations and Trade-offs. Several limitations must be acknowledged. First, as this was an illustrative study of a conceptual framework, we focused on metric-based performance improvements and did not conduct formal statistical significance tests (e.g., t-tests) with confidence intervals. While the gains are consistent, future work should include rigorous statistical validation to judge robustness. Second, there is a trade-off between performance and computational cost. The depth-focused path, with its iterative “Generate–Evaluate–Optimize” loop (Section 3.4), incurs higher latency than single-pass approaches. However, our ablation studies confirm this cost is justified by significant gains in coherence. Finally, the system relies on LLMs, which carry inherent risks in regulated environments. While our internal evaluation loop (

S < θ

) acts as a monitoring mechanism, a human-in-the-loop (HITL) approach remains essential for risk-sensitive deployments.

6. Conclusions

This paper introduced MFOG, a novel Self-Optimizing Multi-Path Fusion Framework designed to address key limitations in multi-document summarization (MDS). Moving beyond conventional single-path paradigms, MFOG deconstructs the task using a dual-path architecture guided by a “Cognitive Chain”. A depth-focused path ensures informational depth through a RAG-enhanced closed loop, while a breadth-first parallel path guarantees comprehensive coverage.

Our illustrative experiments demonstrate that our framework consistently outperforms existing baselines, validating the feasibility of this approach. More importantly, by providing a detailed “conceptual replication package” (Algorithms 1 and 2 and Table 1, Table 2, Table 3 and Table 4), we offer a transparent and reproducible architecture. Future work will focus on rigorous statistical validation and adapting the Cognitive Chain to more diverse domains.

Author Contributions

Conceptualization, C.L., L.Z., J.Z. and Q.Z.; Methodology, C.L.; Software, C.L.; Validation, C.L.; Formal analysis, C.L.; Investigation, C.L.; Resources, C.L.; Data curation, C.L.; Writing—original draft, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Major Project, grant number 2023ZD0120603; and the Science Guidance Project of China National Textile and Apparel Council, grant number 2025047.

Data Availability Statement

The Multi-News, DUC-2004, and Multi-XScience datasets analyzed during this study are publicly available, with sources cited in the “Dataset” section (Section 4.1). To address the reproducibility of the proposed MFOG framework, we provide a comprehensive conceptual replication package within this manuscript. This package details the architectural logic, pseudo-code, and specific prompt templates necessary to reimplement the methodology. Furthermore, the full source code will be made publicly available on GitHub immediately upon acceptance. The critical implementation artifacts are located as follows: 1. Cognitive Chain Prompts: The specific prompts for the depth-focused and breadth-first parallel paths are provided in Table 1. 2. Self-Optimization Logic: The procedural logic for the internal evaluation loop is detailed in Algorithm 1, and the corresponding prompts for critique and scoring are listed in Table 2. 3. RAG-Trigger Mechanism: The logic for the on-demand retrieval trigger is presented in Algorithm 2, and the associated diagnosis and regeneration prompts provided in Table 3. 4. Fusion Module: The logic and prioritization instruction for the final weighted fusion module are described in Section 3.6, with the specific prompt template provided in Table 4. This detailed documentation provides all necessary components for researchers to replicate the proposed framework using standard LLM APIs.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual, 3–10 March 2021; pp. 610–623. [Google Scholar]
Ma, C.; Zhang, W.E.; Guo, M.; Wang, H.; Sheng, Q.Z. Multi-document summarization via deep learning techniques: A survey. ACM Comput. Surv. 2022, 55, 1–37. [Google Scholar] [CrossRef]
Mihalcea, R.; Tarau, P. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]
Carbonell, J.; Goldstein, J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24–28 August 1998; pp. 335–336. [Google Scholar]
Barzilay, R.; McKeown, K.R. Sentence fusion for multidocument news summarization. Comput. Linguist. 2005, 31, 297–328. [Google Scholar] [CrossRef]
See, A.; Liu, P.J.; Manning, C.D. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1073–1083. [Google Scholar] [CrossRef]
Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef] [PubMed]
Ernst, O.; Caciularu, A.; Shapira, O.; Pasunuru, R.; Bansal, M.; Goldberger, J.; Dagan, I. Proposition-Level Clustering for Multi-Document Summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 1765–1779. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 11328–11339. [Google Scholar]
Xiao, W.; Beltagy, I.; Carenini, G.; Cohan, A. PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 5245–5263. [Google Scholar] [CrossRef]
Pazhouhan, M.; Karimi Mazraeshahi, A.; Jahanbakht, M.; Rezanejad, K.; Rohban, M.H. Wave and Tidal Energy: A Patent Landscape Study. J. Mar. Sci. Eng. 2024, 12, 1967. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.S.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.t. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, 16–20 November 2020; pp. 6769–6781. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Sarthi, P.; Abdullah, S.; Tuli, A.; Khanna, S.; Goldie, A.; Manning, C.D. Raptor: Recursive abstractive processing for tree-organized retrieval. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv 2024, arXiv:2404.16130. [Google Scholar] [CrossRef]
Kim, H.; Kim, B.H. NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 10120–10157. [Google Scholar] [CrossRef]
Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-Rag: Learning to Retrieve, Generate, and Critique Through Self-Reflection. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Yan, S.Q.; Gu, J.C.; Zhu, Y.; Ling, Z.H. Corrective Retrieval Augmented Generation. arXiv 2024, arXiv:2401.15884. [Google Scholar] [CrossRef]
Li, L.; Chen, W.; Li, J.; Chen, L. Relation-r1: Cognitive chain-of-thought guided reinforcement learning for unified relational comprehension. arXiv 2025, arXiv:2504.14642. [Google Scholar]
Fabbri, A.; Li, I.; She, T.; Li, S.; Radev, D. Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1074–1084. [Google Scholar] [CrossRef]
Over, P.; Yen, J. An introduction to DUC-2004; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2004.
Lu, Y.; Dong, Y.; Charlin, L. Multi-XScience: A large-scale dataset for extreme multi-document summarization of scientific articles. arXiv 2020, arXiv:2010.14235. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Ryu, S.; Do, H.; Kim, Y.; Lee, G.; Ok, J. Multi-Dimensional Optimization for Text Summarization via Reinforcement Learning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 5858–5871. [Google Scholar] [CrossRef]

Figure 1. A case study on the comparative effect of the MFOG framework in the tech news summarization task.

Figure 3. Comparative enhancement effect of the MFOG framework in major natural disaster report summarization. Red text highlights key details and specific information captured by each cognitive dimension.

Table 1. Prompt templates for the cognitive chain.

Dimension	Path	Prompt Template
Background	Depth	Role: You are a professional summarization analyst. Task: Your goal is to provide the essential background and context. Input: Based on the provided documents: [Document Context]. Action: Summarize foundational background (key entities, timelines, locations). Do not describe the event itself, only the context. Produce a concise paragraph.
Background	Breadth	Role: You are a professional summarization analyst. Task: Your goal is to provide the essential background and context. Input: Based on the provided documents: [Document Context]. Action: Summarize foundational background (key entities, timelines, locations). Do not describe the event itself, only the context. Produce a concise paragraph.
Content	Depth	Role: You are a professional summarization analyst. Task: Your goal is to detail the core events, building upon established context. Input: You are given: 1. Source Documents: [Document Context]. 2. Previously Generated Background: [Generated Background]. Action: Using Source Documents, synthesize core facts/developments. Use ’Previously Generated Background’ as a starting point and ensure logical flow. Focus only on the main event. Produce a concise paragraph.
Content	Breadth	Role: You are a professional summarization analyst. Task: Your goal is to independently detail the core events. Input: Based on the provided documents: [Document Context]. Action: Synthesize core facts and key developments from the documents. Focus only on the main event. Produce a concise paragraph.
Impact	Depth	Role: You are a professional summarization analyst. Task: Your goal is to analyze consequences based on prior analysis. Input: You are given prior analysis: 1. Background: [Generated Background]. 2. Core Content: [Generated Content]. 3. Source Documents: [Document Context]. Action: Based on prior analysis, analyze immediate and long-term impacts (e.g., social, economic) from Source Documents. Analysis must be a direct consequence of ’Core Content’. Produce a concise analytical paragraph.
Impact	Breadth	Role: You are a professional summarization analyst. Task: Your goal is to independently analyze the consequences of an event. Input: Based on the provided documents: [Document Context]. Action: Analyze immediate and long-term impacts of the event. Identify consequences across domains (e.g., social, economic) from the documents. Produce a concise analytical paragraph.
Trends	Depth	Role: You are a professional summarization analyst. Task: Your goal is to forecast future developments based on a complete analysis. Input: You are given the complete analysis: 1. Background: [Generated Background]. 2. Core Content: [Generated Content]. 3. Impact Analysis: [Generated Impact]. 4. Source Documents: [Document Context]. Action: Based on the full analysis, forecast future trends. Your forecast must be logically derived from the ’Impact Analysis’. Produce a concise forecast.
Trends	Breadth	Role: You are a professional summarization analyst. Task: Your goal is to independently forecast future developments. Input: Based on the provided documents: [Document Context]. Action: Forecast future trends and developments related to the event described in the documents. Produce a concise forecast.

Table 2. Prompts for the internal evaluation model (Score(C)). These structured prompts transform the evaluation into a two-step quantitative process.

Prompt Name	Purpose	Prompt Template
CRITIQUE_TMPL	Generate Qualitative Critique	Role: You are an expert fact-checker and summarization evaluator. Input: 1. Source Documents: [Source Documents] 2. Generated Summary for [Dimension]: [Content to Evaluate] Criteria: Compare the Summary against the Source Documents. Check for: - Factual Accuracy: Are there any hallucinations? - Completeness: Is key information for this dimension missing? Action: Provide a brief, critical analysis of the summary’s flaws. If it is perfect, state that.
SCORING_TMPL	Convert Critique to Score [0, 1]	Role: You are a strict scoring engine. Input: 1. Generated Summary: [Content to Evaluate] 2. Expert Critique: [Text Critique] Task: Based strictly on the critique, assign a quality score $S \in [0.0, 1.0]$ . - 1.0: Perfect, no errors. - <0.8: Requires regeneration (contains errors or omissions). Output: Output only the score in this exact format: “Score: [value]”.

Table 3. Prompts for the RAG-trigger module. These prompts guide the diagnosis of deficiencies and the subsequent retrieval-augmented regeneration.

Prompt Name	Purpose	Prompt Template
DIAGNOSIS_PROMPT	Formulate Search Query	Role: You are a diagnostic assistant focused on information retrieval. Input: 1. Failed Summary Draft: [Failed Content] 2. Expert Critique: [Text Critique] Task: Analyze the critique to identify exactly what information is missing or factually incorrect. Formulate a specific, targeted search query to retrieve this missing evidence from the knowledge base. Output: Output only the search query string.
REGENERATION_PROMPT	Regenerate with Context	Role: You are a senior editor performing corrective summarization. Input: 1. Original Draft: [Failed Content] 2. Identified Issues: [Text Critique] 3. New Evidence: [Retrieved Snippets] Task: Rewrite the draft to address the identified issues. You must incorporate the “New Evidence” to fix factual errors or fill information gaps. Maintain the original style but ensure accuracy and completeness. Output: The revised, optimized summary paragraph.

Table 4. Prompt template for the final fusion module. This prompt implements the weighted integration strategy.

Name	Purpose	Prompt Template
FUSION	Integration	Role: Senior Editor. Input: Draft 1 (Depth-Focused): [ $S_{d e p t h}$ ]; Draft 2 (Breadth-Focused): [ $S_{p a r a l l e l}$ ]. Task: Synthesize a final summary. Use Draft 1 as the primary structural foundation (approx. 70% weight) to ensure logical coherence. Integrate unique, complementary details from Draft 2 (approx. 30% weight) to enhance coverage without disrupting the narrative flow. Avoid redundancy. Output: The final polished summary.

Table 5. Comparison of MFOG with baseline models on the Multi-News (M-N), DUC-2004 (DUC), and Multi-XScience (M-X) datasets.

Models	ROUGE-1			ROUGE-2			ROUGE-L			BERTScore			METEOR
Models	M-N	DUC	M-X	M-N	DUC	M-X	M-N	DUC	M-X	M-N	DUC	M-X	M-N	DUC	M-X
TextRank	41.35	28.15	19.88	15.62	7.11	3.98	20.09	12.81	17.24	52.43	31.50	52.81	19.15	11.23	16.90
ProCluster	45.83	32.47	24.61	18.07	9.69	5.81	23.03	14.94	21.60	54.72	35.95	58.24	22.70	13.16	20.95
PRIMERA	48.70	34.20	26.10	20.50	11.50	6.80	25.80	15.40	22.50	56.61	38.10	59.10	24.50	15.40	22.10
GraphRAG	49.55	34.82	26.85	21.28	11.91	7.15	27.10	15.73	22.93	57.31	38.95	59.85	26.20	15.84	22.82
RAPTOR	49.85	34.71	27.85	21.50	11.82	7.03	27.55	15.63	22.81	57.15	39.02	59.68	26.35	16.05	22.69
NexusSum	49.38	34.50	27.15	21.65	11.75	7.20	27.40	15.80	23.10	57.25	38.80	60.10	26.50	15.95	23.05
MFOG	51.08	36.12	28.51	22.76	12.48	8.21	28.91	16.83	24.12	58.21	40.16	60.15	28.03	16.98	23.76

Table 6. Human evaluation results on the Multi-News dataset. Scores are reported as average (Avg) and standard deviation (SD) from five expert evaluators.

Models	Coherence		Consistency		Fluency		Relevance		Overall
Models	Avg	SD	Avg	SD	Avg	SD	Avg	SD	Avg	SD
TextRank	6.75	0.18	6.82	0.16	6.78	0.19	6.79	0.17	6.77	0.18
ProCluster	8.02	0.15	8.15	0.14	8.05	0.16	8.08	0.15	8.09	0.16
PRIMERA	8.31	0.13	8.40	0.12	8.35	0.14	8.36	0.13	8.36	0.14
GraphRAG	8.55	0.11	8.82	0.10	8.58	0.11	8.79	0.11	8.61	0.11
RAPTOR	8.65	0.10	8.75	0.08	8.68	0.12	8.70	0.09	8.67	0.12
NexusSum	8.40	0.08	8.80	0.08	8.30	0.07	8.85	0.08	8.86	0.08
MFOG	9.35	0.06	9.01	0.06	9.20	0.07	9.31	0.06	9.16	0.07

Table 7. Results of ablation studies on the Multi-News dataset. The top section ablates core architectural components (dual-path and fusion). The bottom section ablates the RAG and Cognitive Chains modules. R-1, R-2, R-L, BS, and MT are abbreviations for ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, and METEOR, respectively.

Models	R-1	R-2	R-L	BS	MT
Standard RAG	47.95	20.25	25.10	55.42	24.30
W/O depth	48.80	21.05	26.20	56.10	25.05
W/O Parallel	49.60	21.95	27.20	56.80	26.15
W/O Fusion	50.15	22.40	27.05	57.55	27.00
W/O RAG	48.85	20.85	26.80	56.15	25.10
W/O Cognitive Chains	50.10	21.60	26.15	57.60	26.70
MFOG (Full)	51.08	22.76	28.91	58.21	28.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Zhang, L.; Zhang, J.; Zheng, Q. Cognitive Chain-Based Dual Fusion Framework for Multi-Document Summarization. Electronics 2025, 14, 4545. https://doi.org/10.3390/electronics14224545

AMA Style

Li C, Zhang L, Zhang J, Zheng Q. Cognitive Chain-Based Dual Fusion Framework for Multi-Document Summarization. Electronics. 2025; 14(22):4545. https://doi.org/10.3390/electronics14224545

Chicago/Turabian Style

Li, Chenyang, Long Zhang, Junshuai Zhang, and Qiusheng Zheng. 2025. "Cognitive Chain-Based Dual Fusion Framework for Multi-Document Summarization" Electronics 14, no. 22: 4545. https://doi.org/10.3390/electronics14224545

APA Style

Li, C., Zhang, L., Zhang, J., & Zheng, Q. (2025). Cognitive Chain-Based Dual Fusion Framework for Multi-Document Summarization. Electronics, 14(22), 4545. https://doi.org/10.3390/electronics14224545

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cognitive Chain-Based Dual Fusion Framework for Multi-Document Summarization

Abstract

1. Introduction

2. Related Work

2.1. Traditional Multi-Document Summarization Methods

2.2. LLM-Based MDS Methods

3. Cognitive Chain-Driven Generative Framework

3.1. Overall Framework Architecture

3.2. The Cognitive Chain

3.3. Implementation of the Cognitive Chain

3.4. Dual-Path Generation and Optimization

3.4.1. Self-Optimization Logic and Evaluation Prompts

3.4.2. On-Demand RAG and Query Formulation

3.5. Overall Quality Assessment

3.6. LLM Fusion Document Module

4. Experimental Evaluation

4.1. Dataset

4.2. Evaluation Metrics

4.3. Implementation Environment

4.4. Result Analysis

4.5. Analysis of Ablation Studies

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI