You are currently viewing a new version of our website. To view the old version click .
by
  • Chenyang Li1,
  • Long Zhang1,2,3,* and
  • Junshuai Zhang1
  • et al.

Reviewer 1: Anonymous Reviewer 2: Anonymous

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper introduces the, a complex, multi-component architecture for document summarization.  

The framework proposes several ideas, including a dual-path (depth and parallel) generation process, a "Cognitive Chain"  to structure the summarization task, and a closed-loop self-optimization mechanism. The authors report that MFOG achieves new state-of-the-art (SOTA) results on public benchmarks (Multi-News, DUC-2004, and Multi-XScience).

While the conceptual framework is ambitious, this submission suffers from a critical flaw: a complete lack of reproducibility.

The authors make claims of superiority on public datasets but provide no source code, implementation artifacts, or even the specific prompts that are central to their methodology.

The framework's complexity makes it impossible to replicate from the textual description alone.

Furthermore, several of the key performance claims made in the abstract are not supported, and are in fact contradicted, by the data presented in the paper's own results tables.

Given these severe issues, the paper's claims are currently unverifiable and the work cannot be accepted in its present form.

A major revision, contingent upon the public release of a complete and functional replication package, is required.

Opaque and Complex Architecture: The MFOG framework's complexity cannot be overstated. As described, it involves: An innovative dual-path generation architecture (depth-focused sequential and breadth-first parallel); A "Cognitive Chain" (e.g. ⟨Background, Content, Impact, Trends⟩) that deconstructs the task; A closed-loop "Generate-Evaluate-Optimize" self-optimization mechanism (Section 3.4); "Dimension-specific evaluation models" that are "realized by employing dedicated scoring prompts" to provide a quality score, which is then compared to an "empirically set" threshold of 0.8; An "on-demand" RAG module that triggers when quality is below this threshold; A final weighted fusion module (Section 4.6 mentions a 7:3 ratio).

 

Missing Implementation Details: None of the critical implementation details for these components are provided. The scientific contribution of this paper rests entirely on these details. Specifically, the authors must provide:

The exact prompts used for each dimension of the Cognitive Chain (for both depth and parallel paths).

The exact prompts and logic of the "dimension-specific evaluation models." This is a core part of the method and it is completely opaque. How is a "scoring prompt" turned into a reliable [0, 1] score?

The logic for the RAG trigger. How is a "New Query" (Figure 2) formulated by the "LLM Diagnosis" step?

The precise algorithm for the "LLM Fusion Document" module.

Without the code and all associated prompts/configurations, this paper does not describe a scientific method but rather an advertisement for one.

 

 Internal Inconsistencies and Misrepresented Results

The paper's claims of superiority are undermined by significant discrepancies between the performance gains highlighted in the abstract and the data presented in Table 2. This is a major concern.

This suggests a lack of carefulness at best, and an attempt to inflate the contribution at worst.

This must be corrected, and the authors must re-evaluate and state their contributions based on their actual results.

 

Vague Justification for Cognitive Chain Generalization.

Confusing Evaluation Mechanisms

 

The authors must provide the full, runnable source code, all model configurations, and the complete set of prompts (for generation, evaluation, RAG triggers, and fusion) necessary to precisely replicate the results reported in Table 2.

 

Until this methodological transparency is provided, the paper's claims of achieving state-of-the-art performance are unsubstantiated and cannot be accepted for publication.

The authors must also correct / explain  significant discrepancies between the abstract and Table 2.

Comments on the Quality of English Language

The English language quality is reasonable but could be improved.

While the paper is generally understandable, the text is often overly complex and relies heavily on jargon and buzzwords.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript introduces an LLM-based system for complex text analytics and decision-support. The architecture combines data ingestion, preprocessing, and task-specific prompt chains for classification, summarization, and risk detection. The framework is evaluated on a real-world dataset and benchmarked against baseline methods, showing performance gains on selected metrics. Overall, the study provides an applied, system-level demonstration of how LLMs can be integrated into an end-to-end operational workflow.

Research Framework and Contribution

The overall motivation—using LLMs to automate and scale complex knowledge work—is timely. However, the scientific contribution is not clearly separated from the engineering implementation. It is still difficult to see what is new in terms of methods or conceptual framing beyond building a well-engineered pipeline around an existing API. You may suggest a framework for prompt-chaining design, encoding domain constraints, evaluation protocol, etc to help readers who want to adopt your method.

The claimed “zero-shot adaptability,” “interpretability,” should be briefly defined and anchored in existing literature. As it stands now, the conceptual language is stronger than what the empirical evidence supports.

Methodology and Data

Some important details about the data and evaluation setup is missing or too high-level:

  • The dataset needs to be specified more precisely: size, sources, time period, how training/validation/test splits (if any), pre-processing steps.
  • It is not always clear which tasks are fully LLM-based vs. traditional rules or ML. Would it be possible to include a small table or diagram and explain which parts are deterministic vs. generative.
  • The evaluation design should distinguish between system performance and interaction performance. Right now, the evaluation leans heavily on overall performance.
  • Can you provide targeted evaluation for your claim that prompt chaining reduces hallucinations or improves reliability in a small human-rated subset?.
  • Include more transparency on data provenance and the experimental protocol.

Results and Analysis

The results show that the proposed system outperforms the chosen baseline(s) on selected metrics, but the analysis feels somewhat thin:

  • Please justify the choice of baselines: rule-based systems, fine-tuned model, or other ML pipelines?
  • Please report statistical tests with confidence intervals to judge whether differences are robust.
  • Explain in which types of cases does the LLM make errors, generate misleading output, or require human intervention?

Discussion and Practical Implications

The discussion currently repeat / recap of technical points explained earlier in the paper. My suggest is to discuss implications such as:

  • Is there any hinderance/or monitoring mechanism for organizations can adopt and integrate this LLM-based component?
  • What this system changes in practice compared with traditional workflows.
  • The trade-offs
  • Are there any limitations in using this method such as risk-sensitive systems or regulated environments?
  • Under what conditions can this system be adapted to other domains? That will help position your research as a reusable pattern for implementation.

If the authors wish to situate their LLM-based system within a broader methodological tradition, they might consider engaging with prior work that integrates machine learning architectures with structured, domain-specific analytical pipelines. For example, Pazhouhan et al. (2024, JMSE) offers a relevant framework that combines ML modeling with domain ontologies and structured feature engineering.

Pazhouhan, M., Karimi Mazraeshahi, A., Jahanbakht, M., Rezanejad, K., & Rohban, M. H. (2024). Wave and tidal energy: A patent landscape study. Journal of Marine Science and Engineering12(11), 1967.

Comments on the Quality of English Language

a light language and structure edit to: break up the long sentences, avoid excessive jargon where simpler wording would do (for readers who are not LLM specialists), use clearer signposting, use consistent terminology for key concepts to avoid confusion.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

I thank authors for their detailed response and for effort put into revising manuscript.

Authors have addressed some concerns regarding complete lack of implementation details.

The inclusion of the Conceptual Replication Package improves transparency of work.

Providing prompts for the Cognitive Chain, evaluation models and  fusion module allows readers to understand some principles of the proposed framework, which was previously opaque.

I also acknowledge correction of the data discrepancies in the Abstract.

The values now align with tables.

Regarding reproducibility: While I would have strongly preferred release of actual source code during the review process to fully verify the results, I accept the authors' inclusion of the detailed prompt templates and algorithms as a compromise for this stage.

I note the "Data Availability Statement" in Section 6, where the authors commit to making the full source code available on GitHub immediately upon acceptance.

The revision has addressed the critical flaws that prevented the paper from being evaluated scientifically in previous round.

Comments on the Quality of English Language

The English language quality is reasonable but could be improved. While the paper is generally understandable, the text is often overly complex and relies heavily on jargon and buzzwords.

Simplifying the sentence structures would make the contribution clearer.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

I thank the authors for their thorough and constructive revision. The updated manuscript clearly addresses the earlier concerns. One minor but optional suggestions:

The revised manuscript includes a helpful discussion of error patterns; tightening this into a more concise paragraph would improve the narrative flow.

Overall, the authors have significantly improved the clarity, rigor, and reproducibility of the work, and I recommend the manuscript for publication.

Author Response

Please see the attachment

Author Response File: Author Response.pdf