Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

CoReaAgents: A Collaboration and Reasoning Framework Based on LLM-Powered Agents for Complex Reasoning Tasks

Appl. Sci. 2025, 15(10), 5663; https://doi.org/10.3390/app15105663

by Zhonghe Han^1,2,3,4,†, Jiaxin Wang^5,†

, Xiaolu Yan⁶, Zhiying Jiang^5,*, Yuanben Zhang^1,2,3,4, Siye Liu^1,2,3,4, Qihang Gong^1,2,3,4 and Chenwei Song⁵

Reviewer 1:

Alexander Ferworn

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Reviewer 5:

Krzysztof Lewandowski

Appl. Sci. 2025, 15(10), 5663; https://doi.org/10.3390/app15105663

Submission received: 13 March 2025 / Revised: 10 May 2025 / Accepted: 16 May 2025 / Published: 19 May 2025

(This article belongs to the Topic Artificial Intelligence Models, Tools and Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper provides a novel framework for planning and reasoning of AI agents. The framework uses 3 agents - one for planning, one to decide on what tools should be used and a agent to assist with reflection and collaboratively alter plans. The framework allows for both static and dynamic planning. The framework uses Code and Comments of Thought (CCoT) as a way to communate plans between agents. The paper demonstrates te utility of the framework in a number of benchmarks and the framework shows good results. The ablation studies are very useful to isolate the importance of each agent. Agentic workflows are a very relevant topic. Though many of the individual concepts - a planning agent, a reflection agents, CCoT are not novel, the authors do bring these concepts together to create an effective framework. Some specific feedback: 1. Some grammatical errors: a. (Line 6) professional tool user. Perhaps use a word like proficient. Professional is the wrong word to use. There are several uses of this word. b. (Line 23) researches. The typical plural is research. researches is an awkward word to use. There are several uses of this word. 2. Analytics a. The paper presents results without sample size or statistical significance. Without the statistical tests its not clear whether the results are possible due to pure chance. This should be added and will make the paper better. All studies should have sample sizes and measures of statistical significant. Figure 6, as an example, has absolute numbers without stating the sample sizes. b. Associated with this are disclosures of sample sizes of the various tests conducted. Novelty: Medium - many of the concepts by themselves have been covered elsewhere. This framework brings them together to solve multiple problem sets. Scope: Appropriate Significance: A good contribution to the field of AI agent workflows and frameworks. Quality: Good and easy to read. Some grammatical errors which should be fix and analytics which should be enhanced. Scientific soundness: Statistical analysis should be added to the final paper. Overall Merit: Recommend accept withe changes. English Level: Good.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper deals with the current topic of AI multi-agent systems proposing a new system example. An update of the literature review including 2025 references may improve the Introduction if the authors mention new published related systems and the advantages or the proposal in the paper. Some other key issues may be tackle in the review such as: strategies for more effective collaboration, scalability, and interpretability. Conclusions may be improved with more details in the experimental results obtained and in the literature review.

The author should insert required spaces before the references and also revise line 651.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper’s research addresses the central question of how Large Language Model-powered agents can be organized into a collaborative, specialized, and reflective multi-agent framework to improve complex task execution, reasoning, and tool usage. It aims to overcome existing limitations in Large Language Model (LLM) agent-based reasoning, such as missing intermediate steps, confusion in tool selection, and inflexible planning, by proposing a novel multi-agent system called CoReaAgents. This framework mirrors the human-like social division of labor by assigning distinct roles to agents, Plan, Tool, and Reflect.

The originality and relevance of the work lie in its conceptual innovation of dividing LLM agents into specialized roles to simulate human collaboration. It introduces mechanisms for inter-agent communication and negotiation, which are critical for both task execution and iterative reflection. The paper emphasizes feedback loops and reflective reasoning, a notable departure from traditional linear approaches such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT). Additionally, the Reflect Agent enables a hybrid of static and dynamic planning, enhancing adaptability while maintaining coherence—thus addressing a key gap in existing systems.

By focusing on reflective planning, multi-agent negotiation, and more effective tool use, the research tackles significant challenges in the field. It addresses the rigidity of static planning methods, the myopia of dynamic systems that lack overarching structure, the absence of reflection in existing frameworks, and the confusion that arises when multiple tools are involved in decision-making.

Compared to previous frameworks like MetaGPT and Toolformer, this paper introduces a novel contribution to the subject area by integrating specialization, inter-agent communication, and dynamic re-planning within a single cohesive framework, named CoReaAgents. Unlike earlier models, it bridges the gap between static and dynamic planning through iterative feedback loops, resulting in improved real-time adaptability. The system’s modular design also makes it versatile and applicable across various domains, including mathematical reasoning, tool usage, and question answering. Furthermore, its effectiveness is supported by empirical validation on five challenging datasets, where it consistently outperforms baseline models. By emulating human-like collaboration and reflective problem-solving, this research marks a meaningful advance toward AGI-style reasoning, offering a more sophisticated and realistic alternative to prior single-agent or rigidly structured systems.

The references in the paper generally cover the essential foundational works, including key methods such as Chain-of-Thought (CoT), Tree-of-Thought (ToT), MetaGPT, Toolformer, Deep Q-Networks (DQN), and AlphaGo, as well as the relevant datasets used for evaluation. These provide a solid grounding in symbolic reasoning, LLM-based planning, and agent tool use. However, the paper could benefit from including a few more recent multi-agent frameworks and prompt-engineering-based planners from 2024, particularly those that explore collaborative dynamics or role-based prompting strategies among agents, to ensure a more comprehensive and up-to-date coverage of the field.

The methodology presented in the paper is solid, but there are several areas where it could be further improved to strengthen the findings and enhance clarity. While the performance improvements are evident, more detailed ablation studies—such as removing the Reflect Agent or disabling multi-agent communication—would help isolate the individual contributions of each system component. Additionally, conducting a thorough error analysis could reveal the types of reasoning failures or edge cases the framework struggles with, offering insights into its current limitations. Improving the interpretability of each agent’s decisions, such as clarifying why a specific tool was selected, would also enhance trust in the system’s reliability and safety. To verify generalizability, scalability testing with larger and more diverse toolsets and task domains is recommended. Moreover, a direct comparison with ensemble-based agent systems, like AutoGPT-style frameworks, would help highlight CoReaAgents’ distinct advantages.

The conclusions drawn in the paper are generally consistent with the evidence and arguments presented. The main research questions are addressed through a set of well-chosen experiments: tool learning was evaluated using API-Bank and ToolBench, where the framework showed improved tool selection and usage accuracy; math reasoning was tested on FuncQA and SVAMP, demonstrating better step-wise problem solving; and multi-hop question answering was assessed on HotpotQA, where the model outperformed baselines in reasoning across multiple evidence sources. These results are supported by both quantitative metrics and qualitative explanations, reinforcing the paper’s narrative. Still, the inclusion of more detailed visualizations—such as agent interaction diagrams or annotated reasoning traces—would further clarify how the system operates and support its claims more vividly.

By the above-mentioned reasons, I recommend accepting the paper with minor revisions. The proposed framework makes a meaningful and timely contribution to the field by addressing key shortcomings in existing LLM agent systems through a novel, collaborative, and reflective multi-agent design. The empirical results are strong, and the conceptual advances are well-motivated. With the inclusion of more detailed ablation studies, improved visual explanations, and a deeper discussion of interpretability and limitations, the paper has the potential to become a significant reference in AGI-oriented research and the development of intelligent, tool-using language agents.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The paper proposes a multi agent framework powered by large language models (LLMs) that simulates social collaboration through three specialized agents: Plan Agent planning, Tool Agent, and Reflect agent. The framework is called CoReaAgents. The paper is interesting and well written. I have the following comments:

It can be good in the introduction to mention other LLMs such as [REF01].
Please explain why is achieving AGI introduced as a framing goal if your experiments only evaluate LLM agent performance on specific tasks like QA.
How do you prove that performance improvements stem specifically from agent collaboration rather than the mere introduction of more structured prompts or task decomposition?
Please provide a formal definition of your “Collaboration and Reasoning” paradigm and its theoretical difference from existing frameworks like ReAct or Reflexion. It would be good to dedicate a paragraph for it.
What is the additional computational cost or latency of using three agents for distinct complex reasoning tasks.
Why was a max of 12 reflection steps chosen?
There are some typos like “enables LLM-powered Agents” should be “enable LLM-powered Agents”. Please revise the paper.
Please include newer multi agent frameworks like CAMEL or Generative Agents missing from simluation results.

References

[REF01] Wu, Shela, Zubair Yacub, and Dennis Shasha. "DietNerd: A Nutrition Question-Answering System That Summarizes and Evaluates Peer-Reviewed Scientific Articles." Applied Sciences (2076-3417) 14.19 (2024).

Comments on the Quality of English Language

The English could be improved to more clearly express the research.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 5 Report

Comments and Suggestions for Authors

The paper entitled "CoReaAgents: A Collaboration and Reasoning Framework based on LLM-powered Agents for Complex Reasoning Tasks" concerns the design and testing of an innovative multi-agent system that uses large language models (LLMs, e.g., GPT) to solve complex tasks requiring planning, tool use, and reflection. The proposed architecture, CoReaAgents, simulates the social division of labor and cooperation among three types of agents:

- Plan Agent - responsible for decomposing the task into subtasks and planning the sequence of actions

- Tool Agent - selects and operates the tools (e.g., APIs) needed to perform each subtask

- Reflect Agent - evaluates the correctness of the executed steps and updates the plan if necessary.

The system has been tested on three classes of tasks:

- Tool learning - using APIs and external tools

- Mathematical reasoning - solving complex mathematical problems

- Multi-hop QA - answering questions that require multi-step reasoning

The results showed that CoReaAgents outperformed existing methods (such as ReAct, Reflexion, ToolNet) both in terms of accuracy and planning flexibility.

The article presents an interesting and novel concept of using LLMs in a collaborative agent environment to solve complex computational problems. This article presents an innovative architecture with a clear simulation of role partitioning and reflective learning within an agent system inspired by social interactions. There is also a clear division of agent functions: Plan, Tool, and Reflect agents have well-defined roles, which enhances the clarity and scalability of the solution. The authors conducted extensive validation, including applying the system to different task types, demonstrating its versatility. They also performed an ablation study to analyze the contribution of each agent to the final outcome, which strengthens the credibility of the results.

The open approach to tools is also appreciated, as implementation in an environment with accessible APIs and tools shows potential for practical application.

However, the paper lacks a section on limitations. The authors do not point out potential weaknesses of the approach, such as vulnerability to bugs in the tools provided, delays in updating reflections, or incorrect decisions made by the agents. In addition, no runtime comparisons are provided - it is not stated how long it takes the multi-agent system to complete a task compared to a standard LLM. Is it more computationally expensive? The lack of practical applications is also striking. The paper focuses on benchmark tests, but no real-world application in an industrial setting (e.g. process automation, decision support systems) is shown. Furthermore, there is no discussion of ethics or transparency. Systems based on reflective agents may be unpredictable. Are there safeguards against bad decisions? What is the interpretability of their actions?

The article makes a significant contribution to research on the use of LLMs to solve complex tasks through an agent-based architecture. Although the work is primarily conceptual and experimental, it may have practical significance in the future. The paper is well written, clear, and logically organized. Figures (e.g. Fig. 1-5) effectively support the understanding of the architecture. No self-citations were detected, demonstrating the authors' integrity.

It is unclear whether all references listed in the bibliography are actually cited in the main text. The citation order appears to be inconsistent - neither alphabetical nor strictly numerical based on the order of appearance. For example, reference [36] appears early in the introduction before lower-numbered citations such as [2], [8], or [9], suggesting disorganized reference management. Although there are no explicit errors such as missing sources in the text, this inconsistency may indicate that not all listed references are actually cited, or that the in-text citations have been inserted in a random order. Authors should be encouraged by the editors to verify that all listed references are properly cited in the manuscript and that the numbering follows a clear and consistent scheme as outlined in the Applied Sciences submission guidelines:

References: References must be numbered in order of appearance in the text (including table captions and figure legends) and listed individually at the end of the manuscript. We recommend preparing the references with a bibliography software package, such as EndNote, ReferenceManager or Zotero to avoid typing mistakes and duplicated references. We encourage citations to data, computer code and other citable research material. If available online, you may use reference style 9. below.

Citations and References in Supplementary files are permitted provided that they also appear in the main text and in the reference list.

Look at: https://www.mdpi.com/journal/applsci/instructions

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 4 Report

Comments and Suggestions for Authors

The reviewer has no more comments

Article Menu

CoReaAgents: A Collaboration and Reasoning Framework Based on LLM-Powered Agents for Complex Reasoning Tasks

Further Information

Guidelines

MDPI Initiatives

Follow MDPI