Multi-Model Dialectical Evaluation of LLM Reasoning Chains: A Structured Framework with Dual Scoring Agents
Round 1
Reviewer 1 Report
Comments and Suggestions for Authorsfollowing comments can be incorporated for improvement -
Improve caption of Figure 1
Include 2–3 recent citations (from 2023–2024) that specifically address evaluation frameworks
Only 10 prompts are used in the study covering a mix of domains. While the prompts are thematically varied, this small sample size may limit generalizability. Future work could expand to 30–50 prompts across clearly categorized domains (law, education, healthcare, etc.).
Neo4j is conceptually described, but lacks schema details and query examples. Include node relationships and a sample Cypher query for clarity.
The structure and intent of the prompts are described but no examples of actual prompts (or LLM outputs) are given. Include at least one full prompt–response chain for each stage (opinion, counter, synthesis) in an appendix or supplementary material.
Assigning specific LLMs (e.g., Gemma for opinion, Dolphin-Mistral for counterargument) is explained, but more empirical justification might be shown or referenced to showcase why these models performed best in their assigned stages.
Author Response
We sincerely thank the reviewers for their relevant and constructive comments. These suggestions have helped improve the quality of our paper and make it more informative and readable. We have carefully considered all comments, and made every effort to comply with the recommendations of the reviewers.
Improve caption of Figure 1
Thank you for the suggestion. We have revised the caption of Figure 1 to provide a more informative description of the reasoning stages, evaluation mechanisms, and graph-based storage integration.
Include 2–3 recent citations (from 2023–2024) that specifically address evaluation frameworks
We thank the reviewer for this suggestion. The cited works—PromptBench [19], RubricEval [20], and LLM-Rubric [21]—were already included in the reference list, but had not been explicitly discussed in the manuscript. We have now added a paragraph in the Related Work section to highlight their relevance and how our framework builds upon these recent evaluation paradigms.
Only 10 prompts are used in the study covering a mix of domains. While the prompts are thematically varied, this small sample size may limit generalizability. Future work could expand to 30–50 prompts across clearly categorized domains (law, education, healthcare, etc.).
We agree that the limited number of prompts may affect generalizability. In response, we have updated Section 2.6 to explicitly acknowledge this limitation. Additionally, we have included in Section 4.4 a concrete plan to expand the prompt set to at least 50–100 items, structured by domain (e.g., law, education, healthcare), to support more robust and granular evaluation.
Neo4j is conceptually described, but lacks schema details and query examples. Include node relationships and a sample Cypher query for clarity.
We appreciate the reviewer’s suggestion to clarify the Neo4j implementation. In response, we have added a logical schema (Figure 4) that illustrates the main node types and relationships used in the reasoning graph. To support practical understanding, we also included in Section 2.5 a representative Cypher query showing how synthesis nodes can be retrieved based on expressed values (e.g., empathy). These additions enhance both the conceptual transparency and the practical inspectability of the graph structure.
The structure and intent of the prompts are described but no examples of actual prompts (or LLM outputs) are given. Include at least one full prompt–response chain for each stage (opinion, counter, synthesis) in an appendix or supplementary material.
We agree with the reviewer that including a concrete example of a full prompt–response chain would enhance transparency and clarity. In response, we have added a complete reasoning chain—including the opinion, counterargument, and synthesis stages—for the prompt “Should freedom of speech include the right to spread misinformation?” in Appendix A.
Assigning specific LLMs (e.g., Gemma for opinion, Dolphin-Mistral for counterargument) is explained, but more empirical justification might be shown or referenced to showcase why these models performed best in their assigned stages.
We thank the reviewer for this valuable comment. We have now clarified in Section 2.4 that model-role assignments were based on qualitative observations made during system implementation. Each model was selected according to its generation tendencies and rhetorical alignment with the intended function of each reasoning stage, rather than based on formal benchmarking.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have tested many LLMs for dialect multimodal evaluation. The research is interesting however, the part of dealing of multimodal data and dialect is not clear. Some points need clarification.
Have the authors used any fusion techniques for multimodal data?
What language and dialects have they tested?
How dealing with different dialects affected the results?
The figures in the manuscript need better visualization and enhancement of resolution
Author Response
We sincerely thank the reviewers for their relevant and constructive comments. These suggestions have helped improve the quality of our paper and make it more informative and readable. We have carefully considered all comments, and made every effort to comply with the recommendations of the reviewers.
The authors have tested many LLMs for dialect multimodal evaluation. The research is interesting however, the part of dealing of multimodal data and dialect is not clear. Some points need clarification.
We thank the reviewer for raising this point. To avoid potential confusion, we have clarified in Section 1.1 that the term “dialectical” is used strictly in the philosophical and argumentative sense, referring to reasoning through opposing viewpoints (opinion, counterargument, synthesis). The study does not involve linguistic dialects or multimodal data, and the entire framework operates in a monolingual, text-based setting.
Have the authors used any fusion techniques for multimodal data?
We confirm that no fusion techniques were used, as the study does not involve multimodal data. The framework operates entirely on text-based inputs and outputs.
What language and dialects have they tested?
The evaluation was conducted exclusively in English. No dialectal or multilingual inputs were used, and all prompts and model outputs were in standard English.
How dealing with different dialects affected the results?
As the study does not involve linguistic dialects, there was no variation across dialects to affect the results. All experiments were conducted in standard English using uniform prompts and evaluation criteria.
The figures in the manuscript need better visualization and enhancement of resolution
Thank you for pointing this out. We have replaced all figures in the manuscript with high-resolution versions to improve readability and ensure visual clarity.
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper proposes a framework to evaluate the reasoning quality of large language models (LLMs). The framework integrates multi-stage reasoning, rubric-based evaluation, and semantic analysis. Experiments are conducted on four open-source LLMs.
Strengthness:
- This work includes experiments across four LLMs. The discussion is carefully written and provides insightful observations on the differences in model outputs across various aspects.
-
The rubric-based scoring is interesting and appears more reasonable than unguided evaluation.
Weakness:
- The scoring rubric may be challenging for LLMs to follow. LLM evaluators may not consistently adhere to the rubric, but the paper does not clarify whether the rubric was followed faithfully or consistently.
- The paper provides limited details on how the rubric-based evaluation was implemented. The prompts in Section 2.6 ("Prompt Set") do not explicitly align with the rubric criteria. The rule-based evaluation is also insufficiently described. These issues hinder reproducibility.
- Since LLMs tend to overrate each other’s outputs, using LLMs to evaluate other LLMs may introduce systematic bias. For a work focused on evaluation methodology, it is crucial to validate the fairness and objectivity of the proposed framework. Including a human annotation baseline or correlation with human judgments would strengthen the work.
Author Response
We sincerely thank the reviewers for their relevant and constructive comments. These suggestions have helped improve the quality of our paper and make it more informative and readable. We have carefully considered all comments, and made every effort to comply with the recommendations of the reviewers.
Strengthness:
- This work includes experiments across four LLMs. The discussion is carefully written and provides insightful observations on the differences in model outputs across various aspects.
Thank you!
- The rubric-based scoring is interesting and appears more reasonable than unguided evaluation.
Thank you!
Weakness:
- The scoring rubric may be challenging for LLMs to follow. LLM evaluators may not consistently adhere to the rubric, but the paper does not clarify whether the rubric was followed faithfully or consistently.
We thank the reviewer for raising this important issue. To ensure that the rubric was followed consistently, each LLM evaluator received the full rubric and score definitions, along with an explicit directive discouraging inflated ratings. The models rated independently, without access to prior evaluations, and malformed outputs were automatically excluded. These clarifications have been added in Section 2.3.
- The paper provides limited details on how the rubric-based evaluation was implemented. The prompts in Section 2.6 ("Prompt Set") do not explicitly align with the rubric criteria. The rule-based evaluation is also insufficiently described. These issues hinder reproducibility.
We thank the reviewer for highlighting the need for greater clarity. In response, we have expanded Section 2.3 to describe the evaluation prompt in more detail, including the rubric anchors and the structure of the input. We have also clarified in Section 2.7 how the rule-based evaluation operates, including the use of regular expressions for anomaly detection and a curated keyword lexicon for value identification. These additions aim to improve transparency and reproducibility.
- Since LLMs tend to overrate each other’s outputs, using LLMs to evaluate other LLMs may introduce systematic bias. For a work focused on evaluation methodology, it is crucial to validate the fairness and objectivity of the proposed framework. Including a human annotation baseline or correlation with human judgments would strengthen the work.
We fully agree with this concern. While our evaluation relies on LLMs, we mitigated potential scoring bias by using two independent evaluators with no shared context and strict rubric instructions. We have added a paragraph in Section 4.2 acknowledging this limitation and identifying human-based validation as an essential direction for future work.
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors clearly took the reviewer feedback and made a sincere effort to improve the manuscript. The expanded explanations in Sections 2.3, 2.6, 2.7, and 4.2 are appreciated.
To further improve the manuscript, adding a short "Limitations" subsection may be helpful.
Author Response
We sincerely thank the reviewers for their relevant and constructive comments. These suggestions have helped improve the quality of our paper and make it more informative and readable. We have carefully considered all comments, and made every effort to comply with the recommendations of the reviewers.
The authors clearly took the reviewer feedback and made a sincere effort to improve the manuscript. The expanded explanations in Sections 2.3, 2.6, 2.7, and 4.2 are appreciated.
Thank you!
To further improve the manuscript, adding a short "Limitations" subsection may be helpful.
We appreciate the suggestion. The existing limitations discussion has been clarified and retitled as a dedicated “Study Limitations” subsection (Section 4.3) to explicitly address this point.