Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Multi-Model Dialectical Evaluation of LLM Reasoning Chains: A Structured Framework with Dual Scoring Agents

Informatics 2025, 12(3), 76; https://doi.org/10.3390/informatics12030076

by Catalin Anghel^1,*

, Andreea Alexandra Anghel²

, Emilia Pecheanu^1,*

, Ioan Susnea¹

, Adina Cocu¹

and Adrian Istrate¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Informatics 2025, 12(3), 76; https://doi.org/10.3390/informatics12030076

Submission received: 30 May 2025 / Revised: 21 July 2025 / Accepted: 23 July 2025 / Published: 1 August 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

following comments can be incorporated for improvement -

Improve caption of Figure 1

Include 2–3 recent citations (from 2023–2024) that specifically address evaluation frameworks

Only 10 prompts are used in the study covering a mix of domains. While the prompts are thematically varied, this small sample size may limit generalizability. Future work could expand to 30–50 prompts across clearly categorized domains (law, education, healthcare, etc.).

Neo4j is conceptually described, but lacks schema details and query examples. Include node relationships and a sample Cypher query for clarity.

The structure and intent of the prompts are described but no examples of actual prompts (or LLM outputs) are given. Include at least one full prompt–response chain for each stage (opinion, counter, synthesis) in an appendix or supplementary material.

Assigning specific LLMs (e.g., Gemma for opinion, Dolphin-Mistral for counterargument) is explained, but more empirical justification might be shown or referenced to showcase why these models performed best in their assigned stages.

Author Response

We sincerely thank the reviewers for their relevant and constructive comments. These suggestions have helped improve the quality of our paper and make it more informative and readable. We have carefully considered all comments, and made every effort to comply with the recommendations of the reviewers.

Improve caption of Figure 1

Thank you for the suggestion. We have revised the caption of Figure 1 to provide a more informative description of the reasoning stages, evaluation mechanisms, and graph-based storage integration.

Include 2–3 recent citations (from 2023–2024) that specifically address evaluation frameworks

We thank the reviewer for this suggestion. The cited works—PromptBench [19], RubricEval [20], and LLM-Rubric [21]—were already included in the reference list, but had not been explicitly discussed in the manuscript. We have now added a paragraph in the Related Work section to highlight their relevance and how our framework builds upon these recent evaluation paradigms.

We agree that the limited number of prompts may affect generalizability. In response, we have updated Section 2.6 to explicitly acknowledge this limitation. Additionally, we have included in Section 4.4 a concrete plan to expand the prompt set to at least 50–100 items, structured by domain (e.g., law, education, healthcare), to support more robust and granular evaluation.

Neo4j is conceptually described, but lacks schema details and query examples. Include node relationships and a sample Cypher query for clarity.

We appreciate the reviewer’s suggestion to clarify the Neo4j implementation. In response, we have added a logical schema (Figure 4) that illustrates the main node types and relationships used in the reasoning graph. To support practical understanding, we also included in Section 2.5 a representative Cypher query showing how synthesis nodes can be retrieved based on expressed values (e.g., empathy). These additions enhance both the conceptual transparency and the practical inspectability of the graph structure.

We agree with the reviewer that including a concrete example of a full prompt–response chain would enhance transparency and clarity. In response, we have added a complete reasoning chain—including the opinion, counterargument, and synthesis stages—for the prompt “Should freedom of speech include the right to spread misinformation?” in Appendix A.

We thank the reviewer for this valuable comment. We have now clarified in Section 2.4 that model-role assignments were based on qualitative observations made during system implementation. Each model was selected according to its generation tendencies and rhetorical alignment with the intended function of each reasoning stage, rather than based on formal benchmarking.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have tested many LLMs for dialect multimodal evaluation. The research is interesting however, the part of dealing of multimodal data and dialect is not clear. Some points need clarification.

Have the authors used any fusion techniques for multimodal data?

What language and dialects have they tested?

How dealing with different dialects affected the results?

The figures in the manuscript need better visualization and enhancement of resolution

Author Response

We thank the reviewer for raising this point. To avoid potential confusion, we have clarified in Section 1.1 that the term “dialectical” is used strictly in the philosophical and argumentative sense, referring to reasoning through opposing viewpoints (opinion, counterargument, synthesis). The study does not involve linguistic dialects or multimodal data, and the entire framework operates in a monolingual, text-based setting.

Have the authors used any fusion techniques for multimodal data?

We confirm that no fusion techniques were used, as the study does not involve multimodal data. The framework operates entirely on text-based inputs and outputs.

What language and dialects have they tested?

The evaluation was conducted exclusively in English. No dialectal or multilingual inputs were used, and all prompts and model outputs were in standard English.

How dealing with different dialects affected the results?

As the study does not involve linguistic dialects, there was no variation across dialects to affect the results. All experiments were conducted in standard English using uniform prompts and evaluation criteria.

The figures in the manuscript need better visualization and enhancement of resolution

Thank you for pointing this out. We have replaced all figures in the manuscript with high-resolution versions to improve readability and ensure visual clarity.

Reviewer 3 Report

Comments and Suggestions for Authors

This paper proposes a framework to evaluate the reasoning quality of large language models (LLMs). The framework integrates multi-stage reasoning, rubric-based evaluation, and semantic analysis. Experiments are conducted on four open-source LLMs.

Strengthness:

This work includes experiments across four LLMs. The discussion is carefully written and provides insightful observations on the differences in model outputs across various aspects.
The rubric-based scoring is interesting and appears more reasonable than unguided evaluation.

Weakness:

The scoring rubric may be challenging for LLMs to follow. LLM evaluators may not consistently adhere to the rubric, but the paper does not clarify whether the rubric was followed faithfully or consistently.
The paper provides limited details on how the rubric-based evaluation was implemented. The prompts in Section 2.6 ("Prompt Set") do not explicitly align with the rubric criteria. The rule-based evaluation is also insufficiently described. These issues hinder reproducibility.
Since LLMs tend to overrate each other’s outputs, using LLMs to evaluate other LLMs may introduce systematic bias. For a work focused on evaluation methodology, it is crucial to validate the fairness and objectivity of the proposed framework. Including a human annotation baseline or correlation with human judgments would strengthen the work.

Author Response

Strengthness:

This work includes experiments across four LLMs. The discussion is carefully written and provides insightful observations on the differences in model outputs across various aspects.

Thank you!

The rubric-based scoring is interesting and appears more reasonable than unguided evaluation.

Thank you!

Weakness:

The scoring rubric may be challenging for LLMs to follow. LLM evaluators may not consistently adhere to the rubric, but the paper does not clarify whether the rubric was followed faithfully or consistently.

We thank the reviewer for raising this important issue. To ensure that the rubric was followed consistently, each LLM evaluator received the full rubric and score definitions, along with an explicit directive discouraging inflated ratings. The models rated independently, without access to prior evaluations, and malformed outputs were automatically excluded. These clarifications have been added in Section 2.3.

The paper provides limited details on how the rubric-based evaluation was implemented. The prompts in Section 2.6 ("Prompt Set") do not explicitly align with the rubric criteria. The rule-based evaluation is also insufficiently described. These issues hinder reproducibility.

We thank the reviewer for highlighting the need for greater clarity. In response, we have expanded Section 2.3 to describe the evaluation prompt in more detail, including the rubric anchors and the structure of the input. We have also clarified in Section 2.7 how the rule-based evaluation operates, including the use of regular expressions for anomaly detection and a curated keyword lexicon for value identification. These additions aim to improve transparency and reproducibility.

Since LLMs tend to overrate each other’s outputs, using LLMs to evaluate other LLMs may introduce systematic bias. For a work focused on evaluation methodology, it is crucial to validate the fairness and objectivity of the proposed framework. Including a human annotation baseline or correlation with human judgments would strengthen the work.

We fully agree with this concern. While our evaluation relies on LLMs, we mitigated potential scoring bias by using two independent evaluators with no shared context and strict rubric instructions. We have added a paragraph in Section 4.2 acknowledging this limitation and identifying human-based validation as an essential direction for future work.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

The authors clearly took the reviewer feedback and made a sincere effort to improve the manuscript. The expanded explanations in Sections 2.3, 2.6, 2.7, and 4.2 are appreciated.

To further improve the manuscript, adding a short "Limitations" subsection may be helpful.

Author Response

The authors clearly took the reviewer feedback and made a sincere effort to improve the manuscript. The expanded explanations in Sections 2.3, 2.6, 2.7, and 4.2 are appreciated.

Thank you!

To further improve the manuscript, adding a short "Limitations" subsection may be helpful.

We appreciate the suggestion. The existing limitations discussion has been clarified and retitled as a dedicated “Study Limitations” subsection (Section 4.3) to explicitly address this point.

Article Menu

Multi-Model Dialectical Evaluation of LLM Reasoning Chains: A Structured Framework with Dual Scoring Agents

Further Information

Guidelines

MDPI Initiatives

Follow MDPI