Review Reports - A Benchmark for Evaluating Cognitive Reasoning in Modern Language Models

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

My peer review comments are attached.

Comments for author File: Comments.pdf

Author Response

A Benchmark for Evaluating Cognitive Reasoning in Modern Language Models

Reviewer 1

This paper studies how to evaluate cognitive reasoning behavior in modern large language models, proposes a modular benchmark that combines factual, syntactic, and logical task dimensions, and implements a controlled test protocol across eight models with context reset and a three-level scoring rule that includes a single corrective attempt. But there are still the following issues:

Response:

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files.

Major revisions

The experimental protocol is not reproducible enough for either API models or local models because key details such as exact model snapshots, decoding settings, prompt templates, run counts, and environment versions are missing while the paper relies on manual session handling and potentially stochastic outputs.

Response 1:

Thank you for pointing this out. We agree that reproducibility is a crucial aspect of any benchmark-based study, and we have therefore designed the proposed benchmark and experimental protocol to be as transparent and reproducible as possible within the practical constraints of working with both open-source and API-based models.

First, all benchmark questions, together with the expected correct answers and task instructions, are explicitly defined and fixed. In the revised version of the manuscript, we will additionally provide these materials as a separate supplementary file, so that future researchers can directly reuse the same task set to evaluate new models or reproduce our experiments.

Second, for local (open-source) models we report the exact model variants and sources (e.g., Mistral 7B, LLaMA3:8B, LLaMA3.2), which can be downloaded from public repositories such as Hugging Face. These models were executed locally using the Ollama runtime environment, which is specified in the manuscript. Because local execution allows full control over model weights and configuration, this part of the protocol is fully reproducible.

Third, for all models we describe the prompting strategy and interaction procedure in detail. Each query is issued in a single-prompt mode with context reset after every interaction, and the same task-specific instruction templates are used across models. We will further clarify in the revised manuscript that default decoding settings of the respective environments were used, and that no additional sampling heuristics or prompt engineering beyond the described instructions were applied.

Fourth, regarding API-based models, we acknowledge the inherent limitation that exact model snapshots and backend updates are controlled by the providers and may change over time, even when a model name remains constant. This limitation is unavoidable in contemporary LLM research and affects virtually all studies involving closed commercial models. Nevertheless, we consider the inclusion of such models necessary, as they represent a significant portion of state-of-the-art systems used in practice. To mitigate this issue, we document the model names, access modality (API), and testing dates, and we emphasize that our benchmark is primarily intended as a comparative and diagnostic framework rather than a source of immutable absolute scores.

Finally, concerning run counts and stochasticity, the current version of the benchmark intentionally does not adopt a multi-run statistical evaluation. Instead, each task is queried once, with the possibility of a single corrective attempt after an incorrect response, following a clearly defined three-level scoring scheme (1 / 0.5 / 0). This design reflects our research goal: to approximate the typical user experience in which a user asks a question once and expects a correct answer, possibly after minimal feedback. We will clarify this motivation more explicitly in the revised manuscript.

In summary, while we recognize that perfect reproducibility is more feasible for local models than for API-based systems, we believe that the benchmark structure, fixed task set, standardized instructions, and clearly described testing procedure together provide a level of reproducibility that is consistent with current best practices in LLM evaluation. We will revise the manuscript to make these aspects more explicit and to include supplementary materials with the full task set and expected answers.

The benchmark validity is not sufficiently justified because the manuscript claims to isolate reasoning from training knowledge yet provides limited evidence that tasks truly prevent memorization or pattern matching, and it lacks human baseline results, item difficulty analysis, and ambiguity checks.

Response 2:

Thank you for pointing this out. The primary goal of the proposed benchmark is not to claim a perfect separation between training knowledge and reasoning - an objective we explicitly acknowledge as unattainable - but rather to maximize the degree to which tasks require local, on-the-fly reasoning instead of recall of memorized facts. We have clarified this point in the revised manuscript (Section 2.2, Task Design).

First, with respect to isolating reasoning from prior knowledge, the benchmark was intentionally designed around newly constructed, artificial scenarios and closed fact bases. In particular, in the compliance-with-fact-base factual tasks and in all logical tasks, the model is instructed to rely exclusively on premises provided in the prompt and not on any external or general knowledge. These premises describe situations that are highly unlikely to occur verbatim in training corpora, as they were authored specifically for this benchmark. This design choice substantially reduces the possibility of direct memorization and shifts the burden toward reasoning over novel, local information.

Second, especially in the logical category (quantifier negation, De Morgan transformations, and diverse syllogistic forms), correct answers follow from the explicit application of classical logical rules rather than from semantic associations or world knowledge. Moreover, syllogisms were constructed so that no two items share identical surface structure, limiting the usefulness of shallow pattern matching and encouraging rule-based inference.

Third, every task has a single, formally verifiable reference answer prepared by a human annotator, and the correctness criteria are grounded either in classical logic or in grammatical well-formedness. When a model response is unclear, the evaluator determines correctness by checking consistency with these predefined rules. We have added an explicit statement in the manuscript emphasizing this evaluation procedure and its role in minimizing ambiguity.

Regarding the absence of a human baseline: we agree that human performance would be a valuable point of comparison. Our current study focuses on establishing the benchmark and demonstrating its diagnostic potential across models of different scales. We now explicitly state this limitation and indicate human baseline evaluation as an important direction for future work.

Concerning item difficulty, tasks within each category were deliberately graded by increasing structural complexity (e.g., longer sentences, more premises, more logical operators). While we did not perform a formal psychometric item-response analysis, this controlled difficulty scaling allows us to observe systematic performance degradation as cognitive load increases. We have clarified this design rationale in the revised text.

In summary, although complete disentanglement of stored knowledge from reasoning is impossible, the proposed benchmark is constructed to maximize reliance on reasoning over novel premises and to minimize opportunities for direct recall. We have added additional methodological clarifications in the manuscript to make these assumptions and limitations explicit.

The scoring and analysis are methodologically weak because a single attempt plus one corrective prompt can confound capability with prompt sensitivity, and the paper should add repeated trials, variance reporting, and more rigorous adjudication procedures to support claims about model limitations.

Response 3:

Thank you for pointing this out. We agree that relying on a single attempt with one corrective prompt does not allow for estimating variance across runs nor for formally disentangling model capability from prompt sensitivity.

In the current version of the benchmark, the single-attempt protocol (with the possibility of one corrective prompt in case of an incorrect answer) was adopted intentionally. Our primary objective was to approximate a realistic usage scenario in which a typical user formulates a task once and, if necessary, slightly reformulates it after receiving an incorrect response. In this sense, the study was designed to evaluate practical task performance under standard user interaction conditions, rather than to provide a controlled analysis of prompt robustness or stochastic variability across multiple trials.

We fully acknowledge that large language models are sensitive to prompt formulation and that repeated trials with variance reporting would provide a more fine-grained assessment of model stability and intrinsic capability. However, conducting systematic multi-run experiments with prompt sensitivity analysis for each benchmark question would substantially increase the scope and computational complexity of the study, going beyond the current research design and resources.

Importantly, the benchmark includes a diverse set of tasks and problem types, which mitigates-at least partially-the risk that the conclusions are driven by idiosyncratic prompt effects in a small subset of questions. We believe that, even under the single-attempt protocol, the observed performance patterns and comparative results provide meaningful insight into the models’ practical limitations and strengths.

To improve transparency, we have clarified the rationale and limitations of the adopted methodology in the revised manuscript (Section 2.5.1, Course of the examination). We also explicitly acknowledge the absence of repeated trials, variance estimates, and formal adjudication procedures as a limitation of the present study and identify multi-run robustness analysis as an important direction for future versions of the benchmark.

Thank you again for highlighting this issue. Your comment has helped us better articulate the methodological boundaries of our work and outline directions for future refinement.

Minor revisions

The manuscript should remove or downplay parameter-count based arguments for commercial models when those numbers are explicitly acknowledged as estimates from unofficial sources.

Response 1:

Thank you for pointing this out. We agree that the parameter counts reported for commercial models are based on publicly available estimates and unofficial sources, and therefore should not be treated as precise or authoritative values. We fully acknowledge this limitation in the manuscript.

At the same time, we believe that providing approximate parameter scales serves an informative comparative purpose. Even if the exact numbers are undisclosed by providers, the reported estimates reflect clear differences in orders of magnitude between models (e.g., billions vs. hundreds of billions vs. trillion-scale systems). Such order-of-magnitude distinctions offer a general quantitative reference point when discussing differences in model scale, especially in contrast to open-source models whose parameter counts are explicitly known.

Importantly, our analysis does not rely on parameter count as a causal explanatory variable, nor do we draw deterministic conclusions solely from model size. We explicitly note that performance differences may also result from architectural design, training data composition, optimization strategies, alignment procedures, and inference-time configurations. The parameter scale is presented as a basic quantitative descriptor rather than as a definitive predictor of cognitive capability.

In the revised manuscript, we have therefore softened the emphasis on exact parameter values for commercial models, clarified that these figures are estimates, and framed them explicitly as indicative of scale rather than precise measurements. We thank the reviewer for helping us improve the transparency and methodological caution of our comparative discussion.

The paper should publish the full benchmark question set, expected answers, and evaluation scripts as supplementary materials so others can audit leakage, replicate scoring, and extend the benchmark.

Response 2:

Thank you for pointing this out. We fully agree that publishing the complete benchmark materials is essential to ensure transparency, reproducibility, and independent auditability. In response to this comment, we have prepared comprehensive supplementary materials containing:

the full benchmark question set (all task categories and subtypes),
the corresponding expected reference answers,
detailed scoring criteria.

These materials enable independent verification of scoring procedures, facilitate replication studies, and allow other researchers to analyze potential data leakage, benchmark robustness, or extend the framework with additional task categories.

The supplementary files have been submitted together with the revised manuscript. We believe this addition substantially strengthens the methodological transparency and practical utility of the proposed benchmark.

Thank you again for emphasizing the importance of open evaluation resources.

Please add a transportation domain citation where the paper uses autonomous vehicles as an example statement in the factual task materials, and a suitable insertion point is the example sentence about artificial intelligence used in autonomous vehicles, where you can cite Peng J, Shangguan W, Chai L, et al., V2X Enabled Platoon Control for Aperiodic Congestion Mitigation via Moving Bottlenecks in Mixed Traffic Environments, IEEE Transactions on Vehicular Technology, 2025.

Response 3:

Thank you for pointing this out. We agree that the use of artificial intelligence in transportation systems - particularly in the context of autonomous vehicles and intelligent traffic control -constitutes an important and rapidly developing research domain. Although the example sentence in our benchmark serves purely as illustrative task material rather than as a substantive discussion of transportation systems, adding a domain-specific citation improves contextual accuracy and interdisciplinary grounding.

Following the reviewer’s recommendation, we have inserted the suggested reference (Peng J., Shangguan W., Chai L., et al., V2X Enabled Platoon Control for Aperiodic Congestion Mitigation via Moving Bottlenecks in Mixed Traffic Environments, IEEE Transactions on Vehicular Technology, 2025) at the point where artificial intelligence in autonomous vehicles is mentioned in the factual task example (Section 3.1, example sentence concerning AI applications in autonomous vehicles).

The citation is included to acknowledge established research in intelligent transportation systems and to provide readers with an authoritative reference illustrating real-world AI applications in this domain.

We appreciate the reviewer’s suggestion, which strengthens the interdisciplinary context of the manuscript.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

1. In factual tasks, the model's incorrect answers may stem from a lack of knowledge or from "illusion" (i.e., confidently fabricating false information). It is recommended to differentiate and statistically analyze these two scenarios in error analysis. This is crucial for understanding the model's behavioral patterns when faced with unknown information.

2. The paper mentions applying "difficulty grading" across various tasks, such as increasing sentence length or the number of facts. It is suggested to introduce a quantitative difficulty metric (e.g., the number of reasoning steps in logic tasks, information entropy in factual tasks) and plot the model's performance as a function of difficulty. This would make the impact of "cognitive load" on model performance more intuitive and measurable.

3. The experiment employs a strict strategy of "resetting the context after each interaction" to isolate single-inference capabilities. This is a reasonable method for controlling variables. However, it also excludes the possibility of the model self-correcting using dialogue history. It is recommended to analyze the advantages and disadvantages of this strategy in the discussion section: while it ensures the purity of the evaluation, it may underestimate the model's potential for progressive reasoning through multi-turn feedback in real-world interaction scenarios.

4. It is recommended to discuss and cite recent research on using pseudo-labels to improve model discriminative power, such as "Enhancing feature discrimination with pseudo-labels for foundation model in segmentation of 3D medical images." This involves using self-generated (potentially noisy) supervisory signals (pseudo-labels/feedback) to guide learning and correction, which is highly relevant to the phenomena observed in this paper.

5. The results clearly demonstrate a positive correlation between model size and performance. However, besides the number of parameters, there are significant differences between open-source models (such as Llama 3.2) and closed-source API models (such as the GPT series) in terms of training data and fine-tuning strategies (especially RLHF). It is recommended to analyze the impact of these factors (not just size) on the final results, particularly on the ability to "correct after feedback," in the discussion section.

Author Response

A Benchmark for Evaluating Cognitive Reasoning in Modern Language Models

Reviewer 2

Response:

In factual tasks, the model's incorrect answers may stem from a lack of knowledge or from "illusion" (i.e., confidently fabricating false information). It is recommended to differentiate and statistically analyze these two scenarios in error analysis. This is crucial for understanding the model's behavioral patterns when faced with unknown information.

Response 1: The distinction between errors caused by lack of knowledge and those resulting from hallucination is indeed an important and actively discussed issue in the evaluation of large language models. We fully agree that such differentiation can be valuable for understanding model behavior in open-ended factual tasks.

However, in practice, reliably determining whether an incorrect response stems from missing knowledge or from hallucination remains methodologically challenging, especially without direct access to the model’s internal representations or confidence estimates. In the context of the proposed benchmark, our primary objective is not to disentangle these two sources of error, but rather to assess whether a model is capable of producing a correct answer based on the cognitive processing required by the task.

Importantly, the benchmark tasks - particularly in the closed-context factual, syntactic, and logical categories - are designed so that no specialized factual knowledge from the training data is required. Correct performance depends solely on the information explicitly provided in the task description and on the model’s ability to perform appropriate cognitive operations, such as rule-based reasoning, inference over local premises, or structural transformation. Consequently, instead of analyzing whether an incorrect answer reflects missing knowledge or hallucination, the benchmark evaluates whether the model possesses sufficient cognitive competence to derive the correct answer under controlled informational conditions.

To clarify this design choice and its implications, we have added an additional discussion addressing this issue in Section 2.1.1 (Fact-based tasks), where we explicitly explain why error-type differentiation was not included in the current evaluation framework and outline it as a potential direction for future extensions of the benchmark.

The paper mentions applying "difficulty grading" across various tasks, such as increasing sentence length or the number of facts. It is suggested to introduce a quantitative difficulty metric (e.g., the number of reasoning steps in logic tasks, information entropy in factual tasks) and plot the model's performance as a function of difficulty. This would make the impact of "cognitive load" on model performance more intuitive and measurable.

Response 2:

Thank you for pointing this out. We agree that introducing an explicit quantitative difficulty metric and analyzing model performance as a function of difficulty would provide an intuitive and informative way to characterize the impact of cognitive load.

However, due to the heterogeneous nature of the proposed benchmark, defining a single, unified quantitative difficulty metric applicable across all task types is not straightforward. The benchmark intentionally integrates tasks that differ not only in surface complexity but also in the underlying cognitive operations they require. As a result, difficulty is not uniformly scalable across all categories.

In particular, graded difficulty is clearly applicable to some tasks, such as factual and syntactic tasks, where complexity can naturally increase with sentence length, the number of facts to be verified, or the amount of distracting information. In contrast, in certain logical tasks - most notably syllogistic reasoning - each item is formally constructed to be at a comparable level of logical complexity, as all tasks require the application of the same inference rules, regardless of linguistic length or surface variation. In these cases, difficulty does not meaningfully increase along a single quantitative dimension.

For these reasons, the current version of the benchmark implements difficulty grading locally and qualitatively within selected task types, rather than through a global quantitative metric. We fully agree that developing task-specific or formally grounded difficulty measures (e.g., number of inference steps or formal reasoning depth) and correlating them with model performance constitutes an important and promising direction for future work. Such an extension would be particularly suitable for a benchmark explicitly designed around controlled task generation with difficulty expressed in a well-defined quantitative metric.

The experiment employs a strict strategy of "resetting the context after each interaction" to isolate single-inference capabilities. This is a reasonable method for controlling variables. However, it also excludes the possibility of the model self-correcting using dialogue history. It is recommended to analyze the advantages and disadvantages of this strategy in the discussion section: while it ensures the purity of the evaluation, it may underestimate the model's potential for progressive reasoning through multi-turn feedback in real-world interaction scenarios.

Response 3:

Thank you for pointing this out. We agree that resetting the context after each interaction represents a methodological trade-off that has both advantages and limitations.

In the proposed benchmark, the decision to reset the context after every query was intentional. The primary objective of the evaluation was to assess whether a model is capable of producing a correct answer in a single inference step, without relying on dialogue history or incremental clarification. The benchmark was therefore not designed to measure progressive reasoning, multi-turn refinement, or self-correction through extended interaction, but rather to evaluate immediate cognitive competence under controlled conditions.

Introducing multi-turn dialogue history and allowing iterative self-correction would significantly complicate the evaluation protocol and, more importantly, alter the fundamental assumptions of the benchmark. In particular, it would reduce control over experimental variables and make it more difficult to compare models in a standardized and reproducible manner. A key advantage of the context-reset strategy is thus the high degree of experimental control and isolation of single-inference behavior.

At the same time, we acknowledge that this approach may underestimate the models’ potential performance in realistic, interactive scenarios, where users often provide feedback and models can refine their responses over multiple turns. To address this limitation, we have added an explicit discussion of the advantages and disadvantages of context resetting in Section Section 2.4. (Testing and evaluation procedur), clarifying the scope of the benchmark and situating it with respect to real-world multi-turn interaction settings. We view the systematic analysis of progressive reasoning and dialogue-based self-correction as an important direction for future extensions of this work.

It is recommended to discuss and cite recent research on using pseudo-labels to improve model discriminative power, such as "Enhancing feature discrimination with pseudo-labels for foundation model in segmentation of 3D medical images." This involves using self-generated (potentially noisy) supervisory signals (pseudo-labels/feedback) to guide learning and correction, which is highly relevant to the phenomena observed in this paper.

Response 4:

Thank you for pointing this out. We agree that recent research on pseudo-labeling and self-generated supervisory signals plays an increasingly important role in improving the discriminative and corrective capabilities of modern AI systems, including both language and vision-based foundation models.

The cited work on enhancing feature discrimination using pseudo-labels is particularly relevant in this context, as it demonstrates how models can leverage internally generated (and potentially noisy) feedback signals to guide learning and refinement. This perspective closely relates to the phenomena observed in our study, especially the limited but measurable corrective behavior exhibited by some models after receiving minimal feedback.

However, it is important to note that the proposed benchmark is designed as an evaluation framework rather than a training or adaptation mechanism. As such, it does not implement pseudo-label–based learning or iterative optimization during inference. Instead, the benchmark aims to diagnostically assess whether models are capable of correcting their responses when exposed to simple external feedback, without modifying model parameters.

To better situate our findings within the broader landscape of contemporary research, we have added a discussion and citation of recent work on pseudo-label-based supervision and self-guided correction mechanisms in Section 3 (Results and analysis of the obtained results). This additional discussion highlights conceptual connections between pseudo-labeling approaches and the observed corrective behavior, while clearly distinguishing evaluation-time feedback from training-time learning strategies. We consider a systematic integration of pseudo-label-inspired mechanisms into benchmark-driven model adaptation as a promising direction for future research.

The results clearly demonstrate a positive correlation between model size and performance. However, besides the number of parameters, there are significant differences between open-source models (such as Llama 3.2) and closed-source API models (such as the GPT series) in terms of training data and fine-tuning strategies (especially RLHF). It is recommended to analyze the impact of these factors (not just size) on the final results, particularly on the ability to "correct after feedback," in the discussion section.

Response 5:

Thank you for pointing this out. We fully agree that, beyond model size, differences in training data, fine-tuning strategies, and alignment procedures - particularly between open-source models and closed-source, API-based models - can substantially influence model performance, including the ability to correct responses after feedback.

To address this point, we have added an explicit discussion in Section 4 (Discussion) emphasizing that model size alone does not fully account for the observed performance differences. In particular, we highlight the role of training and alignment strategies (e.g., RLHF) as confounding factors, while also noting that a systematic quantitative analysis of these aspects is not feasible for commercial closed-source models due to the lack of publicly available details. For this reason, the number of parameters is used as a neutral and accessible reference point for cross-model comparison, although our results indicate that the evaluated cognitive capabilities do not scale monotonically with parameter count.

We believe this added discussion better contextualizes the results and clarifies the limitations of the analysis while remaining consistent with the diagnostic scope of the proposed benchmark.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

I have no other questions.