Next Article in Journal
Contextual Real-Time Optimization on FPGA by Dynamic Selection of Chaotic Maps and Adaptive Metaheuristics
Previous Article in Journal
Self-Attention-Enhanced Deep Learning Framework with Multi-Scale Feature Fusion for Potato Disease Detection in Complex Multi-Leaf Field Conditions
 
 
Article
Peer-Review Record

Applied with Caution: Extreme-Scenario Testing Reveals Significant Risks in Using LLMs for Humanities and Social Sciences Paper Evaluation

Appl. Sci. 2025, 15(19), 10696; https://doi.org/10.3390/app151910696
by Hua Liu 1, Ling Dai 2,* and Haozhe Jiang 3,*
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Appl. Sci. 2025, 15(19), 10696; https://doi.org/10.3390/app151910696
Submission received: 25 August 2025 / Revised: 1 October 2025 / Accepted: 1 October 2025 / Published: 3 October 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper addresses a rather interesting problem, but has significant shortcomings.
1. The "Literature Review" section takes up a significant part of the article. It could be made more concentrated, perhaps by integrating some of the information directly into the introduction or discussion to avoid repetition.
2. The article presents many tables with the results of statistical analysis. However, for better perception, some data, especially the comparison of the effectiveness of different models (for example, the error detection rates from Tables 4 and 8), could be presented in the form of graphs or diagrams.
3. The paper generally lacks infographics. All information is presented in the form of text and tables. Only one figure, there are no formulas and algorithms. This creates conditions for difficult perception of the novelty of the article and the contribution of the authors.
4. The four-phase model of collaboration proposed in section 5.2.2 is a theoretical framework. Its practical effectiveness has not been tested within the framework of this study.
5. It is worth defining scientific novelty more clearly, since from what is presented, the studies only confirm known conclusions and the author's contribution is not clearly defined.

Author Response

1. Reviewer #1: The paper addresses a rather interesting problem, but has significant shortcomings.

Authors’ response: Thank you so much for your balanced and constructive comments. Your insights have been invaluable in helping us refine and enhance our work, and we are excited about the opportunity to address your suggestions to further strengthen our manuscript.

2. Reviewer #1: The "Literature Review" section takes up a significant part of the article. It could be made more concentrated, perhaps by integrating some of the information directly into the introduction or discussion to avoid repetition.

Authors’ response: Thank you for your valuable feedback regarding the "Literature Review" section. To make this section more focused, we have deleted the subsection 'Risk Assessment and Mitigation in LLM Paper Evaluation,' which was less closely related to the main body of the paper. Concurrently, we have integrated some of the information from this section directly into the introduction.

3. Reviewer #1: The article presents many tables with the results of statistical analysis. However, for better perception, some data, especially the comparison of the effectiveness of different models (for example, the error detection rates from Tables 4 and 8), could be presented in the form of graphs or diagrams.

Authors’ response: Thank you for your valuable feedback regarding tables. We have converted Tables 4 and 8 into more vivid and easy-to-understand graphs.

4. Reviewer #1: The paper generally lacks infographics. All information is presented in the form of text and tables. Only one figure, there are no formulas and algorithms. This creates conditions for difficult perception of the novelty of the article and the contribution of the authors.

Authors’ response: Thank you for your valuable feedback regarding lacks infographics. We have draw two graphs to illustrate the frequencies of different scientific and logical flaws types detection by LLMs.

These two graphs are as follows:

Figure 2. Frequency of Different Scientific Flaws Types Detection by LLMs

Figure 4. Frequency of Different Logical Flaws Types Detection by LLMs

5. Reviewer #1: The four-phase model of collaboration proposed in section 5.2.2 is a theoretical framework. Its practical effectiveness has not been tested within the framework of this study.

Authors’ response: Thank you for your valuable feedback regarding the four-phase model of collaboration, which indeed lack empirical testing. Meanwhile, in consideration of this section is not the core part of this paper, we have deleted it.

6. Reviewer #1: It is worth defining scientific novelty more clearly, since from what is presented, the studies only confirm known conclusions and the author's contribution is not clearly defined.

Authors’ response: Thank you for your valuable feedback regarding clearly stating scientific novelty and the author's contribution. We have added a detailed description of the paper's novelty and contributions at the end of the Introduction section.

The added content is as followed:

This paper makes the following contributions:

  • A highly credible comparative experiment was designed, which not only effectively probes the lower bounds of LLM evaluation through extreme scenarios, but also addresses the challenge of measuring evaluation competence due to the highly subjective nature of paper assessment.
  • The study exposes fundamental limitations in LLMs’ paper evaluation capabilities, particularly in the domains of scientific and logical evaluation, which are typically masked under naturalistic conditions,
  • A novel interpretation is provided of the distinctive mechanisms underlying LLM-based paper evaluation, which differ fundamentally from those of human evaluators.

Furthermore, we have revised the paper's title to “Evaluating the Evaluator: Probing the Lower Limits and Risks of LLMs in Academic Paper Assessment through Extreme-Scenario Testing” and concurrently modified the abstract to better highlight the innovative contributions of the study.

The modified the abstract is as follows:

The deployment of large language models (LLMs) in academic paper evaluation is in-creasingly widespread, yet their trustworthiness remains debated; to expose fundamental flaws often masked under conventional testing, this study employed extreme-scenario testing to probe the lower performance boundaries of LLMs in assessing scientific validity and logical coherence. Through a highly credible quasi-experiment, 40 high-quality Chinese papers from philosophy, sociology, education, and psychology were selected, for which domain experts created versions with implanted "scientific flaws" and "logical flaws". Three representative LLMs (GPT-4, DeepSeek, and Doubao) were evaluated against a baseline of 24 doctoral candidates, following a protocol progressing from 'broad' to 'targeted' prompts. Key findings reveal poor evaluation consistency, with significantly low intra-rater and inter-rater reliability for the LLMs, and limited flaw detection capability, as all models failed to distinguish between original and flawed papers under broad prompts, unlike human evaluators; although targeted prompts improved detection, LLM performance remained substantially inferior, particularly in tasks requiring deep empirical insight and logical reasoning. The study proposes that LLMs operate on a fundamentally different "task decomposition-semantic understanding" mechanism, relying on limited text extraction and shallow semantic comparison rather than the human process of "worldscape reconstruction → meaning construction and critique", resulting in a critical inability to assess argumentative plausibility and logical coherence. It concludes that current LLMs possess fundamental limitations in evaluations requiring depth and critical thinking, are not reliable independent evaluators, and that over-trusting them carries substantial risks, necessitating rational human-AI collaborative frameworks, enhanced model adaptation through downstream alignment techniques like prompt engineering and fine-tuning, and improvements in general capabilities such as logical reasoning.

For the detailed notes to reviewers, please also see the attachment. Thank you so much.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

The paper "Applying Large Language Models for Academic Paper Evaluation: Capability Assessment and Risk Mitigation" presents the experience of integrating three LLMs (Doubao, DeepSeek-V3, and GPT-4.1) for automatic assessment of academic papers.

The research is based on zero-shot learning, which classifies the samples without any knowledge related to previously observed classes. Using this approach significantly reduces the accuracy, although by implementing it, the bias can be slightly overcome.

I suggest the authors to extend their approach to one-shot learning, using a set of predefined training example for each label. Although historically older concept, one-shot learning seems to be successful. Based on our recent experience with the same task, I strongly believe that one-shot learning will be fruitful and that they will be able to rely on the LLM-generated results. 

A brief explanation of the variations of intraclass correlation coefficients (ICC) is inevitable, to explain why the obtained values during the examination are as valuable as they are presented now.

Human annotation is a great asset of the research. A brief explanation of the human evaluators (their number, competence, engagement etc.) is crucial to understand the abilities and handicaps of the LLMs for paper evaluation.

Table 1 and Figure 1 should be remade. While the tables misses some punctuation marks, the Figure is completely illegible.

Author Response

Reviewer #2: The paper "Applying Large Language Models for Academic Paper Evaluation: Capability Assessment and Risk Mitigation" presents the experience of integrating three LLMs (Doubao, DeepSeek-V3, and GPT-4.1) for automatic assessment of academic papers. The research is based on zero-shot learning, which classifies the samples without any knowledge related to previously observed classes. Using this approach significantly reduces the accuracy, although by implementing it, the bias can be slightly overcome. I suggest the authors to extend their approach to one-shot learning, using a set of predefined training example for each label. Although historically older concept, one-shot learning seems to be successful. Based on our recent experience with the same task, I strongly believe that one-shot learning will be fruitful and that they will be able to rely on the LLM-generated results. 

Authors’ response: Thank you very much for your insightful suggestion. We agree that carefully designed samples corresponding to specific error types would be beneficial for improving the evaluation performance of LLMs. However, the primary aim of this study is to probe the lower bound of LLMs' evaluation capabilities, reveal their potential risks, and establish a baseline for enhancing their performance through techniques such as prompt engineering and fine-tuning. Therefore, we intentionally avoided providing excessive guidance to the LLMs during the testing phase. Based on this rationale, we have added the following explanation at the end of Section 5.2:

Furthermore, this study emphasizes probing the lower bound of LLMs' paper evaluation capability. This aims to reveal potential risks and establish a baseline for improving their evaluation performance through techniques like prompt engineering and fine-tuning. Consequently, the testing intentionally avoided providing excessive guidance to the LLMs.

Reviewer #2: A brief explanation of the variations of intraclass correlation coefficients (ICC) is inevitable, to explain why the obtained values during the examination are as valuable as they are presented now.

Authors’ response: Thank you for your valuable feedback regarding explanation of the variations of ICC, which can make the meaning of the data clearer. We have revised Section 4.1 by adding explanation of intra-rater reliability and inter-rater reliability, and reference of two different ICC criteria applicable to this test.

Intra-rater reliability reflects the consistency of a specific rater's evaluation criteria over time, indicating the stability of the scoring results. This study adopts the criteria proposed by Koo and Li (2016) for interpreting the Intraclass Correlation Coefficient (ICC): ICC < 0.50 indicates poor consistency and is unacceptable; 0.50 ≤ ICC < 0.75 indicates moderate consistency and is tolerable; 0.75 ≤ ICC < 0.90 indicates good consistency and is acceptable; ICC ≥ 0.90 indicates excellent consistency, which is highly desirable. Analysis of the intra-rater reliability for all 120 papers (under unified, broad prompts) revealed that the two scoring attempts by the three large language models (LLMs) all exhibited poor consistency (with ICC values below 0.75). This suggests substantial random error in the models' scoring, casting doubt on the reliability of their evaluation results (in contrast, the human expert showed good consistency between two scoring attempts).

The revised section of“Evaluation Consistency” is as follows:

Inter-rater reliability measures the uniformity of scoring standards applied by different raters. This study refers to the evaluation standards established by Fleiss (1981) and Cicchetti (1994): ICC < 0.40 indicates poor consistency and is unacceptable; 0.40 ≤ ICC < 0.60 indicates fair/moderate consistency and is tolerable; 0.60 ≤ ICC < 0.75 indicates good consistency and is acceptable; 0.75 ≤ ICC < 0.90 indicates very good consistency, which is highly desirable; ICC ≥ 0.90 indicates excellent consistency. The relatively lower thresholds of these criteria account for the greater difficulty in controlling differences and subjectivity among individuals. Analysis of the inter-rater reliability between the LLMs themselves and between the LLMs and the human expert showed that only the consistency between DeepSeek and the human expert reached a fair level. The consistency between Doubao, GPT-4 and the human expert, as well as the consistency among the three LLMs themselves, was poor. This indicates a significant discrepancy in the practical application of scoring standards between large language models and human evaluators (see Table 1 for details).

Reviewer #2: Human annotation is a great asset of the research. A brief explanation of the human evaluators (their number, competence, engagement etc.) is crucial to understand the abilities and handicaps of the LLMs for paper evaluation.

Authors’ response: Thank you for your valuable feedback regarding explanation of the human evaluators. We have revised the relevant content in Section of 3.1.2 to specify the number, competence, engagement, field of study of human evaluators. It is important to clarify that we did not provide the evaluators with specific training on paper assessment. Instead, we ensured they possessed considerable reading and writing capabilities in their respective disciplines through the requirement of 'possessing at least one high-quality publication within the preceding three years.' The purpose of this approach was to make the abilities of the human evaluators more comparable to the zero-shot capabilities of the LLMs.

The revised section of “explanation of the human evaluators” is as follows:

The third phase established human baseline performance through evaluation by 24 doctoral candidates from philosophy, sociology, education, and psychology disciplines. These evaluators, each possessing at least one high-quality publication within the preceding three years, assessed the manuscript versions following the same evaluation framework. Each evaluator assessed 10 distinct papers to ensure each paper receiving two independent evaluations. Meanwhile, the versions of the papers were randomly distributed and kept confidential from the evaluators to prevent contamination effects.

Reviewer #2: Table 1 and Figure 1 should be remade. While the tables misses some punctuation marks, the Figure is completely illegible.

Authors’ response: Thank you for your detailed review for Table 1 and Figure 1. We have revised Table 1 according to your feedback and deleted Figure 1 in consideration of adjustments to the content of paper.

For the detailed notes to reviewers, please see the attachment. Thank you so much.

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

The paper presents a capability assessment of large language models for academic paper evaluation. The topic itself is controversial, as wide deployment of LLMs for such a sensitive task in the context of "publish or perish" model prevalant in some scientific communities could lead to unjustified consequences for academic careers of researchers. However, research in this area can help understanding on how LLMs can be used as a supporting tool for such tasks.

The Introduction is sound. However, it is barely supported by references and crucial finding from the open literature. Although Literature review is present, I expect the most important findings to be presented in the introduction, as to give landscape and current situation in the field. It is necessary in order to emphasize novelty and contributions of the work. I suggest to present 2-3 most important contributions of the paper in a bullet-by-bullet manner. 

In literature review, I think it would be beneficial to discuss prompting techniques, as well as to emphasize typical academic fields. I doubt that the performance is similar for technical fields, such sa AI, natural science of medicine. It can be explored or discussed in the discussion. From the Section 3.1.1. I see that the study is leaning towards social sciences. This should be discussed in the results, as it can be the bias of this research.

The choice of LLMs was not explained. Although, GTP-4 and DeepSeek seem like a logical choice, Doubao is not well-known nor explained. Please explain your choice and back it with references.

Lines 240-242. "Standardized protocols" for trageted modifications of test papers should be rather rigorously descripted, as it is important for reproducibility. Additionaly, it would be nice to publish anonymized dataset.

Section 3.1.2. Prompt engineering in the testing protocols in now well explained. It is well known that performance can vary depending on the prompting techniques used, especially in guided scenarios. In that sense, the second phase of the protocol should be better descibed.

The results are rich in nature, but lacking a broader context. Conclusions and discussion are mixed in the last section. I think a separate discussion section should be introduced to put the results, limitations, and applicability in broader context together with a comparison with other results from open literature. Conclusion and future work can be given in a short form in the end.

Author Response

Reviewer #3: The paper presents a capability assessment of large language models for academic paper evaluation. The topic itself is controversial, as wide deployment of LLMs for such a sensitive task in the context of "publish or perish" model prevalant in some scientific communities could lead to unjustified consequences for academic careers of researchers. However, research in this area can help understanding on how LLMs can be used as a supporting tool for such tasks.

Authors’ response: We sincerely thank the reviewer for your recognition of this study's value and your accurate understanding of its research objectives. This encouraging feedback greatly strengthens our confidence in refining the manuscript to fully realize its potential academic impact.

Reviewer #3: The Introduction is sound. However, it is barely supported by references and crucial finding from the open literature. Although Literature review is present, I expect the most important findings to be presented in the introduction, as to give landscape and current situation in the field. It is necessary in order to emphasize novelty and contributions of the work. I suggest to present 2-3 most important contributions of the paper in a bullet-by-bullet manner. 

Authors’ response: Thank you for your highly professional suggestions for improving the Introduction section. We have added relevant contents.

The most important findings in the field are as follows:

When the evaluation task involves judging whether a paper adheres to specific rules (e.g., grammar, syntax, formatting conventions, or research norms), LLMs often excel. Conversely, their performance is often unsatisfactory when the task requires a deep understanding and critical assessment of the scholarly content. In everyday scenarios, with appropriate prompt guidance, LLMs can generally provide passable evaluations, albeit with issues such as generic responses and a lack of depth. However, these performances may obscure fundamental flaws in their evaluative capabilities. A limited number of meticulously designed studies have attempted to expose their true abilities through extreme-scenario testing.

The most important contributions of the paper are as follows:

This paper makes the following contributions:

A highly credible comparative experiment was designed, which not only effectively probes the lower bounds of LLM evaluation through extreme scenarios, but also addresses the challenge of measuring evaluation competence due to the highly subjective nature of paper assessment.

The study exposes fundamental limitations in LLMs’ paper evaluation capabilities, particularly in the domains of scientific and logical evaluation, which are typically masked under naturalistic conditions,

A novel interpretation is provided of the distinctive mechanisms underlying LLM-based paper evaluation, which differ fundamentally from those of human evaluators.

Reviewer #3: In literature review, I think it would be beneficial to discuss prompting techniques, as well as to emphasize typical academic fields. I doubt that the performance is similar for technical fields, such sa AI, natural science of medicine. It can be explored or discussed in the discussion. From the Section 3.1.1. I see that the study is leaning towards social sciences. This should be discussed in the results, as it can be the bias of this research.

Authors’ response: Thank you for your valuable feedback regarding literature review. We have expanded the Related Work section to include a review of literature on prompt techniques. Additionally, a comparative analysis discussing the different disciplinary foci between prior research and our present study has been added to the Discussion section.

The literature review on prompt techniques is presented below:

The design of prompts also significantly influences the manifestation of LLMs' evaluation capabilities. Du et al. (2024) employed extensive prompts—which included the ICLR review guidelines, randomly selected human-written reviews for both accepted and rejected papers, and a template for the ICLR 2024 review format—to generate LLM-based reviews. Separately, Zhou et al. (2024a) distilled the review criteria of a leading cell biology journal into key dimensions: originality, accuracy, conceptual advance, timeliness, and significance. These defined criteria were then used to prompt an LLM to evaluate a given PubMed ID paper on a three-star rating system, culminating in an overall assessment. Such meticulously crafted prompts facilitate the LLM's understanding of domain-specific evaluation requirements and conventions, thereby enabling improved performance.

The comparative analysis of disciplinary emphases are as follows:

In summary, this study focuses on the sensitivity and critical ability of LLMs regarding scientific and logical issues in humanities and social sciences papers, which differs from detecting scientific-technical errors or formal reasoning flaws in natural science or engineering papers. Nevertheless, the conclusions of these two types of studies can still be mutually verifying and complementary.

Reviewer #3: The choice of LLMs was not explained. Although, GTP-4 and DeepSeek seem like a logical choice, Doubao is not well-known nor explained. Please explain your choice and back it with references.

Authors’ response: Thank you very much for your insightful comment regarding the selection of Large Language Models (LLMs) in our study. We agree that providing a clear rationale is crucial for the methodological rigor of our work. Our selection of GPT-4, DeepSeek, and Doubao was strategically designed to ensure a comprehensive and representative sample of the current LLM landscape, particularly from the perspective of accessibility and regional relevance for evaluating Chinese humanities and social sciences (HSS) papers. We selected GPT-4 as a global frontier model, DeepSeek as a top-tier open-source model, and Doubao as a high-impact model within the Chinese domestic ecosystem.

The added section is as follows:

3.1.2. Selection of Large Language Models

To ensure the comprehensiveness and representativeness of our findings, we selected three distinct LLMs as evaluators: GPT-4, DeepSeek, and Doubao. This selection was strategic and based on the following rationale. GPT-4 (OpenAI) was included as a benchmark model due to its widely recognized state-of-the-art performance in general NLP tasks. DeepSeek (DeepSeek AI) was chosen as a leading powerful open-source alternative, known for its competitive capabilities often cited in comparative studies. Finally, Doubao (ByteDance) was selected to represent a category of highly accessible and influential LLMs within the Chinese digital ecosystem. Its inclusion is crucial for our objective to assess the capabilities of models that are readily available to a broad Chinese-speaking audience, thereby addressing the practical implications of LLM usage in evaluating Chinese HSS papers. This tripartite selection—covering a global frontier model, a top-tier open-source model, and a dominant domestic model—ensures a diverse and pragmatic analysis.

Reviewer #3: Lines 240-242. "Standardized protocols" for targeted modifications of test papers should be rather rigorously descriptive, as it is important for reproducibility. Additionally, it would be nice to publish anonymized dataset.

Authors’ response: Thank you very much for your valuable suggestions on description of test papers’ modifications. Actually, Appendix A provides a comprehensive presentation of the various error types along with their corresponding examples. We also commit to publishing an anonymized dataset.

Reviewer #3: Section 3.1.2. Prompt engineering in the testing protocols in now well explained. It is well known that performance can vary depending on the prompting techniques used, especially in guided scenarios. In that sense, the second phase of the protocol should be better described.

Authors’ response: Thank you very much for your valuable suggestions on  description of prompt engineering. In fact, Appendix B has been included specifically to detail both the general and tailored prompts employed during the testing phase.

Reviewer #3: The results are rich in nature, but lacking a broader context. Conclusions and discussion are mixed in the last section. I think a separate discussion section should be introduced to put the results, limitations, and applicability in broader context together with a comparison with other results from open literature. Conclusion and future work can be given in a short form in the end.

Authors’ response: Thank you very much for your valuable suggestions on discussion and conclusion. We have revised these two sections.

These sections are as follows:

Discussion

5.1 Reliability of Scoring and Prompt Optimization

This study revealed notably low internal consistency in the scores assigned by the models across multiple evaluations. This issue stems fundamentally from the inherent stochasticity of the LLM sampling mechanism. Consequently, enhancing the stability of LLM-based paper evaluation constitutes a complex systemic challenge, for which prompt optimization is a crucial measure. Our findings indicate that low scoring consistency is associated with insufficiently specific prompts. For instance, when evaluating a theoretical paper, an LLM might alternately apply paradigms suited for empirical research or theoretical study, leading to significant score discrepancies. Furthermore, even when prompts define evaluation dimensions and score distributions, LLMs tend to refine these criteria randomly. When these refined criteria include indicators of "deficiency" or factors unfavorable to the paper, the result is an undeservedly lower score. Therefore, providing LLMs with targeted, detailed evaluation standards, along with exemplars from specific academic domains, is anticipated to mitigate scoring inconsistency to a certain extent.

Research also demonstrates that targeted prompts moderately improved detection rates, reflecting LLMs' evaluation mechanisms. For lengthy texts, LLMs initially extract key sections (abstracts, headings) based on evaluation dimensions, then conduct semantic-level interpretation and assessment. Only under specific guidance do they perform detailed analysis of individual sections from particular perspectives. When prompted to attend to methodological rigor, LLMs carefully examine specific methodological applications. When directed toward logical continuity, they comprehensively examine inter-sectional connections. However, targeted prompts may induce hypercritical evaluation, flagging issues that human readers would not consider problematic.

5.2 Deficiencies in Evaluating Scientific Rigor and Logical Coherence

The study found that the three large models (LLMs) did not exhibit a significant difference in their scores between the original papers and the versions containing scientific or logical flaws. Furthermore, the proportion of comments in which the LLMs accurately identified these errors was substantially lower than that of human scholars. Heterogeneity analysis and qualitative examination revealed that the detection of scientific errors was highly dependent on the type of error. For flaws that could be checked against explicit rules, such as violations of empirical research methodology, the LLMs were capable of accurate identification. However, for problems that transcended clear rules and required concrete, profound empirical insight, the LLMs often failed to make correct judgments. In assessing logical coherence, the LLMs primarily relied on structural features (e.g., section headings) and superficial semantic cues rather than the deep logical argumentation of the paper. Papers in the humanities and social sciences heavily employ fact-based logical reasoning—such as inductive, abductive, and practical reasoning—which premises on objective facts, universal principles, or accepted common knowledge to derive conclusions through rigorous logical relationships, emphasizing factual evidence and causal linkages over subjective conjecture or formal symbolic manipulation. The LLMs' lack of embodied cognition, and consequently a unified world schema grounded in such cognition [30], underlies their incompetence in evaluating the logic of humanities and social sciences papers.

The findings of this study exhibit a complementary and corroborative relationship with the conclusions of two other studies that employed the method of deliberately implanted errors. Tyser et al. (2024) found that LLMs performed well in detecting "overclaiming," "citation issues," and "theoretical errors" but showed lower detection rates for "technical errors" and "ethical problems." This discrepancy may stem from the former having more distinct linguistic cues or referenceable standard norms, while the latter rely more heavily on contextual understanding. It should be noted that in their study, the errors implanted into the papers were primarily generated by LLMs themselves (except for ethical issues, which were manually added). This approach may have introduced a bias favoring LLM performance. Liu et al. (2023) reported that LLMs identified deliberately implanted errors in 7 out of 13 test papers, covering both mathematical and conceptual errors. Importantly, the test papers used were artificially constructed, short computer science papers, and three different prompting strategies were employed: Direct Prompting, One-Shot Prompting, and Part-Based Prompting. These experimental settings could influence the difficulty of error detection.

In summary, this study focuses on the sensitivity and critical ability of LLMs regarding scientific and logical issues in humanities and social sciences papers, which differs from detecting scientific-technical errors or formal reasoning flaws in natural science or engineering papers. Nevertheless, the conclusions of these two types of studies can still be mutually verifying and complementary. Furthermore, this study emphasizes probing the lower bound of LLMs' paper evaluation capability. This aims to reveal potential risks and establish a baseline for improving their evaluation performance through techniques like prompt engineering and fine-tuning. Consequently, the testing intentionally avoided providing excessive guidance to the LLMs.

5.3 The Distinctive Evaluation Mechanism of LLMs for Scholarly Papers

An examination of LLMs' practical performance reveals that they often automatically adopt a "task decomposition-semantic understanding" process when evaluating lengthy papers. Specifically, based on the prompts and the paper's key features, LLMs first decompose the overall task into smaller, potentially multi-layered evaluation subtasks. Each subtask encompasses specific evaluation angles, criteria, and particular sections or aspects of the paper. Subsequently, the LLM extracts limited information from the paper relevant to each subtask for relatively in-depth semantic understanding and comparison. However, the computational resources allocated to each subtask are minimal, as is the amount of textual information processed. Consequently, the scope and depth of the subsequent semantic understanding and comparison are severely constrained. Compounded by the inherent limitations of LLMs, such as their lack of embodied cognition and real-world experience, this leads to evaluations from each subtask that lack critical depth and comprehensiveness, particularly when the prompts are broad. This mechanistic limitation likely constitutes the fundamental reason for the low detection rates of issues related to argument plausibility and logical coherence observed in LLM testing.

Based on schema theory (e.g., Rumelhart, 1980), a competent evaluator does not directly proceed by filling out a detailed evaluation form item by item. Instead, they must first undergo an internal psychological process of "worldscape reconstruction → meaning construction and critique." First, using the language of the paper as cues, the human evaluator "reconstructs" the worldscape depicted within it (including the methods of its construction). This process of "reconstruction" is grounded in the evaluator's own knowledge and experience and incorporates their reflection and creativity. The evaluator keenly discerns which parts of this worldscape are crucial and which might be problematic. Subsequent meaning construction and critique are then conducted upon this reconstructed worldscape: judging the validity of viewpoints and their exposition becomes a process of mutual constitution between the whole and its parts, with this worldscape as the backdrop; assessing logical coherence equates to examining the structural soundness of the worldscape itself. Furthermore, the acute awareness of key and potentially problematic parts formed during "reconstruction" makes the evaluation more targeted and productive.

Therefore, to mitigate the risks of over-trusting and over-relying on LLMs, it is necessary to develop human-AI collaborative evaluation. Since the strength of human evaluators lies in deep textual understanding and critiquing through reflection from within the text, and the strength of LLMs lies in their vast knowledge, rich data, rapid retrieval, and comparison capabilities—excelling at objective analysis and comparison from outside the text—combining their advantages can balance evaluation efficiency with quality, and breadth with depth.

5.4 Limitations

The conclusions of this study are derived from evaluation tests conducted by LLMs on academic papers (in Chinese) in the humanities and social sciences, and thus cannot be directly generalized to their ability to evaluate papers in other disciplines. The prompts used in the study, including both broad and targeted prompts, were relatively simple. They neither provided specific evaluation criteria for particular types of papers nor included exemplars of paper evaluations. Consequently, the results reflect the lower bound of LLMs' evaluation capability rather than their best possible performance. Furthermore, the proposed explanation for the identified deficiencies—the "task decomposition-semantic understanding" evaluation mechanism—is primarily inferred from limited cases and external performance observations, and thus lacks sufficient empirical evidence.

Conclusion

In summary, this study exposes fundamental deficiencies in LLMs for paper evaluation, including low internal consistency in scoring and low detection rates for flaws in argument plausibility and logical coherence. It confirms a significant gap between LLM capabilities and human evaluators when assessing academic papers (in Chinese) within the humanities and social sciences. These findings highlight the substantial risks associated with over-trusting and over-relying on LLMs—the warning that "automatically generating reviews without thoroughly reading the manuscripts will undermine the rigorous assessment process fundamental to scientific progress" is not an exaggeration [25].

To mitigate these risks, a two-pronged approach is necessary. Firstly, it is essential to fully recognize the limitations and the unique operational mechanism of LLMs in paper evaluation, and on this basis, establish rational human-AI collaborative evaluation frameworks. Secondly, efforts should focus on enhancing the LLMs' adaptation to the paper evaluation task through downstream alignment techniques such as prompt engineering, supervised fine-tuning, and reinforcement learning. Concurrently, improving the general capabilities of LLMs is crucial. This could involve integrating reasoning models to shift LLMs from statistical pattern matching towards an intelligent review paradigm closer to human critical thinking [33]. Alternatively, enhancing logical reasoning capabilities, particularly fact-based reasoning, might be achieved by incorporating reasoning tasks (e.g., chain-of-thought prompting, counterfactual generation) during pre-training, or by explicitly separating factual memory from reasoning abilities within the model (e.g., through knowledge neuron localization).

Regarding future research on LLMs' paper evaluation capability, investigating their upper limit for evaluating humanities and social sciences papers could be pursued through complex prompt engineering, constructing paper evaluation agents (multi-agent systems), and model fine-tuning on high-quality paper evaluation datasets. Furthermore, to validate the proposed "task decomposition-semantic understanding" evaluation mechanism, future work could employ methods such as visualizing attention mechanisms or inserting classifiers at intermediate model layers. Another promising direction involves integrating graph-based reasoning tools, such as Graph Neural Networks (GNNs), to guide LLMs towards incorporating a human-like evaluation process of "worldscape reconstruction → meaning construction and critique".

For the detailed notes to reviewers, please see the attachment. Thank you so much!

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors added infographics, but it was a fairly quick job and could have been done better.

Author Response

Reviewer #1: The authors added infographics, but it was a fairly quick job and could have been done better.

Authors’ response: Thank you for this constructive suggestion. We have now redesigned and improved the infographics with greater detail and clarity to better meet the academic standards and enhance visual communication.

For the detailed Author's Notes to Reviewer, please see the attachment. Thank you very much again!

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

Thanks to the authors for addressing the most of my concerns and making paper significantly better. Although all the changes are good, there are still some minor concerns to be addressed:

- Introduction is still not backed up/supported with references. It is necessary to support motivation, rationale and relevance of the paper with references.

- Abstract, Introduction and Discussion does not sufficiently emphasize the focus on social sciences and humanities, rather than technical sciences. 

- I also suggest to change the title of the paper to better reflect the content.

Author Response

Reviewer #3: Thanks to the authors for addressing the most of my concerns and making paper significantly better. Although all the changes are good, there are still some minor concerns to be addressed.

Authors’ response: We sincerely appreciate your positive feedback and are glad to hear that the revisions have strengthened the manuscript. We have carefully addressed the remaining minor concerns as outlined below, and thank you once again for your constructive guidance throughout the review process.

Reviewer #3: Introduction is still not backed up/supported with references. It is necessary to support motivation, rationale and relevance of the paper with references.

Authors’ response: We thank your for highlighting this important aspect. We have now thoroughly revised the Introduction section and added relevant references to properly support the motivation, rationale, and relevance of our study.

Reviewer #3: Abstract, Introduction and Discussion does not sufficiently emphasize the focus on social sciences and humanities, rather than technical sciences.

Authors’ response: We thank the reviewer for this valuable observation. We have revised the Abstract, Introduction, and Discussion sections to more clearly and consistently emphasize the focus and contributions of our work within the social sciences and humanities.

The key revisions to the Abstract are presented below:

...to expose fundamental flaws often masked under conventional testing, this study employed extreme-scenario testing to systematically probe the lower performance boundaries of LLMs in assessing the scientific validity and logical coherence of papers from the humanities and social sciences (HSS).

The key revisions to the Introduction are presented below:

...particularly in the context of HSS paper assessment, which relies heavily on contextual nuance, theoretical reasoning, and interpretive depth.

... specifically within HSS contexts.

...This paper makes the following contributions:

We designed a highly credible comparative experiment focused on HSS papers, which not only effectively probes the lower bounds of LLM evaluation through extreme scenarios but also addresses the challenge of measuring evaluation competence due to the highly subjective nature of HSS paper assessment.

The study exposes fundamental limitations in LLMs’ paper evaluation capabilities, particularly in the domains of scientific and logical evaluation, which are typically masked under naturalistic conditions,

We provide a novel interpretation of the distinctive mechanisms underlying LLM-based paper evaluation, which differ fundamentally from those of human evaluators and help explain their shortcomings in capturing fact-based logical reasoning and theoretical coherence in HSS writing.

In the Discussion section, in addition to the original explanation that attributed the low detection rate of logical coherence issues in LLMs to the fact-based logical reasoning inherent in HSS papers, the revised manuscript further emphasizes the context of HSS paper evaluation when comparing the distinct evaluation mechanisms of LLMs and humans, as well as when elaborating on human-AI collaborative assessment. For easy identification, all changes made to the manuscript are shown in bold.

Reviewer #3: I also suggest to change the title of the paper to better reflect the content.

Authors’ response: We sincerely thank the reviewer for this valuable suggestion. We have carefully considered the comment and revised the title to more accurately and clearly reflect the core focus and contributions of our study. The new title is:

Applied with Caution: Extreme-Scenario Testing Reveals Significant Risks in Using LLMs for Humanities and Social Sciences Paper Evaluation

We believe this revised title offers several key improvements:

- It immediately establishes the high-stakes context of applying LLMs in academic evaluation, specifically within the challenging domain of the Humanities and Social Sciences (HSS).

- It precisely describes our methodological approach ("Extreme-Scenario Testing"), signaling to readers the rigorous and probing nature of our investigation.

- It unambiguously communicates our central finding ("Reveals Significant Risks"), setting a clear expectation for the paper's critical perspective and its contribution to risk awareness.

- The main clause "Applied with Caution" serves as a concise and impactful takeaway, directly translating our research into a practical, actionable guideline for the academic community.

We are confident that this new title better captures the paper's essence and aligns more directly with its key arguments and evidence.

For the detailed Author's Notes to Reviewer, please see the attachment. Thank you very much! 

 

Author Response File: Author Response.docx

Back to TopTop