Next Article in Journal
Empirical Rules for Oscillation and Harmonic Approximation of Fractional Kelvin–Voigt Oscillators
Previous Article in Journal
Adaptive Current Differential Protection Method Based on Fréchet Distance Algorithm
Previous Article in Special Issue
Dual-Stream Time-Series Transformer-Based Encrypted Traffic Data Augmentation Framework
 
 
Article
Peer-Review Record

Exploring the Potential of Anomaly Detection Through Reasoning with Large Language Models

Appl. Sci. 2025, 15(19), 10384; https://doi.org/10.3390/app151910384
by Sungjune Park * and Daeseon Choi *
Reviewer 1:
Reviewer 2: Anonymous
Appl. Sci. 2025, 15(19), 10384; https://doi.org/10.3390/app151910384
Submission received: 1 August 2025 / Revised: 5 September 2025 / Accepted: 18 September 2025 / Published: 24 September 2025
(This article belongs to the Special Issue AI-Enabled Next-Generation Computing and Its Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript addresses how to perform anomaly detection (e.g., spam, fake news, toxic content) efficiently without task-specific retraining by leveraging Prompt Engineering techniques on transformer-based LLMs. It analyzes and compares zero-shot, few-shot, Chain-of-Thought (CoT), Self-Consistency (SC), and Tree-of-Thought (ToT) prompting, finding CoT and SC most effective, with ToT showing biases and limited gains in some contexts. It provides a comprehensive, evidence-backed synthesis of prompting strategies for some anomaly detection tasks, including an empirical ablation study across multiple datasets (SMS Spam, Fake News, Toxic) and models (GPT-3.5-Turbo, GPT-4o). It clarifies when advanced prompts yield the largest performance gains, highlights GPT-4o’s superior gains with CoT/SC, and outlines practical adaptive-prompting avenues (Self-Criticism, RAG, multi-agent debate, decomposition). Strengths include its breadth, actionable guidance on prompt selection, and integration of empirical results with existing literature.

Some weaknesses:
1. The related work section on anomaly detection is insufficiently comprehensive. Several categories (e.g., machine learning–based methods and deep learning–based methods) are represented by only a single reference, and key foundational features of these approaches are not adequately introduced. We recommend broadening citations within each category, distilling core principles (e.g., feature representations, model architectures, optimization paradigms), and highlighting how these baselines compare to prompting-based approaches.

2. The study relies exclusively on GPT-series LLMs. To strengthen claims about prompting methods, include a broader set of large language models (e.g., PaLM, LLaMA/LLAMA-2, Claude, Mistral, and open-source alternatives) and report results where feasible. A cross-model analysis would clarify whether observed gains generalize beyond GPT-family models and help identify model-dependent effects.

3. For each detection task, only a single monolingual dataset is used, which limits generalizability. It is suggested to provide a more robust assessment, evaluate prompting-based methods across multiple datasets for each detection task and compare with established anomaly-detection results from related literature. This broader empirical coverage is essential to illustrate the true performance and generalizability of prompting-based approaches.

Author Response

Comments 1 : The related work section on anomaly detection is insufficiently comprehensive. Several categories (e.g., machine learning–based methods and deep learning–based methods) are represented by only a single reference, and key foundational features of these approaches are not adequately introduced. We recommend broadening citations within each category, distilling core principles (e.g., feature representations, model architectures, optimization paradigms), and highlighting how these baselines compare to prompting-based approaches.
Response 1 : We are grateful for the criticism. In this paper, we have augmented the related work by citing additional literature for the main categories of anomaly detection (statistical techniques, machine learning techniques, deep learning techniques), summarising the key concepts of each approach (feature representation, model architecture, optimisation methods, etc.), and highlighting the differences between these existing techniques and our prompting-based approach.

Comments 2 : The study relies exclusively on GPT-series LLMs. To strengthen claims about prompting methods, include a broader set of large language models (e.g., PaLM, LLaMA/LLAMA-2, Claude, Mistral, and open-source alternatives) and report results where feasible. A cross-model analysis would clarify whether observed gains generalize beyond GPT-family models and help identify model-dependent effects.
Response 2 : We appreciate the reviewer’s insightful observation. While we acknowledge that including a wider variety of LLM families (e.g., PaLM, LLaMA-2, Claude, Mistral) could provide additional perspectives, the central focus of our study—as also reflected in the title “Exploring the Potential of Anomaly Detection through Reasoning with Large Language Models”—is to examine the role of reasoning-oriented prompt engineering strategies rather than to benchmark model architectures.

To ensure a consistent and controlled environment, we chose the GPT family as a representative and widely accessible baseline, enabling us to isolate the effect of different reasoning-based prompting methods such as Zero-shot, Few-shot, Chain-of-Thought, Self-Consistency, and Tree-of-Thought.




Comments 3 :  For each detection task, only a single monolingual dataset is used, which limits generalizability. It is suggested to provide a more robust assessment, evaluate prompting-based methods across multiple datasets for each detection task and compare with established anomaly-detection results from related literature. This broader empirical coverage is essential to illustrate the true performance and generalizability of prompting-based approaches.
Response 3 : 

We thank the reviewer for raising this important point. We acknowledge that our evaluation is limited to a small number of publicly available datasets (SMS Spam, Fake News Corpus, Toxic Comment). This design choice was intentional, as our primary objective was to explore the potential of reasoning-based prompting strategies for anomaly detection rather than to provide an exhaustive empirical survey.

In the revised manuscript, we explicitly state this limitation in the Conclusion section, noting that future work should extend the evaluation to multilingual and multi-domain datasets and benchmark against broader anomaly detection results reported in the literature. By doing so, we make clear that the present study is a first step toward highlighting the promise of reasoning with LLMs for anomaly detection, while more comprehensive validation remains an important direction for future research.

Reviewer 2 Report

Comments and Suggestions for Authors

The article discusses the potential of large language models (LLMs) in anomaly detection using various prompt engineering techniques, such as Chain-of-Thought and Tree-of-Thought. While the paper presents interesting comparative results, its scientific value could be significantly enhanced with key methodological and structural improvements. The following review highlights the areas that require attention for the work to make a more meaningful contribution to the field.

Structural and methodological issues:
1. The paper fails to compare its results against the state-of-the-art (SOTA) on publicly available datasets. For instance, since the authors used the SMS Spam Collection dataset, they should have benchmarked their findings (e.g. 96% accuracy for GPT-4o) against other established methods. This omission makes it impossible to determine if their prompt engineering techniques truly advance the field.

2. The paper's structure lacks a logical flow. The discussion of the limitations of Tree-of-Thought (ToT) and the results of the ablation study are placed in the "Discussion" section (chapter 5), rather than in the "Results" section (chapter 4), where the presentation of data should be directly followed by its interpretation.

3. Section 6, which covers adaptive techniques, is misplaced. It appears after the discussion, which is illogical. Its theoretical nature suggests it should either be part of the introduction or a dedicated "Future Work" section placed after the conclusions.

4. The analysis of ToT is insufficient. Despite its poor performance, the authors do not provide a deeper investigation into why the technique failed. They instead rely on a single, incomplete example, rather than analyzing how the model arrived at incorrect conclusions in a broader set of cases.

Results presentation issues:
5. The paper lacks sufficient interpretation and specific examples. The experimental results are presented without a deeper analysis of their causes. There are no concrete examples to illustrate how Chain-of-Thought (CoT) reasoning leads to better results than the Few-shot technique. For instance, showing how CoT processes a message step-by-step versus how Few-shot makes an incorrect classification based on superficial similarities would be highly valuable.

6. The formatting is inconsistent. Accuracy is presented as a percentage (e.g. 91%), while the F1-score is given as a decimal value (e.g. 0.95). A more consistent standard (e.g. 0.91 and 0.95) would improve clarity and readability.

7. The ablation study is incomplete. The experiments should have been extended to determine if increasing the number of examples or responses continues to improve results until a saturation point is reached. For example, the authors could have tested variants with 40/10, 50/12, or even 60/14 Few-shot examples to confirm that the performance gains observed up to 30/8 examples do not continue indefinitely.

Suggestions (optional):
8. The authors should consider that to achieve optimal results, it is necessary to fine-tune model parameters, such as lowering the temperature, to reduce the randomness of responses.

9. The paper needs to move from a theoretical description of advanced techniques (e.g. RAG) to practical experiments to verify their true potential in anomaly detection.

Author Response

Comments 1: The paper fails to compare its results against the state-of-the-art (SOTA) on publicly available datasets. For instance, since the authors used the SMS Spam Collection dataset, they should have benchmarked their findings (e.g. 96% accuracy for GPT-4o) against other established methods. This omission makes it impossible to determine if their prompt engineering techniques truly advance the field.

Response 1: We agree that comparing our results with prior state-of-the-art (SOTA) methods provides valuable context. While our study does not introduce new baseline experiments, we have revised the manuscript to reference widely reported results from the literature on the datasets we used.

For the SMS Spam Collection dataset, traditional machine learning approaches such as SVM and ensemble methods typically report accuracy between 90–95% (Cormack, 2008), which is comparable to our best result of 96% with GPT-4o using reasoning-based prompting. In fake news detection, deep learning methods such as CNNs and LSTMs generally achieve accuracies in the 80–90% range (Shu et al., 2017), aligning with our results using CoT and Few-shot prompting. For toxic comment detection, neural network models such as CNN-GRU architectures report F1 scores around 0.80–0.85 (Zhang et al., 2018), which is consistent with the F1 ≈ 0.87–0.93 achieved in our prompting-based experiments.

We emphasize that the novelty of our work lies not in surpassing SOTA baselines through model-specific optimization, but in demonstrating that reasoning-oriented prompt engineering techniques can achieve competitive performance without any task-specific training or fine-tuning

 

Cormack, G. V. (2008). “Email spam filtering: A systematic review.” Foundations and Trends in Information Retrieval, 1(4), 335–455.
Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). “Fake News Detection on Social Media: A Data Mining Perspective.” SIGKDD Explorations, 19(1), 22–36.
Zhang, Z., Robinson, D., & Tepper, J. (2018). “Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network.” EACL.

 

Comments 2: The paper's structure lacks a logical flow. The discussion of the limitations of Tree-of-Thought (ToT) and the results of the ablation study are placed in the "Discussion" section (chapter 5), rather than in the "Results" section (chapter 4), where the presentation of data should be directly followed by its interpretation.

Response 2: We agree that the interpretation of experimental results should directly follow their presentation. Accordingly, we have revised the manuscript structure so that the results of the ablation study and the analysis of Tree-of-Thought (ToT) are now integrated into Section 4 (Results), immediately after the corresponding quantitative results are presented. We believe this restructuring improves the logical flow of the manuscript and makes the connection between results and interpretation clearer.

 

Comments 3: Section 6, which covers adaptive techniques, is misplaced. It appears after the discussion, which is illogical. Its theoretical nature suggests it should either be part of the introduction or a dedicated "Future Work" section placed after the conclusions.

Response 3: We appreciate the reviewer’s observation. We agree that the placement of Section 6 could interrupt the logical flow of the manuscript. In the revised version, we have moved Section 6 (Adaptive Techniques for Next-Generation Anomaly Detection) to follow the Conclusion, presenting it as part of the Future Work section. This restructuring improves the manuscript’s organization by ensuring that results and discussion are presented sequentially, while theoretical and forward-looking perspectives are reserved for the closing sections.

 

 

Comments 4: The analysis of ToT is insufficient. Despite its poor performance, the authors do not provide a deeper investigation into why the technique failed. They instead rely on a single, incomplete example, rather than analyzing how the model arrived at incorrect conclusions in a broader set of cases.
Response 4:We agree that the initial analysis of Tree-of-Thought (ToT) was limited. While we do not add new case studies, we have revised the manuscript to expand the explanation beyond the single example originally provided. Specifically, we now discuss general patterns observed in ToT misclassifications—such as consensus bias among simulated experts and keyword over-reliance that led to false positives. These broader observations help clarify why ToT underperformed compared to CoT and Self-Consistency. We also note this as a limitation in the Discussion section and emphasize that systematic exploration of alternative ToT implementations remains an important direction for future research.

 

Comments 5: The paper lacks sufficient interpretation and specific examples. The experimental results are presented without a deeper analysis of their causes. There are no concrete examples to illustrate how Chain-of-Thought (CoT) reasoning leads to better results than the Few-shot technique. For instance, showing how CoT processes a message step-by-step versus how Few-shot makes an incorrect classification based on superficial similarities would be highly valuable.

Response 5: While we do not add new case studies, we have strengthened the interpretation of the results in the revised manuscript. Specifically, we highlight why Chain-of-Thought (CoT) prompting outperforms Few-shot prompting in anomaly detection tasks. CoT encourages models to articulate step-by-step reasoning, which reduces the risk of relying solely on superficial similarities between examples and input data. In contrast, Few-shot prompting sometimes leads the model to overfit to patterns in the provided examples, resulting in misclassifications when the input text is semantically ambiguous. We have incorporated this explanation into the results section to make the performance differences clearer.

 

Comments 6: The formatting is inconsistent. Accuracy is presented as a percentage (e.g., 91%), while the F1-score is given as a decimal value (e.g., 0.95). A more consistent standard (e.g., 0.91 and 0.95) would improve clarity and readability

Response 6:
We thank the reviewer for pointing out this inconsistency. In the revised manuscript, we have standardized the reporting format for performance metrics. Both accuracy and F1-score are now presented in decimal form (e.g., 0.91 and 0.95) to ensure clarity, consistency, and readability across all tables and text descriptions..

 

Comments 7: The ablation study is incomplete. The experiments should have been extended to determine if increasing the number of examples or responses continues to improve results until a saturation point is reached. For example, the authors could have tested variants with 40/10, 50/12, or even 60/14 Few-shot examples to confirm that the performance gains observed up to 30/8 examples do not continue indefinitely.

Response 7: We acknowledge that our ablation study was limited to a maximum of 30/8 examples for Few-shot prompting and 7 responses for Self-Consistency. While we did not extend the experiments further due to computational constraints, we observed that performance improvements had already plateaued at these levels. For example, in both spam and toxic comment detection, accuracy and F1-scores stabilized between the 20/6 and 30/8 configurations, and additional SC responses beyond 5 yielded diminishing returns.

We have clarified this observation in the revised manuscript, emphasizing that the results indicate a saturation trend rather than a continued improvement.

 

Comments 8: The authors should consider that to achieve optimal results, it is necessary to fine-tune model parameters, such as lowering the temperature, to reduce the randomness of responses.

Response 8: In this study, we intentionally did not perform parameter tuning (e.g., adjusting temperature) in order to ensure consistency and comparability across all experiments. Our aim was to evaluate the effectiveness of reasoning-based prompt engineering strategies under standardized and widely used default configurations, rather than to optimize performance through model-specific hyperparameter adjustments. This design choice allowed us to isolate the contribution of prompting methods themselves. We have clarified this rationale in the revised manuscript and noted that parameter tuning may further enhance results, which we leave as a potential extension for future research.

 

Comments 9: The paper needs to move from a theoretical description of advanced techniques (e.g. RAG) to practical experiments to verify their true potential in anomaly detection.
Response 9: We appreciate the reviewer’s insightful comment. We agree that empirical validation of adaptive prompting techniques such as Retrieval-Augmented Generation (RAG) and multi-agent debate frameworks would provide stronger evidence of their utility in anomaly detection. However, the current study was scoped as an exploratory investigation into reasoning-oriented prompting strategies, and thus we presented these advanced techniques as theoretical prospects rather than implemented experiments. In the revised manuscript, we have clarified this limitation and explicitly identified the empirical evaluation of adaptive prompting methods as a key direction for future work.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Based on the review comments and the revised manuscript, the authors have successfully addressed all points of concern. They have added a comparison against state of the art methods, restructured the paper to improve logical flow, and moved the discussion of advanced techniques to a new Future Work section. Furthermore, they provided a more in-depth analysis of the limitations of the TOT technique, clarified why COT outperformed Few-shot prompting, and standardized the formatting of performance metrics. The authors also explained the computational constraints of their ablation study and confirmed their intentional choice not to tune model parameters.

The comprehensive nature of these revisions and their full alignment with the feedback demonstrate that the manuscript is now ready for publication.

Back to TopTop