Exploring the Potential of Anomaly Detection Through Reasoning with Large Language Models

Park, Sungjune; Choi, Daeseon

doi:10.3390/app151910384

Open AccessArticle

Exploring the Potential of Anomaly Detection Through Reasoning with Large Language Models

by

Sungjune Park

^*

and

Daeseon Choi

^*

Department of Software, Soongsil University, Seoul 07027, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10384; https://doi.org/10.3390/app151910384

Submission received: 1 August 2025 / Revised: 5 September 2025 / Accepted: 18 September 2025 / Published: 24 September 2025

(This article belongs to the Special Issue AI-Enabled Next-Generation Computing and Its Applications)

Download

Browse Figure

Review Reports Versions Notes

Abstract

In recent years, anomaly detection in digital environments has become a critical research area due to issues such as spam messages and fake news, which can lead to privacy breaches, social disruption, and undermined information reliability. Traditional anomaly detection models often require specific training for each task, resulting in significant time and resource consumption and limited flexibility. This study explores the use of Prompt Engineering with Transformer-based Large Language Models (LLMs) to address these challenges more efficiently. By comparing techniques such as Zero-shot, Few-shot, Chain-of-Thought (CoT), Self-Consistency (SC), and Tree-of-Thought (ToT) prompting, the study identifies CoT and SC as particularly effective, achieving up to 0.96 accuracy in spam detection without the need for task-specific training. However, ToT exhibited limitations due to biases and misinterpretation. The findings emphasize the importance of selecting appropriate prompting strategies to optimize LLM performance across various tasks, highlighting the potential of Prompt Engineering to reduce costs and improve the adaptability of anomaly detection systems. Future research is needed to explore the broader applicability and scalability of these methods. Additionally, this study includes a survey of Prompt Engineering techniques applicable to anomaly detection, examining strategies such as Self-Refine and Retrieval-Augmented Generation to further enhance detection accuracy and adaptability.

Keywords:

anomaly detection; large language models; prompt engineering; reasoning AI

1. Introduction

In recent years, anomaly detection in digital environments has become a crucial research topic [1,2,3]. Anomalies such as spam messages and fake news can lead to serious issues, including breaches of privacy, social disruption, and the degradation of information reliability. While various approaches have been explored to address these challenges, traditional anomaly detection models are often constrained by their need to be specifically trained for individual tasks. This requirement consumes significant time and resources and limits their flexibility in addressing a diverse range of domains and emerging problems.

In this context, Attention mechanisms [4] and Transformer-based Large Language Models [5,6,7,8] (LLMs) have opened new possibilities. In particular, the application of Prompt Engineering [9] in LLMs holds potential for achieving superior performance and efficiency compared to traditional anomaly detection techniques. Prompt Engineering optimizes input data to enable LLMs to better understand complex problems and generate appropriate responses, thereby reducing the need for extensive task-specific training. Moreover, since Prompt Engineering does not require separate training processes for specific tasks, it significantly reduces the costs and time associated with training while offering greater flexibility and adaptability across a variety of tasks and scenarios. Although research on utilizing Prompt Engineering for anomaly detection is still in its early stages, its diverse implementations invite a survey-based analysis to identify best practices and emerging trends.

This paper provides a comprehensive survey of how Prompt Engineering techniques can enhance anomaly detection performance, particularly in the context of spam messages and fake news. Specifically, we will compare and analyze various Prompt Engineering techniques, such as Zero-shot prompting [10], Few-shot prompting [6,11], Chain-of-Thought (CoT) prompting [12], Self-Consistency (SC) [13], and Tree-of-Thought (ToT) prompting [14,15,16], to evaluate how each technique contributes to anomaly detection. In addition, we provide a brief but comprehensive overview of advanced Prompt Engineering methods applicable to anomaly detection, including Self-Refine [17], Retrieval-Augmented Generation [18], and Judgment of Thoguht [19], which have the potential to further refine detection accuracy and robustness. This evaluation aims to assess how each technique contributes to anomaly detection. Through this analysis, we aim to identify the potential and limitations of Prompt-Engineering-based anomaly detection and suggest directions for future research.

This study makes several significant contributions to the field of anomaly detection using Prompt Engineering as a transformative paradigm. First, by synthesizing and critically examining a wide range of Prompt Engineering techniques, it offers valuable guidance for researchers in selecting the most appropriate methods for specific anomaly detection scenarios. Second, by highlighting emerging advanced approaches, the survey broadens the applicability of generative AI beyond traditional text-based anomaly detection. Third, it lays a foundation for future research by identifying key challenges and potential improvements, thereby guiding the development of next-generation AI technologies for more effective anomaly detection.

2. Background and Related Works

2.1. Anomaly Detection

Anomaly detection is a crucial area in data mining and machine learning that focuses on identifying patterns in data that do not conform to expected behavior. These atypical patterns, or anomalies, can indicate critical incidents, such as fraudulent activities, system failures, or unusual behavior in various domains. Traditional approaches to anomaly detection can be broadly categorized into statistical methods, machine learning-based methods, and deep learning-based methods.

Statistical methods rely on predefined thresholds and probability distributions to identify anomalies. Common techniques include z-score analysis, Gaussian mixture models, and hypothesis testing [20]. Their main strengths are simplicity and interpretability, but they struggle with non-linear patterns and high-dimensional data.

Machine learning-based methods utilize labeled or semi-labeled data to train models that distinguish between normal and anomalous instances. Examples include Support Vector Machines (SVM), k-Nearest Neighbors (k-NN), Local Outlier Factor (LOF) [21], and Isolation Forest [22]. These methods often achieve higher accuracy than statistical approaches and can capture more complex structures, but they rely heavily on sufficient labeled data and can be sensitive to data imbalance.

Deep learning-based methods have gained prominence due to their ability to automatically learn feature representations from large-scale and high-dimensional data. Techniques such as Auto-Encoders (AEs), Recurrent Neural Networks (RNNs), and Convolutional Neural Networks (CNNs) have been applied in various domains [3,23,24,25]. These approaches can achieve state-of-the-art performance in tasks like time-series anomaly detection, image recognition, and natural language processing. However, they are computationally expensive, require large datasets, and often lack interpretability.

Recent advancements in Transformer architecture and Attention mechanisms have significantly expanded the boundaries of anomaly detection. Originally designed for natural language processing tasks, Transformers excel at modeling long-term dependencies and complex patterns. The Attention mechanism enables models to focus on relevant input features, facilitating the detection of subtle anomalies that traditional methods may miss. Nevertheless, conventional anomaly detection models, whether ML- or DL-based, typically require task-specific training, which is costly and inflexible. To address these limitations, Prompt Engineering with Large Language Models (LLMs) has emerged as a promising alternative, offering efficient task generalization without additional training overhead.

2.2. Foundations of Prompt Engineering

Prompt engineering is a technique designed to optimize the performance of large language models (LLMs) such as GPT-3 by crafting specific input prompts. These models, based on the Transformer architecture, have demonstrated exceptional capabilities in understanding and generating human-like text. The goal of Prompt engineering is to leverage these capabilities to guide the model in generating desired outputs for specific tasks. Various strategies within Prompt Engineering exist, each with distinct advantages and applications. These are illustrated in Figure 1.

Zero-shot prompting involves providing the model with a description of a specific task, allowing it to generate relevant responses without any task-specific training. This technique utilizes the model’s pre-trained knowledge and is particularly useful for tasks with limited labeled data. Few-shot prompting provides the model with a few examples of the task to help it better understand and generate accurate responses. It strikes a balance between zero-shot and full fine-tuning, improving performance with minimal labeled data.

CoT prompting guides the model to generate intermediate steps or reasoning processes before arriving at a final answer. By breaking down tasks into smaller steps, this technique enhances problem-solving capabilities and increases the interpretability of the output. Self-Consistency (SC) applies the CoT approach, encouraging the model to solve problems step by step while generating multiple responses for each step and selecting the most consistent answer. This approach reduces output variability and enhances reliability, especially in tasks requiring high accuracy and precision.

ToT prompting is similar to CoT. It guides the model to solve problems through hierarchical or branched reasoning processes, making it particularly effective in complex decision-making tasks that require considering multiple alternatives and evaluating their outcomes. Prompt Engineering has shown promising potential in various applications, including text classification, summarization, translation, and creative writing. However, its potential in anomaly detection has not been fully explored. By designing prompts that highlight anomalous patterns or behaviors, Prompt Engineering could be expected to enhance the anomaly detection performance of LLMs.

3. Prompt Design for Anomaly Detection: Exploring Its Effectiveness

In this section, we describe how prompts were designed for each Prompt-Engineering method for our experiments. The prompts were crafted to align with the characteristics and objectives of each methodology, aiming to maximize their effectiveness in anomaly detection. The content within the angle brackets (<>) varies depending on the task (e.g., spam detection, fake news detection, harmful comment detection). All user prompts and outputs are as follows:

User prompt: <text to detect>
Output: <anomaly> / <normal>

Zero-shot Prompting involves providing the model with a task description and generating responses without any additional task-specific training. For anomaly detection, the zero-shot prompt is designed to enable the model to identify anomalies in a general context. This prompt guides the model to determine whether a sentence is spam based on its characteristics and context. The strength of zero-shot prompting lies in the model’s reliance on pre-existing knowledge, allowing it to utilize a broad domain of knowledge. Our zero-shot prompt design for anomaly detection is as follows:

System prompt: “You are a helpful <task> detection expert. Analyze the given text and respond with “<anomaly>” if the input is an anomaly or “<normal>” if it is normal.”

Few-shot Prompting provides the model with a few examples to guide it in generating the correct responses. This approach is advantageous for enhancing the model’s accuracy by providing context and patterns. By providing specific examples, the model can better understand the task and improve detection accuracy. Our few-shot prompt design for anomaly detection is as follows:

System prompt: “You are a helpful <task> detection expert. Look at the given examples, analyze the given text, and respond with “<anomaly>” if the input is an anomaly or “<normal>” if it is normal.”
Example:
<example text>, Classification: <anomaly>/<normal>
<example text>, Classification: <anomaly>/<normal>
…

CoT prompting encourages the model to think through problems step-by-step. This method is useful for tasks requiring logical reasoning and deep analysis. This prompt encourages the model to break down the analysis process into steps, enabling more accurate anomaly detection. Our CoT prompt design for anomaly detection is as follows:

System prompt: “You are a helpful <task> detection expert. Look at the given examples and reasoning, analyze the given text, and respond with “<anomaly>” if the input is an anomaly or “<normal>” if it is normal.”
Example:
<example text>, Classification: <anomaly>/<normal> Reasoning: <reasoning>
<example text>, Classification: <anomaly>/<normal> Reasoning: <reasoning>
…

SC improves reliability by generating multiple responses and selecting the most consistent one. This method reduces the impact of random errors and inconsistencies, allowing the model to provide a more trustworthy final response. The design for SC in anomaly detection uses the same prompt as CoT, repeated multiple times, and selects the most consistent response as the final output.

ToT prompting is a method that creates a tree structure to explore various reasoning paths. This method is particularly useful for tasks that require complex decision-making processes. To apply ToT to anomaly detection, we adopted the single-prompt ToT approach described by Hulbert et al. [16]. This approach guides the model to explore multiple reasoning paths, enhancing anomaly detection capabilities through comprehensive analysis. By designing prompts tailored to highlight anomalous patterns or behaviors, Prompt Engineering is expected to improve the anomaly detection performance of LLMs. Our ToT prompt design for anomaly detection is as follows:

System prompt: “Imagine three different <task> detection experts are answering the given input. In the first step, all three experts write down their thoughts and then share them with the group. Then, all the experts move on to the next step and discuss. If any expert realizes a mistake at any point, that expert leaves. Continue the discussion until all three experts agree on the input. If all experts finish the discussion, output the final decision.”

4. Experiment

4.1. Experimental Setting

In this section, we describe the experimental setup designed to evaluate the effectiveness of various Prompt-Engineering techniques for anomaly detection. The primary focus of the experiment is anomaly detection in textual data, such as spam detection and fake news detection. For the experiments, we used publicly available datasets related to each anomaly detection task. For spam detection, we utilized the SMS Spam dataset (SMS) [26], which contains labeled spam and legitimate (non-spam) emails. For fake news detection, we employed the Fake News Corpus dataset (Fake) [27], consisting of labeled fake and real news articles, and for Toxic Comment dataset (Toxic) detection, the Toxic Comment dataset [28] was adopted.

The experiments were conducted using state-of-the-art large language models based on the Transformer architecture, specifically GPT-3.5-turbo and GPT-4o (GPT-4omni). The model parameters were set to their default configurations: temperature and Top P at 1, and Frequency penalty and Presence penalty at 0. Prompts were designed according to the characteristics and goals of each Prompt-Engineering technique evaluated. The techniques assessed include Zero-shot prompts, which provide only the task description without additional training, Few-shot prompts, which provide the model with a few examples to guide correct responses, CoT prompts, which encourage the model to think step-by-step, SC prompts, which generate multiple responses and select the most consistent one, and ToT prompts, which explore various reasoning paths through a tree structure. The number of examples used in Few-shot prompting and CoT prompting were 10 for the SMS Spam and Toxic Comment datasets, and 4 for the Fake News Corpus dataset. The number of responses generated in SC was set to 3 for all datasets.

The evaluation procedure consisted of four stages: prompt design, model execution, response collection, and performance evaluation. Specific prompts were designed for each anomaly detection task and technique, and these were used to run the GPT-3.5-turbo and GPT-4o models on the preprocessed datasets. The models’ responses to each input text were collected and compared against the labeled data to assess performance. The performance of each Prompt-Engineering technique was measured using accuracy and F1 score. Accuracy represents the proportion of instances where the model’s prediction matches the actual label, indicating how often the model makes correct predictions across the entire dataset. F1 score, the harmonic mean of precision and recall, is particularly important for imbalanced datasets, which are common in anomaly detection tasks. It evaluates the balance between how precisely the model identifies the positive class (precision) and how many of the actual positive instances it detects (recall). In tasks such as spam detection or fake news detection, where missing certain instances can lead to significant issues, the F1 score plays a crucial role in providing a comprehensive assessment of the model’s performance.

4.2. Experimental Results

In this section, we present the experimental results evaluating the effectiveness of various Prompt-Engineering techniques using the GPT-3.5-Turbo and GPT-4 models for tasks such as SMS spam detection, fake news detection, and toxic comment detection. The performance of each technique is assessed based on accuracy and F1 Score. The summary of the performance of different Prompt-Engineering techniques across anomaly detection tasks is shown in Table 1. A comparative analysis of the results reveals that Prompt-Engineering techniques have the potential to significantly improve anomaly detection performance in both GPT-3.5-Turbo and GPT-4 models.

In the SMS Spam detection dataset, a comparison between prompt techniques reveals that the CoT and SC techniques were particularly effective in boosting performance. While both models achieved similar accuracy and F1 scores in Zero-shot and Few-shot learning, the application of CoT and SC techniques resulted in a significant performance increase, with GPT-4o achieving 96% accuracy and an F1 score of 0.98. In contrast, GPT-3.5-Turbo, despite showing improvements, still lagged slightly behind, with a 94% accuracy and an F1 score of 0.96 for the same techniques. This suggests that although both models benefited from CoT and SC, GPT-4o was able to leverage these techniques more efficiently, demonstrating a greater performance boost, particularly in tasks requiring complex reasoning.

In the Fake News detection dataset, the performance gap between prompt techniques was also noticeable. In Zero-shot and Few-shot learning, the differences between GPT-3.5-Turbo and GPT-4o were minimal. However, when CoT and SC techniques were applied, GPT-4o demonstrated superior performance, achieving 90% accuracy and an F1 score of 0.91 with CoT and 86%/0.87 with SC. GPT-4o also excelled in Few-shot learning, achieving a 92% accuracy and an F1 score of 0.92, indicating that even with minimal training data, the model could deliver high-level performance. This suggests that CoT and SC techniques played a critical role in enhancing model performance in fake news detection tasks, with GPT-4o particularly benefiting from their application, resulting in more accurate predictions.

In the Toxic dataset, the differences between prompt techniques were even more pronounced. While GPT-4o already outperformed GPT-3.5-Turbo in Zero-shot and Few-shot learning, the performance gap widened significantly with the application of CoT and SC techniques. With CoT and SC, GPT-4o achieved 87% accuracy and an F1 score of 0.93, significantly outperforming GPT-3.5-Turbo’s 79% accuracy and 0.87 F1 score. This highlights GPT-4o’s ability to better handle complex reasoning tasks using CoT and SC techniques, particularly in tasks like toxic content detection. The SC technique, which generates multiple answers and selects the most consistent one, was especially effective in boosting performance in tasks requiring nuanced text analysis.

In contrast, the ToT (Thought of Thoughts) technique did not lead to as significant performance improvements as CoT and SC. In spam and fake news detection, the performance gains were less pronounced, with ToT showing more modest improvements primarily in the toxic dataset. While GPT-4o still outperformed GPT-3.5-Turbo with ToT, the substantial differences seen in CoT and SC were not replicated here.

In conclusion, CoT and SC techniques were instrumental in improving performance across both models, with GPT-4o showing the greatest benefit, especially in tasks requiring complex reasoning. While ToT led to more modest performance improvements, it still maintained consistent gains across datasets. These results highlight the importance of advanced prompt techniques in enhancing model performance, particularly in state-of-the-art models like GPT-4o.

To contextualize our findings, we compare them with reported state-of-the-art results from the literature. On the SMS Spam Collection dataset, traditional machine learning approaches such as SVM and ensemble methods have typically achieved 90–95% accuracy [29], which is comparable to our 96% accuracy with GPT-4o using reasoning-based prompting. In fake news detection, deep learning methods such as CNNs and LSTMs report accuracies in the 80–90% range [30], which aligns with our results using CoT and Few-shot prompting. For toxic comment detection, neural architectures like CNN-GRU models reach F1 scores of about 0.80–0.85 [31], consistent with the F1 0.87–0.93 achieved in our prompting-based experiments.

The superior performance of Chain-of-Thought (CoT) prompting compared to Few-shot prompting can be attributed to its reasoning process. By explicitly generating intermediate steps, CoT allows the model to consider contextual cues and logical consistency before reaching a decision. This step-by-step reasoning mitigates errors caused by surface-level similarities, which Few-shot prompting is more prone to. For example, in spam detection, Few-shot prompting occasionally misclassified benign messages containing promotional keywords, whereas CoT was able to reason about the overall intent and correctly classify them as non-spam.

Although our ablation study was limited to 30/8 examples for Few-shot prompting and 7 responses for Self-Consistency, the results suggest a clear saturation trend. Performance gains became marginal beyond 20/6 in Few-shot and 5 responses in SC, indicating that additional increases would likely yield diminishing returns. A more extensive investigation of these saturation points remains an important task for future research.

4.3. Ablation Study

In the ablation study, as summarized in Table 2, explore the effect of varying the number of examples in Few-shot prompts and the number of responses generated in SC prompts on model performance. These experiments were conducted using the GPT-3.5-Turbo and GPT-4o models across tasks such as SMS spam detection, fake news detection, and toxic comment detection. The table illustrates how performance varied with changes in shot counts for Few-shot learning and the number of generated responses in the SC technique.

For Few-shot learning, we adjust the number of shots for each task with different configurations for each dataset (e.g., 10/4, 20/6, and 30/8). In the SMS spam detection task, increasing the number of shots from 10 to 30 led to a notable improvement in both models. GPT-4o, in particular, improved from 95% accuracy and 0.97 F1 score with 10 shots to 96% accuracy and 0.98 F1 score with 30 shots. Similarly, GPT-3.5-Turbo also showed a slight improvement, increasing from 92%/0.94 to 93%/0.96 under the same conditions. This suggests that while both models benefit from additional Few-shot examples, GPT-4o demonstrates a more consistent and significant performance boost as the number of examples increases.

In the fake news detection task, however, the impact of increasing the number of Few-shot examples is less pronounced. GPT-4o’s performance peaked at 92% accuracy and 0.92 F1 score with 10 shots but slightly decreased as more shots were introduced, stabilizing at 88% accuracy and 0.89 F1 score with 30 shots. This could indicate that for certain tasks like fake news detection, adding more Few-shot examples does not always correlate with improved performance. GPT-3.5-Turbo followed a similar trend, reaching its highest performance with 20 shots (86% accuracy and 0.87 F1 score), but remaining stable even as the number of shots increased to 30.

In the toxic comment detection task, the trend diverges more clearly between the two models. GPT-4o exhibited a substantial improvement as the number of Few-shot examples increased, with its accuracy rising from 81% to 88% and its F1 score improving from 0.89 to 0.93 as shots increased from 10 to 30. On the other hand, GPT-3.5-Turbo showed mixed results. Although its performance improved initially from 67% accuracy and 0.79 F1 score to 84%/0.91 with 20 shots, it declined slightly when the shot count was increased to 30, indicating that the model might struggle with overfitting or noise when provided with too many examples.

The SC technique, which varies the number of generated responses (3, 5, and 7), had a more consistent impact on both models. In the SMS spam detection task, GPT-4o maintained its peak performance (96% accuracy and 0.98 F1 score) regardless of the number of responses generated, while GPT-3.5-Turbo’s performance also remained stable at 95%/0.97. This suggests that for simpler tasks like SMS spam detection, generating more responses in the SC technique has a diminishing return after a certain point.

In the fake news detection task, GPT-4o showed a slight performance increase with more SC responses, rising from 88%/0.89 with 5 responses to 90%/0.91 with 7 responses. This pattern was not observed in GPT-3.5-Turbo, which remained constant at 86%/0.87 across all response configurations. For more nuanced tasks like fake news detection, GPT-4o appears to benefit more from the additional consistency checks in the SC approach compared to GPT-3.5-Turbo.

For toxic comment detection, the SC technique yielded more significant results. GPT-4o maintained its high performance (87% accuracy and 0.93 F1 score) with 5 and 7 responses, while GPT-3.5-Turbo exhibited a slight improvement with 5 responses, achieving 82% accuracy and 0.89 F1 score. These results indicate that the SC technique is particularly effective for more complex tasks, with both models benefiting from generating more responses and ensuring higher consistency, though GPT-4o consistently outperforms GPT-3.5-Turbo across configurations.

In summary, the ablation study reveals that increasing the number of Few-shot examples and SC responses can lead to improved performance, but the extent of improvement depends on the task and model. GPT-4o generally shows more resilience and gains with additional Few-shot examples and SC responses, while GPT-3.5-Turbo tends to experience diminishing returns or even slight declines with higher shot counts. The SC technique is particularly effective for both models in complex tasks, but its impact varies depending on the task’s complexity and the model used.

The results of this experiment demonstrate that adjusting the configuration of Few-shot and SC prompts can optimize model performance. Through a detailed analysis of each technique, we have closely examined how these adjustments influence the overall performance.

First, the analysis of the effect of the number of examples in Few-shot prompts reveals that the impact varies depending on the task. In the SMS spam detection task, both GPT-4o and GPT-3.5-Turbo showed steady improvements as the number of examples increased. Notably, GPT-4o’s performance continued to improve as the number of examples increased from 10 to 30, indicating that the model can better leverage additional examples to optimize its performance. In contrast, in the fake news detection task, increasing the number of examples led to a slight decline in performance. This suggests that for tasks like fake news detection, adding more examples does not necessarily correlate with improved performance. GPT-3.5-Turbo, for instance, saw little to no performance improvement beyond 20 shots, suggesting that exceeding a certain threshold of examples may not contribute to further gains and could potentially introduce noise or complexity that hinders performance.

In the toxic comment detection task, the trend diverged more clearly between the two models. GPT-4o exhibited substantial performance improvements as the number of Few-shot examples increased, with its accuracy and F1 score steadily rising until 30 shots. On the other hand, GPT-3.5-Turbo showed an initial improvement up to 20 examples, but its performance began to decline as the number of examples increased to 30. This suggests that an excessive number of examples may lead to overfitting or confusion in the model, particularly for GPT-3.5-Turbo, highlighting the importance of selecting an optimal number of examples for each task.

The analysis of the SC prompts showed that increasing the number of generated responses was especially effective for more complex tasks. In the SMS spam detection task, increasing the number of responses from 3 to 7 had little impact on performance for both models. However, for tasks such as fake news and toxic comment detection, the increase in responses resulted in noticeable performance gains. GPT-4o, in particular, exhibited significant improvements when the number of SC responses increased from 5 to 7, indicating that the SC technique helps the model produce more consistent and accurate outputs in complex tasks. In contrast, GPT-3.5-Turbo displayed more limited improvements in certain tasks, suggesting that its architecture may not fully capitalize on the benefits of the SC technique to the same extent as GPT-4o.

Overall, this study demonstrates that Few-shot and SC techniques are crucial for optimizing model performance, and their effectiveness varies depending on both the task and the model. GPT-4o, in particular, showed greater sensitivity to these techniques, resulting in more substantial performance gains, whereas GPT-3.5-Turbo exhibited more constrained improvements. These findings underscore the importance of fine-tuning advanced Prompt-Engineering techniques to maximize the performance of state-of-the-art models like GPT-4o, particularly in more complex tasks.

4.4. Limitation of Single Prompt ToT

The ToT prompt simulates a problem-solving process where multiple “experts” collaborate to reach a consensus decision. This method is designed to enhance the model’s reasoning capabilities by exploring multiple reasoning paths. However, experimental results indicated that the performance of ToT, particularly in spam detection tasks, was somewhat lower compared to other prompt techniques. To explain this in detail, we will examine an input example and its corresponding ToT-based output from Table 3.

The example illustrates a significant limitation of the Single Prompt ToT. In this case, the input message was actually a legitimate message, not spam, but the ToT process mistakenly classified it as spam. Several factors contributed to this misclassification. The experts in the ToT process primarily relied on specific keywords related to drug activities (“gram”, “eighth”, “second gram”). This reliance led to a biased interpretation of the message’s context, resulting in a false positive. Although ToT aims to strengthen reasoning by considering multiple perspectives, it can fall short in situations where a detailed understanding of context is crucial. The model’s experts failed to recognize that the terms could be used in a legitimate context.

Moreover, the collaborative nature of ToT can lead to consensus bias, where the experts reinforce the initial interpretation and do not sufficiently explore alternative explanations. In this example, all three experts quickly agreed on the drug-related interpretation, failing to consider other possibilities. The model’s experts automatically interpreted potentially cryptic language as an indicator of spam. However, in real-world applications, cryptic language can be used legitimately, and overgeneralization can lead to frequent misclassifications.

This example demonstrates that, while the ToT prompt attempts to leverage collaborative reasoning, it is vulnerable to biases and can misclassify messages when contextual cues are misinterpreted. Over-reliance on specific keywords and a lack of detailed understanding can result in false positive decisions. This suggests that the ToT prompt, in its current form, needs improvement to better understand context and reduce consensus bias. These findings highlight the need for enhancements in the ToT prompt to improve its contextual understanding and mitigate consensus bias, ensuring more accurate message classification.

Beyond the illustrative example provided, our observations indicate that the limitations of Tree-of-Thought (ToT) are not confined to isolated cases. In multiple instances, the simulated experts exhibited consensus bias, converging too quickly on an early interpretation without sufficient exploration of alternatives. We also observed keyword over-reliance, where certain trigger terms (e.g., “gram”, “deal”) led to misclassification even in benign contexts. These recurring patterns suggest that, while ToT aims to enhance reasoning through structured collaboration, in practice, it can amplify biases when contextual understanding is shallow. This explains its lower performance relative to CoT and Self-Consistency and highlights the need for improved mechanisms to diversify reasoning paths in future work.

5. Conclusions

In this study, we present a comprehensive survey of Prompt-Engineering techniques for anomaly detection. We review and synthesize insights from empirical evaluations using the GPT-3.5-Turbo and GPT-4o models, comparing methods such as Zero-shot prompting, Few-shot prompting, Chain-of-Thought prompting, Self-Consistency prompting, and Tree-of-Thought prompting. Additionally, we conducted an ablation study to investigate how variations in the number of examples in Few-shot prompting and the number of responses in Self-Consistency prompting influence performance.

Our survey not only aggregates performance trends from recent experiments but also contextualizes these findings within the broader landscape of anomaly detection research. By integrating empirical evidence with established literature, we highlight the strengths and limitations of each Prompt-Engineering strategy and identify key areas for further exploration.

We also acknowledge that our evaluation is limited to a small number of monolingual datasets, and broader validation across multilingual and multi-domain benchmarks remains an important direction for future research.

Future research should further investigate the scalability and adaptability of Prompt-Engineering techniques across diverse domains and anomaly detection tasks beyond text data. In particular, studies focusing on real-time data streams, large-scale datasets, and multilingual environments are essential to advance the field. Such efforts will be instrumental in elucidating how Prompt Engineering can effectively address practical challenges and guide the development of next-generation AI technologies.

6. Future Work: Adaptive Prompt-Engineering Techniques for Next-Generation Anomaly Detection

The detection of complex anomalies—such as spam messages, fake news, or other malicious text—often exceeds the capabilities of simple classification methods. Consequently, recent research has explored adaptive prompt engineering to guide LLMs in systematically identifying suspicious or deceptive patterns. This section groups such techniques into five main categories—Self-Criticism, Few-Shot Chain-of-Thought, Decomposition, Multi-Agent Debate, and Retrieval-Augmented Generation (RAG)—and discusses how each has the potential to improve anomaly detection. Table 4 summarizes these categories and their reported benefits.

6.1. Self-Criticism Methods

Self-criticism approaches encourage an LLM to reflect on or revise its own outputs before finalizing any judgment. Techniques such as Self-Refine [17] and Self-Verification [32] rely on iterative loops, in which the model inspects its initial responses or reasoning steps, identifying potential oversights or inconsistencies. If issues emerge, a subsequent pass aims to address them.

Table 4. Classification of adaptive prompting techniques for anomaly detection.

Category	Techniques	Key References	Possible Benefit in Anomaly Detection
Self-Criticism	Self-Refine Self-Verification Chain-of-Verification (COVE)	Madaan et al. (2023) [17] Weng et al. (2022) [13] Dhuliawala et al. (2023) [33]	May reduce both over-alert and under-alert outcomes by iteratively checking internal reasoning.
Few-Shot CoT	Contrastive CoT Prompting Complexity-based Prompting	Chia et al. (2023) [34] Fu et al. (2022) [35]	Could more effectively distinguish legitimate vs. flawed reasoning in deceptive text.
Decomposition	Least-to-Most Prompting DECOMP Plan-and-Solve	Zhou et al. (2022) [36] Khot et al. (2022) [37] Wang et al. (2023) [38]	Breaks large problems into sub-questions, facilitating more focused analysis of suspicious claims.
Multi-Agent Debate	ChatEval Judgment of Thought (JoT)	Chan et al. (2023) [39] Park et al. (2024) [19]	Aggregates multiple perspectives or personas, potentially mitigating single-agent oversights.
RAG	Standard RAG IRCoT Iterative Retrieval Augmentation	Lewis et al. (2020) [18] Trivedi et al. (2023) [40] Jiang et al. (2023) [41] Balepur et al. (2023) [42]	Uses external data to handle time-sensitive or domain-specific claims that static models may miss.

In addition, Chain-of-Verification (COVE) [33] formalizes this process by having the model generate targeted sub-questions that specifically probe uncertain elements in its rationale. By repeatedly prompting the model to clarify or justify certain reasoning steps, COVE may help the system isolate ambiguous claims that might otherwise go unquestioned. In the context of malicious text detection, such iterative self-checks can potentially mitigate both excessive false positives and unrecognized threats (false negatives). Although no single technique guarantees flawless filtering, self-criticism-based pipelines have shown promise when applied to high-stakes tasks such as spam filtering or misinformation identification.

6.2. Few-Shot Chain-of-Thought

Few-Shot CoT prompting involves providing exemplars to guide the model through step-by-step reasoning. In contrast to zero-shot prompting, this tactic explicitly shows the LLM howto dissect tasks logically [12]. Within this family, Contrastive CoT Prompting [34] includes both correct and deliberately incorrect reasoning paths, prompting the model to compare them. Such a structured comparison may significantly increase the likelihood of spotting subtle manipulations in spam or disinformation.

Meanwhile, Complexity-based Prompting [35] inserts more challenging or intricate examples into the few-shot prompt, potentially enabling the LLM to handle borderline or sophisticated malicious content more effectively. While results can vary, these CoT-based methods have shown considerable promise for refining a model’s ability to detect nuanced anomalies. In certain implementations, models that have seen both high-complexity examples and explicit contrastive cases appear less susceptible to glossing over carefully embedded falsehoods or spurious reasoning.

6.3. Decomposition

A growing body of work has also shown the potential benefits of decomposing complex tasks into simpler sub-problems rather than tackling them in a single monolithic prompt. One approach, Least-to-Most Prompting [36] asks the model to parse a complicated question—such as a multi-part claim in a fake news article—and break it down into smaller, more tractable queries. The model then solves these sub-queries step by step, arriving at a final conclusion with heightened clarity.

Similarly, DECOMP [37] prompts the model to first outline a plan that divides a suspicious text into relevant components (e.g., factual claims, emotional manipulations, logical steps), then address each component separately. In the context of anomaly detection, this stepwise approach allows the LLM to focus on one suspicious aspect at a time, potentially making it simpler to spot where manipulative language or misleading references might appear. Decomposition may also be combined with self-criticism loops—for example, encouraging the model to reevaluate each sub-step if a contradiction arises.

More recently, Plan-and-Solve Prompting [38] has extended this paradigm by explicitly dividing the model’s reasoning into a planning phase and a solving phase. In doing so, the model is first prompted to develop a coherent solution strategy, and then executes each sub-step based on that plan, further clarifying complex or deceptive content. Although large-scale empirical evaluations remain ongoing, early results suggest that decomposition—particularly when grounded in a structured plan—can help LLMs methodically handle multifaceted deceptive texts.

6.4. Multi-Agent Debate and Role Simulation

Multi-agent debate and role simulation techniques divide the LLM into distinct personas or processes, each presenting a different viewpoint. ChatEval [39] exemplifies this by assigning a "Proponent" agent and an "Opponent" agent, whose conflicting points of view are reconciled by a moderator or aggregator. Judgment of Thought (JoT) [19] implements a similar concept but frames it as a courtroom setting with a "lawyer" defending the text, a "prosecutor" identifying potential anomalies, and a "judge" delivering the final verdict.

Such role-based or multi-agent frameworks can, in principle, enhance reliability in tasks like spam or fake news detection by ensuring that one agent’s oversight or bias is highlighted by another. One persona may scan for rhetorical cues suggestive of deceit, while another checks for factual consistency, and yet another weighs these points to form a balanced conclusion. Though these methods do not guarantee perfect results, multi-agent debate offers a coherent avenue for more balanced decision-making across ambiguous or borderline content.

6.5. Retrieval-Augmented Generation

RAG [18] addresses the LLM’s limited access to up-to-date or domain-specific knowledge by integrating external sources during the inference process. For example, IRCoT [40] interleaves chain-of-thought with repeated retrieval calls to verify facts, while Iterative Retrieval Augmentation [41,42] refines the model’s understanding after each new piece of external data.

These retrieval-based approaches might be especially helpful in anomaly detection, where malicious text often references real-world events or constructs claims that a static model cannot verify on its own. By involving external lookups, an LLM may more promptly flag inconsistencies or fabricated details, thus improving the overall quality of detection. Certain hybrid systems, moreover, combine RAG with multi-agent debates to further validate the factual evidence each agent relies upon.

In many practical use cases, combining these strategies can further increase the likelihood of successfully detecting anomalous or fraudulent text, particularly when addressing novel or rapidly evolving threats. For instance, a system could employ Decomposition methods to parse a multi-faceted claim into sub-problems, apply Contrastive CoT Prompting to each sub-claim, and then rely on a multi-agent debate to finalize the verdict. Should the evidence remain ambiguous, retrieval-based checks may be called upon to consult external databases. While these techniques show promise, they may require additional computational overhead or domain-specific exemplars, depending on the application. These adaptive prompting techniques exhibit the potential to improve both the reliability and precision of large language models deployed in high-stakes contexts.

Author Contributions

Conceptualization, S.P. and D.C.; methodology, S.P.; software, S.P.; validation, S.P.; formal analysis, S.P.; investigation, D.C.; resources, D.C.; data curation, S.P.; writing—original draft preparation, S.P.; writing—review and editing, S.P.; visualization, S.P.; supervision, D.C.; project administration, D.C.; funding acquisition, D.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2024-00398353, Development of Countermeasure Technologies for Generative AI Security Threats).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to ethical restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ahmed, H.; Traore, I.; Saad, S. Detecting opinion spams and fake news using text classification. Secur. Priv. 2018, 1, e9. [Google Scholar] [CrossRef]
Aldwairi, M.; Tawalbeh, L. Security techniques for intelligent spam sensing and anomaly detection in online social platforms. Int. J. Electr. Comput. Eng. 2020, 10, 275. [Google Scholar] [CrossRef]
Chalapathy, R.; Chawla, S. Deep learning for anomaly detection: A survey. arXiv 2019, arXiv:1901.03407. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017. [Google Scholar] [CrossRef]
Floridi, L.; Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [PubMed]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv 2024, arXiv:2402.07927. [Google Scholar] [CrossRef]
Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 2024, 36, 11809–11822. [Google Scholar]
Long, J. Large language model guided tree-of-thought. arXiv 2023, arXiv:2305.08291. [Google Scholar] [CrossRef]
Hulbert, D. Using Tree-of-Thought Prompting to Boost ChatGPT’s Reasoning. 2023. Available online: https://github.com/dave1010/tree-of-thought-prompting (accessed on 1 August 2025).
Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-refine: Iterative refinement with self-feedback. Adv. Neural Inf. Process. Syst. 2023, 36, 46534–46594. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Park, S.; Kim, H.; Cho, H.; Choi, D. Judgment-of-Thought Prompting: A Courtroom-Inspired Framework for Binary Logical Reasoning with Large Language Models. arXiv 2025, arXiv:2409.16635. [Google Scholar] [CrossRef]
Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 2009, 41, 1–58. [Google Scholar] [CrossRef]
Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 15–18 May 2000; pp. 93–104. [Google Scholar]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 413–422. [Google Scholar]
Malhotra, P.; Vig, L.; Shroff, G.; Agarwal, P. Long short term memory networks for anomaly detection in time series. In Proceedings of the 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2015, Bruges, Belgium, 22–24 April 2015; Volume 89, p. 94. [Google Scholar]
Xu, H.; Chen, W.; Zhao, N.; Li, Z.; Bu, J.; Li, Z.; Liu, Y.; Zhao, Y.; Pei, D.; Feng, Y.; et al. Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 187–196. [Google Scholar]
Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S.A.; Binder, A.; Müller, E.; Kloft, M. Deep one-class classification. In Proceedings of the International Conference on Machine Learning. PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4393–4402. [Google Scholar]
SMS Spam Collection Dataset. 2016. Available online: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset (accessed on 1 August 2025).
Fake News Detection Datasets. 2022. Available online: https://www.kaggle.com/datasets/emineyetm/fake-news-detection-datasets (accessed on 1 August 2025).
Toxic Comment Classification Challenge. 2018. Available online: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge (accessed on 1 August 2025).
Cormack, G.V. Email spam filtering: A systematic review. Found. Trends® Inf. Retr. 2008, 1, 335–455. [Google Scholar] [CrossRef]
Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
Zhang, Z.; Robinson, D.; Tepper, J. Detecting hate speech on twitter using a convolution-gru based deep neural network. In Proceedings of the European Semantic Web Conference, Heraklion, Greece, 3–7 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 745–760. [Google Scholar]
Weng, Y.; Zhu, M.; Xia, F.; Li, B.; He, S.; Liu, S.; Sun, B.; Liu, K.; Zhao, J. Large language models are better reasoners with self-verification. arXiv 2022, arXiv:2212.09561. [Google Scholar]
Dhuliawala, S.; Komeili, M.; Xu, J.; Raileanu, R.; Li, X.; Celikyilmaz, A.; Weston, J. Chain-of-verification reduces hallucination in large language models. arXiv 2023, arXiv:2309.11495. [Google Scholar]
Chia, Y.K.; Chen, G.; Tuan, L.A.; Poria, S.; Bing, L. Contrastive chain-of-thought prompting. arXiv 2023, arXiv:2311.09277. [Google Scholar]
Fu, Y.; Peng, H.; Sabharwal, A.; Clark, P.; Khot, T. Complexity-based prompting for multi-step reasoning. arXiv 2022, arXiv:2210.00720. [Google Scholar]
Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang, X.; Schuurmans, D.; Cui, C.; Bousquet, O.; Le, Q.; et al. Least-to-most prompting enables complex reasoning in large language models. arXiv 2022, arXiv:2205.10625. [Google Scholar]
Khot, T.; Trivedi, H.; Finlayson, M.; Fu, Y.; Richardson, K.; Clark, P.; Sabharwal, A. Decomposed prompting: A modular approach for solving complex tasks. arXiv 2022, arXiv:2210.024063. [Google Scholar]
Wang, L.; Xu, W.; Lan, Y.; Hu, Z.; Lan, Y.; Lee, R.K.W.; Lim, E.P. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv 2023, arXiv:2305.04091. [Google Scholar]
Chan, C.M.; Chen, W.; Su, Y.; Yu, J.; Xue, W.; Zhang, S.; Fu, J.; Liu, Z. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv 2023, arXiv:2308.07201. [Google Scholar]
Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv 2022, arXiv:2212.10509. [Google Scholar]
Jiang, Z.; Xu, F.F.; Gao, L.; Sun, Z.; Liu, Q.; Dwivedi-Yu, J.; Yang, Y.; Callan, J.; Neubig, G. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 7969–7992. [Google Scholar]
Balepur, N.; Huang, J.; Chang, K.C.C. Expository text generation: Imitate, retrieve, paraphrase. arXiv 2023, arXiv:2305.03276. [Google Scholar] [CrossRef]

Figure 1. Illustration of various Prompt-Engineering techniques for problem solving with LLMs.

Table 1. Accuracy and F1 score of GPT-3.5-Turbo and GPT-4o on different datasets and evaluation methods.

Dataset	Model	Zero-Shot	Few-Shot	CoT	SC	ToT
SMS Spam	GPT-3.5-Turbo	0.91/0.95	0.92/0.94	0.94/0.96	0.94/0.96	0.81/0.87
SMS Spam	GPT-4o	0.91/0.94	0.95/0.97	0.96/0.98	0.96/0.98	0.92/0.96
Fake News	GPT-3.5-Turbo	0.88/0.88	0.84/0.86	l0.88/0.89	0.86/0.87	0.70/0.71
Fake News	GPT-4o	0.88/0.89	0.92/0.92	0.90/0.91	0.86/0.87	0.86/0.86
Toxic	GPT-3.5-Turbo	0.68/0.79	0.67/0.79	0.79/0.87	0.81/0.89	0.82/0.89
Toxic	GPT-4o	0.80/0.88	0.81/0.89	0.87/0.93	0.87/0.93	0.84/0.91

Table 2. Ablation study between GPT-3.5-Turbo and GPT-4o on different datasets. ‘Few-shot’ columns show shots for SMS, Toxic, and Fake datasets. ‘SC’ columns indicate the number of generated responses.

Dataset	Model	Few-Shot			SC
Dataset	Model	10/4	20/6	30/8	3	5	7
SMS Spam	GPT-3.5-Turbo	0.92/0.94	0.91/0.94	0.93/0.96	0.94/0.96	0.95/0.97	0.95/0.97
SMS Spam	GPT-4o	0.95/0.97	0.96/0.98	0.96/0.98	0.96/0.98	0.96/0.98	0.95/0.97
Fake News	GPT-3.5-Turbo	0.84/0.86	0.86/0.87	0.86/0.87	0.86/0.87	0.86/0.87	0.86/0.87
Fake News	GPT-4o	0.92/0.92	0.88/0.89	0.88/0.89	0.86/0.87	0.88/0.89	0.90/0.91
Toxic	GPT-3.5-Turbo	0.67/0.79	0.84/0.91	0.77/0.86	0.81/0.89	0.82/0.89	0.82/0.89
Toxic	GPT-4o	0.81/0.89	0.80/0.88	0.88/0.93	0.87/0.93	0.87/0.93	0.87/0.93

Table 3. Example of input, reasoning, and decision in ToT-based (Tree-of-Thought) output.

Input	“A gram usually runs like <#>, a half eighth is smarter though and gets you almost a whole second gram for <#>”
Reasoning	Expert 1: This message seems to be using coded language to talk about drugs. Words like “gram” and “eighth” are often associated with drug quantities. Expert 2: I agree, the terminology used here is suspicious and commonly associated with illicit drug transactions. Expert 3: The context of “getting almost a whole second gram” reinforces the idea that this message is talking about drugs. Expert 1: Given the terminology and context, I classify this as spam. Expert 2: I also classify this as spam due to the implicit drug-related content. Expert 3: I concur with the assessment; the coded language and context make it clear that it is spam.
Final Decision	Spam

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, S.; Choi, D. Exploring the Potential of Anomaly Detection Through Reasoning with Large Language Models. Appl. Sci. 2025, 15, 10384. https://doi.org/10.3390/app151910384

AMA Style

Park S, Choi D. Exploring the Potential of Anomaly Detection Through Reasoning with Large Language Models. Applied Sciences. 2025; 15(19):10384. https://doi.org/10.3390/app151910384

Chicago/Turabian Style

Park, Sungjune, and Daeseon Choi. 2025. "Exploring the Potential of Anomaly Detection Through Reasoning with Large Language Models" Applied Sciences 15, no. 19: 10384. https://doi.org/10.3390/app151910384

APA Style

Park, S., & Choi, D. (2025). Exploring the Potential of Anomaly Detection Through Reasoning with Large Language Models. Applied Sciences, 15(19), 10384. https://doi.org/10.3390/app151910384

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring the Potential of Anomaly Detection Through Reasoning with Large Language Models

Abstract

1. Introduction

2. Background and Related Works

2.1. Anomaly Detection

2.2. Foundations of Prompt Engineering

3. Prompt Design for Anomaly Detection: Exploring Its Effectiveness

4. Experiment

4.1. Experimental Setting

4.2. Experimental Results

4.3. Ablation Study

4.4. Limitation of Single Prompt ToT

5. Conclusions

6. Future Work: Adaptive Prompt-Engineering Techniques for Next-Generation Anomaly Detection

6.1. Self-Criticism Methods

6.2. Few-Shot Chain-of-Thought

6.3. Decomposition

6.4. Multi-Agent Debate and Role Simulation

6.5. Retrieval-Augmented Generation

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI