Telecom Fraud Detection Based on Large Language Models: A Multi-Role, Multi-Layer Prompting Strategy

Ding, Jianpeng; Zhou, Houpan

doi:10.3390/app16010544

Open AccessArticle

Telecom Fraud Detection Based on Large Language Models: A Multi-Role, Multi-Layer Prompting Strategy

by

Jianpeng Ding

and

Houpan Zhou

^*

School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 544; https://doi.org/10.3390/app16010544

Submission received: 9 December 2025 / Revised: 26 December 2025 / Accepted: 27 December 2025 / Published: 5 January 2026

Download

Browse Figure

Versions Notes

Abstract

Telecom network fraud continues to evolve, and its textual expressions have become increasingly concealed, making automated detection more challenging. When combined with mainstream prompting strategies, large language models (LLMs) often exhibit unstable performance when handling diverse fraud texts, particularly for long-tail categories and confusing cases where consistent detection is difficult to maintain. To address this limitation, this study proposes a Multi-Role, Multi-Layer (MRML) prompting strategy. The strategy constructs three expert roles—text analysis, business process analysis, and security analysis—and adopts a conditional hierarchical reasoning mechanism to achieve a structured detection process that transitions from rapid binary screening to deep multi-class classification. This design systematically organizes the LLM’s inference steps and enhances its ability to distinguish different types of telecom fraud. Experiments conducted on two public datasets show that the proposed framework significantly outperforms mainstream prompting strategies and surpasses deep learning baselines such as BERT, TextCNN, and Transformer in terms of precision, recall, and F1-score, demonstrating superior performance and robustness. Overall, the results indicate that the proposed prompting strategy provides an effective and practically applicable solution for telecom fraud text detection in real-world scenarios.

Keywords:

telecom fraud; LLM; prompt engineering; fraud detection

1. Introduction

In recent years, the rapid development of Information and Communication Technologies (ICT) has profoundly reshaped modern society, bringing substantial economic and social benefits. However, this transformation has also been accompanied by a sharp global rise in telecom network fraud, which now poses a serious threat to socioeconomic stability and individual financial security. Telecom fraud has evolved from traditional SMS and voice-based scams to social media, emails, and diverse online channels, including phishing attacks, financial scams, and identity theft, causing significant economic losses worldwide [1]. The growing diversity and sophistication of these fraud schemes have made detection increasingly challenging, raising the need for more advanced, intelligent monitoring solutions.

To address these challenges, researchers and practitioners have explored various technical strategies. Traditional machine learning and deep learning methods have achieved notable success in spam and fraud detection, yet they exhibit inherent limitations. First, the adaptive and rapidly changing nature of fraud makes static pattern-based models slow to respond. Second, many deep learning models operate as “black boxes”, which undermines interpretability and trustworthiness in high-stakes applications. Third, emerging fraud patterns often lack sufficient high-quality labeled datasets, which hinders model training and generalization [2].

LLMs offer new opportunities to address these limitations. Trained on vast corpora, LLMs exhibit strong contextual reasoning and generalization capabilities, allowing them to identify unseen fraud patterns in zero-shot or few-shot settings. Moreover, techniques such as chain-of-thought reasoning enable more transparent and interpretable decision-making [3]. Nonetheless, conventional prompting approaches—whether zero-shot, few-shot, or chain-of-thought—often yield unstable performance in practical fraud detection tasks [4], leaving a gap between research advancement and real-world deployment.

To bridge this gap, this study proposes an MRML prompting strategy for telecom fraud detection. The proposed approach establishes a hierarchical, expert-inspired framework that structures the LLM’s reasoning process to enhance both detection accuracy and interpretability. The main contributions of this work are:

MRML prompting: This strategy simulates a human expert panel, enabling collaborative reasoning from low-level feature perception to high-level semantic judgment, thereby improving accuracy and robustness.
Transparent decision-making: By incorporating multiple expert roles for parallel information parsing and feature extraction, this framework provides explicit and traceable rationales for each decision.
Conditional triggered reasoning: This framework activates in-depth analysis only when initial screening indicates suspicious activity, improving computational efficiency without compromising performance.

Beyond its practical significance for anti-fraud efforts, this framework establishes a reusable methodology for deploying LLMs in security-critical decision-making scenarios. The remainder of the paper is organized as follows: Section 2 reviews related work in telecom fraud detection; Section 3 details the proposed model architecture and workflow; Section 4 presents experimental datasets, results, and analysis; Section 5 discusses the limitations of MRML; and Section 6 concludes with key findings and future research directions.

2. Related Work

The continuous evolution of telecommunications fraud, fueled by rapid advances in information technology, has in turn driven the progressive development of intelligent detection methods. This section surveys representative research in telecom fraud detection, paying special attention to recent advancements in applying LLMs within this domain.

2.1. Machine Learning-Based Fraud Detection Methods

Early research in fraud detection predominantly employed traditional machine learning methods, which relied heavily on manual feature engineering and conventional classification algorithms. These approaches typically converted textual data into numerical representations using either bag-of-words models or Term Frequency-Inverse Document Frequency (TF-IDF), followed by classification with algorithms such as Support Vector Machines (SVMs), Naive Bayes (NB), and Random Forests (RFs) [5]. For instance, SVM demonstrated superior performance in phishing email detection, achieving up to 97.3% accuracy, while RF attained 98% accuracy and a 99% F1-score on test sets. Owing to its capability to highlight distinctive terms within documents, TF-IDF also proved effective in fraud detection, with top-performing models reaching 98% in both F1-score and accuracy [6].

However, these methods exhibit pronounced limitations when processing semantically ambiguous fraudulent content. Their performance remains heavily dependent on feature quality, and they struggle to adapt to emerging fraud variants [7]. Although traditional approaches deliver competitive results on specific datasets, they face several critical constraints: feature engineering demands domain-specific expertise, model generalization remains limited, and they fail to effectively capture long-range textual dependencies or contextual semantics. Furthermore, studies indicate that traditional machine learning methods are particularly vulnerable to imbalanced data distributions and demonstrate inadequate performance for low-resource languages or novel fraud types. These inherent limitations subsequently motivated the development of deep learning approaches capable of automated feature learning.

2.2. Deep Learning-Based Fraud Detection Methods

With the advent of deep learning, neural-network-based approaches for fraud message identification and classification have significantly improved detection performance, overcoming the limitations of traditional methods through automatic feature learning and end-to-end training. Convolutional neural networks (CNNs) have been applied to extract local textual patterns; for example, in phishing email detection, CNNs capture keyword sequences through sliding windows and generate robust representations via pooling operations [8]. Recurrent neural networks (RNNs) and their variants, such as long short-term memory networks (LSTMs), are suited for processing sequential data.

With the emergence of the Transformer architecture, its self-attention mechanism enables global modeling of textual dependencies [9,10,11]. Transformer-based models such as BERT learn deep semantic representations through masked language modeling and have achieved accuracy as high as 99% in relevant text classification tasks, substantially outperforming traditional methods [6]. In addition, generative adversarial networks (GANs) have been introduced to enhance model robustness; for instance, GAN-enhanced text classification approaches combined with pretrained language models have achieved more than 90% accuracy in real-world applications [12]. Multi-head-attention-enhanced RoBERTa models with residual connections have also shown strong performance in Chinese telecom fraud detection. Li et al. proposed the RoBERTa-MHARC model by integrating RoBERTa with multi-head attention, residual connections, and dual loss functions, achieving an F1-score of up to 98% in multi-category telecom fraud text detection and outperforming traditional deep learning approaches [13].

Despite these advances, the “black-box” nature of deep learning models, their reliance on large-scale labeled datasets, and the challenges of generalization under data imbalance have driven research toward LLMs with stronger reasoning and generalization capabilities.

2.3. Large Language Model-Based Fraud Detection Methods

The emergence of LLMs, exemplified by the GPT series, has profoundly transformed the paradigm of fraud message detection and classification. Through pre-training, fine-tuning, and prompt engineering, LLMs achieve unprecedented levels of accuracy and adaptability [14]. A key advantage of LLMs lies in their massive parameter scales and extensive pre-training on large corpora, enabling them to comprehend complex semantic contexts and effectively identify fraudulent intent.

LLM-based approaches for fraud classification mainly fall into three categories: fine-tuning, prompt engineering, and retrieval-augmented generation (RAG) [15].

Fine-tuning directly adapts a pre-trained model to a downstream task. For instance, in Chinese fraud detection, Llama2-7B is efficiently fine-tuned via Quantized Low-Rank Adaptation (QLoRA), where fraud texts are organized in an instruction-tuning format and enhanced with Chain-of-Thought (CoT) reasoning, achieving strong performance on the CHIFRAUD dataset [16]. Similarly, Fraud-BERT fine-tunes BERT using domain-specific data and surpasses traditional methods in job-recruitment fraud detection, demonstrating the effectiveness of fine-tuning in low-resource scenarios [7].

Prompt engineering guides model outputs using natural language instructions, eliminating the need for large labeled datasets. In zero-shot phishing email detection, GPT-4 sucessfully performs classification through carefully crafted prompts; however, its stability remains sensitive to prompt design [17], indicating that simple prompt construction is insufficient to fully exploit LLM capabilities in complex fraud detection tasks. To further improve performance, few-shot learning is often incorporated, providing a small number of task examples to enhance model generalization [18].

Finally, RAG integrates external knowledge bases to enrich model context and mitigate hallucinations. For example, in cybersecurity threat detection, RAG retrieves relevant fraud patterns from a knowledge base to support more accurate classification outputs [19], demonstrating the potential of combining LLMs’ reasoning with external knowledge for robust fraud detection.

2.4. Summary

Traditional machine learning methods, while simple to implement and computationally efficient, heavily rely on manually crafted features. They struggle to capture deep semantic patterns and demonstrate limited generalization capabilities in imbalanced datasets. Deep learning methods, in contrast, significantly improve detection performance through end-to-end representation; however, they remain constrained by limited interpretability, large labeled datasets requirements, and computational overhead. LLMs demonstrate remarkable advantages in fraud detection due to their powerful semantic understanding and small-sample generalization capabilities. Techniques like prompt engineering and retrieval augmentation have enhanced model adaptability and decision-making transparency.

However, LLM’s practical application faces three key challenges: high fine-tuning costs, insufficient stability and accuracy of conventional prompt engineering when handling ambiguous or adversarial texts, and limited adaptability to long-tail categories and low-resource scenarios.

To systematically address these limitations, this paper proposes an MRML prompt strategy. By implementing role-based division of tasks and phased decision-making, combined with CoT, role-based prompts, and small-sample examples, this approach aims to enhance the accuracy, robustness, and decision transparency in identifying easily confused and long-tail fraud types.

3. Methodology

To address the above limitations, this paper proposes an MRML prompt strategy. The design is grounded in the following considerations: accurate identification of telecom fraud texts relies on the collaborative analysis of multi-dimensional evidence spanning linguistic representations, business process logic, and security threat characteristics. Conventional single-prompt approaches lack the capacity to systematically organize such complex reasoning processes. To bridge this gap, the proposed strategy constructs three specialized roles—text analysis, business process, and security analysis—by discretizing expert knowledge into structured models, and further designs a condition-triggered hierarchical reasoning architecture. This systematically guides the LLMs to perform end-to-end analysis, progressing from local semantic perception to global risk assessment. The framework is designed to enhance the LLM’s discriminative capability and decision traceability in complex scenarios, thereby offering a reliable structured reasoning paradigm for high-stakes AI applications.

The overall architecture of this framework is shown in Figure 1, comprising two functionally distinct and tightly integrated layers:

Layer 1: Multi-Role Parallel Analysis and Preliminary Assessment

This initial screening layer employs three parallel expert modules (Text analysis 1, Business process 1, and Security 1) to perform a rapid risk assessment of the input text. Each module analyzes the text from its specialized perspective. The preliminary analyses from these three modules are aggregated into Integrated information 1, which provides crucial contextual input for the Final decision 1 module. Based on this Integrated information 1, Final decision 1 performs binary classification (fraudulent vs. non-fraudulent), thereby achieving efficient binary classification.

Layer 2: Conditional Triggering and Fine-Grained Classification

This layer is activated if and only if Layer 1 classifies the input as “Fraudulent information”. It then utilizes three enhanced expert modules (Text analysis 2, Business process 2, and Security 2) to perform a detailed, fine-grained analysis. Unlike the first-stage modules, which focus on binary assessment, these second-stage modules leverage both Integrated information 1 and domain-specific knowledge to categorize the fraudulent information into specific types. Their in-depth analyses are synthesized into Integrated information 2. Finally, the Final decision 2 module integrates this comprehensive information to deliver the final judgment, thereby achieving precise classification of the Types of Fraudulent Information.

The subsequent sections elaborate on the module design, prompt engineering techniques, and collaborative mechanisms employed in each layer.

3.1. Overview of the Overall Framework

The MRML prompt strategy proposed in this paper adopts a two-stage conditional-trigger architecture, with its core workflow illustrated in Figure 1. This architecture mimics the division of tasks and collaboration within professional anti-fraud teams, employing a two-phase processing flow (initial screening → fine-tuning): The first phase concurrently extracts textual representations and rule-based evidence for binary screening; the second phase activates a fine-grained classifier and arbitration mechanism when the initial screening identifies suspicious cases, generating final classification and explanations.

The framework receives raw text for detection. In the first phase, it simultaneously activates three specialized expert roles: text analysis, business process, and security analysis. These roles, powered by a unified LLM, are assigned distinct analytical perspectives and responsibilities through customized prompts. Each role independently performs preliminary text evaluation and generates initial analysis conclusions.

The output from the first stage is consolidated into a preliminary decision-making role. This role synthesizes input from three parties and generates a high-level binary decision (fraudulent vs. non-fraudulent). The conditional trigger mechanism here optimizes computational efficiency: if “non-fraudulent” is determined, the process terminates and returns the result directly; if “fraudulent information” is identified, the system automatically initiates a more complex second-stage analysis process.

In the second phase, the three expert roles are reactivated with enhanced prompts specifically designed for precise multi-category classification of fraudulent information. This framework processes raw text data and integrates findings from the initial phase, performing identification from a predefined fraud category database. A comprehensive decision-making role then synthesizes outputs from all second-phase participants, synthesizing information from multiple sources to determine the final fraud category classification and supporting rationale.

The framework systematically constructs a general LLM into a specialized fraud detection system with high coverage, high accuracy and high computational efficiency through role-based division of tasks, hierarchical analysis and condition-triggered mechanisms.

3.2. Layer 1: Multi-Role Parallel Analysis and Preliminary Judgment

The first layer, as the framework’s perception and filtering layer, is tasked with performing rapid initial risk screening of the input text through three parallel expert roles.

The design of prompt words for text analysis roles is shown in Table 1. The design of business process and security analysis roles follows a similar modular structure of “role definition—knowledge injection—task instruction—output guidance”, so it will not be repeated here.

This layer integrates three core analytical roles and one decision-making role. The text analysis role focuses on surface-level features and semantic content, identifying common psychological manipulation and linguistic inducement signals in fraudulent information. Its prompt-based instruction LLMs act as a “text analysis expert for fraudulent information,” requiring step-by-step reasoning and emphasizing the identification of typical “non-fraudulent” information characteristics, such as directing to official channels and no financial operation requirements. The business process role evaluates the rationality of commercial logic and operational procedures, with its analysis focusing on whether the text describes transactions, claims, loans, or other processes that align with standard business practices. The security analysis role conducts risk assessments based on cybersecurity threat intelligence, examining whether the text contains high-risk signals like malicious links, software downloads, or requests for sensitive information.

The analytical conclusions from the three aforementioned roles are relayed to the preliminary decision-making role. As the critical hub at this level, it consolidates and adjudicates parallel analysis results. Its prompt instruction LLMs functions as a “comprehensive analysis expert,” incorporating key business rules such as determining that information containing only official prompts and no substantive risk operations should be classified as “risk-free.” Additionally, it ensures the standardization and parseability of results through strict output format directives (e.g., ‘Conclusion: [Fraudulent Information/non-fraudulent] Reason:…’). The framework parses the output from this role: if the conclusion is “non-fraudulent,” the process terminates; if “fraudulent information,” it triggers the second-layer in-depth analysis.

3.3. Layer 2: Conditional Triggering and Fine-Grained Classification

The framework enters this layer if and only if the text is classified as “fraudulent information” by the first layer. The second layer aims to achieve precise and granular classification of the fraud framework. It reuses the three roles from the first layer but adds significantly enhanced prompt words.

The enhanced prompt design for text analysis roles is shown in Table 2. Business process and security analysis roles are built under a unified design paradigm, with similar prompt structure modules. The only distinction lies in domain-specific content; detailed descriptions are omitted here to avoid redundancy.

The prompt designs for the three enhanced expert roles in the second layer share the following common features:

Task focus: The LLMs are explicitly required to select the most matching option frompredefined fraud categories.

Knowledge injection: Core fraud indicators are embedded directly into prompts. For example, for ‘false credit reporting,’ it details the typical process: threatening credit impact → requesting remote meeting tool download → inducing loans → transferring funds to a ‘safe account.’

Context Utilization: The input contains the original text and the integrated result of the first stage to achieve information relay.

Structured output: The LLMs are required to produce output in the format “Prediction category: <category> Reason: <description>”.

The outputs of the three enhanced roles are passed to the integrated decision-making role. As the decision terminal of the entire framework, this role is responsible for the final arbitration of potential disagreements. Its prompt instruction LLMs integrates all deep analysis information, providing the final and most reliable classification results while ensuring a unified output format.

The specific prompt designed for the integrated decision-making role (Final decision 2) is as follows:

You are a highly experienced comprehensive analysis expert with years of expertise in anti-fraud. Please conduct the final judgment on the SMS content based on the following analysis results from the second-stage experts. Integrate all aspects of information to perform a multi-class classification. You may reason step by step and select the most appropriate category from the predefined list:

SMS Content: {text}

Second-stage Results: {analysis results}

Please output:

Prediction category: <category>

Reason: <brief explanation>

This prompt is characterized by two key design principles: First, it explicitly assigns the LLMs the role of an “experienced expert,” leveraging its parametric knowledge to simulate seasoned judgment. Second, it mandates the integration of all preceding specialized analyses(text, process, and security), requiring the model to synthesize these inputs rather than rely on a single perspective, thereby emulating a comprehensive expert review process to reach a final, consolidated verdict.

3.4. Summary of the Policy Process

The implementation process of this strategy can be precisely summarized into four core steps:

1. Input and parallel preliminary screening: The framework receives the text and simultaneously initiates three roles—text analysis, business process analysis, and security analysis—to conduct the first round of risk feature extraction and assessment.

2. Preliminary Decision and Routing: The preliminary decision role consolidates the results from three parties and outputs a binary decision of “fraudulent information” or “non-fraudulent”, thereby determining the process direction.

3. Condition Triggering and Deep Classification: When flagged as ‘fraudulent information’, the framework simultaneously activates three enhanced roles to perform granular fraud type classification.

4. Final Arbitration and Output: The integrated decision-making role synthesizes all predictions from the second layer to generate the final prediction category and its rationale.

Through systematic process design, this MRML prompt strategy transforms general LLMs into an efficient, accurate, and interpretable telecom fraud detection system. Its innovation lies in the organic integration of role division, conditional triggering, and prompt enhancement techniques, achieving a balance between performance and efficiency in complex classification tasks. This provides a reusable engineering paradigm for high-stakes applications based on LLMs.

4. Experiments and Results

4.1. Dataset

To evaluate the effectiveness and generalization of the proposed framework, we use two representative public data sets: one for telecom fraud messages and the other for Chinese web fraud texts.

(1): CCL2023-FGRC-SCD fusion dataset

The first dataset used in this study is primarily based on the CCL2023 (available at: https://github.com/GJSeason/CCL2023-FCC, accessed on 29 March 2025) Telecommunication Network Fraud Case Classification Evaluation Dataset, with data sourced from real victim case records on the anti-fraud big data platform of public security departments, ensuring high authority and authenticity. The dataset covers 12 mainstream types of fraud, including fake order returns, impersonating e-commerce logistics customer service, and fraudulent online investment and financial management. However, the original data lacks normal (risk-free) samples. To address this issue, this study introduced 8000 normal samples from the FGRC-SCD(available at: https://aistudio.baidu.com/datasetdetail/215947; accessed on 25 April 2025) dataset, ultimately constructing a complete corpus of 110,762 samples suitable for fraud detection and classification tasks. As shown in Table 3, the category distribution of the dataset is highly imbalanced, with “fake order returns” samples totaling 35,459, while “impersonating military or police officers” only has 1092 samples. This truly reflects the distribution characteristics of telecommunication fraud in the real world, posing a challenge to the generalization ability of the model.

(2): ChiFraud dataset

This second dataset used in this study is a large-scale corpus specifically built for Chinese web page fraudulent texts. The data comes from original texts continuously collected by professional crawlers from billions of web pages, all of which have been rigorously manually annotated by domain experts to ensure data quality. ChiFraud (available at: https://github.com/xuemingxxx/ChiFraud, accessed on 20 May 2025) covers 10 major fraud categories, including gambling, solicitation of prostitution, forgery of documents, and illegal drug transactions, along with a special “emerging fraud” category to test the model’s adaptability to unknown threats. As shown in Table 4, the dataset contains 411,434 samples, with as many as 352,328 normal samples, accurately simulating the scenario of “finding a needle in a haystack” in real-world online content governance.

To ensure experimental compliance and ethical standards, all data underwent rigorous privacy protection and anonymization at the source. The two datasets were divided in an 8:1:1 ratio into training, validation, and testing sets. Based on this partition, deep learning models (e.g., BERT, TextCNN) were trained and validated, with their final performance evaluated on the designated test set. All LLM-based methods—including the proposed strategy and other baseline prompt strategies—were also evaluated on this identical test set to ensure fair and consistent comparisons across all models.

4.2. Model Selection

To comprehensively evaluate the proposed framework’s effectiveness, we conducted performance comparisons using multiple representative baseline models in our experiments. These models encompassed various deep learning architectures, including TextCNN, renowned for its efficient local feature extraction; Transformer, leveraging self-attention mechanisms to capture global dependencies; and BERT and ChineseBERT, as exemplary pre-trained language models with the latter being specifically optimized for the Chinese context. The implementations and configurations of these baseline models followed the standard practices in their respective original works or widely-used repositories. To ensure reproducibility and clarity, the code for these baseline models is based on the publicly available repository at: https://github.com/xuemingxxx/ChiFraud (accessed on 20 December 2025).

In selecting LLMs, to systematically verify the universality and effectiveness of the MRML prompt strategy proposed in this paper, we chose Alibaba’s Qwen2.5-7B-Instruct and ZhiPu’s glm-4-9b-chat as experimental models. This selection is primarily based on the following considerations: First, both are currently leading open-source models with excellent foundational capabilities, providing a reliable basis for validating the effectiveness of the prompt strategy while ensuring reproducibility and transparency of the research. Second, Qwen2.5 and glm-4-9b-chat originate from different technical approaches and training data distributions, which helps to fairly and comprehensively evaluate the adaptability and generalization capabilities of the proposed strategy across different model architectures, avoiding conclusions that are limited to a single model. More importantly, this study focuses on building a detection solution that combines high performance with practical value. The selected 7B-9B parameter scale maintains strong semantic understanding capabilities while significantly reducing computational overhead and deployment barriers. This simulates resource constraints in real-world application scenarios, aiming to demonstrate that our framework can achieve excellent performance without relying on LLMs with hundreds of billions of parameters, thus making it more practically valuable.

4.3. Experimental Results and Analysis

To comprehensively validate the effectiveness of the MRML Prompting strategy proposed in this paper, we conducted two experimental groups: First, we compared the performance of LLMs employing MRML with multiple mainstream deep learning baseline models; Second, we contrasted MRML with other typical prompting strategies to analyze its core advantages. All experiments were performed on the two datasets described in Section 4.1, ensuring comprehensive and fair evaluation.

4.3.1. Evaluation Indicators

To evaluate the model’s practical effectiveness in real-world telecom fraud scenarios, this study employs three core metrics: Precision, Recall, and F1-Score. These metrics comprehensively assess the model’s ability to balance false positives and false negatives. Notably, the F1-Score, which combines Precision and Recall, is particularly critical for fraud detection. Given the class imbalance in the dataset, all experimental results are presented as weighted averages to ensure robust evaluation outcomes.

4.3.2. Comparison of Baseline Models

This experiment aims to compare the performance differences between the proposed method and traditional deep learning models in fraud detection tasks. As shown in Table 5 and Table 6, on the CCL2023-FGRC-SCD and ChiFraud datasets, the Qwen2.5-7B-Instruct and glm-4-9b-chat models equipped with the MRML strategy consistently and significantly outperform all baseline models in terms of precision, recall, and F1 score.

Specifically, on the CCL2023-FGRC-SCD dataset with highly uneven category distribution, the proposed method demonstrates superior recall performance compared to the baseline model when handling typical long-tail category samples, indicating its outstanding potential in controlling false-negative risks. On the ChiFraud dataset, where normal samples dominate, the method also maintains high precision while effectively suppressing the rise in false positives. Overall, these comparative results conclusively demonstrate that the MRML strategy exhibits stronger generalization capabilities and robustness when dealing with real-world complex data distributions and evolving fraud patterns.

4.3.3. Comparison of the Reminder Strategies

To thoroughly evaluate the effectiveness of the MRML strategy, we conducted a comparative analysis with a series of representative prompt strategies. The comparison included:

Zero-Shot: Basic zero-sample prompt;
Few-Shot: A few sample prompts with limited examples
CoT: The thinking chain prompts the LLMs to display the reasoning process.
RP(Role-based prompts): This strategy includes three specialized prompts—text analysis, business process, and security analysis—to evaluate analytical skills from different perspectives.

As shown in Table 7 and Table 8, the MRML strategy proposed in this study significantly outperforms all other prompt strategies across all metrics, tested on both datasets and two LLMs. This not only validates the necessity of multi-role collaboration and hierarchical decision-making design—providing a more comprehensive and structured analytical perspective that effectively overcomes the analytical limitations of single-role or single-phase prompts in complex scenarios—but also demonstrates the MRML strategy’s strong adaptability to different architectures of LLMs.

The combined results from both experimental groups demonstrate that the MRML prompt strategy proposed in this study systematically unlocks the potential of LLMs in telecom fraud detection. Not only does it significantly outperform traditional deep learning models, it also delivers superior and more stable performance compared to existing mainstream prompt methods. This breakthrough establishes a robust technical foundation for developing efficient, reliable, and explainable fraud detection systems.

4.3.4. Ablation Experiments

To quantitatively evaluate the contributions of different components and the effectiveness of the hierarchical design in the MRML framework, we conducted systematic ablation studies from two perspectives: the importance of each expert role within the two-stage framework, and the necessity of the two-stage structure itself.

First, to isolate the contribution of each expert role, we sequentially removed one of the three expert roles—Text Role, Process Role, or Security Role—from the two-stage framework while keeping all other components and prompt structures unchanged. The performance drop observed after removing a specific role quantifies its importance within the integrated system.

Second, to validate the necessity of the proposed two-stage hierarchical architecture, we conducted a critical comparative experiment. Specifically, compared our full two-stage model (MRML) against a simplified, single-stage variant. In this ablated version, the input text bypasses the initial rapid screening of the first stage and is directly fed into the enhanced expert roles of the second stage (Text analysis_2, Business process_2, and Security_2) for comprehensive analysis.

The ablation results on the CCL2023-FGRC-SCD and ChiFraud datasets are presented in Table 9 and Table 10, respectively. For both the glm-4-9b-chat and Qwen2.5-7B-Instruct models, removing any single expert role, whether textual, procedural, or security-focused, leads to a consistent and significant decline across all key metrics: precision, recall, and F1-score. This finding clearly demonstrates the unique and complementary value contributed by each analytical perspective, confirming them as essential components of the framework.

A more critical test involved evaluating the hierarchical design itself by removing the entire first screening layer (Layer 1). The performance of this single-layer variant was significantly worse than that of the complete two-stage model. This key result demonstrates that the performance degradation is not simply due to the removal of individual components but stems from the absence of the entire conditional, coarse-to-fine reasoning mechanism. This mechanism is fundamental to the framework’s success, where Layer 1 serves as an efficient filter and Layer 2 acts as a precision discriminator.

Therefore, the synergy both among the multiple roles within each layer and between the two stages of the architecture is critical for achieving the high performance and robustness of the complete MRML strategy. These results collectively validate the necessity of the integrated multi-role design and the two-stage hierarchical architecture.

5. Limitations

In this study, we propose an MRML prompt strategy. Experiments demonstrate that it significantly improves the classification performance and explainability of LLMs in telecom fraud detection. Additionally, the conditional trigger mechanism maintains high accuracy while reducing redundant computations, effectively enhancing inference efficiency.

However, this approach has several limitations. First, the framework’s performance heavily depends on prompt design and the LLM’s comprehension capabilities, which may struggle to handle complex fraud patterns involving cross-text or multi-round interactions. Second, while this study primarily relies on text data, real-world fraud often manifests through multimodal features like images, audio, or links, potentially compromising the model’s effectiveness in practical scenarios. Additionally, as fraud tactics rapidly evolve, the model’s generalization ability and adaptability require continuous monitoring and updates.

6. Conclusions

This study proposes a novel prompt strategy that significantly enhances the detection capability of LLMs in identifying telecom fraudulent information. By integrating text analysis, business process analysis, and security analysis into a conditional trigger reasoning framework, the MRML prompt strategy achieves efficient and accurate fraud detection, providing an innovative technical pathway for intelligent anti-fraud systems. The framework is engineered for practical deployment, providing a scalable solution that enables telecom operators to filter fraudulent messages efficiently while offering security analysts interpretable reasoning traces.

The two-stage hierarchical design is inherently efficient, achieving low computational cost and latency, which is essential for real-time detection. This efficiency stems from two synergistic mechanisms. First, the conditional coarse-to-fine filtering reduces the average number of LLM calls. By leveraging the natural class distribution (e.g., 85.6% of messages are normal in the ChiFraud dataset), most inputs are processed by Layer 1, requiring only four LLM calls. Only a minority of suspicious cases trigger the full eight-call analysis in Layer 2, thereby significantly reducing the overall computational overhead. Second, to minimize latency, the framework employs intra-layer parallel processing. Within each layer, the three expert roles perform their analyses concurrently. Thus, although each layer involves four LLM calls (three for parallel analysis plus one for the decision role), the processing time per layer is equivalent to that of only two sequential calls (one for the parallel analysis phase and one for the decision phase). This parallelism, combined with the conditional execution of Layer 2, ensures that the overall response time is substantially lower than that of a sequential design requiring an equivalent number of calls. Together, these mechanisms—conditional filtering and intra-layer parallelism—enable the MRML framework to achieve high accuracy while meeting the stringent demands of large-scale, real-time fraud detection.

Future research will prioritize advancing multimodal fraud detection technologies by integrating text, images, audio, and video data to combat increasingly sophisticated scams. We will also enhance computational efficiency to ensure system performance in resource-constrained environments. Furthermore, we will explore the integration of retrieval-augmented generation techniques to rapidly address emerging fraud tactics, thereby providing stronger support for real-world applications.

Author Contributions

Conceptualization, H.Z. and J.D.; methodology, J.D.; software, J.D.; validation, H.Z. and J.D.; formal analysis, H.Z.; investigation, J.D.; resources, H.Z.; data curation, J.D.; writing—original draft preparation, J.D.; writing—review and editing, H.Z. and J.D.; visualization, H.Z.; supervision, H.Z.; project administration, H.Z. and J.D.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (NSFC), grant number U22A2047.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All the data used in this article are provided throughout the article.

Acknowledgments

The authors would like to express gratitude to all of the participants who have contributed to this research and the funding support from numerous projects.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kwon, S.; Jang, B. A comprehensive survey of fake text detection on misinformation and LM-generated texts. IEEE Access 2025, 13, 25301–25324. [Google Scholar] [CrossRef]
Papasavva, A.; Lundrigan, S.; Lowther, E.; Johnson, S.; Mariconti, E.; Markovska, A.; Tuptuk, N. Applications of AI-Based Models for Online Fraud Detection and Analysis. Crime Sci. 2025, 14, 7. [Google Scholar] [CrossRef]
Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large language models: A survey. arXiv 2024, arXiv:2402.06196. [Google Scholar]
Chae, Y.; Davidson, T. Large language models for text classification: From zero-shot learning to instruction-tuning. Sociol. Methods Res. 2025. [Google Scholar] [CrossRef]
Salloum, S.; Gaber, T.; Vadera, S.; Shaalan, K. A systematic literature review on phishing email detection using natural language processing techniques. IEEE Access 2022, 10, 65703–65727. [Google Scholar] [CrossRef]
Al Tawil, A.; Almazaydeh, L.; Qawasmeh, D.; Qawasmeh, B.; Alshinwan, M.; Elleithy, K. Comparative analysis of machine learning algorithms for email phishing detection using TF-IDF, Word2Vec, and BERT. Comput. Mater. Contin. 2024, 81, 3395. [Google Scholar] [CrossRef]
Taneja, K.; Vashishtha, J.; Ratnoo, S. Fraud-BERT: Transformer based context aware online recruitment fraud detection. Discov. Comput. 2025, 28, 9. [Google Scholar] [CrossRef]
Wu, Y.; Wang, L.; Li, H.; Liu, J. A Deep Learning Method of Credit Card Fraud Detection Based on Continuous-Coupled Neural Networks. Mathematics 2025, 13, 819. [Google Scholar] [CrossRef]
Fields, J.; Chovanec, K.; Madiraju, P. A survey of text classification with transformers: How wide? how large? how long? how accurate? how expensive? how safe? IEEE Access 2024, 12, 6518–6531. [Google Scholar] [CrossRef]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Zhuoxian, L.; Tuo, S.; Xiaofeng, H. A Text Classification Model Combining Adversarial Training with Pre-trained Language Model and neural networks: A Case Study on Telecom Fraud Incident Texts. arXiv 2024, arXiv:2411.06772. [Google Scholar] [CrossRef]
Li, J.; Zhang, C.; Jiang, L. Innovative Telecom Fraud Detection: A New Dataset and an Advanced Model with RoBERTa and Dual Loss Functions. Appl. Sci. 2024, 14, 11628. [Google Scholar] [CrossRef]
Wang, Z.; Lin, Y.; Shen, J.; Zhu, X. A Survey of Large Language Models for Text Classification: What, Why, When, Where, and How. TechRxiv 2025. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H.; Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
Tang, M.; Zou, L.; Liang, S.; Jin, Z.; Wang, W.; Cui, S. Chifraud: A long-term web text benchmark for chinese fraud detection. In Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Abu Dhabi, United Arab Emirates, 19–24 January 2025. [Google Scholar]
Heiding, F.; Schneier, B.; Vishwanath, A.; Bernstein, J.; Park, P.S. Devising and detecting phishing emails using large language models. IEEE Access 2024, 12, 42131–42146. [Google Scholar] [CrossRef]
Cate, M. The Role of Zero-Shot and Few-Shot Learning in Enhancing LLMs for Real-World Applications. Report 2023. Available online: https://www.researchgate.net/publication/388920886 (accessed on 20 December 2025).
Xu, H.X.; Wang, S.A.; Li, N.; Wang, K.; Zhao, Y.; Chen, K.; Yu, T.; Liu, Y.; Wang, H. Large language models for cyber security: A systematic literature review. arXiv 2024, arXiv:2405.04760. [Google Scholar] [CrossRef]

Figure 1. Multi-Role, Multi-Layer Prompting Strategy. Colors are used for visual distinction only and do not convey specific information.

Table 1. Text analysis expert prompts for the first stage (preliminary risk screening).

Prompt Module	Prompt Content	Function Description
Role Definition	As a seasoned expert in analyzing fraudulent text, you are well-versed in the characteristics of various scam messages and can spot red flags in so-called ‘risk-free’ information.	The LLMs are given the identity of a domain expert to restrict their analysis perspective and focus on the feature of the text level.
knowledge infusion	Be aware of typical signals of non-fraudulent information, such as: “Contact official customer service”, “Through official channels”, ”Only providing information or advice”, no fund transfer, no download of third-party apps, no urgent urging, etc.	Provide key domain knowledge to the LLMs, especially the features of counterexamples, to improve the accuracy of binary classification and reduce misjudgment.
Task Instruction	Please analyze the following information step by step to determine whether it contains characteristics of fraudulent information or is non-fraudulent: SMS content: {text}	The LLMs are required to perform CoT with a binary classification task and input formatting.
Output Guide	Please give your judgment.	The LLMs output the preliminary analysis results, which can provide the basis for the next stage of decision-making.

Table 2. Expert-Recommended Terms for Text Analysis in Phase 2 (Fine Classification).

Prompt Module	Prompt Content	Function Description
Role Definition	You are a highly experienced expert in analyzing fraudulent text messages, specializing in the precise classification of multi-category scam SMS.	Shift role-based tasks from binary screening to multi-category scenarios, prioritizing the core objective of ‘precise’ classification.
Task and context	Please use the analysis results from the first stage {first_stage_results} to perform multi-category judgment on the following SMS content {text}, and select the most matching category from the predefined categories [‘,’.join(LABELS)}.	The hierarchical information transmission is realized, and the conclusion of the first stage is taken as the important context. The task is defined as the multiple choice discrimination from the closed category library.
knowledge infusion	You can identify these fraud types by recognizing their core red flags, such as: • Phishing calls pretending to be financial or regulatory bodies, claiming credit issues • Urging you to download and share a screen for remote meetings • Encouraging loans on major platforms • Asking you to transfer funds to a so-called “safe account” • Using fake official documents to threaten you	For example, the LLMs can be trained with fine-grained domain knowledge to provide specific and actionable discriminative criteria.
Structured Output	Please strictly follow the output format below: Prediction category: <category> Reason: <about 150 words, explain the risk signals in the text and their matching with category features, and explain why they are not misclassified as other categories>	The LLMs output standardized and structured results, which are easy to be analyzed by the program, and the LLMs also require the classification reasons to be provided.

Table 3. Category distribution of the CCL2023-FGRC-SCD fusion dataset.

Category Name	Sample Size	Percentage
Cashback for fake transactions	35,459	32.0%
Pretending to be a customer service representative for e-commerce logistics	13,772	12.4%
Fake online investment and financial management products	11,836	10.7%
Loans, credit card applications, and related services	11,105	10.0%
False credit reporting	8464	7.6%
Fake shopping and services	7058	6.4%
Pretending to be from the public security, procuratorial, judicial, or government agencies	4563	4.1%
Pretending to be a superior or someone you know well	4407	4.0%
Online game products involving fake transactions	2155	1.9%
Online dating and social networking	1654	1.5%
Shopping items for those who impersonate military or police officers	1092	1.0%
Cases of cybercrimes	1197	1.1%
No risk (FGRC-SCD)	8000	7.2%
Total	110,762	100%

Table 4. Category Distribution in the ChiFraud Dataset.

Category	Number	Percentage
New	1063	0.3%
Loan	1522	0.4%
SIM	1405	0.3%
Certification	7073	1.7%
Cash-out	2817	0.7%
Drugs	5128	1.2%
Bank	1837	0.4%
Credentials	1381	0.3%
Whoring	26,187	6.4%
Gambling	10,693	2.6%
Normal	352,328	85.6%
Total	411,434	100%

Table 5. Experimental results of various models on the CCL2023-FGRC-SCD fusion dataset.

Model	Precision	Recall	F1-Score
TextCNN	85.2	83.8	83.4
Transformer	80.6	81.3	80.3
Bert	63.6	62.9	62.9
ChineseBert	86.9	87.0	86.5
glm-4-9b-chat/MRML	87.8	86.1	86.5
Qwen2.5-7B-Instruct/MRML	91.3	87.6	88.2

Table 6. Experimental results of various models on the ChiFraud dataset.

Model	Precision	Recall	F1-Score
TextCNN	81.1	63.0	66.9
Transformer	81.8	78.4	77.9
Bert	76.8	75.0	74.9
ChineseBert	83.1	81.0	80.5
glm-4-9b-chat/MRML	87.4	87.4	87.4
Qwen2.5-7B-Instruct/MRML	84.9	84.9	84.9

Table 7. Experimental results of the model on the CCL2023-FGRC-SCD fusion dataset under different prompt strategies.

Model	Prompt Strategy	Precision	Recall	F1-Score
glm-4-9b-chat	ZERO-SHOT	76.4	68.7	66.1
	FEW-SHOT	82.4	77.5	77.4
	COT	74.8	66.6	64.2
	RP	78.0	70.2	67.4
	MRML	87.8	86.1	86.5
Qwen2.5-7B-Instruct	ZERO-SHOT	83.3	75.2	73.0
	FEW-SHOT	83.9	77.9	76.7
	COT	81.2	74.4	72.4
	RP	82.1	75.9	74.5
	MRML	91.3	87.6	88.2

Table 8. Experimental results of the model on the ChiFraud dataset under different prompt strategies.

Model	Prompt Strategy	Precision	Recall	F1-Score
glm-4-9b-chat	ZERO-SHOT	78.8	75.8	76.2
	FEW-SHOT	77.7	74.3	74.3
	COT	79.3	75.3	75.7
	RP	80.1	78.0	78.1
	MRML	87.4	84.9	85.9
Qwen2.5-7B-Instruct	ZERO-SHOT	80.0	68.3	69.4
	FEW-SHOT	79.2	63.1	64.7
	COT	80.4	71.3	72.0
	RP	79.7	68.6	69.8
	MRML	87.8	82.7	84.1

Table 9. Performance comparison of MRML variants, including the single-layer architecture and scenarios with missing individual roles, on the CCL2023-FGRC-SCD dataset.

Model	Prompt Strategy	Precision	Recall	F1-Score
glm-4-9b-chat	without Text Role	87.3	85.2	85.9
	without Process Role	87.7	85.4	86.0
	without Security Role	87.7	85.8	86.3
	Without Layer 1	75.9	74.8	72.9
	MRML	87.8	86.1	86.5
Qwen2.5-7B-Instruct	without Text Role	89.0	82.4	83.3
	without Process Role	88.4	83.6	84.0
	without Security Role	88.3	82.9	83.5
	Without Layer 1	86.3	80.8	81.8
	MRML	91.3	87.6	88.2

Table 10. Experimental results of the proposed model and its variants on the ChiFraud dataset, including the single-layer architecture and scenarios with a specific role absent.

Model	Prompt Strategy	Precision	Recall	F1-Score
glm-4-9b-chat	without Text Role	86.2	83.3	84.4
	without Process Role	86.9	84.7	85.6
	without Security Role	87.3	84.5	85.7
	Without Layer 1	81.5	79.3	79.7
	MRML	87.4	84.9	85.9
Qwen2.5-7B-Instruct	without Text Role	87.6	67.2	71.8
	without Process Role	87.0	66.7	71.7
	without Security Role	87.4	66.8	70.7
	Without Layer 1	83.5	79.1	79.4
	MRML	87.8	82.7	84.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ding, J.; Zhou, H. Telecom Fraud Detection Based on Large Language Models: A Multi-Role, Multi-Layer Prompting Strategy. Appl. Sci. 2026, 16, 544. https://doi.org/10.3390/app16010544

AMA Style

Ding J, Zhou H. Telecom Fraud Detection Based on Large Language Models: A Multi-Role, Multi-Layer Prompting Strategy. Applied Sciences. 2026; 16(1):544. https://doi.org/10.3390/app16010544

Chicago/Turabian Style

Ding, Jianpeng, and Houpan Zhou. 2026. "Telecom Fraud Detection Based on Large Language Models: A Multi-Role, Multi-Layer Prompting Strategy" Applied Sciences 16, no. 1: 544. https://doi.org/10.3390/app16010544

APA Style

Ding, J., & Zhou, H. (2026). Telecom Fraud Detection Based on Large Language Models: A Multi-Role, Multi-Layer Prompting Strategy. Applied Sciences, 16(1), 544. https://doi.org/10.3390/app16010544

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Telecom Fraud Detection Based on Large Language Models: A Multi-Role, Multi-Layer Prompting Strategy

Abstract

1. Introduction

2. Related Work

2.1. Machine Learning-Based Fraud Detection Methods

2.2. Deep Learning-Based Fraud Detection Methods

2.3. Large Language Model-Based Fraud Detection Methods

2.4. Summary

3. Methodology

3.1. Overview of the Overall Framework

3.2. Layer 1: Multi-Role Parallel Analysis and Preliminary Judgment

3.3. Layer 2: Conditional Triggering and Fine-Grained Classification

3.4. Summary of the Policy Process

4. Experiments and Results

4.1. Dataset

4.2. Model Selection

4.3. Experimental Results and Analysis

4.3.1. Evaluation Indicators

4.3.2. Comparison of Baseline Models

4.3.3. Comparison of the Reminder Strategies

4.3.4. Ablation Experiments

5. Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI