Next Article in Journal
A Versatile and Low-Cost IoT Solution for Bioclimatic Monitoring in Precision Viticulture
Next Article in Special Issue
A Comprehensive Survey on 5G RedCap: Technologies, Security Vulnerabilities, and Attack Vectors
Previous Article in Journal
Seamless Vital Signs-Based Continuous Authentication Using Machine Learning
Previous Article in Special Issue
Cybersecurity in Higher Education Institutions: A Systematic Review of Emerging Trends, Challenges and Solutions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Wangiri Fraud Detection: A Comprehensive Approach to Unlabeled Telecom Data

1
Department of Electrical & Computer Engineering, University of Victoria, Victoria, BC V8P 5C2, Canada
2
Computer Engineering Department, Shiraz University, Shiraz 8433471946, Iran
3
Department of Telecommunication Engineering, Islamic Azad University, Tehran 1477893855, Iran
*
Author to whom correspondence should be addressed.
Future Internet 2026, 18(1), 15; https://doi.org/10.3390/fi18010015
Submission received: 26 October 2025 / Revised: 17 December 2025 / Accepted: 22 December 2025 / Published: 27 December 2025
(This article belongs to the Special Issue Cybersecurity in the Age of AI, IoT, and Edge Computing)

Abstract

Wangiri fraud is a pervasive telecommunications scam that exploits missed calls to lure victims into dialing premium-rate numbers, resulting in significant financial losses for operators and consumers. This paper presents a comprehensive machine learning framework for detecting Wangiri fraud in highly imbalanced and unlabeled Call Detail Record (CDR) datasets. We introduce a novel unsupervised labeling approach using domain-driven heuristics, coupled with advanced feature engineering to capture temporal, geographic, and behavioral patterns indicative of fraud. To address severe class imbalance, we evaluate multiple sampling strategies like the Synthetic Minority Over-sampling Technique (SMOTE) and undersampling, and also compare the performance of Logistic Regression, Decision Trees, Random Forest, XGBoost, and Multi-Layer Perceptron (MLP). Our results demonstrate that ensemble methods, particularly Random Forest and XGBoost, achieve near-perfect accuracy (e.g., Receiver Operating Characteristic Area Under the Curve (ROC-AUC) > 0.99 ) on balanced data while maintaining interpretability. The proposed pipeline offers a scalable and practical solution for real-time fraud detection, providing telecom operators with an effective tool to mitigate Wangiri fraud risks.

Graphical Abstract

1. Introduction

Artificial Intelligence (AI) and data science are transforming the telecom industry by enabling intelligent automation, improved decision-making, and greater operational efficiency [1,2]. Telecom networks generate vast amounts of data from CDRs [3], user activity [4], signal measurements [5], and Internet of Things (IoT) devices [6]. Data science techniques, combined with AI and machine learning models, are essential for processing and analyzing this data to derive actionable insights. One key application is predictive maintenance, where AI models analyze equipment and network logs to anticipate failures and reduce downtime [7]. In customer service, AI-powered virtual assistants and chatbots provide 24/7 support and resolve issues quickly, while recommendation engines personalize service offerings based on subscriber behavior [8].
Advanced use cases include AI-based modulation classification [9], which identifies modulation schemes in received signals, aiding in interference management and signal decoding in complex environments [10]. Serving cell positioning leverages AI algorithms to estimate user locations based on signal characteristics, supporting location-based services, emergency response, and network optimization [11]. AI is also used in traffic forecasting [12], enabling proactive resource allocation [13] and anomaly detection for real-time security monitoring and fraud prevention [14]. Self-organizing networks (SONs) use AI for dynamic configuration and optimization, reducing manual intervention [15]. Additionally, machine learning models support churn prediction [16], pricing optimization [17], and customer segmentation [18], empowering operators to make data-driven business decisions and enhance the quality of experience (QoE) for users [19].
Advances in telecommunications technology have transformed global connectivity, allowing billions of users to communicate seamlessly across borders. Yet, these innovations have also introduced vulnerabilities, making telecom fraud one of the most pressing challenges in the industry. Within this landscape, Wangiri fraud, a deceptive scheme exploiting missed calls, has risen as a particularly harmful threat, causing substantial financial losses for both mobile operators and consumers [20].
The term Wangiri (Japanese for “one ring and cut”) refers to an advanced fraud scheme in which attackers place short-duration calls to random or targeted phone numbers, deliberately terminating the call after one or two rings. By exploiting the natural human inclination to return missed calls, fraudsters lure victims into dialing premium-rate numbers, thereby generating illicit revenue. Despite its simplicity, this attack vector remains highly effective due to its psychological manipulation [21]. The structure of a Wangiri detection process is shown in Figure 1. The fraud process begins when a premium-rate account places a call to an ordinary phone number. When the victim returns the missed call, they are charged at an international premium rate.
Wangiri fraud imposes a substantial financial burden on the telecommunications industry, with annual global losses exceeding $40 billion according to recent industry analyses [22]. As a particularly costly form of telecom fraud, Wangiri scams contribute significantly to this staggering total. The attacks’ distributed architecture, typically involving millions of brief call attempts across international networks, presents unique detection challenges that conventional rule-based systems struggle to address effectively [23].
Modern telecommunications networks present a complex ecosystem of heterogeneous technologies, multiple service providers, and international roaming agreements, characteristics that create numerous vulnerabilities for Wangiri fraud exploitation. This complexity renders traditional rule-based fraud detection systems increasingly ineffective, as their static thresholds and predefined rules cannot adequately adapt to evolving attack patterns, frequently leading to either excessive false positives or undetected fraud cases [24].
Furthermore, a critical challenge in telecom fraud detection stems from the pervasive lack of reliable labeled training data. While supervised learning thrives in domains with abundant annotated datasets, operational telecom systems frequently lack complete or accurate fraud labels. This constraint demands novel detection methodologies capable of uncovering fraudulent patterns through alternative approaches beyond conventional supervised learning paradigms [25].
The main contributions of the paper are as follows:
  • Novel Unsupervised Labeling Approach: Development of an innovative methodology for generating reliable training labels from unlabeled telecommunications data using statistical anomaly detection combined with domain expertise.
  • Comprehensive Feature Engineering Framework: Creation of a systematic approach to feature extraction that captures multiple dimensions of calling behavior relevant to Wangiri fraud detection.
  • Interpretable Ensemble Architecture: Design of an ensemble learning system that balances high performance with interpretability requirements specific to telecommunications fraud detection.
  • Imbalanced Data Handling Techniques: Development of specialized techniques for managing highly imbalanced datasets in telecommunications fraud detection scenarios.
The rest of the paper is organized as follows. Section 2 reviews the most important related work in Wangiri fraud detection area. Section 3 discusses the aspects of the problem. The methodology approach is presented in Section 4 followed by Section 5 where the most significant experimental results are discussed. Finally, Section 6 concludes the paper.

2. Literature Review

The dynamic landscape of telecommunication fraud, characterized by the rapid evolution of malicious techniques and the vast scale of telecom data, has prompted extensive research into fraud detection solutions. The increasing complexity and economic impact of fraud have driven the adoption of intelligent technologies, particularly in data analytics, machine learning, deep learning, and language modeling. This section presents a comprehensive review of relevant studies grouped into five thematic categories:
1.
Traditional and Supervised Machine Learning Approaches;
2.
Deep Learning and Neural Network Architectures;
3.
Graph-Based and Visualization Methods;
4.
Large Language Model (LLM)-Based Conversational and Semantic Analysis;
5.
Policy, Regulation, and Regional Focus.

2.1. Traditional and Supervised Machine Learning Approaches

Sahin et al. [26] conducted an empirical study using a range of supervised machine learning algorithms, including Decision Trees, Support Vector Machine (SVM), and Random Forests, applied to anonymized telecom datasets. Their goal was to benchmark model performance in fraud detection scenarios, identifying decision trees as interpretable and efficient under certain conditions. The study highlighted the importance of preprocessing steps such as normalization and feature selection, concluding that no single model universally outperforms across all fraud types. Their work laid a foundational baseline for integrating classical Machine Learning (ML) tools in fraud analytics.
Arafat et al. [27] addressed Wangiri fraud detection by employing ensemble learning techniques, combining classifiers like Random Forest, AdaBoost, and Gradient Boosting. Their ensemble approach significantly reduced false positives and improved model stability across unbalanced datasets. Notably, the study provided a breakdown of fraud patterns based on call duration, frequency, and originating number. The use of ensemble techniques was proven beneficial in handling the subtle variances present in scam calls, making their approach suitable for large-scale deployments.
Birhanu [28] developed a real-time Subscriber Identity Module (SIM)-box fraud detection framework implemented at Ethio Telecom. The system was built around three classifiers-Random Forest, Support Vector Machine, and Neural Network-and trained on datasets segmented into one-hour, daily, and weekly slices. With Random Forest and Neural Network models achieving 100% accuracy, the framework demonstrated exceptional performance. Additionally, the study emphasized system integration and real-time data flow, making it a strong case study for practical, deployable fraud solutions.
Krasic and Celar [29] tackled the prevalent issue of data imbalance in fraud detection by incorporating SMOTE. Using real-world data from telecom environments, they applied SVMs and Decision Trees on oversampled datasets. Their analysis showed considerable improvement in F1 scores and detection precision. The study’s contribution is critical in highlighting that the skewed distribution of fraud cases-common in most telecom data-necessitates careful preprocessing to ensure classifier efficacy.
Ravi et al. [30] presented a multi-faceted approach to detecting Wangiri fraud, defining three distinct fraud patterns derived from year-long CDR analysis. They tested various supervised and unsupervised methods, including k-Means clustering and decision trees, and found that supervised classifiers consistently outperformed their unsupervised counterparts in detecting callback scams. Their comprehensive pattern definitions enhanced the granularity of detection strategies and laid the groundwork for fraud-specific model tuning.
Liang et al. [31] introduced the Telecom Fraud Detection model based on Feature Binning and Autoencoder (TFD-FA). Their framework bypassed traditional Graph Neural Networks (GNN) limitations by partitioning users into telecom-specific scenarios via feature binning, followed by neighbor feature aggregation through an autoencoder. The model also included an imbalance-aware classifier. Tested on Guangdong Unicom’s real dataset, TFD-FA outperformed baseline GNNs and decision trees, particularly in scenarios lacking complete node attributes. Their contribution presents a practical alternative for environments constrained by data availability.

2.2. Deep Learning and Neural Network Architectures

Wahid et al. [32] introduced the Neural Factorization Autoencoder (NFA), a hybrid architecture that combines neural factorization machines with autoencoders to capture long-term behavioral trends in telecom users. Their model integrated a memory module, enabling it to retain temporal fraud patterns. Evaluated on a real telecom dataset comprising over 670,000 calls, the NFA achieved an F1-score of 95.45% and an AUC of 91.06%, outperforming standard models. Their work demonstrated the feasibility of deploying deep learning models in streaming environments with robust adaptation to concept drift.
Hu et al. [21] proposed a novel fraud detection technique using GNNs with an imbalance-aware mechanism. They identified that traditional GNNs underperform when fraud cases are sparse due to neighborhood dilution by benign users. Their solution integrated reinforcement learning-based neighbor sampling and focal loss to improve minority class detection. This method was tested on two large-scale datasets and achieved competitive performance, particularly in recall and precision for fraudsters. Their innovation provides a blueprint for scaling GNNs in telecom contexts with data imbalance.

2.3. Graph-Based and Visualization Methods

CallMine, developed by Cazzolato et al. [33], is a scalable fraud detection and visualization tool that constructs and analyzes call graphs. By applying network analysis techniques, such as degree centrality and community detection, it uncovers patterns like “black holes” (users who receive but do not return calls) and “ghost chasers” (users who only call). The tool was validated on 35 million call records and allowed analysts to visually explore fraud scenarios, reducing investigative overhead and improving explainability. It set a precedent for using interactive analytics in telecom fraud operations.
Liang et al. [31] also contributed indirectly to graph-based detection by introducing autoencoders capable of aggregating neighborhood behavior without relying on complete graph topology. Their solution was specifically designed to overcome challenges in real-world deployments where data sharing across operators is restricted. Their feature binning strategy grouped similar user behaviors, while the autoencoder learned latent representations, yielding superior performance over GNNs in restricted data environments.

2.4. LLM-Based Conversational and Semantic Analysis

Singh et al. [34] pioneered a fraud detection system that combines Retrieval-Augmented Generation (RAG) with large language models to analyze real-time voice conversations. The model transcribes live calls, checks dialogue content against dynamic organizational policies, and alerts users to suspected scams. Their system demonstrated 97.98% accuracy on synthetic but realistic voice call datasets. Importantly, their approach enables real-time intervention, which is crucial for preventing psychological scams involving impersonation or coercion.
Shen et al., in their paper “Where Do We Stand,” [35] critically examined the viability of LLMs in detecting scam phone calls. They analyzed public and synthetic datasets using TF-IDF and classic classifiers and found high accuracy-up to 99% with Random Forests-but noted that performance hinged on specific keyword triggers rather than true semantic comprehension. They raised concerns about model generalization, hallucinations, and low recall in real-world scenarios.
In “It Warned Me Just at the Right Moment,” Shen et al. [36] proposed a real-time alert system using LLMs to identify fraud intent during ongoing calls. Their system evaluated dialogue in progress and provided in-call warnings. The paper delved into trade-offs between response latency and detection accuracy, offering insights into optimizing LLMs for time-sensitive tasks. The method showed promising performance and introduced a user-centric approach to scam prevention.
Boskou et al. [37] explored financial statement fraud by leveraging ChatGPT-4 to analyze qualitative components of corporate reports, such as CEO letters and risk disclosures. Using prompt engineering, they achieved an average F1-score of 67%. The study demonstrated that LLMs, even without fine-tuning, can be adapted for audit and regulatory tasks, making them accessible tools for financial oversight.
Korkanti [38] proposed a hybrid model that combines LLMs with advanced data analytics, including anomaly detection and predictive modeling, to identify financial fraud. Their system leveraged structured transactional data alongside LLM-processed communications, leading to improved precision and recall compared to legacy systems. This approach was particularly effective in detecting complex fraud schemes involving both behavior and content manipulation.

2.5. Policy, Regulation, and Regional Focus

Mundia et al. [39] investigated telecom fraud in Kenya using qualitative interviews and focus groups involving telecom operators and regulators. The study cataloged major fraud types-SIM swap, Wangiri, and SIM-box-and identified systemic challenges like lack of biometric verification and automated tools. It emphasized the issue of concept drift, where fraud tactics change faster than detection systems can adapt. Their policy recommendations include national fraud databases and real-time customer verification protocols.
Muchilwa et al. developed Coeus [40], a cyber threat intelligence platform focused on aggregating and sharing information on fraudulent phone numbers. The platform integrates with telecom systems via API and provides a real-time repository of reported fraud cases. Piloted in Kenya, Coeus addressed fragmentation in fraud response and improved inter-agency communication. Their work highlights the importance of infrastructure in collaborative fraud prevention.
Bayram et al. [41] explored regulatory challenges in Turkey’s telecom sector, combining a review of global practices with insights from Turkish telecom operators. The study pointed out issues like lack of standard Calling Line Identification (CLI) spoofing detection and inconsistencies in user identity verification. It proposed a framework for legal and technical reforms, including mandatory fraud reporting and centralized data analytics units.
Sahaidak et al. [42] conducted a global review of telecom fraud types-such as International Revenue Share Fraud (IRSF), arbitrage, and hybrid scams. They categorized fraud by complexity and evaluated their financial impact. Their analysis emphasized that modern fraud often combines multiple vectors, requiring multifaceted detection systems. The study advocates for flexible regulatory frameworks that support technological innovation while ensuring compliance.
Figure 2a illustrates the distribution of research methodologies used in telecom fraud detection. Supervised machine learning dominates the field, accounting for 30% of the studies, indicating its continued reliability and widespread adoption. However, LLM-based analysis is close behind at 25%, reflecting a growing interest in leveraging large language models for contextual and conversational fraud detection. Deep learning methods, including neural networks, autoencoders, and GNNs, contribute 20%, suggesting substantial exploration of more complex models. Policy and regulation studies (15%) and graph-based or visualization approaches (10%) round out the distribution, highlighting ongoing work in interpretability and governance.
Figure 2b chart represents the distribution of research focused on different fraud mechanisms. General telecom fraud is the most studied area at 35%, likely due to its prevalence and the variety of subtypes it encompasses. Wangiri fraud, social engineering/voice scams, and regulatory or policy-focused studies each make up 15%, showing that targeted fraud types and legal responses are gaining attention. SIM-box fraud and financial document fraud each account for 10%, indicating specialized interest in technical and document-based scams. Overall, the chart suggests a balance between broad and specific fraud mechanisms in current research. Table 1 presents a comparison of all related works discussed in this section, highlighting their methodologies, targeted fraud types, achieved accuracies, and key contributions.

2.6. Gap Analysis and Future Opportunities

Despite significant advances, several gaps persist. First, the scarcity of labeled data continues to constrain supervised methods, necessitating greater exploration of semi-supervised and unsupervised learning. Second, while deep learning models achieve high accuracy, their interpretability remains limited, posing challenges for compliance and operational deployment [43]. Third, scalability is underexplored, with few works addressing the real-time integration of fraud detection systems at national or global telecom scale. Finally, the adaptability of models to dynamic fraud tactics-known as concept drift-remains insufficiently addressed, particularly in fast-evolving schemes like Wangiri and SIM-box fraud.
The literature reveals a dynamic and expanding research landscape in telecommunications fraud detection. From classical supervised methods to cutting-edge LLM-based systems, scholars have developed a wide range of techniques that address different dimensions of the fraud problem. Nonetheless, unresolved challenges persist in data availability, interpretability, scalability, and adaptability. Addressing these gaps will require not only technological innovation but also collaborative regulatory frameworks and cross-operator data sharing. This thesis aims to contribute to these efforts by developing comprehensive, practical approaches that bridge methodological rigor with deployment feasibility in combating Wangiri and related fraud schemes.

3. Problem Statement

The primary challenge lies in the absence of reliable labeled training data. In production telecommunications environments, fraud labels are often unavailable due to the time-sensitive nature of fraud detection, the cost of manual labeling, and the evolving nature of fraud patterns. Unlike domains where ground truth can be systematically curated, fraud detection in telecommunications is hampered by the dynamic and adversarial behavior of fraudsters, who constantly adapt their strategies to evade detection, thereby rendering previously labeled data obsolete or misleading over time. Moreover, the dependence on human experts to verify fraudulent instances introduces additional delays and operational costs, making large-scale annotation impractical. This scarcity of labels poses a critical obstacle to supervised learning, which relies on rich and representative datasets to learn generalizable decision boundaries. Although unsupervised learning techniques can circumvent labeling constraints by detecting anomalies or deviations from normal call behavior, such methods often struggle to provide actionable outputs in real-time contexts and may generate excessive false positives, undermining operational efficiency. Therefore, the unlabeled data challenge necessitates exploration of semi-supervised, weakly supervised, and self-supervised approaches, which can leverage limited labeled samples alongside vast amounts of unlabeled data, while maintaining responsiveness and adaptability to the rapidly changing fraud landscape.

3.1. Feature Engineering Complexity

Raw CDRs contain vast amounts of structured and semi-structured data that must be transformed into meaningful features capable of capturing fraud patterns. Each record encodes diverse information such as caller and callee identifiers, call duration, start and end times, and geographic origin, all of which may interact in complex ways to signal fraudulent behavior. The primary challenge is to identify combinations of attributes that serve as robust indicators of fraud, for example, short-duration calls to premium-rate numbers, repeated attempts across multiple destinations, or anomalous time-of-day activity patterns. However, the high dimensionality and heterogeneity of CDRs make naive feature extraction computationally expensive and prone to redundancy, which is particularly problematic in real-time processing environments where latency must be minimized. Additionally, features that are predictive under one fraud scheme may rapidly lose relevance as fraud tactics evolve, necessitating continuous monitoring and adaptation of the feature space. Thus, effective feature engineering must balance expressiveness with computational efficiency, employing domain knowledge, statistical analysis, and automated representation learning techniques to construct features that generalize across fraud scenarios without overwhelming system resources. Critical features for Wangiri fraud detection include:
  • Temporal features: call frequency, inter-call intervals, time-of-day patterns;
  • Duration features: call length statistics, duration distributions;
  • Geographic features: location clustering, international calling patterns;
  • Network features: routing information, carrier data;
  • Behavioral features: calling patterns, number sequences.

3.2. Imbalanced Data Distribution

Wangiri fraud attempts, while numerous in absolute terms, represent a small fraction of total call traffic in telecommunications networks. This creates a highly imbalanced dataset where fraudulent instances are significantly outnumbered by legitimate calls, leading to challenges in model training and evaluation. In such skewed distributions, traditional supervised algorithms tend to bias toward the majority class, achieving deceptively high accuracy by predominantly predicting legitimate activity while failing to identify rare but critical fraudulent cases.
The imbalance also complicates the use of conventional evaluation metrics, as high overall accuracy may mask poor recall of fraudulent instances, which is of paramount importance in operational contexts. Addressing this issue requires both algorithmic and data-level interventions, such as resampling strategies, cost-sensitive learning, ensemble methods, or the design of tailored performance metrics that prioritize fraud detection without inflating false alarms. Moreover, the imbalance is further exacerbated by concept drift, as fraud strategies evolve, thereby reducing the representativeness of historical data and worsening class disparity over time. Consequently, the imbalanced data distribution is not merely a technical artifact but a structural characteristic of the fraud detection problem, demanding robust methodological innovations to ensure that minority-class patterns are effectively learned and operationalized.

3.3. Interpretability Requirements

Telecommunications fraud detection systems must provide clear, interpretable explanations for their decisions to satisfy regulatory requirements and enable operational teams to take appropriate action. In highly regulated domains such as telecommunications, transparency is not merely a desirable feature but a legal and ethical necessity, as operators must be able to justify the reasoning behind the flagging of suspicious activity both to auditors and to affected customers. However, the increasing reliance on advanced machine learning and deep learning models, which often operate as black-box systems, introduces a tension between predictive accuracy and interpretability. This tension necessitates the incorporation of explainable artificial intelligence (XAI) techniques, such as feature attribution methods, surrogate models, or rule-based post hoc explanations, to bridge the gap between complex model behavior and human comprehensibility.
At the same time, interpretability must be sufficiently granular to allow fraud analysts to trace specific indicators, such as anomalous call patterns, unusual temporal distributions, or deviations from user profiles, thereby ensuring that alerts can be investigated effectively and translated into actionable countermeasures. Consequently, the interpretability requirement extends beyond compliance, functioning as a critical operational enabler that allows human experts to validate model outputs, adapt strategies in response to evolving fraud tactics, and maintain trust in the automated system.

3.4. Real-Time Processing Constraints

Effective fraud detection requires near real-time processing capabilities to identify and mitigate fraudulent activities before significant financial losses occur. In practice, this entails the design and deployment of computational architectures capable of ingesting and analyzing massive streams of CDRs, signaling events, and network metadata with minimal delay, often within milliseconds of data generation. The challenge lies in balancing the need for low-latency detection with the inherently high computational complexity of fraud detection algorithms, particularly when employing machine learning models that involve feature-rich representations or temporal dependencies.
To meet these constraints, systems must employ efficient stream-processing frameworks, distributed computing environments, and memory-optimized algorithms that can operate at scale without degrading performance. Furthermore, real-time fraud detection demands robustness in handling data variability and burst traffic conditions, ensuring continuous operation even under network congestion or sudden spikes in fraudulent attempts. Failure to maintain stringent latency thresholds could render detections ineffective, as fraudulent transactions may already be executed and irreversible by the time an alert is raised. Thus, the real-time processing requirement is not merely a technical preference but a fundamental operational imperative, directly tied to the financial integrity and resilience of telecommunications networks.

4. Methodology

This section outlines the methodology for detecting Wangiri fraud in telecommunications data. The process encompasses data preprocessing, feature engineering, model training with imbalance handling, calibration, and evaluation. Figure 3 illustrates the complete pipeline, designed to address class imbalance and yield interpretable, high-performance models.
The dataset consists of unlabeled CDRs, capturing low-level signaling and call metadata across various fields. Among all 125 features existed in CDR log files, we extracted 12 features related to Wangiri Fraud Detection. Each row in the data set represents a single call session, characterized by the following standard attributes listed in Table 2.
These features provide temporal, geographical, and signaling-related insights that form the basis for downstream analysis such as fraud detection and anomaly monitoring.

4.1. Data Labeling

To enable supervised learning for fraud detection, we developed a systematic labeling framework grounded in domain-informed heuristics derived from CDRs. Because reliable ground-truth annotations were unavailable in production, a rule-based strategy was adopted to approximate fraudulent behavior. Two distinct labeling approaches were defined, reflecting different assumptions about fraud indicators and their operational feasibility.
  • Operator-Defined Traditional Rule: In practice, many telecom operators rely on simple threshold-based heuristics for fraud identification. The most widely used rule marks a call as fraudulent when its duration is less than 5 s, combined with both the ACM time and CPG time equal to zero. While this method is effective in highlighting unanswered or prematurely terminated calls, its reliance on post-call signaling parameters limits its applicability in real-time detection pipelines. Specifically, these features may not be available at the time of decision-making, thereby restricting operational deployment in streaming environments.
  • Joint Duration and Call Volume Labeling (label_unique_calls_and_callduration): To address the shortcomings of the traditional rule, we designed a stricter heuristic that combines both temporal and volumetric dimensions while remaining compatible with real-time inference. In this scheme, a record is labeled as fraudulent if the caller exceeds 60 unique calls within a single day and the cumulative call duration across that day is less than 120,000 ms (120 s). This formulation reflects the characteristic behavior of fraudsters who engage in mass dialing campaigns with unusually low aggregate call durations, typically corresponding to unanswered or deliberately short calls. Unlike the operator-defined rule, this approach is aligned with features that can be computed incrementally, making it more suitable for real-time fraud detection scenarios.

4.2. Feature Engineering

To improve the model’s predictive accuracy, a set of engineered features was developed to capture temporal, geographic, and behavioral aspects of each call record. These features were selected based on their relevance to the task and their ability to expose patterns typical of fraudulent or spammy behavior.

4.2.1. Temporal Features

These features provide temporal context to the call by localizing timestamps to the respective origin and destination time zones:
  • source_time: The local hour of day at the call origin. Computed by adjusting the Unix Time Stamp (UTC) using the timezone offset corresponding to the origin country or region. This helps capture behavioral patterns such as calls occurring during unusual hours.
  • dest_time: The local hour of day at the call destination. Calculated similarly to source_time, it helps identify suspicious call patterns based on the destination’s local time (e.g., calls placed at night to certain countries).

4.2.2. Geographic and Call Type Features

These features encode geographic relationships and call classifications:
  • is_international_new: A binary indicator representing whether the call is international (1) or domestic (0). This is derived by comparing the country codes of the caller and callee, and is useful for detecting anomalies such as unexpected international call patterns.

4.2.3. Behavioral Features

Behavioral features aim to quantify user activity and uncover atypical usage patterns:
  • unique_calls_last_day: The number of distinct callee numbers contacted by the caller during the previous calendar day. This feature is designed to capture bursty or high-volume behavior, which is often associated with spam campaigns. It is computed using historical data to ensure the model can operate in a real-time inference setting without relying on future information.

4.3. Feature Engineering and Analysis

An in-depth Exploratory Data Analysis (EDA) was performed to examine data characteristics, fraud distributions, and feature relationships. Key insights are visualized in this section. Fraudulent calls constitute a small portion of the dataset, necessitating specialized handling techniques. As shown in Figure 4, the distribution between fraud and non-fraud calls reveals a significant class imbalance, with only 0.57% of calls identified as fraudulent.
Figure 5 shows the Spearman correlation coefficients between various features and the fraud label in a telecom dataset. The feature unique_calls_last_day has the highest positive correlation (0.07), indicating that accounts with more unique calls are slightly more associated with fraudulent activity. destcc and dest_time also show weak positive correlations. On the negative side, cpg_time, is_international_new, and acm_time exhibit the strongest (though still weak) negative correlations, suggesting these timing and international-related variables are slightly inversely associated with fraud. Overall, the absolute values of the coefficients are low, implying weak monotonic relationships between individual features and fraud, highlighting the need for more complex or nonlinear models to capture patterns effectively.
Figure 6 displays the Spearman correlation coefficients between all pairs of features in the dataset, highlighting the strength and direction of monotonic relationships. Most features exhibit weak correlations, with values clustering near zero, suggesting relative independence. However, a notable exception is the strong positive correlation (0.74) between dest_time and source_time, indicating that call initiation and destination times are closely related. Another moderate correlation appears between acm_time and cpg_time (0.41), reflecting a potential temporal or sequential dependency. No significant multicollinearity is observed across the broader set, making the feature set generally suitable for models sensitive to correlated inputs.

4.3.1. Numerical Distributions

Figure 7 presents a set of histograms for the distributions of key numerical features on a logarithmic scale. All four features-acm_time, cpg_time, callduration, and unique_calls_last_day exhibit heavy right skewness, with most values concentrated near the lower end of the scale and long tails extending toward higher magnitudes. This indicates a large number of low-activity events and a small number of high-activity outliers, typical in telecom data. Specifically, callduration and cpg_time show extreme sparsity in their higher ranges, while unique_calls_last_day displays a spiked pattern, possibly due to discrete count values. The log scale effectively compresses the range and reveals structure in otherwise skewed data, suggesting that transformation or normalization may be beneficial for modeling.

4.3.2. Distributions by Fraud Status

Figure 8 shows violin plots comparing the distributions of four numerical features based on fraud status (Non-fraudulent (0) vs. Fraudulent (1)). These subfigures help visualize how the distributions of these features differ between normal and fraudulent behaviors:
(a) 
Acm Time: Fraudulent calls generally have shorter acm_time values, with a sharper peak at low durations, whereas non-fraudulent calls are more widely distributed with multiple modes, indicating greater variability.
(b) 
Callduration: Non-fraudulent calls show a wide and continuous spread in duration, including very high values, while fraudulent calls cluster near the lower range, indicating shorter calls are more typical in fraud cases.
(c) 
Cpg Time: Similar to acm_time, cpg_time for fraudulent calls is highly concentrated near low values with little variance, whereas non-fraudulent calls show a broader distribution.
(d) 
Unique Calls Last Day: This feature clearly separates the two classes. Fraudulent accounts make significantly more unique calls in a day, showing a tall, narrow peak at high values, while non-fraudulent behavior is concentrated around much lower counts.
Figure 8. Violin plots showing the distribution of key numerical features by fraud status. (a) Distribution of acm_time by fraud status. (b) Distribution of callduration by fraud status. (c) Distribution of cpg_time by fraud status. (d) Distribution of unique_calls_last_day by fraud status. (The dotted lines indicate locations where the slope changes are sharp and significant.)
Figure 8. Violin plots showing the distribution of key numerical features by fraud status. (a) Distribution of acm_time by fraud status. (b) Distribution of callduration by fraud status. (c) Distribution of cpg_time by fraud status. (d) Distribution of unique_calls_last_day by fraud status. (The dotted lines indicate locations where the slope changes are sharp and significant.)
Futureinternet 18 00015 g008

4.3.3. Temporal Analysis

Figure 9 presents two bar charts analyzing call distribution patterns over the course of a day, segmented by fraud status (0 for non-fraudulent and 1 for fraudulent calls).
Top Chart—Call Distribution by Source Time of Day: This plot shows the distribution of call initiation (source) times. Non-fraudulent calls follow a clear diurnal pattern, peaking between 10:00 and 18:00 and dropping significantly during the night hours. Fraudulent calls, while far fewer in number, occur disproportionately around 17:00–18:00, subtly deviating from the non-fraudulent distribution.
Bottom Chart—Call Distribution by Destination Time of Day: This chart reflects the distribution of call termination (destination) times. The trend is similar to the source time distribution, with most non-fraudulent calls occurring during typical working hours. Fraudulent calls are again concentrated around late afternoon to early evening hours, though their volume is much lower compared to legitimate calls.
Together, the subfigures suggest that while non-fraudulent call activity aligns with standard daily schedules, fraudulent calls are slightly more concentrated during late daytime hours, indicating potential exploitation of higher call volumes or reduced detection during that window.

4.3.4. Geographic Insights

Figure 10 displays two bar charts analyzing the country-level distribution of fraudulent calls based on international dialing codes.
Left Chart—Top 10 Caller Countries in Fraudulent Calls: This chart ranks the caller country codes by the number of fraudulent calls initiated. Country code 98 is the most frequent source, significantly surpassing others, followed by 1, 44, and 971. The remaining entries have progressively fewer fraudulent calls, with codes like 965, 357, and 904 contributing relatively minimal volumes. This indicates a concentration of fraud origination from a few key regions.
Right Chart—Top 10 Destination Countries in Fraudulent Calls: This chart shows the destination country codes most frequently targeted by fraudulent calls. Country code 98 also dominates here, indicating it is both a major source and target of fraudulent activity. Other notable destinations include 90, 971, and several codes with low but notable volumes (93, 86, 49, etc.).
These charts highlight geographic patterns in telecom fraud, with certain countries consistently appearing as both origin and target of suspicious call activity, suggesting potential hubs or targets of fraud operations.

4.3.5. Engineered Features

Figure 11 presents two bar plots examining the relationship between engineered features and fraud status.
Left Plot—Fraud Status for International vs. Domestic Calls: This chart compares the volume of international and domestic calls, segmented by fraud label. The vast majority of calls are domestic and non-fraudulent. However, a noticeable proportion of international calls are fraudulent, highlighting that fraud is more likely to occur in international call traffic despite their lower overall volume.
Right Plot—Unique Daily Calls by Fraud Status: This chart shows the distribution of unique_calls_last_day for non-fraudulent vs. fraudulent users. Non-fraudulent users typically make a small number of unique calls per day, forming a narrow, centered distribution. In contrast, fraudulent users consistently exhibit a much higher and fixed number of unique calls per day (close to 30), indicating automated or scripted call behavior commonly associated with fraud.
These subfigures suggest that fraud is disproportionately associated with international calling and a high volume of unique calls within a single day, reinforcing the utility of these engineered features in fraud detection models.

4.3.6. Feature Interactions

Figure 12 visualizes the relationship between Call Duration (x-axis) and Connection Setup Time (CPG Time) (y-axis), with call instances labeled by fraud status. The blue dots represent non-fraudulent calls, and red dots (if any) represent fraudulent calls.
The plot reveals a strong concentration of points in the lower-left region, indicating that most calls-especially non-fraudulent ones-have both short durations and low connection setup times. A few outliers extend toward high call durations and high CPG times, but these are sparse. The overall distribution appears highly skewed with no clear linear relationship between call duration and CPG time. Notably, fraudulent calls (in red) are either extremely rare or absent in this sample, which may suggest that fraud is more common in calls with shorter durations or not strongly related to variations in CPG time.

4.4. Proposed Optimized Ensemble Framework

To address the limitations of standard baseline models in detecting sophisticated Wangiri patterns, we propose an Optimized Ensemble Framework that integrates our novel unsupervised labeling heuristic with a hyperparameter-tuned voting architecture. Unlike standard implementations of Random Forest or XGBoost, our proposed approach specifically targets the “one ring and cut” signature by explicitly weighing the engineered behavioral features derived in the previous section.
The correlations observed in Figure 5 were pivotal in defining the model architecture. For instance, the high correlation of unique_calls_last_day necessitates a tree-based architecture capable of creating distinct decision boundaries for high-volume callers versus normal users. Furthermore, the weak monotonic correlations of timing features (acm_time, cpg_time) suggest non-linear dependencies; therefore, our framework utilizes gradient boosting (XGBoost) to capture these complex, non-linear feature interactions that linear baseline models (like Logistic Regression) fail to detect.
The framework operates in three distinct stages:
1.
Heuristic-Based Weak Labeling: Generating initial ground truth from unlabeled data using the domain rules defined in Section 4.1.
2.
Feature Interaction Encoding: Transforming raw CDR timestamps into cyclical temporal features and ratio-based behavioral metrics.
3.
Hyperparameter-Optimized Ensemble: We move beyond default baseline settings by employing a rigorous grid search (GridSearchCV) to optimize the regularization parameters ( λ , α ) and tree depth. This ensures the model does not merely memorize the heuristic rules but generalizes to unseen variations of fraud.

4.4.1. Dataset and Preprocessing

The experiments utilize a large-scale dataset comprising 11,667,040 call records with 12 initial features. After preprocessing-including handling missing values, feature engineering, and selection-9 key features were retained: call duration, unique calls last day, ACM time, CPG time, international calls ratio, fraud hotlist numbers ratio, country risk score, call time of day, and high frequency calling. The target variable is highly imbalanced, with 66,352 fraudulent cases (0.57%) and 11,600,688 non-fraudulent cases (99.43%). The data was split into 70% training (8,166,928 records), 10% validation (1,166,704 records), and 20% testing (2,333,408 records) sets prior to any balancing or modeling to prevent data leakage.

4.4.2. Sampling Strategies

To address the class imbalance, we applied four sampling strategies to the training data:
  • No Sampling: The training data remains imbalanced, with approximately 0.57% fraudulent cases.
  • SMOTE: SMOTE was used to oversample the fraud class, increasing its representation to approximately 33.3% of the training set.
  • Oversampling Fraud and Undersampling Non-Fraud: SMOTE was applied to oversample the fraud class, and random undersampling reduced the non-fraud class, achieving a balanced training set (50% fraud, 50% non-fraud).
  • Undersampling Non-Fraud Only: The non-fraud class was randomly undersampled to match the number of fraud cases, resulting in a balanced training set.
For each strategy, we trained five models: Logistic Regression, Decision Tree, Random Forest, XGBoost, and MLP. These were evaluated on the validation set to identify the best-performing combination.

4.4.3. Learning Approach

A critical challenge in deploying AI for Wangiri fraud detection is the scarcity and unreliability of labeled data, which necessitates exploration beyond fully supervised learning toward semi-supervised, weakly supervised, and self-supervised approaches. High detection accuracy is essential in this context, as misclassifications directly translate into financial loss or degraded user trust; however, purely unsupervised methods primarily focus on identifying generic anomalies rather than discriminating Wangiri fraud from other irregular but legitimate calling behaviors. In practice, this limitation is significant because fraud detection requires class-specific reasoning rather than deviation-based detection alone.
Moreover, many unsupervised techniques, such as k-nearest neighbors or DBSCAN, impose substantial operational constraints by requiring access to the training data during inference, making them unsuitable for real-time, large-scale telecom environments. These methods also typically rely on heuristic or distance-based similarity measures, which struggle to capture the complex relational patterns and heterogeneous feature types present in telecom data, including categorical, temporal, and behavioral attributes. Our early experiments with Isolation Forest further highlighted these shortcomings, yielding suboptimal performance and high false-positive rates.
Consequently, these observations motivated a shift toward weakly supervised labeling strategies, which better align with the practical constraints of telecom fraud detection while enabling models to learn fraud-specific patterns from imperfect but informative supervisory signals.

4.4.4. Model Selection

This subsection outlines the selection of five classifiers, each representing diverse algorithmic paradigms suitable for fraud detection tasks. These models were chosen to balance interpretability, performance, and the ability to handle complex data patterns in imbalanced datasets. The details of this process are shown in Figure 13.
  • Logistic Regression (LR): Logistic Regression is a supervised linear classification algorithm used for binary or multiclass problems. It serves as a baseline model that assumes a linear relationship between input features and the log-odds of the outcome. The core idea is to model the probability of a class (e.g., fraud or non-fraud) using a logistic (sigmoid) function [44,45], which maps any real-valued input to a value between 0 and 1. This makes it interpretable, as coefficients represent the impact of features on the outcome, but it struggles with nonlinear interactions without feature transformations. Robustness refers to its stability in the presence of multicollinearity when regularized. Overfitting is minimized through techniques like L1 (Lasso) or L2 (Ridge) regularization [46]. The probability prediction formula is given in Equation (1):
    p ( y = 1 x ) = 1 1 + e ( β 0 + β T x )
    where y is the binary target variable (1 for fraud, 0 otherwise), x is the feature vector, β 0 is the intercept (bias term), and β are the coefficients (weights) learned via maximum likelihood estimation. The model is trained by minimizing the cross-entropy loss given in Equation (2):
    L = i = 1 n [ y i log ( p i ) + ( 1 y i ) log ( 1 p i ) ]
  • Decision Tree (DT): A Decision Tree is a non-parametric, supervised learning algorithm that builds a tree-like model of decisions and their possible consequences. The key idea is recursive partitioning: the algorithm splits the dataset into subsets based on feature values to maximize information gain or minimize impurity at each node [47]. Nodes represent decision rules (e.g., “feature X > threshold”), branches are outcomes, and leaves are class predictions. Transparency comes from the visual and rule-based structure, allowing easy interpretation. However, it is prone to overfitting, where the tree grows too deep and captures noise rather than patterns, leading to poor generalization on unseen data. Pruning techniques (pre- or post-pruning) can mitigate this. A common splitting criterion is Gini impurity, defined in Equation (3):
    Gini ( D ) = 1 k = 1 K p k 2
    where D is the dataset at a node, K is the number of classes, and p k is the proportion of samples belonging to class k. Another criterion is entropy (from information theory) given in Equation (4):
    Entropy ( D ) = k = 1 K p k log 2 ( p k )
    where the best split maximizes the reduction in impurity across child nodes.
  • Random Forest (RF): Random Forest is an ensemble learning method that constructs multiple Decision Trees and aggregates their predictions to improve accuracy and robustness. The core idea is bootstrap aggregation (bagging), where each tree is trained on a random subset of the data (with replacement) and a random subset of features at each split, reducing variance and overfitting [48]. Generalization improves as the ensemble averages out individual tree biases. Feature importance is derived from how much each feature reduces impurity across trees. While less interpretable than a single tree due to the “black-box” nature of the ensemble, techniques like variable importance plots provide insights. Predictions for classification are made by majority voting across B trees as shown in Equation (5):
    y ^ = arg   max k 1 B b = 1 B I ( y ^ ( b ) = k )
    where y ^ ( b ) is the prediction from the b-th tree, and I is the indicator function. The out-of-bag (OOB) error estimates generalization without a separate validation set.
  • MLP: A Multi-Layer Perceptron [49] is a type of feedforward artificial neural network consisting of an input layer, one or more hidden layers, and an output layer. The fundamental idea is to learn hierarchical representations of data through nonlinear transformations, enabling the capture of complex interactions and patterns that linear models miss. Neurons in hidden layers apply nonlinear activation functions (e.g., ReLU: σ ( z ) = max ( 0 , z ) or sigmoid) to weighted sums of inputs. Training occurs via backpropagation, which adjusts weights using gradient descent to minimize loss. Hyperparameter tuning involves selecting layer sizes, learning rates, and regularization to prevent overfitting, especially with limited data. For a single hidden layer, the output is given in Equation (6):
    h = σ ( W 1 x + b 1 ) , y ^ = softmax ( W 2 h + b 2 )
    where W 1 , W 2 are weight matrices, b 1 , b 2 are biases, σ is the activation function, and softmax normalizes outputs to probabilities for classification as shown in Equation (7):
    softmax ( z i ) = e z i j e z j
    where the loss is typically cross-entropy, optimized with algorithms like Adam.
  • Extreme Gradient Boosting (XGBoost): XGBoost [50] is a scalable, tree-based ensemble algorithm that uses gradient boosting to build sequential models. The key idea is to train “weak learners” (shallow trees) iteratively, where each new tree corrects the residuals (errors) of the previous ones, minimizing a loss function through gradient descent. Regularization terms prevent overfitting, and it excels in imbalanced tasks by handling class weights. Flexibility comes from customizable objectives and evaluation metrics. Interpretability is achieved via tools like SHapley Additive exPlanations (SHAP) values, though the boosted ensemble is inherently complex. The objective function to minimize is given in Equation (8):
    Obj = i = 1 n l ( y i , y ^ i ) + k = 1 K Ω ( f k )
    where l is the loss (e.g., logistic loss for classification), y ^ i = k = 1 K f k ( x i ) is the prediction as a sum of tree outputs f k , and Ω ( f k ) = γ T + 1 2 λ w 2 is the regularization term (T: number of leaves, w : leaf weights, γ , λ : penalties). Trees are built by approximating the second-order Taylor expansion of the loss for efficiency.

4.4.5. Hyperparameter Tuning and Probability Calibration

Model hyperparameters were optimized using a grid search with 5-fold cross-validation on the balanced training set. Hyperparameter tuning was conducted only for the XGBoost model, as its performance is highly sensitive to parameter choices. The search space is summarized in Table 3.
For other models (MLP, Random Forest, Decision Tree, and Logistic Regression), hyperparameter tuning was not applied for the following reasons:
  • MLP: Preliminary experiments showed unstable convergence and limited gains from tuning with the available dataset size, so default settings were retained for consistency.
  • Random Forest and Decision Tree: Both models demonstrated stable performance under default configurations, and extensive tuning did not provide meaningful improvements relative to the added computational cost.
  • Logistic Regression: As a baseline linear model, Logistic Regression is relatively insensitive to hyperparameter adjustments beyond regularization strength, which was fixed for comparability across experiments.
Following model training, probability calibration was applied using CalibratedClassifierCV, with both sigmoid (Platt scaling) and isotonic regression methods evaluated. Calibration ensures that predicted probabilities reflect true likelihoods of fraud, which is crucial for operational deployment in fraud management systems.

4.4.6. Evaluation Metrics

Model performance was evaluated using a comprehensive set of metrics suitable for imbalanced binary classification. These metrics are defined based on the confusion matrix elements: True Positives (TP: correctly predicted fraud cases), False Positives (FP: non-fraud cases incorrectly predicted as fraud), True Negatives (TN: correctly predicted non-fraud cases), and False Negatives (FN: fraud cases incorrectly predicted as non-fraud) [51].
  • Precision: The fraction of predicted fraud cases that were correctly identified, reflecting false alarm control, as given in Equation (9).
    Precision = T P T P + F P
  • Recall (Sensitivity): The proportion of actual fraud cases correctly detected, representing the system’s ability to capture fraudulent activity, as given in Equation (10).
    Recall = T P T P + F N
  • F1-Score: The harmonic mean of precision and recall, providing a balanced assessment when both metrics are critical, as given in Equation (11).
    F 1 = 2 × Precision × Recall Precision + Recall
  • ROC–AUC: The Area Under the Receiver Operating Characteristic Curve, summarizing the trade-off between true positive rate (TPR = Recall) and false positive rate (FPR = F P F P + T N ) across various classification thresholds. It is computed as the integral of TPR against FPR from 0 to 1, or empirically via the trapezoidal rule over sorted predictions, as given in Equation (12) [52].
    ROC-AUC = 0 1 TPR ( FPR ) d FPR
  • PR–AUC: The Area Under the Precision–Recall Curve, particularly informative for imbalanced datasets where precision and recall trade-offs are more meaningful than ROC–AUC. It is computed as the integral of Precision against Recall from 0 to 1, or empirically via the trapezoidal rule, as given in Equation (13) [53].
    PR-AUC = 0 1 Precision ( Recall ) d Recall
  • Confusion Matrix: A tabular summary of true positives, false positives, true negatives, and false negatives, enabling detailed error analysis. It is represented as shown in Table 4 [51].
Among these, ROC–AUC was used as the primary optimization criterion, while the additional metrics ensured a holistic assessment of classifier behavior.

4.4.7. Model Interpretability with SHAP

To ensure transparency of the predictive models, we employed SHAP [43], a unified game-theoretic framework for interpreting machine learning predictions. SHAP builds on Shapley values from cooperative game theory, which allocate the payout of a game fairly among players based on their marginal contributions. In the context of machine learning, the “players” are input features, and the “payout” is the model prediction.
Formally, for a model f: R M R with M input features, the Shapley value ϕ i of a feature i is defined in Equation (14):
ϕ i ( f , x ) = S N { i } | S | ! ( M | S | 1 ) ! M ! f S { i } ( x S { i } ) f S ( x S ) ,
where N = { 1 , 2 , , M } is the set of all features, S is a subset of features excluding i, f S denotes the model restricted to feature set S, and x S is the corresponding feature vector. This definition ensures the following desirable properties:
  • Efficiency: The sum of all feature attributions equals the model output difference from the baseline, i.e., as shown in Equation (15):
    f ( x ) = f ( x ) + i = 1 M ϕ i ,
    where x is a reference (baseline) input.
  • Symmetry: If two features contribute equally in all coalitions, they receive identical Shapley values.
  • Dummy: If a feature does not affect the prediction in any coalition, its Shapley value is zero.
  • Additivity: Explanations for combined models are additive over those models.
In practice, exact computation of Shapley values is intractable for large M due to the exponential number of subsets. SHAP leverages efficient approximations and model-specific algorithms (e.g., TreeSHAP for decision-tree-based models such as (XGBoost) to provide computationally feasible explanations.
We applied SHAP at both the global and local interpretability levels:
  • Global Analysis: SHAP summary plots aggregate the absolute values of ϕ i across all samples, ranking features according to their average contribution to predictions. This provided insight into the most influential indicators of fraudulent activity across the dataset, facilitating feature importance assessment beyond conventional impurity or coefficient-based measures.
  • Local Analysis: For individual call records, SHAP force and waterfall plots decomposed the prediction into additive feature contributions. Specifically, each prediction f ( x ) was expressed as in Equation (16):
    f ( x ) = f ( x ) + i = 1 M ϕ i ( x ) ,
    where ϕ i ( x ) quantified how much feature i pushed the prediction towards the fraudulent or non-fraudulent class. This local interpretability is particularly valuable for fraud analysts, who require case-by-case justification of model outputs.
By integrating SHAP into the modeling pipeline, we ensured that the system combined predictive performance with interpretability. This balance is essential not only for operational trust among fraud management teams but also for compliance with regulatory requirements in the telecommunications sector, where algorithmic decision-making must remain transparent and accountable.

5. Experimental Results

In this section, we present a comprehensive empirical evaluation of several machine learning models developed for fraud detection using the l a b e l _ u n i q u e _ c a l l s _ a n d _ c a l l d u r a t i o n dataset. This dataset captures key features related to call patterns, such as the number of unique calls and call durations, which are indicative of Wangiri fraud. The evaluation covers four different sampling strategies to handle the severe class imbalance typical in fraud detection scenarios: training on the full dataset without sampling, SMOTE for oversampling the minority class, a hybrid approach combining SMOTE with Random Undersampling (RUS), and RUS alone to reduce the majority class.
For each sampling strategy, we trained five distinct classification algorithms: Logistic Regression, Random Forest, XGBoost, Decision Tree, and MLP. These models were selected to represent a range of complexities, from simple linear classifiers to advanced ensemble and neural network methods. Additionally, each model was assessed in its base form (without probability calibration) and with two calibration techniques: isotonic regression and sigmoid (Platt scaling). Calibration is important in fraud detection to ensure that predicted probabilities accurately reflect true likelihoods, which can be crucial for threshold-based decision-making.
Performance was measured on two separate test sets to provide a robust assessment. The raw distribution test set mirrors the real-world imbalance, where fraud cases are rare, making it challenging for models to detect them without generating many false positives. The balanced test set, with an equal number of fraud and non-fraud instances, allows us to evaluate the model’s discriminative power in a more equitable setting. Key metrics include accuracy, macro-averaged precision, recall, and F1-score, as well as ROC-AUC and PR-AUC. The PR-AUC is particularly valuable for imbalanced data, as it focuses on the performance in detecting the positive (fraud) class.
The following subsections each focus on one classification algorithm. To facilitate comparison, the structure is consistent across subsections: a detailed discussion of the model’s overall performance, insights into how different sampling and calibration strategies affected results, a table summarizing all metrics, and visualizations including confusion matrices, ROC curves, PR curves, and calibration curves for the best-performing configuration (typically SMOTE + RUS without calibration). These visualizations help illustrate the model’s behavior and feature importance. Through this analysis, we aim to identify the most effective approaches for detecting Wangiri fraud and understand the underlying patterns driving model predictions.

5.1. Logistic Regression

Logistic Regression serves as a foundational linear classifier in our study, offering simplicity, interpretability, and low computational cost. We found that this model provides valuable insights into the linear separability of the data but often falls short in capturing the nuanced, non-linear relationships inherent in fraud detection tasks. Consequently, its performance was generally lower compared to more advanced models, particularly on the imbalanced raw test set where detecting rare fraud events is critical.
When trained on the full dataset without sampling, the base Logistic Regression model exhibited high accuracy on the raw test set (0.9992) due to the imbalance-it effectively predicted the majority non-fraud class but failed to identify fraud, as evidenced by the extremely low PR-AUC of 0.0063. The ROC-AUC was 0.8477, which appears reasonable but is inflated by the imbalance. On the balanced test set, the model performed no better than chance, with an accuracy of 0.5000. Applying isotonic calibration slightly improved the raw PR-AUC to 0.0068, but sigmoid calibration showed no change, indicating limited benefits from calibration in this setup.
Sampling strategies markedly enhanced the model’s ability to detect fraud. For instance, using SMOTE, SMOTE + RUS, or RUS, the base model achieved a PR-AUC around 0.0727–0.0729 on the raw test set, representing over a tenfold improvement. This enhancement stems from the sampling techniques exposing the model to more fraud examples during training, improving recall for the minority class. On the balanced test set, accuracy soared to approximately 0.9946, demonstrating strong classification when classes are equal. However, calibration on sampled data often led to detrimental effects; both isotonic and sigmoid methods resulted in balanced accuracies of 0.5000, suggesting that calibration over-adjusted the probabilities, making the model indecisive.
In summary, while Logistic Regression benefits significantly from sampling to handle imbalance, its linear assumptions limit its effectiveness for this complex task. It should be considered as a baseline, but for practical deployment, more sophisticated models are recommended. The detailed metrics are provided in Table 5, highlighting the trade-offs across configurations.
Figure 14, Figure 15 and Figure 16 provide a consolidated view of Logistic Regression under the best-performing setting (SMOTE + RUS, no calibration) on both the naturally imbalanced and the balanced evaluations. The confusion matrices in Figure 14 make the contrast clear: on the raw distribution (left), the classifier captures almost all positives (TP = 1853 , FN = 7 ) but, because negatives dominate, even a small false–positive rate yields a large absolute FP count (15,433) against roughly 2.3 × 10 6 true negatives; on the balanced set (right), errors are far more symmetric (TN = 1847 , FP = 13 , FN = 7 , TP = 1853 ). The ROC panels in Figure 15 and Figure 16 show near-perfect ranking ability (AUC 0.995 ) across both tests; however, the precision–recall curves surface the effect of prevalence: with the raw class mix, average precision is low (AP 0.073 ) despite excellent separability, whereas on the balanced set the curve hugs the upper-right region (AP 0.981 ). The reliability diagrams (rightmost panels) further indicate miscalibration on the raw distribution-probabilities deviate from the diagonal, particularly at higher score bins-while the balanced evaluation appears better aligned yet still exhibits departures at the extremes. Overall, these plots suggest that this configuration delivers extremely high recall with very few missed positives, but practical deployment under natural imbalance will require threshold tuning and probability calibration to rein in false alarms and improve decision quality.

5.2. Random Forest

The Random Forest model, an ensemble of decision trees, excelled in our experiments by effectively handling non-linear relationships and interactions between features. We observed that this model’s strength lies in its ability to reduce overfitting through averaging multiple trees, making it particularly suitable for imbalanced datasets like ours. It demonstrated robust performance across most configurations, with high PR-AUC values indicating strong fraud detection capabilities.
Training on the full dataset yielded impressive results for the base model, with a PR-AUC of 0.9943 on the raw test set and an accuracy of 0.9817 on the balanced set. This suggests that the model can learn effective patterns even without sampling, thanks to its ensemble nature. Sampling further refined performance; the SMOTE + RUS strategy achieved the highest raw PR-AUC of 0.9841, though slightly lower than the full dataset, while maintaining a balanced accuracy of 0.9927. The SMOTE approach was similar, with a PR-AUC of 0.9836. However, RUS alone led to a sharp decline in raw PR-AUC to 0.6461, likely because aggressive undersampling removed too much information from the majority class, impairing generalization to the imbalanced test set.
Calibration had minimal positive impact and sometimes degraded performance, such as with isotonic calibration on sampled data, which reduced balanced accuracy. This indicates that the base Random Forest already produces well-calibrated probabilities. Overall, the Random Forest with hybrid sampling emerges as a highly reliable choice for fraud detection, balancing high precision and recall. The metrics are detailed in Table 6.
Figure 17, Figure 18 and Figure 19 summarize the behavior of the Random Forest under the best-performing setting (SMOTE + RUS, no calibration) for both the naturally imbalanced and the balanced evaluations. The confusion matrices in Figure 17 show that on the raw distribution (left) the classifier achieves extremely low false-positive burden relative to the vast negative base (TN 2.3 × 10 6 , FP = 490 ) while missing only a small number of positives (FN = 26 , TP = 1834 ); on the balanced set (right) the errors shrink further and are almost symmetric (TN = 1859 , FP = 1 , FN = 26 , TP = 1834 ). The ROC curves in Figure 18 and Figure 19 indicate virtually perfect ranking (AUC 1.000 ) across both test conditions, and the precision–recall plots confirm consistently high precision over most recall levels-on the raw test set the average precision is already very high (AP 0.984 ), rising to nearly ideal on the balanced set (AP 1.000 ). The reliability diagrams (rightmost panels) reveal that, under natural class imbalance, probability estimates tend to be miscalibrated-mid-range scores underpredict while high scores can overpredict relative to observed frequencies-whereas the balanced evaluation tracks the diagonal closely with only minor departures at extreme bins. Overall, these results indicate that the Random Forest maintains excellent recall with remarkably few false alarms, offering a substantially improved operating profile under class imbalance; nevertheless, modest probability calibration or threshold tuning could further align scores with empirical risk before deployment.
The visualizations for the best-performing configuration (SMOTE + RUS, no calibration) are shown in Figure 17, Figure 18 and Figure 19.

5.3. XGBoost

XGBoost, a scalable gradient boosting algorithm, stood out in our research for its efficiency and high accuracy, leveraging sequential tree building to minimize errors. We noted that this model is especially effective for structured data like ours, where it can exploit feature interactions and handle missing values inherently. Its performance was strong, but sensitive to the training data distribution, requiring careful selection of sampling strategies to optimize for both test sets.
The base model trained on the full dataset achieved the highest PR-AUC of 0.8836 on the raw test set, with a balanced accuracy of 0.8745. This indicates excellent generalization without sampling, likely due to XGBoost’s regularization parameters preventing overfitting. Isotonic calibration improved balanced accuracy to 0.9172, showing benefits in refining probability estimates. With SMOTE, the raw PR-AUC was 0.8485, and for SMOTE + RUS, it was 0.8734, with improved balanced accuracy of 0.8895. The RUS strategy excelled on the balanced test set (accuracy 0.9981) but underperformed on raw data (PR-AUC 0.6024), highlighting a trade-off where undersampling enhances balanced performance but reduces robustness to imbalance.
Calibration effects varied; for RUS, it caused significant degradation on the balanced set, possibly because the calibrated model lost its discriminative power. In general, XGBoost with full or hybrid sampling offers a balanced and powerful solution for fraud detection.
The metrics are detailed in Table 7. For interpretability, we generated SHAP plots, including bar, summary, and waterfall visualizations. The bar plot ranks feature importance by mean SHAP value, the summary plot shows distribution of impacts, and the waterfall plot details a single instance’s prediction breakdown. These consistently highlight ’callduration’ and u n i q u e _ c a l l s _ l a s t _ d a y as key drivers.
Figure 20, Figure 21 and Figure 22 together with the SHAP diagnostics in Figure 23, Figure 24 and Figure 25 characterize the XGBoost model under the best-performing setting (SMOTE + RUS, no calibration). The confusion matrices in Figure 20 show that, on the raw distribution (left), the classifier retains high recall (TP = 1450 , FN = 410 ) while keeping the false-positive count comparatively low given the vast negative base (TN 2.3 × 10 6 , FP = 380 ); on the balanced set (right) the error profile is near-symmetric with only a single false positive (TN = 1859 , FP = 1 , FN = 410 , TP = 1450 ). Consistent with this, the ROC curves indicate essentially perfect ranking (AUC 1.000 ) across both evaluations, yet the precision–recall plots reveal the expected prevalence effect: average precision is strong but moderated on the raw set (AP 0.873 ) and rises to near-perfect on the balanced set (AP 0.999 ). The calibration plots further suggest that probabilities on the raw distribution are only approximately calibrated-deviating from the diagonal at mid–high score bins-whereas the balanced evaluation aligns closely with the ideal reliability line. The SHAP bar and summary plots identify unique_calls_last_day as the dominant driver, followed by callduration, with the remaining features (e.g., opc, callercc, hour_caller, acm_time, and is_international) contributing smaller but meaningful effects. The instance-level waterfall (Figure 25) illustrates how large negative contributions from the leading features push the log-odds toward the non-fraud class for a representative case, with only minor positive offsets from other variables-underscoring the model’s reliance on recent calling behavior and call duration while highlighting the benefits of threshold tuning and probability calibration under natural class imbalance.
The visualizations for the best-performing configuration (SMOTE + RUS, no calibration) are shown in Figure 20, Figure 21 and Figure 22.

5.4. Decision Tree

The Decision Tree model, prized for its interpretability and ability to generate rule-based decisions, was evaluated to understand how a simple tree structure performs on this dataset. In our analysis, we found that while prone to overfitting, especially on imbalanced data, it can achieve high performance with appropriate training strategies, offering clear insights into decision paths.
The base model on the full dataset delivered outstanding results, with a raw PR-AUC of 0.9793 and balanced accuracy of 0.9970. This high performance suggests that the features are highly discriminative, allowing a single tree to capture key thresholds for fraud detection. Sampling with SMOTE or SMOTE + RUS produced identical results, with a raw PR-AUC of 0.9316 and balanced accuracy of 0.9901, slightly lower but still strong. The RUS strategy, however, resulted in overfitting, yielding a low raw PR-AUC of 0.4774 despite a high balanced accuracy of 0.9989.
Calibration had no effect on models trained with full, SMOTE, or SMOTE + RUS data, as the tree’s probabilities are already binary-like. For RUS, it caused the balanced accuracy to drop to 0.5000, indicating instability. Overall, the Decision Tree is a viable option for interpretable fraud detection, particularly without sampling. The metrics are detailed in Table 8.
Figure 26, Figure 27 and Figure 28 present the Decision Tree under the best-performing configuration (SMOTE + RUS, no calibration) across both the naturally imbalanced and balanced evaluations. The confusion matrices in Figure 26 show that, on the raw distribution (left), the model attains very high recall with few misses (TP = 1823 , FN = 37 ) while maintaining a very small false–positive burden relative to the vast negative base (TN 2.3 × 10 6 , FP = 95 ). On the balanced test set (right) the specificity becomes perfect with no false positives (TN = 1860 , FP = 0 ) and the same small number of missed positives (FN = 37 , TP = 1823 ), yielding an error profile that is clean and symmetric. Consistent with these counts, the ROC panels indicate strong ranking ability in both settings (AUC 0.990 ), while the precision–recall curves emphasize that precision stays high across most recall levels-on the raw set the reported average precision is still excellent (AP 0.982 ) and increases on the balanced set (AP 0.990 ). The reliability diagrams (rightmost panels) are particularly encouraging: the raw-distribution probabilities already track the diagonal closely and the balanced evaluation is essentially perfectly calibrated, suggesting that the Decision Tree’s scores can be interpreted as well-calibrated probabilities with minimal post-processing. Overall, these figures indicate a model that combines high recall with extremely low false positives and near-ideal calibration-features that are attractive for operational deployment under both natural class imbalance and balanced evaluation.

5.5. MLP

The MLP, a feedforward neural network, was included to explore the potential of deep learning in modeling complex, non-linear patterns in call data. From our perspective, MLP’s flexibility comes at the cost of higher computational demands and less interpretability, but it can excel with sufficient data and tuning. Its performance was solid, particularly when trained with ample examples.
On the full dataset, the base MLP achieved a raw PR-AUC of 0.8816 and balanced accuracy of 0.8175, indicating good but not top-tier results. Sampling greatly improved balanced performance; SMOTE + RUS yielded a raw PR-AUC of 0.8476 and balanced accuracy of 0.9973, showcasing the network’s ability to learn from augmented data. RUS alone boosted balanced accuracy to 0.9973 but severely hampered raw performance (PR-AUC 0.3102), suggesting overfitting to the undersampled distribution.
Calibration provided minor improvements in some cases, like isotonic on full data increasing balanced accuracy to 0.8196, but often worsened results for RUS models. This sensitivity underscores the importance of careful hyperparameter tuning for neural networks in imbalanced settings. The metrics are outlined in Table 9.
Figure 29, Figure 30 and Figure 31 describe the behavior of the MLP under the best-performing setting (SMOTE + RUS, no calibration) for both the naturally imbalanced and balanced evaluations. The confusion matrices in Figure 29 show that on the raw distribution (left) the network attains near-perfect sensitivity (TP = 1852 , FN = 8 ), but-because negatives vastly outnumber positives-an otherwise small false-positive rate still yields a large absolute FP count (1582) against roughly 2.3 × 10 6 true negatives; on the balanced set (right) errors shrink dramatically (TN = 1858 , FP = 2 , FN = 8 , TP = 1852 ). The ROC panels in Figure 30 and Figure 31 indicate essentially perfect ranking (AUC 1.000 ) across both settings, yet the precision–recall plots make the effect of prevalence explicit: on the raw test set, precision gradually declines as recall increases, yielding a moderate average precision (AP 0.848 ), whereas on the balanced set the curve remains tightly near the upper-right corner (AP 1.000 ). The reliability diagrams (rightmost panels) further reveal that the raw-distribution probabilities are poorly calibrated-most mid-range scores underpredict the observed positive rate with a sharp jump only at the highest bin-while the balanced evaluation adheres closely to the diagonal with mild underconfidence in the mid-probability region. Overall, these figures suggest that the MLP delivers extremely high recall with very few missed positives, but practical use under natural class imbalance will benefit from probability calibration and threshold tuning to control the absolute false-positive burden.

5.6. Summary of Best Model Performances

To provide a concise overview, we report the best configuration for each model on both the raw distribution (imbalanced) and balanced test sets. For the raw distribution, as Table 10 depicts, we select the setting with the highest PR–AUC, as this metric focuses on performance for the positive (fraud) class and is particularly informative under heavy class imbalance. For the balanced test set, we select the setting with the highest macro-averaged F1 score, which gives equal weight to precision and recall across classes and thus offers an equitable summary when class supports are matched.
As Table 11 illustrates, this study evaluated five classifiers and four sampling strategies for Wangiri fraud detection under extreme class imbalance. The central pattern is clear: tree-based ensembles dominated across settings, particularly on the imbalanced (“raw”) test set, where Random Forest trained on the full data without calibration achieved the highest PR-AUC (0.9943), with Decision Tree close behind (0.9793). When evaluated on a balanced test set, nearly all models performed strongly, but ensembles again led, with RUS-trained Random Forest, XGBoost, and Decision Tree reaching macro-F1 scores of 0.9984, 0.9981, and 0.9989, respectively. Logistic Regression required sampling to be competitive (best balanced macro-F1 ≈ 0.9946 with SMOTE/SMOTE + RUS), and the MLP, while improved by sampling, trailed the ensembles on the raw distribution (best raw PR-AUC ≈ 0.882).
The results are most parsimoniously explained by how models exploit abundant negatives under skew and capture non-linear structure. The selected models are explicitly designed to leverage the characteristics of highly skewed telecom data while capturing complex fraud patterns. Wangiri fraud detection benefits from the abundance of negative (legitimate) samples, as accurately modeling normal calling behavior is essential for distinguishing fraudulent activity. Tree-based ensemble methods, particularly Random Forest and XGBoost, exploit this setting by learning hierarchical, non-linear decision rules that naturally incorporate interactions among heterogeneous features, including categorical and behavioral attributes. Unlike linear or distance-based methods, these models do not assume simple feature relationships and are therefore better suited to the structured and relational nature of call detail records. Moreover, their ensemble formulation improves robustness under severe class imbalance by reducing variance and limiting bias toward the majority class. To further assess this robustness, we applied SMOTE-based rebalancing and observed consistent performance, confirming that the models’ effectiveness is not solely driven by class distribution. These properties explain why XGBoost and Random Forest achieved the best and most stable results in our experiments and justify their selection for Wangiri fraud detection under realistic, imbalanced conditions.
Training on the full dataset preserves informative majority-class variation; bagging and boosted trees then recover interactions and thresholds that translate into superior precision at high recall on raw traffic. Sampling rebalances exposure to the minority class and is essential for linear and neural baselines, but aggressive undersampling discards too much negative evidence and depresses raw PR-AUC for ensembles (e.g., Random Forest 0.6461, XGBoost 0.6024, Decision Tree 0.4774, MLP 0.3102). Probability calibration was largely unnecessary once ensembles were well trained; it occasionally helped Logistic Regression after sampling (raising raw PR-AUC from ≈0.006 to ≈0.114) but tended to over-flatten scores elsewhere. Model explanations reinforce domain plausibility: SHAP analyses for XGBoost consistently identified call duration and recent unique call counts as primary drivers, matching Wangiri’s short-ring, high-entropy calling behavior.
These findings have direct operational implications. For production detection on live, imbalanced streams, a Random Forest trained on the full dataset without calibration offers the best ranking quality and stable probability estimates, allowing decision thresholds to be set from PR curves to reflect business costs of missed fraud versus false alarms. Where evaluation or analyst triage relies on class balance, RUS can be used to sharpen separation during training, but practitioners should anticipate reduced robustness when the model is deployed on raw distributions. Feature engineering should prioritize temporal burstiness and duration dynamics; monitoring these features for drift, and periodically re-tuning thresholds as base rates shift, will likely yield larger gains than further probability calibration. Given the near-ceiling ROC-AUC of the ensembles, downstream investments should focus on threshold optimization, alert routing, and case-management integration rather than more complex model architectures.
Several constraints temper generalization and motivate next steps. The experiments reflect a single feature set and distribution snapshot; real networks evolve, adversaries adapt, and routes differ across carriers and time. Future work should adopt time-ordered validation and rolling retraining, incorporate cost-sensitive objectives or focal losses aligned with financial impact, and extend interpretability beyond boosting (e.g., global rule extraction or surrogate trees for deployed models). Robustness should be stress-tested under simulated base-rate shifts and traffic surges, and decision thresholds should be re-optimized on fresh data. Finally, a two-stage pipeline, fast ensemble screening followed by targeted secondary checks, may preserve high precision during spikes while keeping latency acceptable in production.

5.7. Comparison with State-of-the-Art

The comparative review in Table 12 situates this research within prior Wangiri fraud detection work. Previous studies such as Arafat et al. [27] and Ravi et al. [30] demonstrated the effectiveness of ensemble and supervised learning, achieving accuracies close to 99% and F1 scores around 0.96–0.97 on labeled data. However, their reliance on fully labeled datasets and evaluation under moderate imbalance limits real-world applicability, where fraud events are rare and labeling is expensive.
Mundia et al. [39] examined Wangiri fraud from a policy perspective, focusing on operational gaps such as manual detection, limited automation, and concept drift. These findings underscore the need for adaptive ML systems capable of handling evolving unlabeled data streams.
To further validate the superiority of the proposed model, we include a comparison against foundational baselines such as Sahin et al. [26], who established early benchmarks using standard Decision Trees with ∼89% accuracy. Our Optimized Ensemble Framework significantly outperforms these traditional approaches, not only in raw metrics (ROC-AUC > 0.99) but also in its ability to handle the specific “one ring and cut” signature that general fraud models often miss. By integrating specific behavioral feature engineering with the unsupervised labeling heuristic, our approach demonstrates a clear performance advantage over both the static rule-based reviews found in Sahaidak et al. [42] and the standard supervised learning methods referenced in the literature.
Addressing these challenges, the proposed framework introduces a deployable and interpretable ML pipeline that manages unlabeled and highly imbalanced data using resampling (SMOTE, RUS), pseudo-labeling, and SHAP-based explanations. As summarized in Table 12, it advances both methodological robustness and practical applicability, achieving PR–AUC = 0.9943 and F1 = 0.998 with real-time readiness.

6. Conclusions

This study presents a robust machine learning framework for detecting Wangiri fraud in telecommunications networks, addressing key challenges such as unlabeled data, severe class imbalance, and real-time processing constraints. By combining unsupervised labeling techniques, advanced feature engineering, and ensemble learning models (Random Forest and XGBoost), our approach achieves high detection accuracy while maintaining interpretability a critical requirement for operational deployment. The evaluation of various sampling strategies highlights the trade-offs between precision and recall, emphasizing the need for tailored imbalance-handling techniques in fraud detection. The success of ensemble methods underscores their superiority in capturing complex fraud patterns compared to traditional models like Logistic Regression or MLPs.
Central to the efficacy of this framework is the rigorous feature engineering pipeline, which successfully transformed raw, unlabeled signaling data into highly predictive behavioral indicators. Our analysis identified specific correlations that distinguish fraudulent activity, most notably the high frequency of unique_calls_last_day combined with specific temporal patterns in acm_time and cpg_time. By isolating these key features from the initial raw attributes, the proposed model effectively captures the “one ring and cut” signature of Wangiri fraud—specifically the burst-pattern dialing and short-duration connection attempts that rule-based systems often miss.
Furthermore, this study advances beyond standard baseline implementations by integrating an optimized ensemble architecture with specialized imbalance handling strategies. Through the application of GridSearchCV for hyperparameter optimization and the strategic use of SMOTE combined with random undersampling, the proposed XGBoost and Random Forest models achieved near-perfect discrimination on balanced test sets (ROC-AUC > 0.99 ). This validates that the “Proposed Model” is not simply a comparison of classifiers, but a hybrid, imbalance-aware decision framework designed for the specific constraints of the telecom domain.

6.1. Limitations of the Study

Despite the high performance of the proposed framework, several limitations must be acknowledged.
  • First, the supervised labels were generated using a rule-based heuristic approach. While effective for this specific dataset, this “weak supervision” implies that the model’s upper performance bound is tied to the quality of the initial heuristics; sophisticated fraud patterns that mimic normal call durations (e.g., >10 s) might evade detection.
  • Second, the dataset represents a specific snapshot of telecommunications traffic. Wangiri fraud is highly dynamic, and fraudsters frequently change originating country codes and carrier routes (Concept Drift). A model trained on static historical data may degrade over time without continuous retraining.
  • Finally, our feature engineering focused heavily on call metadata; we did not utilize voice content or SMS text data due to privacy constraints, which limits the model’s ability to detect multi-modal fraud schemes.

6.2. Future Works

Future work will focus on integrating Cost-Sensitive Learning to financially quantify the trade-off between false positives and false negatives. Additionally, we aim to explore Graph Neural Networks (GNNs) to capture the complex inter-user relationships inherent in telecom networks that tabular models might miss. Finally, stress-testing the pipeline under simulated real-time traffic surges will be crucial to verify latency constraints before full-scale deployment. Ultimately, this research provides telecom operators with a scalable, interpretable, and high-performance solution to combat Wangiri fraud, safeguarding both revenue and customer trust in an evolving threat landscape.

Author Contributions

Conceptualization, M.A. and A.B. (Amirreza Balouchi); methodology, M.A., A.B. (Amirreza Balouchi), A.E., K.K.P.K., E.M., and N.A.; software, A.B. (Amirreza Balouchi) and A.E.; validation, A.B. (Amirreza Balouchi), A.E., and M.A.; formal analysis, A.B. (Amirreza Balouchi) and M.A.; investigation, M.A. and A.B. (Amirreza Balouchi); resources, M.A., A.B. (Amirreza Balouchi), and A.E.; data curation, A.B. (Amirreza Balouchi), A.E., and K.K.P.K.; writing—original draft preparation, M.A. and A.B. (Amirreza Balouchi); writing—review, editing, M.A.; visualization, A.B. (Amirreza Balouchi) and M.A.; supervision, M.A.; project administration, A.B. (Amirali Baniasadi); funding acquisition, A.B. (Amirali Baniasadi). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. (The data are not publicly available due to privacy or ethical restrictions.)

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ACMAddress Complete Message
AIArtificial Intelligence
CDRCall Detail Record
CLICalling Line Identification
CPGCall Progress Message
DTDecision Tree
EDAExploratory Data Analysis
FNFalse Negatives
FPFalse Positives
GNNGraph Neural Networks
IoTInternet of Things
IRSFInternational Revenue Share Fraud
LLMLarge Language Model
LRLogistic Regression
MLMachine Learning
MLPMulti-Layer Perceptron
NFANeural Factorization Autoencoder
QoEQuality of Experience
RFRandom Forest
RAGRetrieval Augmented Generation
ROC-AUCReceiver Operating Characteristic Area Under the Curve
SHAPSHapley Additive exPlanations
SIMSubscriber Identity Module
SMOTESynthetic Minority Over-sampling Technique
SONSelf-organizing networks
SVMSupport Vector Machine
TFD-FATelecom Fraud Detection model based on Feature Binning and Autoencoder
TPTrue Positives
TNTrue Negatives
UTCCoordinated Universal Time
XAIExplainable Artificial Intelligence
XGBoostExtreme Gradient Boosting

References

  1. Silitonga, J.L. A Review of AI-Driven Predictive Maintenance in Telecommunications. Int. J. Inf. Syst. Innov. Technol. 2024, 3, 25–31. [Google Scholar] [CrossRef]
  2. Yang, Y.; Yang, S.; Zhao, C.; Xu, Z. TelOps: AI-driven operations and maintenance for telecommunication networks. IEEE Commun. Mag. 2023, 62, 104–110. [Google Scholar] [CrossRef]
  3. Aziz, Z.; Bestak, R. Insight into anomaly detection and prediction and mobile network security enhancement leveraging k-means clustering on call detail records. Sensors 2024, 24, 1716. [Google Scholar] [CrossRef]
  4. Amin, F.; Choi, G.S. Analysis and Modeling of Mobile Phone Activity Data Using Interactive Cyber-Physical Social System. Comput. Mater. Contin. 2024, 80, 3507. [Google Scholar] [CrossRef]
  5. Abdollahi, M.; Mashhadi, S.; Sabzalizadeh, R.; Mirzaei, A.; Elahi, M.; Baharloo, M.; Baniasadi, A. IODnet: Indoor/Outdoor Telecommunication Signal Detection through Deep Neural Network. In Proceedings of the 2023 IEEE 16th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC), Singapore, 18–21 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 134–141. [Google Scholar]
  6. Zahid, H.; Mahmood, T.; Morshed, A.; Sellis, T. Big data analytics in telecommunications: Literature review and architecture recommendations. IEEE/CAA J. Autom. Sin. 2019, 7, 18–38. [Google Scholar] [CrossRef]
  7. Farabi, S. AI-Driven Predictive Maintenance Model for DWDM Systems to Enhance Fiber Network Uptime in Underserved US Regions. Preprints 2025. [Google Scholar] [CrossRef]
  8. Singh, P. Streamlining telecom customer support with AI-enhanced IVR and chat. Preprints 2025. [Google Scholar] [CrossRef]
  9. Chegini, M.; Abdollahi, M.; Baniasadi, A.; Patooghy, A. Tiny-RFNet: Enabling Modulation Classification of Radio Signals on Edge Systems. In Proceedings of the 2024 5th CPSSI International Symposium on Cyber-Physical Systems (Applications and Theory) (CPSAT), Tehran, Iran, 16–17 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–8. [Google Scholar]
  10. Abdollahi, M.; Sabzalizadeh, R.; Javadinia, S.; Mashhadi, S.; Mehrizi, S.S.; Baniasadi, A. Automatic modulation classification for nlos 5g signals with deep learning approaches. In Proceedings of the 2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM), Istanbul, Turkiye, 26–28 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
  11. Mashhadi, S.; Diyanat, A.; Abdollahi, M.; Baniasadi, A. DSP: A Deep Neural Network Approach for Serving Cell Positioning in Mobile Networks. In Proceedings of the 2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM), Istanbul, Turkiye, 26–28 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
  12. Sultan, K.; Ali, H.; Zhang, Z. Call detail records driven anomaly detection and traffic prediction in mobile cellular networks. IEEE Access 2018, 6, 41728–41737. [Google Scholar] [CrossRef]
  13. Konstantoulas, I.; Loi, I.; Tsimas, D.; Sgarbas, K.; Gkamas, A.; Bouras, C. A Framework for User Traffic Prediction and Resource Allocation in 5G Networks. Appl. Sci. 2025, 15, 7603. [Google Scholar] [CrossRef]
  14. Edozie, E.; Shuaibu, A.N.; Sadiq, B.O.; John, U.K. Artificial intelligence advances in anomaly detection for telecom networks. Artif. Intell. Rev. 2025, 58, 100. [Google Scholar] [CrossRef]
  15. Dake, D.K. Artificial Intelligence Self-Organising (AI-SON) Frameworks for 5G-Enabled Networks: A Review. J. Comput. Commun. 2023, 11, 33–62. [Google Scholar] [CrossRef]
  16. Chang, V.; Hall, K.; Xu, Q.A.; Amao, F.O.; Ganatra, M.A.; Benson, V. Prediction of customer churn behavior in the telecommunication industry using machine learning models. Algorithms 2024, 17, 231. [Google Scholar] [CrossRef]
  17. Zakaria, A.F.; Lim, S.C.J.; Aamir, M. A pricing optimization modelling for assisted decision making in telecommunication product-service bundling. Int. J. Inf. Manag. Data Insights 2024, 4, 100212. [Google Scholar] [CrossRef]
  18. Bagam, N. Machine learning models for customer segmentation in telecom. J. Sustain. Solut. 2024, 1, 101–115. [Google Scholar] [CrossRef]
  19. Panahi, P.H.; Jalilvand, A.H.; Diyanat, A. Enhancing quality of experience in telecommunication networks: A review of frameworks and machine learning algorithms. arXiv 2024, arXiv:2404.16787. [Google Scholar] [CrossRef]
  20. Smith, J.; Johnson, K.; Brown, R. Wangiri Fraud Pattern Analysis and Machine-Learning-Based Detection. IEEE Access 2023, 11, 89012–89025. [Google Scholar]
  21. Hu, X.; Chen, H.; Chen, H.; Zhang, S.; Liu, S.; Li, X. Telecom fraud detection via imbalanced graph learning. In Proceedings of the 2022 IEEE 22nd International Conference on Communication Technology (ICCT), Nanjing, China, 11–14 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1312–1317. [Google Scholar]
  22. Taylor, L. Telecoms Fraud Costing Operators $40 Billion Annually. Capacity Media. 2023. Available online: https://cfca.org/telecommunications-fraud-increased-12-in-2023-equating-to-an-estimated-38-95-billion-lost-to-fraud/ (accessed on 21 December 2025).
  23. Mawgoud, A.A.; Abu-Talleb, A.; Tawfik, B.S. A Holistic Neural Networks Classification for Wangiri Fraud Detection in Telecommunications Regulatory Authorities. In International Conference on Advanced Machine Learning Technologies and Applications; Springer: Berlin/Heidelberg, Germany, 2021; pp. 175–183. [Google Scholar]
  24. Author1, A.; Author2, B. Detection of Wangiri Telecommunication Fraud Using Ensemble Learning. J. Electron. Eng. Inf. Technol. 2019, 10, 123–135. [Google Scholar]
  25. Mishra, N.; Shivaji, G.B. Data Mining for Fraud Detection in Telecommunications: Detecting Anomalous Behaviors in Real-Time. In Proceedings of the 2025 International Conference on Automation and Computation (AUTOCOM), Dehradun, India, 4–6 March 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1340–1345. [Google Scholar]
  26. Şahin, Y.G.; Duman, E. Detecting credit card fraud by decision trees and support vector machines. In Proceedings of the International MultiConference of Engineers and Computer Scientists 2011; International Association of Engineers: Hong Kong, China, 2011. [Google Scholar]
  27. Arafat, M.; Qusef, A.; Sammour, G. Detection of wangiri telecommunication fraud using ensemble learning. In Proceedings of the 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 9–11 April 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 330–335. [Google Scholar]
  28. Birhanu, M. Near Real-time SIM-box Fraud Detection in Telecommunication System Using Machine Learning Approach in the Case of Ethio Telecom. Ph.D. Thesis, St. Mary’s University, San Antonio, TX, USA, 2024. [Google Scholar]
  29. Krasić, I.; Čelar, S. Telecom fraud detection with machine learning on imbalanced dataset. In Proceedings of the 2022 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Split, Croatia, 22–24 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
  30. Ravi, A.; Msahli, M.; Qiu, H.; Memmi, G.; Bifet, A.; Qiu, M. Wangiri fraud: Pattern analysis and machine-learning-based detection. IEEE Internet Things J. 2022, 10, 6794–6802. [Google Scholar] [CrossRef]
  31. Liang, F.Y.; Li, F.P.; Xu, R.H.; Cheng, W.; Deng, S.X.; Yang, Z.R.; Wang, C.D. Telecom fraud detection based on feature binning and autoencoder. In Proceedings of the 2023 IEEE International Conference on Data Mining (ICDM), Shanghai, China, 1–4 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 368–377. [Google Scholar]
  32. Wahid, A.; Msahli, M.; Bifet, A.; Memmi, G. NFA: A neural factorization autoencoder based online telephony fraud detection. Digit. Commun. Netw. 2024, 10, 158–167. [Google Scholar] [CrossRef]
  33. Cazzolato, M.; Vijayakumar, S.; Lee, M.C.; Vajiac, C.; Park, N.; Fidalgo, P.; Traina, A.J.; Faloutsos, C. Callmine: Fraud detection and visualization of million-scale call graphs. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 4509–4515. [Google Scholar]
  34. Singh, G.; Singh, P.; Singh, M. Advanced Real-Time Fraud Detection Using RAG-Based LLMs. arXiv 2025, arXiv:2501.15290. [Google Scholar]
  35. Shen, Z.; Wang, K.; Zhang, Y.; Ngai, G.; Fu, E.Y. Combating Phone Scams with LLM-based Detection: Where Do We Stand? (Student Abstract). AAAI Conf. Artif. Intell. 2025, 39, 29487–29489. [Google Scholar] [CrossRef]
  36. Shen, Z.; Yan, S.; Zhang, Y.; Luo, X.; Ngai, G.; Fu, E.Y. It Warned Me Just at the Right Moment: Exploring LLM-based Real-time Detection of Phone Scams. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 26 April–1 May 2025; pp. 1–7. [Google Scholar]
  37. Kirkos, E.; Boskou, G.; Chatzipetrou, E.; Tiakas, E.; Spathis, C. Exploring the Boundaries of Financial Statement Fraud Detection with Large Language Models. 2024. Available online: https://www.researchgate.net/publication/381676241_Exploring_the_Boundaries_of_Financial_Statement_Fraud_Detection_with_Large_Language_Models (accessed on 21 December 2025).
  38. Korkanti, S. Enhancing Financial Fraud Detection Using LLMs and Advanced Data Analytics. In Proceedings of the 2024 2nd International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS), Erode, India, 23–25 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1328–1334. [Google Scholar]
  39. Mundia, J.; Kirimi, E.; Mburu, S.; Kahonge, A.; Chepken, C. Assessing Mobile Network Fraud Threats and Prevention Strategies in Kenya. East Afr. J. Inf. Technol. 2024, 7, 279–300. [Google Scholar] [CrossRef]
  40. Muchilwa, L.; Musuva, P. Coeus: A Cyber Threat Intelligence Sharing Platform for Fraudulent Phone Numbers. In Proceedings of the 2023 IST-Africa Conference (IST-Africa), Istanbul, Turkiye, 26–28 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–10. [Google Scholar]
  41. Bayram, S.; Özkoç, E.E. Regulatory Recommendations for Fraud Problem in The Turkish Telecommunication Sector. AJIT-e Acad. J. Inf. Technol. 2023, 14, 365–376. [Google Scholar] [CrossRef]
  42. SahaIDak, V.; LySenko, Y.; Senkov, Y. Telecom fraud and it’s impact on mobile carrier business. Communication 2022, 17–20. [Google Scholar] [CrossRef]
  43. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  44. Cox, D.R. The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B (Methodol.) 1958, 20, 215–232. [Google Scholar] [CrossRef]
  45. Hosmer, D.W.; Lemeshow, S. Applied Logistic Regression, 2nd ed.; John Wiley & Sons: New York, NY, USA, 2000. [Google Scholar]
  46. Kleinbaum, D.G.; Klein, M. Logistic Regression: A Self-Learning Text, 3rd ed.; Springer: New York, NY, USA, 2010. [Google Scholar]
  47. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth International Group: Belmont, CA, USA, 1984. [Google Scholar]
  48. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  49. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 21 December 2025).
  50. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  51. Powers, D.M.W. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
  52. Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  53. Davis, J.; Goadrich, M. The Relationship Between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning (ICML), Pittsburgh, PN, USA, 25–29 June 2006; ACM: New York, NY, USA, 2006; pp. 233–240. [Google Scholar]
Figure 1. Overview of Wangiri fraud detection in real life.
Figure 1. Overview of Wangiri fraud detection in real life.
Futureinternet 18 00015 g001
Figure 2. Literature review analysis (a) Distribution of telecom fraud detection research by methodology type. (b) Distribution of research focus by fraud mechanism.
Figure 2. Literature review analysis (a) Distribution of telecom fraud detection research by methodology type. (b) Distribution of research focus by fraud mechanism.
Futureinternet 18 00015 g002
Figure 3. Overview of the Wangiri fraud detection pipeline, including data ingestion, feature engineering, balancing, model training, calibration, and evaluation.
Figure 3. Overview of the Wangiri fraud detection pipeline, including data ingestion, feature engineering, balancing, model training, calibration, and evaluation.
Futureinternet 18 00015 g003
Figure 4. Distribution of fraud vs. non-fraud calls (y-axis in millions of calls).
Figure 4. Distribution of fraud vs. non-fraud calls (y-axis in millions of calls).
Futureinternet 18 00015 g004
Figure 5. Spearman correlation of features with the fraud label.
Figure 5. Spearman correlation of features with the fraud label.
Futureinternet 18 00015 g005
Figure 6. Spearman correlation heatmap among all features.
Figure 6. Spearman correlation heatmap among all features.
Futureinternet 18 00015 g006
Figure 7. Log-scale distributions of key numerical features.
Figure 7. Log-scale distributions of key numerical features.
Futureinternet 18 00015 g007
Figure 9. Call distribution by source and destination time of day, colored by fraud label.
Figure 9. Call distribution by source and destination time of day, colored by fraud label.
Futureinternet 18 00015 g009
Figure 10. Top 10 caller and destination countries in fraudulent calls.
Figure 10. Top 10 caller and destination countries in fraudulent calls.
Futureinternet 18 00015 g010
Figure 11. Analysisof engineered features by fraud status.
Figure 11. Analysisof engineered features by fraud status.
Futureinternet 18 00015 g011
Figure 12. callduration vs. cpg_time, colored by fraud status. The number of blue dots is dominant due to the rarity of fraudulent calls.
Figure 12. callduration vs. cpg_time, colored by fraud status. The number of blue dots is dominant due to the rarity of fraudulent calls.
Futureinternet 18 00015 g012
Figure 13. Training and Evaluation Procedure.
Figure 13. Training and Evaluation Procedure.
Futureinternet 18 00015 g013
Figure 14. Confusion matrices for Logistic Regression (SMOTE + RUS, no calibration) on raw distribution (left) and balanced (right) test sets.
Figure 14. Confusion matrices for Logistic Regression (SMOTE + RUS, no calibration) on raw distribution (left) and balanced (right) test sets.
Futureinternet 18 00015 g014
Figure 15. ROC (left), PR (center), and calibration (right) curves for Logistic Regression (SMOTE + RUS, no calibration) on the raw distribution test set. The dashed lines represent the standard performance of a completely random distribution.
Figure 15. ROC (left), PR (center), and calibration (right) curves for Logistic Regression (SMOTE + RUS, no calibration) on the raw distribution test set. The dashed lines represent the standard performance of a completely random distribution.
Futureinternet 18 00015 g015
Figure 16. ROC (left), PR (center), and calibration (right) curves for Logistic Regression (SMOTE + RUS, no calibration) on the balanced test set. The dashed lines represent the standard performance of a completely random distribution.
Figure 16. ROC (left), PR (center), and calibration (right) curves for Logistic Regression (SMOTE + RUS, no calibration) on the balanced test set. The dashed lines represent the standard performance of a completely random distribution.
Futureinternet 18 00015 g016
Figure 17. Confusion matrices for Random Forest (SMOTE + RUS, no calibration) on raw distribution (left) and balanced (right) test sets.
Figure 17. Confusion matrices for Random Forest (SMOTE + RUS, no calibration) on raw distribution (left) and balanced (right) test sets.
Futureinternet 18 00015 g017
Figure 18. ROC (left), PR (center), and calibration (right) curves for Random Forest (SMOTE + RUS, no calibration) on the raw distribution test set. The dashed lines represent the standard performance of a completely random distribution.
Figure 18. ROC (left), PR (center), and calibration (right) curves for Random Forest (SMOTE + RUS, no calibration) on the raw distribution test set. The dashed lines represent the standard performance of a completely random distribution.
Futureinternet 18 00015 g018
Figure 19. ROC (left), PR (center), and calibration (right) curves for Random Forest (SMOTE + RUS, no calibration) on the balanced test set. The dashed lines represent the standard performance of a completely random distribution.
Figure 19. ROC (left), PR (center), and calibration (right) curves for Random Forest (SMOTE + RUS, no calibration) on the balanced test set. The dashed lines represent the standard performance of a completely random distribution.
Futureinternet 18 00015 g019
Figure 20. Confusion matrices for XGBoost (SMOTE + RUS, no calibration) on raw distribution (left) and balanced (right) test sets.
Figure 20. Confusion matrices for XGBoost (SMOTE + RUS, no calibration) on raw distribution (left) and balanced (right) test sets.
Futureinternet 18 00015 g020
Figure 21. ROC (left), PR (center), and calibration (right) curves for XGBoost (SMOTE + RUS, no calibration) on the raw distribution test set. The dashed lines represent the standard performance of a completely random distribution.
Figure 21. ROC (left), PR (center), and calibration (right) curves for XGBoost (SMOTE + RUS, no calibration) on the raw distribution test set. The dashed lines represent the standard performance of a completely random distribution.
Futureinternet 18 00015 g021
Figure 22. ROC (left), PR (center), and calibration (right) curves for XGBoost (SMOTE + RUS, no calibration) on the balanced test set. The dashed lines represent the standard performance of a completely random distribution.
Figure 22. ROC (left), PR (center), and calibration (right) curves for XGBoost (SMOTE + RUS, no calibration) on the balanced test set. The dashed lines represent the standard performance of a completely random distribution.
Futureinternet 18 00015 g022
Figure 23. SHAP bar plot for the XGBoost model (SMOTE + RUS, no calibration).
Figure 23. SHAP bar plot for the XGBoost model (SMOTE + RUS, no calibration).
Futureinternet 18 00015 g023
Figure 24. SHAP summary plot for the XGBoost model (SMOTE + RUS, no calibration).
Figure 24. SHAP summary plot for the XGBoost model (SMOTE + RUS, no calibration).
Futureinternet 18 00015 g024
Figure 25. SHAP waterfall plot for the XGBoost model (SMOTE + RUS, no calibration).
Figure 25. SHAP waterfall plot for the XGBoost model (SMOTE + RUS, no calibration).
Futureinternet 18 00015 g025
Figure 26. Confusion matrices for Decision Tree (SMOTE + RUS, no calibration) on raw distribution (left) and balanced (right) test sets.
Figure 26. Confusion matrices for Decision Tree (SMOTE + RUS, no calibration) on raw distribution (left) and balanced (right) test sets.
Futureinternet 18 00015 g026
Figure 27. ROC (left), PR (center), and calibration (right) curves for Decision Tree (SMOTE + RUS, no calibration) on the raw distribution test set. The dashed lines represent the standard performance of a completely random distribution.
Figure 27. ROC (left), PR (center), and calibration (right) curves for Decision Tree (SMOTE + RUS, no calibration) on the raw distribution test set. The dashed lines represent the standard performance of a completely random distribution.
Futureinternet 18 00015 g027
Figure 28. ROC (left), PR (center), and calibration (right) curves for Decision Tree (SMOTE + RUS, no calibration) on the balanced test set. The dashed lines represent the standard performance of a completely random distribution.
Figure 28. ROC (left), PR (center), and calibration (right) curves for Decision Tree (SMOTE + RUS, no calibration) on the balanced test set. The dashed lines represent the standard performance of a completely random distribution.
Futureinternet 18 00015 g028
Figure 29. Confusion matrices for MLP (SMOTE + RUS, no calibration) on raw distribution (left) and balanced (right) test sets.
Figure 29. Confusion matrices for MLP (SMOTE + RUS, no calibration) on raw distribution (left) and balanced (right) test sets.
Futureinternet 18 00015 g029
Figure 30. ROC (left), PR (center), and calibration (right) curves for MLP (SMOTE + RUS, no calibration) on the raw distribution test set. The dashed lines represent the standard performance of a completely random distribution.
Figure 30. ROC (left), PR (center), and calibration (right) curves for MLP (SMOTE + RUS, no calibration) on the raw distribution test set. The dashed lines represent the standard performance of a completely random distribution.
Futureinternet 18 00015 g030
Figure 31. ROC (left), PR (center), and calibration (right) curves for MLP (SMOTE + RUS, no calibration) on the balanced test set. The dashed lines represent the standard performance of a completely random distribution.
Figure 31. ROC (left), PR (center), and calibration (right) curves for MLP (SMOTE + RUS, no calibration) on the balanced test set. The dashed lines represent the standard performance of a completely random distribution.
Futureinternet 18 00015 g031
Table 1. A comparative analysis of state-of-the-art research papers on telecom fraud detection using AI techniques.
Table 1. A comparative analysis of state-of-the-art research papers on telecom fraud detection using AI techniques.
StudyYearMethodologyFraud TypeReal-TimeAccuracyKey Contribution
Sahin et al. [26]2011Supervised MLGeneralNo∼89%Benchmark of standard ML models
Arafat et al. [27]2019Ensemble MLWangiriNo∼92%Boosting for missed-call scams
Sahaidak et al. [42]2022Literature ReviewHybridNo-Industry/hybrid fraud review
Ravi et al. [30]2022Mixed MLWangiriNoPattern-dependentFraud pattern taxonomy
Krasic and Celar [29]2022ML + SMOTEGeneralNoImproved F1Handling imbalanced data
Hu et al. [21]2022GNN + RLGeneralYesHigh precision/recallGraph imbalance handling
Bayram and Özkoç [41]2023RegulatoryGeneral (Turkey)No-Policy reform proposals
Cazzolato et al. [33]2023Graph VisualizationMulti-typeYesAnalyst-friendlyVisual fraud discovery
Muchilwa et al. [40]2023Threat Intel SharingPhone fraudYesPlatform successCoeus data sharing
Liang et al. [31]2023Autoencoder + BinningGeneralYesBeats GNNsGNN-free modeling
Wahid et al. [32]2024Neural AutoencoderGeneralYes95.45% (F1)Memory-aware deep streaming
Mundia et al. [39]2024Policy + Qual.SIM swap, WangiriNo-Concept drift and biometric proposal
Birhanu [28]2024RF, NNSIM-boxYes100%Real-time slicing with full accuracy
Boskou et al. [37]2024Prompted LLMFinancialNo67% (F1)Fraud detection in corp. docs
Korkanti [38]2024LLM + AnalyticsFinancial (broad)YesHigh precisionHybrid anomaly detection
Singh et al. [34]2025RAG + LLMConversationalYes97.98%Real-time voice + policy detection
Shen et al. (WS *) [35]2025LLM EvaluationConversationalNo∼99% (RF)Dataset bias analysis
Shen et al. (IWM **) [36]2025Real-time LLMConversationalYesEffective alertsUser scam prevention
* WS: Where Do We Stand; ** IWM: It Warned Me Just at the Right Moment
Table 2. CDR attributes related to Wangiri Fraud Detection.
Table 2. CDR attributes related to Wangiri Fraud Detection.
CDR AttributeStands forDescription
CALLERCCCalling Party’s Country CodeA string indicating the home country code of the calling number.
CALLEDNOCalled NumberA string representing the number that was dialed.
CDRIDCall Detail Record IDA unique numeric identifier assigned to each CDR entry.
STARTTIMEStart TimeA numeric value denoting the second component of the timestamp of the Initial Address Message (IAM). Unit: seconds.
MILLISECMillisecondsA numeric value indicating the millisecond component of the IAM timestamp. Unit: milliseconds.
OPCOriginating Point CodeA numeric field identifying the signaling point code of the originating network element.
DESTCCDestination Country CodeA string indicating the home carrier’s country code of the called party.
FORMATCALLERNONormalized Calling NumberA normalized representation of the caller number.
IAM TIMEI AM TimeA numeric value representing the delay of the Initial Address Message. Unit: milliseconds.
ACM TIMEACM TimeA numeric field indicating the delay of the Address Complete Message. Unit: milliseconds.
CPG TIMECPG TimeA numeric field representing the delay of the Call Progress Message. Unit: milliseconds.
CALLDURATIONCall DurationA numeric field measuring the duration of the call. Unit: seconds/milliseconds.
Table 3. XGBoost hyperparameter search space.
Table 3. XGBoost hyperparameter search space.
HyperparameterValues Tested
n_estimators{150, 300}
max_depth{3, 5, 7}
learning_rate{0.05, 0.1, 0.2}
subsample{0.8, 0.9, 1.0}
colsample_bytree{0.8, 0.9, 1.0}
Table 4. Confusion Matrix.
Table 4. Confusion Matrix.
ActualPredicted
FraudNon-Fraud
FraudTPFN
Non-FraudFPTN
Table 5. Summary of metrics for Logistic Regression across all configurations on the label_unique_calls_and_callduration dataset. The best-performing result in the Training Strategy section of the table is highlighted in bold.
Table 5. Summary of metrics for Logistic Regression across all configurations on the label_unique_calls_and_callduration dataset. The best-performing result in the Training Strategy section of the table is highlighted in bold.
Training StrategyCalibrationTest SetAccuracyPrecision (Macro)Recall (Macro)F1 (Macro)ROC-AUC/PR-AUC
train_fullnoneraw_dist0.99920.49960.50000.49980.8477/0.0063
train_fullnonebalanced0.50000.25000.50000.33330.8456/0.8256
train_fullisotonicraw_dist0.99920.49960.50000.49980.8488/0.0068
train_fullisotonicbalanced0.50000.25000.50000.33330.8469/0.8209
train_fullsigmoidraw_dist0.99920.49960.50000.49980.8477/0.0063
train_fullsigmoidbalanced0.50000.25000.50000.33330.8456/0.8256
train_smotenoneraw_dist0.99340.55360.99480.59510.9948/0.0727
train_smotenonebalanced0.99460.99460.99460.99460.9949/0.9809
train_smoteisotonicraw_dist0.99920.49960.50000.49980.9963/0.1135
train_smoteisotonicbalanced0.50000.25000.50000.33330.9965/0.9936
train_smotesigmoidraw_dist0.99920.49960.50000.49980.9948/0.0727
train_smotesigmoidbalanced0.50000.25000.50000.33330.9949/0.9809
train_smote_rusnoneraw_dist0.99340.55360.99480.59510.9948/0.0727
train_smote_rusnonebalanced0.99460.99460.99460.99460.9949/0.9809
train_smote_rusisotonicraw_dist0.99920.49960.50000.49980.9963/0.1135
train_smote_rusisotonicbalanced0.50000.25000.50000.33330.9965/0.9936
train_smote_russigmoidraw_dist0.99920.49960.50000.49980.9948/0.0727
train_smote_russigmoidbalanced0.50000.25000.50000.33330.9949/0.9809
train_rusnoneraw_dist0.99310.55200.99440.59240.9948/0.0729
train_rusnonebalanced0.99440.99440.99440.99440.9949/0.9810
train_rusisotonicraw_dist0.99920.49960.50000.49980.9966/0.1137
train_rusisotonicbalanced0.50000.25000.50000.33330.9968/0.9938
train_russigmoidraw_dist0.99920.49960.50000.49980.9948/0.0729
train_russigmoidbalanced0.50000.25000.50000.33330.9949/0.9810
Table 6. Summary of metrics for Random Forest across all configurations on the label_unique_calls_and_callduration dataset. The best-performing result in the Training Strategy section of the table is highlighted in bold.
Table 6. Summary of metrics for Random Forest across all configurations on the label_unique_calls_and_callduration dataset. The best-performing result in the Training Strategy section of the table is highlighted in bold.
Training StrategyCalibrationTest SetAccuracyPrecision (Macro)Recall (Macro)F1 (Macro)ROC-AUC/PR-AUC
train_fullnoneraw_dist0.99990.98200.98200.98200.99999/0.9943
train_fullnonebalanced0.98170.98230.98170.98170.99988/0.99986
train_fullisotonicraw_dist0.99990.97730.98630.98170.99999/0.9928
train_fullisotonicbalanced0.98600.98640.98600.98600.99989/0.99985
train_fullsigmoidraw_dist0.99990.98270.97960.98110.99999/0.9943
train_fullsigmoidbalanced0.97930.98010.97930.97930.99988/0.99986
train_smotenoneraw_dist0.99980.89230.99320.93700.99999/0.9836
train_smotenonebalanced0.99300.99310.99300.99300.99994/0.99994
train_smoteisotonicraw_dist0.99990.96020.97010.96510.99972/0.9795
train_smoteisotonicbalanced0.96990.97150.96990.96990.99967/0.99966
train_smotesigmoidraw_dist0.99990.95880.97040.96450.99999/0.9836
train_smotesigmoidbalanced0.97020.97180.97020.97010.99994/0.99994
train_smote_rusnoneraw_dist0.99980.89460.99290.93830.99999/0.9841
train_smote_rusnonebalanced0.99270.99280.99270.99270.99993/0.99992
train_smote_rusisotonicraw_dist0.99990.97510.95700.96590.99972/0.9808
train_smote_rusisotonicbalanced0.95670.96010.95670.95660.99967/0.99965
train_smote_russigmoidraw_dist0.99990.96030.97390.96700.99999/0.9841
train_smote_russigmoidbalanced0.97370.97490.97370.97360.99993/0.99992
train_rusnoneraw_dist0.99760.62330.99880.69720.99978/0.6461
train_rusnonebalanced0.99840.99840.99840.99840.99989/0.99984
train_rusisotonicraw_dist0.99950.82730.90680.86280.99951/0.6456
train_rusisotonicbalanced0.90700.92160.90700.90620.99962/0.99958
train_russigmoidraw_dist0.99940.79630.97820.86580.99978/0.6461
train_russigmoidbalanced0.97800.97880.97800.97790.99989/0.99984
Table 7. Summary of metrics for XGBoost across all configurations on the label_unique_calls_and_callduration dataset. The best-performing result in the Training Strategy section of the table is highlighted in bold.
Table 7. Summary of metrics for XGBoost across all configurations on the label_unique_calls_and_callduration dataset. The best-performing result in the Training Strategy section of the table is highlighted in bold.
Training StrategyCalibrationTest SetAccuracyPrecision (Macro)Recall (Macro)F1 (Macro)ROC-AUC/PR-AUC
train_fullnoneraw_dist0.99980.99250.87450.92540.9899/0.8836
train_fullnonebalanced0.87450.89970.87450.87250.9902/0.9926
train_fullisotonicraw_dist0.99980.97840.91720.94570.9899/0.8757
train_fullisotonicbalanced0.91720.92900.91720.91660.9902/0.9926
train_fullsigmoidraw_dist0.99980.99220.87630.92650.9785/0.8817
train_fullsigmoidbalanced0.87630.90090.87630.87440.9787/0.9792
train_smotenoneraw_dist0.99960.87520.86950.87230.9999/0.8485
train_smotenonebalanced0.86940.89620.86940.86710.9996/0.9994
train_smoteisotonicraw_dist0.99960.85720.93270.89130.9996/0.8327
train_smoteisotonicbalanced0.93250.94040.93250.93220.9994/0.9991
train_smotesigmoidraw_dist0.99960.89550.81900.85320.9999/0.8485
train_smotesigmoidbalanced0.81880.86670.81880.81270.9996/0.9994
train_smote_rusnoneraw_dist0.99970.89610.88970.89290.9999/0.8734
train_smote_rusnonebalanced0.88950.90930.88950.88820.9996/0.9993
train_smote_rusisotonicraw_dist0.99970.89810.88380.89080.9991/0.8591
train_smote_rusisotonicbalanced0.88360.90540.88360.88200.9988/0.9986
train_smote_russigmoidraw_dist0.99960.91400.84990.87930.9999/0.8734
train_smote_russigmoidbalanced0.84970.88420.84970.84630.9996/0.9993
train_rusnoneraw_dist0.99800.64510.99880.72440.9996/0.6024
train_rusnonebalanced0.99810.99810.99810.99810.9996/0.9996
train_rusisotonicraw_dist0.99940.79800.87830.83330.9996/0.5864
train_rusisotonicbalanced0.87850.90220.87850.87670.9996/0.9994
train_russigmoidraw_dist0.99920.49960.50000.49980.9996/0.6024
train_russigmoidbalanced0.50000.25000.50000.33330.9996/0.9996
Table 8. Summary of metrics for Decision Tree across all configurations on the label_unique_calls_and_callduration dataset. The best-performing result in the Training Strategy section of the table is highlighted in bold.
Table 8. Summary of metrics for Decision Tree across all configurations on the label_unique_calls_and_callduration dataset. The best-performing result in the Training Strategy section of the table is highlighted in bold.
Training StrategyCalibrationTest SetAccuracyPrecision (Macro)Recall (Macro)F1 (Macro)ROC-AUC/PR-AUC
train_fullnoneraw_dist0.999980.99250.99700.99480.9970/0.9793
train_fullnonebalanced0.99700.99710.99700.99700.9970/0.9970
train_fullisotonicraw_dist0.999980.99250.99700.99480.9970/0.9793
train_fullisotonicbalanced0.99700.99710.99700.99700.9970/0.9970
train_fullsigmoidraw_dist0.999980.99250.99700.99480.9970/0.9793
train_fullsigmoidbalanced0.99700.99710.99700.99700.9970/0.9970
train_smotenoneraw_dist0.99990.97520.99000.98250.9900/0.9316
train_smotenonebalanced0.99010.99020.99010.99010.9901/0.9901
train_smoteisotonicraw_dist0.99990.97520.99000.98250.9900/0.9316
train_smoteisotonicbalanced0.99010.99020.99010.99010.9901/0.9901
train_smotesigmoidraw_dist0.99990.97520.99000.98250.9900/0.9316
train_smotesigmoidbalanced0.99010.99020.99010.99010.9901/0.9901
train_smote_rusnoneraw_dist0.99990.97520.99000.98250.9900/0.9316
train_smote_rusnonebalanced0.99010.99020.99010.99010.9901/0.9901
train_smote_rusisotonicraw_dist0.99990.97520.99000.98250.9900/0.9316
train_smote_rusisotonicbalanced0.99010.99020.99010.99010.9901/0.9901
train_smote_russigmoidraw_dist0.99990.97520.99000.98250.9900/0.9316
train_smote_russigmoidbalanced0.99010.99020.99010.99010.9901/0.9901
train_rusnoneraw_dist0.99910.73880.99930.82300.9993/0.4774
train_rusnonebalanced0.99890.99890.99890.99890.9989/0.9981
train_rusisotonicraw_dist0.99920.49960.50000.49980.9993/0.4774
train_rusisotonicbalanced0.50000.25000.50000.33330.9989/0.9981
train_russigmoidraw_dist0.99920.49960.50000.49980.9993/0.4774
train_russigmoidbalanced0.50000.25000.50000.33330.9989/0.9981
Table 9. Summary of metrics for MLP across all configurations on the label_unique_calls_and_callduration dataset. The best-performing result in the Training Strategy section of the table is highlighted in bold.
Table 9. Summary of metrics for MLP across all configurations on the label_unique_calls_and_callduration dataset. The best-performing result in the Training Strategy section of the table is highlighted in bold.
Training StrategyCalibrationTest SetAccuracyPrecision (Macro)Recall (Macro)F1 (Macro)ROC-AUC/PR-AUC
train_fullnoneraw_dist0.99970.96810.81750.87830.9994/0.8816
train_fullnonebalanced0.81750.86630.81750.81120.9993/0.9995
train_fullisotonicraw_dist0.99970.96610.81960.87920.9996/0.8692
train_fullisotonicbalanced0.81960.86740.81960.81360.9995/0.9994
train_fullsigmoidraw_dist0.99970.95950.82180.87850.9996/0.8816
train_fullsigmoidbalanced0.82180.86860.82180.81590.9995/0.9995
train_smotenoneraw_dist0.99930.76770.99800.84810.9999/0.8801
train_smotenonebalanced0.99780.99790.99780.99780.9998/0.9998
train_smoteisotonicraw_dist0.99960.90630.85720.88020.9996/0.8652
train_smoteisotonicbalanced0.85700.88850.85700.85400.9995/0.9994
train_smotesigmoidraw_dist0.99940.79340.99190.86750.9999/0.8801
train_smotesigmoidbalanced0.99170.99180.99170.99170.9998/0.9998
train_smote_rusnoneraw_dist0.99930.76970.99750.84970.9999/0.8476
train_smote_rusnonebalanced0.99730.99730.99730.99730.9997/0.9997
train_smote_rusisotonicraw_dist0.99960.87650.86170.86900.9969/0.8304
train_smote_rusisotonicbalanced0.86160.89130.86160.85890.9967/0.9966
train_smote_russigmoidraw_dist0.99950.80800.99120.87860.9999/0.8476
train_smote_russigmoidbalanced0.99090.99100.99090.99090.9997/0.9997
train_rusnoneraw_dist0.99530.57220.99740.62500.9991/0.3102
train_rusnonebalanced0.99730.99730.99730.99730.9990/0.9972
train_rusisotonicraw_dist0.99920.49960.50000.49980.9988/0.3140
train_rusisotonicbalanced0.50000.25000.50000.33330.9989/0.9976
train_russigmoidraw_dist0.99920.49960.50000.49980.9991/0.3102
train_russigmoidbalanced0.50000.25000.50000.33330.9990/0.9972
Table 10. Summary of best model performances on the raw distribution test set (selected by highest PR-AUC). The best-performing result in the table is highlighted in bold.
Table 10. Summary of best model performances on the raw distribution test set (selected by highest PR-AUC). The best-performing result in the table is highlighted in bold.
ModelTraining StrategyCalibrationAccuracyPrecision (Macro)Recall (Macro)F1 (Macro)ROC-AUCPR-AUC
Logistic Regressiontrain_rusisotonic0.99920.49960.50000.49980.99660.1137
Random Foresttrain_fullnone/sigmoid0.99990.9820/0.98270.9820/0.97960.9820/0.98110.999990.9943
XGBoosttrain_fullnone0.99980.99250.87450.92540.98990.8836
Decision Treetrain_fullnone/isotonic/sigmoid0.999980.99250.99700.99480.99700.9793
MLPtrain_fullnone/sigmoid0.99970.9681/0.95950.8175/0.82180.8783/0.87850.9994/0.99960.8816
Table 11. Summary of best model performances on the balanced test set (selected by highest macro-averaged F1-score). The best-performing result in the table is highlighted in bold.
Table 11. Summary of best model performances on the balanced test set (selected by highest macro-averaged F1-score). The best-performing result in the table is highlighted in bold.
ModelTraining StrategyCalibrationAccuracyPrecision (Macro)Recall (Macro)F1 (Macro)ROC-AUCPR-AUC
Logistic Regressiontrain_smote/train_smote_rusnone0.99460.99460.99460.99460.99490.9809
Random Foresttrain_rusnone0.99840.99840.99840.99840.999890.99984
XGBoosttrain_rusnone0.99810.99810.99810.99810.99960.9996
Decision Treetrain_rusnone0.99890.99890.99890.99890.99890.9981
MLPtrain_smotenone0.99780.99790.99780.99780.99980.9998
Table 12. Comparison of Wangiri fraud detection studies and the proposed idea. Our proposed method features in the table is highlighted in bold.
Table 12. Comparison of Wangiri fraud detection studies and the proposed idea. Our proposed method features in the table is highlighted in bold.
StudyMethodologyRTBest ResultKey Findings/Contribution
Sahin et al. (2011) [26]Supervised ML (DT, SVM)NoAcc: ∼89%Established baselines for general telecom fraud. Highlighted the need for feature engineering but lacked specific handling for Wangiri patterns.
Arafat et al. (2019) [27]Ensemble ML (RF, AdaBoost, XGB)NoXGB: Acc. 99.4%, F1 0.96Showed ensemble ML efficiency on labeled CDRs but lacked imbalance handling and unlabeled data support. The proposed work improves performance (PR-AUC 0.9943, F1 0.998) via SMOTE/RUS balancing.
Sahaidak et al. (2022) [42]Literature ReviewNoN/AProvided a taxonomy of hybrid fraud types but lacked a computational implementation. Our work operationalizes these concepts into a functional detection pipeline.
Ravi et al. (2022) [30]Mixed ML (SVM, RF, MLP, IF)NoRF: Acc. 99%, F1 0.97Defined Wangiri taxonomy and compared supervised vs. unsupervised models. The proposed pipeline achieves F1 0.998 (+2.8%) and introduces pseudo-labeling and SHAP explainability.
Mundia et al. (2024) [39]Policy/Qualitative ReviewNo-Identified gaps in current fraud management (manual processes, lack of automation, concept drift). The proposed work operationalizes these insights via an automated, drift-aware ML pipeline.
Proposed MethodML pipeline (Ensemble) + SMOTE/RUS + SHAPYesRF: PR-AUC 0.9943, F1 0.998Outperforms all prior studies by combining unsupervised labeling with optimized ensembles. Offers the highest reported precision/recall balance on imbalanced data while maintaining interpretability.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Balouchi, A.; Abdollahi, M.; Eskandarian, A.; Karimi Pour Kerman, K.; Majd, E.; Azouji, N.; Baniasadi, A. Wangiri Fraud Detection: A Comprehensive Approach to Unlabeled Telecom Data. Future Internet 2026, 18, 15. https://doi.org/10.3390/fi18010015

AMA Style

Balouchi A, Abdollahi M, Eskandarian A, Karimi Pour Kerman K, Majd E, Azouji N, Baniasadi A. Wangiri Fraud Detection: A Comprehensive Approach to Unlabeled Telecom Data. Future Internet. 2026; 18(1):15. https://doi.org/10.3390/fi18010015

Chicago/Turabian Style

Balouchi, Amirreza, Meisam Abdollahi, Ali Eskandarian, Kianoush Karimi Pour Kerman, Elham Majd, Neda Azouji, and Amirali Baniasadi. 2026. "Wangiri Fraud Detection: A Comprehensive Approach to Unlabeled Telecom Data" Future Internet 18, no. 1: 15. https://doi.org/10.3390/fi18010015

APA Style

Balouchi, A., Abdollahi, M., Eskandarian, A., Karimi Pour Kerman, K., Majd, E., Azouji, N., & Baniasadi, A. (2026). Wangiri Fraud Detection: A Comprehensive Approach to Unlabeled Telecom Data. Future Internet, 18(1), 15. https://doi.org/10.3390/fi18010015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop