1. Introduction
Rising voice scams, especially against the elderly, endanger public safety and digital trust. Anti-fraud agencies report billions in annual losses from voice scams. Traditional security struggles against these real-time, persuasive scams. Current fraud detection is slow and ineffective due to cloud-based post-call analysis. Deploying these systems in sensitive areas is difficult due to privacy concerns. Rule-based and keyword matching approaches are fragile against paraphrasing, language variation, and growing scams.
To address these limitations, we propose a novel segment-based voice scam detection system designed for on-device deployment using edge AI and offline NLP techniques. Our system operates by continuously capturing short audio segments (e.g., 3 s), performing offline speech transcription using a Whisper Tiny model, and semantically analyzing each segment using DeepSeek-V3 Embedding. By comparing the embedded segments to a curated library of scam-intent phrases, the system evaluates threat likelihood based on a voting-based consensus mechanism across recent segments. This design ensures early scam detection with minimal false positives, while maintaining real-time responsiveness and full local data processing. This research significantly contributes to the field. Resource-efficient design allows use on ESP32 and smartphones. Incremental analysis of short speech segments improves alert speed and detail. A multi-segment voting system enhances robustness by mitigating false alarms from ambiguous phrases. Offline processing protects user privacy. A schematic diagram of the system operation is shown in
Figure 1.
2. Related Work
The detection of fraudulent calls using AI technologies has attracted substantial attention in recent years, particularly with the rise of real-time telecommunication scams targeting vulnerable populations. Previous research spans across multiple sub-domains, including automatic speech recognition (ASR), voice activity detection (VAD), semantic analysis, and on-device deployment [
1,
2,
3]. We combine segment-based voice analysis with lightweight semantic voting on edge AI systems, achieving a novel balance between inference accuracy, latency, and power efficiency.
Researchers have addressed the problem of real-time scam detection in telephony. Scam-Detector [
4] and Ng fine-tuned large language models (LLMs) such as GPT-2 and LLaMA-3 on synthesized scam dialogues for fraud detection. Malhotra et al. [
5] proposed a hybrid support vector machine–recurrent neural network (SVM-RNN) architecture for classifying fraudulent calls using both acoustic and metadata features. Similarly, Hong et al. [
6] created a pipeline integrating Google Speech-to-Text and LSTM classifiers on a custom speech corpus. Our system infers at the segment level, independent of full-call context. Real-time, privacy-focused speech recognition runs locally.
The TeleAntiFraud-28k dataset [
7] is used to simulate slow-thinking reasoning over 28,000+ audio–text pairs. It incorporates real-world call data, LLM-generated augmentations, and adversarial dialogues. While comprehensive, its models require considerable computer power and are designed for cloud inference. For low-power systems, we use a lightweight library and vote for practical deployment.
Rao et al. [
8] demonstrated a super learner ensemble for phishing detection on mobile devices, emphasizing fast response time and compact models. Reddy and Pallerla [
9] presented a mobile AI framework using behavioral and semantic indicators to detect suspicious messages. We use voice-focused inference with cumulative scoring for on-device deployment. Tan et al. proposed rVAD [
10], a segment-based, unsupervised VAD method designed for robustness in real-world conditions. Their findings reinforce the value of robust segmentation for inference. We incorporate fixed-length segmentation (e.g., 3 s) to ensure latency control while maintaining the semantic completeness of each utterance.
A knowledge-infused semantic graph system for scam detection was proposed in [
5], offering explainability and ontology integration. Ma et al. [
7] introduced AntiFraud-Qwen2Audio with structured multi-agent reasoning. In contrast, our system avoids symbolic reasoning overhead by implementing segment-level semantic embedding and a voting logic that approximates consensus with minimal resources. Oļeiņiks and Solodovņikova [
11] presented a real-time fraud detection system using Whisper and Gemini 1.5, with cloud dependencies and high accuracy. Our system achieves similar goals while ensuring offline capability, operating independently of the internet infrastructure.
Based on the results, we developed a unified fraud detection system. Consistent input is ensured by Whisper’s ASR. DeepSeek detects scam phrases early via semantic embedding. In the design, we used ESP32 for offline mobile app integration. This integration enables cost-effective, real-time scam detection while maintaining privacy, interpretability, and user trust.
3. System Architecture
A lightweight edge AI system using speech analysis and semantic voting detects voice scams in low-resource environments. A hybrid pipeline acquires segments, and transmits (BLE), transcribes (ASR), and classifies audio signals. This is a privacy-focused architecture for offline embedded deployment.
3.1. Hardware Component
We chose the ESP32-WROOM-32U (Espressif Systems, Shanghai, China) or its low power, Wi-Fi/Bluetooth, and edge features. The ESP32 receives digital audio directly from an INMP441 MEMS I2S microphone (InvenSense, San Jose, CA, USA). BLE-enabled, the system runs on a 3.7 V LiPo battery. A mobile phone executes both the Whisper ASR model and the semantic inference engine. The schematic of the system architecture is presented in
Figure 2.
Key hardware components include the following.
ESP32-WROOM-32U: Edge controller for audio recording and BLE transmission.
INMP441: High-sensitivity I2S microphone for real-time digital audio capture.
BLE Module: Transmits brief audio to mobile device.
Mobile Device: Hosts Whisper ASR model and semantic inference engine.
Optional: LED indicator and speaker for user alerts and feedback.
3.1.1. Audio Segmentation and Data Transmission
Audio data is continuously captured using the ESP32’s I2S interface and stored in buffers for a predetermined duration. This segmentation strategy guarantees predictable latency and permits the independent semantic analysis of each segment. Once a segment is complete, it is immediately transmitted via Bluetooth Low Energy (BLE) to a paired mobile device for further processing. The design of the segmentation and transmission pipeline allows for the use of either overlapping or non-overlapping windows, a feature that provides flexibility for future adjustments to the balance between sensitivity and redundancy.
3.1.2. ASR and Semantic Inference on Mobile Devices
Following segment reception, the mobile device starts a series of sequential analytical processes on the acquired data. Transcription of the audio segment is first performed using a local Whisper model, with variations in model size (base, tiny, etc.) being possible. This method ensures that speech recognition operates independently of internet connectivity. After transcription, a vector representation of the text is generated using the DeepSeek Embedding method. Following this, the segment vector is compared against a pre-compiled scam phrase library through cosine similarity measurements. If the comparison produces a similarity score above the predefined threshold (e.g., 0.80), we flag and store the result. If multiple segments surpass this threshold, the system activates a fraud alert, enhancing detection capabilities, as shown in
Figure 3.
3.2. Scam Phrase Library and Embedding Strategy
The scam detection module leverages a refined library of authentic Chinese-language scam transcripts encompassing financial fraud, phishing, loan scams, and identity theft schemes. All phrases within the library utilize the same DeepSeek model for embedding and inference, thereby guaranteeing vector space comparison consistency and compatibility. A segment is identified as potentially fraudulent if its embedding similarity surpasses a predefined threshold when compared to any element within the library. The sensitivity and false positive rate may be modulated through adjustment of voting thresholds (for example, a minimum of three out of five segments).
3.3. Offline and On-Device Design
The entire system is meticulously designed for offline operation, ensuring complete independence from cloud-based solutions. The audio acquisition and BLE transmission tasks are handled by the ESP32, while the Whisper ASR and semantic similarity analysis are executed locally on the mobile device. Importantly, the system refrains from using cloud APIs, persistent data storage, or any form of user identity tracking, thereby safeguarding user privacy. This makes the system highly applicable to privacy-critical applications, such as fraud prevention for elderly users or in areas with limited connectivity.
4. Detection Strategy
To balance inference speed, detection accuracy, and energy efficiency, we evaluate four distinct strategies for voice scam detection. Each strategy operates on transcribed voice content but differs in segmentation granularity, inference timing, and semantic judgment mechanisms. These strategies range from whole-conversation classification to segment-wise voting with semantic embedding.
4.1. Full-Transcript Semantic Classification (A1)
In this baseline strategy, the entire conversation is transcribed using Whisper, and scam detection is performed by applying a single semantic comparison between the full transcript and a curated scam phrase library. The inference mode for this approach operates post-conversation. By utilizing the embedding vector of the full transcript and comparing it with scam phrases, this strategy presents high context completeness with minimal risks associated with fragmentation. However, it comes with notable limitations, including high latency and unsuitability for real-time alerting. This method is most effectively applied in scenarios such as cloud-based analysis or forensic fraud auditing.
4.2. Segment-Level Keyword Matching (A2)
This strategy processes fixed 5 s segments individually by identifying recognized keywords from a predefined scam keyword list. The inference mode operates in real time, analyzing each segment as it is processed. The similarity method relies on token-based keyword matching, targeting specific terms such as “loan,” “verify,” or “install app.” The primary advantage of this approach is its speed and minimal computational requirements, making it the fastest inference method available. However, it has notable limitations, including a high false positive rate and low adaptability to paraphrased or nuanced language. This method is best suited for low-computing, rule-based systems where moderate accuracy is acceptable.
4.3. Segment-Level Semantic Similarity (A3)
Each 5 s segment is transcribed and semantically embedded using DeepSeek technology, which compares the segment vector with those in a scam phrase embedding library via cosine similarity. Operating in real time, this method is robust against variations in phrasing and offers improved precision. However, it may sometimes overreact to isolated segments that are semantically similar but benign. This approach is particularly suited for intermediate deployment on mobile or embedded systems.
4.4. Semantic Voting Across Segments (A4)
This approach is constructed based on the third method (A3) described above by consolidating similarity metrics across consecutive segments, thereby establishing a voting system for identifying scams, shown as
Figure 4.
A scam is flagged only when a threshold number of segments, such as three out of five, surpasses the semantic similarity benchmark, for instance, a cosine similarity of greater than 0.80. Operating in real time, this approach combines segmental and cumulative inference modes to suppress isolated false positives while facilitating early and precise warnings. It is computationally lightweight, interpretable, and particularly effective for embedded real-time scam detection. However, it requires careful tuning of both the segment count and the voting threshold to ensure optimal performance.
All strategies rely on Whisper for ASR and DeepSeek for semantic embedding. A2 is the only strategy that does not require vector computation. Strategies A3 and A4 require a small in-memory scam vector database, typically under 2 MB, making them feasible for mobile or lightweight embedded deployment. A4 may provide a good balance between response speed and stability. The voting mechanism significantly reduces false positives caused by one-off ambiguous phrases, and its cumulative logic aligns closely with how human listeners recognize manipulation across evolving speech patterns.
Table 1 presents a summary of the four methods.
5. Results
To assess the effectiveness and practicality of the proposed segment-based voice scam detection strategies, we designed an experiment suite focusing on real-time performance, semantic accuracy, and energy-aware deployment feasibility. The four detection strategies are comparatively evaluated across several dimensions, including precision, latency, and segment sensitivity.
5.1. Experiment Configuration
The configuration of the experiment is as follows.
Hardware: ESP32-WROOM-32U for segment capture; Android phone with Snapdragon 778 G for Whisper and DeepSeek inference.
Segmentation: Divide into various brief, non-overlapping time periods.
ASR: Whisper base (int8 quantized, offline).
Embedding: DeepSeek with LLaMA v2, 4096-dim.
Voting Threshold (A4): ≥3 segments out of 5 with similarity ≥ 0.80.
Offline testing was performed to create a realistic simulation of real-world deployment conditions for accurate results. We adopted the deepseek-llm-7b-chat.Q8_0.gguf model for sentence-level semantic embedding, which is based on the LLaMA v2 architecture. The resulting embeddings are 4096-dimensional floating-point vectors.
5.2. Dataset
To evaluate the system’s performance, we constructed a pseudo-realistic Mandarin-language dialogue dataset designed to simulate real-world phone conversations under both fraudulent and benign conditions. Given the legal and privacy constraints associated with collecting actual scam call recordings, we adopted a semi-synthetic data generation strategy, drawing from publicly available, domain-relevant resources and LLMs.
5.2.1. Scam Dialogue Construction
Sixty scam dialogues were generated from the following sources.
Published scam scripts from government agencies, such as Taiwan’s Anti-Fraud Center (165.gov.tw), China’s Ministry of Public Security, and cybersecurity advisories.
Text excerpts from news reports and public awareness campaigns describing real fraud cases.
Semantic augmentation using LLMs, including DeepSeek-V3 and ChatGPT-4o, where prompts were used to generate paraphrased or regionally adapted scam variants based on seed sentences from the TeleAntiFraud-28k dataset.
Dialogues were written to reflect typical scam structures (e.g., phishing, fake investment, impersonation of authorities) and reviewed to maintain linguistic and contextual plausibility.
5.2.2. Benign Dialogue Construction
We added 60 legitimate call dialogues. These included:
Simulated customer service calls, such as banking inquiries, government helplines, and e-commerce support.
Scripted role-play conversations performed by volunteers to emulate general-purpose, non-threatening communication.
Translated or adapted open-domain spoken corpora, such as service-oriented dialogues from public NLP datasets.
All dialogues were segmented into 5 s audio clips, with transcripts generated via Whisper or manual annotation. Segments containing scam-relevant content were labeled accordingly to facilitate evaluation of segment-based strategies (A2–A4).
5.3. Evaluation Metrics
To rigorously evaluate the effectiveness and real-world usability of the proposed scam detection strategies (A1–A4), we adopted a suite of standard classification and performance metrics. These metrics are tailored to reflect both semantic accuracy and temporal responsiveness, which are critical in voice-based fraud prevention systems.
We define the key components for evaluation: : This represents the set of all segmented speech units extracted from voice data. : This is the predicted label for an individual segment . A value of 1 indicates that the segment is predicted as a scam, while 0 indicates it is predicted as benign. : This is the ground truth label for segment . It represents the actual, verified nature of the segment (1 for scam, 0 for benign).
Based on these labels, we define the following metrics.
We precisely define the following metrics for further analysis. Precision (P) denotes the proportion of predicted scam segments that are authentic scams. The use of high precision leads to fewer false alarms. In scam detection, higher precision implies more user trust; false alarms may cause users to ignore valid warnings.
Recall (R) quantifies the proportion of genuine scam segments accurately identified. A high recall rate is essential for minimizing the omission of fraudulent activity. Within security applications, the concept of recall is significantly important. Failing to detect fraudulent activity (false negatives) presents a substantial risk of financial loss.
The F1 score, a harmonic mean of precision and recall, provides a balanced performance assessment. A high F1 score ensures the system avoids bias toward precision or recall. It is the primary performance target in our experiments.
The false positive rate (FPR) represents the proportion of benign segments incorrectly identified as fraudulent. A low FPR is essential in real-time systems to avoid user fatigue or unnecessary disruption.
We measure the average time from segment acquisition as latency until a scam alert is generated. Measured in seconds, this reflects the system’s reactiveness. For segment-wise voting (e.g., A4), latency includes multiple segment delays before a conclusive vote is triggered.
In Strategy A1, metrics are calculated for each call using the complete transcript as the unit of analysis. Concerning Strategies A2 to A4, metrics are determined segmentally before undergoing dataset-wide aggregation. Strategy A4 employs a voting mechanism; a call is identified as fraudulent when a similarity threshold is surpassed in a minimum of three of the five most recent segments. This method employs a two-tiered evaluation process: segment-level assessment of true and false positives for individual units, and session-level determination of real-time call blocking.
5.4. Segment Duration Sensitivity
To determine the optimal segment length for real-time voice scam detection, we conducted a comparative study using three different segmentation window durations: 2, 3, and 5 s. Strategy A4 (semantic voting) was exclusively employed for this evaluation, given its demonstrably superior detection performance.
Observations revealed that 2 s segments often lacked adequate context, which resulted in truncated scam phrases and an increased rate of false negatives, despite their quick alerting capabilities. On the other hand, 5 s segments enhanced context completeness but introduced higher latency, making them less effective in real-time scenarios by increasing the likelihood of detecting scam phrases only after they had already been spoken entirely.
Figure 5 illustrates the impact of different voice segment lengths on the F1 score and inference latency. Experimental results indicate that while the 3 s segment has comparable latency to the 5 s segment (approximately 2.7 s), its F1 score significantly drops to 0, demonstrating its inability to consistently capture meaningful features in semantic recognition. In contrast, both the 2 s and 5 s segments achieve an F1 score of 1.0, showcasing the system’s strong recognition performance in both very short and longer segments. Considering the balance between accuracy and real-time responsiveness, a 5 s segment is recommended as the optimal solution. This duration maximized the F1 score and maintained response latency within acceptable limits for real-time applications.
6. Discussion
We analyze the experimental results across the four detection strategies (A1–A4), highlighting the trade-offs between accuracy, latency, and reliability under real-time deployment constraints. We also discuss the impact of segment duration, model performance stability, and error characteristics to provide a comprehensive assessment of system behavior in practice.
Table 2 summarizes the evaluation metrics across the four implemented strategies.
A1 (whole-transcript embedding) is semantically strong but slow and not real-time-applicable. A2 (keyword matching) is fastest but yields many false positives. Segment-wise embedding in A3 leads to good performance but over-alerting on isolated segments. The preferred strategy (A4) balances reliability, early detection, and low false positives. The enhanced accuracy of results is confirmed using multi-segment voting.
To evaluate the influence of segment granularity, Strategy A4 was tested under 2, 3, and 5 s non-overlapping windows. Results are shown in
Table 3. Short segments of 2 s improve reactivity but truncate semantic context, resulting in lower recall and F1 scores. In contrast, longer segments of 5 s slightly enhance accuracy but introduce delays in alerts and reduce user response time. A 5 s window emerged as the optimal balance, effectively preserving semantic completeness while maintaining low latency.
An analysis of 30 misclassified cases across all strategies revealed three dominant sources of error, as shown in
Figure 6. First, false positives often occurred due to neutral phrases being mistakenly flagged, such as benign financial inquiries like “download our form.” Second, false negatives resulted from vague or subtle scam threats being overlooked, exemplified by passive warnings such as “you’ll be responsible if ignored.” Lastly, semantic drift across segments posed challenges, with partially split scam phrases losing clarity due to segmentation boundaries. Our A4 voting logic reduces the impact of FPn and SDs by accumulating similarity scores, confirming its robustness in fragmented dialogue.
The system achieves segment-wise alert generation with a mean response latency of <4 s under Strategy A4. The timeline in
Figure 7 illustrates the time-sequenced process of audio segmentation, transcription, inference, and alert triggering. This real-time responsiveness meets operational demands for on-device, offline deployment in elder-protection or low-connectivity scenarios.
Semantic voting across segments, as implemented in Strategy A4, demonstrates exceptional robustness in handling phrase ambiguity and contextual fragmentation. The use of a 5 s segmentation window achieves an optimal balance between accuracy and latency, particularly for detecting scam patterns in Chinese speech. By utilizing low-power hardware and operating free from cloud dependency, the design facilitates real-time detection, making it highly suitable for deployment in cost-sensitive environments with strict privacy requirements.
7. Conclusions
We examine a novel real-time voice scam detection method in this study. A low-power system thwarts scams on embedded devices. The system classifies using audio, ASR, semantic embedding, and voting. The system shows high detection accuracy (F1 = 0.90) and a low rate of false positives (2.5%). Real-time alerts trigger sub-four-second responses. This system is a standalone, offline unit that works on ESP32 and mobile. Its segment-level design improves fraud recognition reliability. The best F1 score (0.800) came from a 5 s window, but the 1.222 s latency may affect user response. Real-time response can be enhanced via overlapping segmentation and VAD. This research improves system reliability by adding dynamic segmentation, multi-language support, and sentiment analysis, validating fraud protection for vulnerable groups.
Author Contributions
Conceptualization, S.-Y.L.; methodology, S.-Y.L.; software, S.-Y.L.; validation, W.-P.C.; formal analysis, S.-Y.L.; writing—original draft preparation, S.-Y.L.; writing—review and editing, W.-P.C.; supervision, W.-P.C. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data presented in this study are available on request from the corresponding author.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Zhao, Q.; Chen, K.; Li, T.; Yang, Y.; Wang, X. Detecting telecommunication fraud by understanding the contents of a call. Cybersecurity 2018, 1, 8. [Google Scholar] [CrossRef]
- Boulieris, P.; Pavlopoulos, J.; Xenos, A.; Vassalos, V. Fraud detection with natural language processing. Mach. Learn. 2024, 113, 5087–5108. [Google Scholar] [CrossRef]
- Chang, Y.-C.; Aïmeur, E. Chat or Trap? Detecting Scams in Messaging Applications with Large Language Models. In Proceedings of the 2024 8th Cyber Security in Networking Conference (CSNet), Paris, France, 4–6 December 2024; IEEE: New York, NY, USA, 2024; pp. 92–99. [Google Scholar] [CrossRef]
- Nicholas, P.Y.J.; Ng, P.C. ScamDetector: Leveraging Fine-Tuned Language Models for Improved Fraudulent Call Detection. In Proceedings of the TENCON 2024—2024 IEEE Region 10 Conference (TENCON), Singapore, 1–4 December 2024; IEEE: New York, NY, USA, 2024; pp. 422–425. [Google Scholar] [CrossRef]
- Malhotra, S.; Arora, G.; Bathla, R. Detection and Analysis of Fraud Phone Calls using Artificial Intelligence. In Proceedings of the 2023 International Conference on Recent Advances in Electrical, Electronics & Digital Healthcare Technologies (REEDCON), New Delhi, India, 1–3 May 2023; IEEE: New York, NY, USA, 2023; pp. 592–595. [Google Scholar] [CrossRef]
- Hong, B.; Connie, T.; Goh, M.K.O. Scam Calls Detection Using Machine Learning Approaches. In Proceedings of the 2023 11th International Conference on Information and Communication Technology (ICoICT), Melaka, Malaysia, 23–24 August 2023; IEEE: New York, NY, USA, 2023; pp. 442–447. [Google Scholar] [CrossRef]
- Ma, Z.; Wang, P.; Huang, M.; Wang, J.; Wu, K.; Lv, X.; Pang, Y.; Yang, Y.; Tang, W.; Kang, Y. TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection. arXiv 2025, arXiv:2503.24115. [Google Scholar] [CrossRef]
- Rao, R.S.; Kondaiah, C.; Pais, A.R.; Lee, B. A hybrid super learner ensemble for phishing detection on mobile devices. Sci. Rep. 2025, 15, 16839. [Google Scholar] [CrossRef] [PubMed]
- Reddy, M.; Pallerla, R. Using AI to Detect and Classify Suspicious Mobile Messages in Real Time. In Proceedings of the 2025 3rd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT), Bengaluru, India, 5–7 February 2025; IEEE: New York, NY, USA, 2025; pp. 1772–1777. [Google Scholar] [CrossRef]
- Tan, Z.-H.; Sarkar, A.K.; Dehak, N. rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method. arXiv 2022, arXiv:1906.03588. [Google Scholar] [CrossRef]
- Oļeiņiks, R. Real-Time Fraud Detection and Prevention Based on Artificial Intelligence Tools. Balt. J. Mod. Comput. 2025, 13, 252–289. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |