A Hybrid Machine Learning Approach for Classifying Indonesian Cybercrime Discourse Using a Localized Threat Taxonomy
Abstract
1. Introduction
- 1.
- High-Dimensional and Sparse Vocabulary: The language used on Indonesian social media is informal, rich in slang, and contains many out-of-vocabulary terms not found in standard NLP corpora.
- 2.
- Significant Label Noise: The inherent ambiguity of informal language makes it difficult to assign clear, unambiguous labels, even for human annotators.
- 3.
- Severe Class Imbalance: Reports of some threat types (e.g., WhatsApp Phishing) are far more prevalent in the public discourse than others (e.g., Ransomware), leading to a highly imbalanced dataset.
- 4.
- Interpretability vs. Performance Trade-Off: To be useful for threat analysts and policymakers, the system must not only be accurate but also interpretable.
- Design of the Indonesian Cybercrime Threat Taxonomy (ICTT): The ICTT is a novel, five-dimensional framework tailored to the nuances of Indonesian cybercrime discourse, bridging formal policy language with informal citizen language.
- An End-to-End OSINT and Machine Learning Pipeline: This study offers a complete system for collecting, preprocessing, weakly labeling, and classifying Indonesian cybercrime content, incorporating state-of-the-art NLP techniques.
- Comparative Analysis of Classification Approaches: This study conducted a rigorous evaluation of rule-based, transformer-only, and hybrid classification models, demonstrating the effectiveness of combining interpretability with deep learning performance.
2. Related Work
2.1. Cybercrime Taxonomies and Threat Classification
2.2. Natural Language Processing for Cybersecurity
2.3. Weak Supervision and Noisy Label Learning
2.4. Hybrid Rule-Based and Machine Learning Systems
3. Methodology
3.1. The Indonesian Cybercrime Threat Taxonomy (ICTT)
- Threat Type—the nature and category of the malicious act. The ICTT enumerates 10 major threat categories with over 60 specific subcategories (e.g., Phishing & Social Engineering, Malware & Malicious Software, Fraud & Online Scams, Data Breach & Identity Theft, Hacking & System Intrusion, Financial & Payment Attacks, Online Child Exploitation, Cyber Harassment & Online Abuse, Cyber-Enabled Traditional Crime, and Emerging Threats). New categories will be added when 50 or more validated samples describe a distinct, operationally relevant threat type.
- Attack Vector—the delivery mechanism or channel through which the threat is executed. The ICTT identifies five primary attack vectors (Messaging/Apps, such as WhatsApp and Telegram; Social Platforms, such as X, Instagram, Facebook, TikTok, and YouTube; Email/SMS/Voice; Web & Apps; and Network/Infrastructure). The prominence of messaging and social platform vectors reflects their centrality in Indonesian online communication and cybercrime reporting.
- Threat Actor—the entity or group responsible for perpetrating the cybercrime. The ICTT classifies threat actors into three categories (non-state actors, including cybercriminal groups and hacktivists; Internal/Partner actors, such as insiders and contractors; and State-Linked actors, including Advanced Persistent Threats and state-sponsored operations).
- Victim—the target or affected entity. The ICTT enumerates four prevalent victim categories (Private Sector, including SMEs and e-commerce platforms; Finance, including banks and fintech companies; Public Sector, including government agencies and educational institutions; and Individuals, including citizens and minors). While the Indonesian Cybersecurity Agency (BSSN) 2024 Cybersecurity Landscape report [3] identifies additional victim categories in critical infrastructure incidents (e.g., IoT/CCTV systems, healthcare institutions), references to them are not prevalent in current social media discourse and may emerge in future analyses.
- Impact—the consequences or harm resulting from the incident. The ICTT classifies incident impact into four dimensions (Confidentiality impacts, such as Privacy/Data Protection; Integrity/Availability impacts, such as Service Disruption/Outage; Financial/Legal impacts, such as Financial Loss and Regulatory Exposure; and Reputation/Societal impacts, such as Reputation Damage and Public Safety/National Security concerns).
3.2. Data Collection and Preprocessing
- File-based labeling—Samples were retrieved from X and YouTube using threat-specific keyword queries. For instance, WhatsApp Phishing samples were collected using Indonesian terms such as “wa kena hack” (WhatsApp hacked), “link WA palsu” (fake WhatsApp link), and” hadiah wa” (WhatsApp prize). Labels were assigned based on the keyword set used for retrieval. While this ensures broad coverage, it assumes that keyword-based mining accurately reflects the threat type, a premise often complicated by semantic ambiguity in the Indonesian language.
- Rule-based labeling—To improve label precision, a set of 12 core regular expression (regex) patterns was developed for the four most prevalent threat categories (WhatsApp Phishing, Email Phishing, Deepfake Scams, and Ransomware). These patterns encode linguistic markers specific to Indonesian cybercrime discourse and were applied to all 2344 mined samples. Samples matching one or more rules received a rule-based label; samples with no matches retained only their file-based label or remained unlabeled.
3.3. Data Characteristics and Representational Biases
3.3.1. Class Distribution and Imbalance
3.3.2. Platform Representation Bias
3.3.3. User Representation Bias
3.3.4. Implications for Model Fairness and Deployment
3.4. Classification Framework
3.4.1. Rule-Based Classifier
3.4.2. Transformer-Based Classifier (IndoBERT)
- Training configuration using the indolem/indobert-base-uncased checkpoint with a maximum sequence length of 160 tokens, a batch size of 16, and three training epochs.
- A classification head (a linear layer with softmax activation) was added on top of the IndoBERT base model.
- To mitigate class imbalance, inverse-frequency class weighting was applied via a weighted cross-entropy loss function. This approach avoids the risk of creating linguistically implausible synthetic examples inherent in the Synthetic Minority Over-Sampling Technique (SMOTE) [31].
- The learning rate was set to 5 × 10−5 with weight decay of 0.01, and 10% of the data were reserved as a validation split.
- AdamW optimization and a random seed of 42 were used to ensure reproducible splits and initialization.
3.4.3. Hybrid Model
3.5. Evaluation Framework
- Accuracy, the proportion of correctly classified samples, is calculated aswhere (I(⋅)) is the indicator function.
- Precision, Recall, and F1-Score were calculated on a per-class basis to assess performance on individual threat categories.
- Macro-Averaged F1-Score, the unweighted mean of the F1-scores across all classes, provides a balanced view of performance on both majority and minority classes.
4. Results
4.1. Descriptive Power of ICTT
- AI-generated voice/video fraud (deepfakes) from conventional social engineering;
- WhatsApp OTP hijacking from email phishing;
- Platform-specific e-wallet fraud (OVO, GoPay, DANA, LinkAja, and QRIS) from generic payment fraud;
- Illegal online loans from conventional fraud.
4.2. Overall Model Performance
4.3. Per-Class Performance Analysis
4.4. Platform-Specific Performance
4.5. The Role of the ‘Other’ Category
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- APJII. Survei Penetrasi dan Perilaku Internet Indonesia 2025 [Internet Penetration and Behavior Survey of Indonesia 2025]; Asosiasi Penyelenggara Jasa Internet Indonesia: Jakarta, Indonesia, 2025. [Google Scholar]
- Dhanya, D. Indonesia’s BSSN Records 3.64 Billion Cyberattacks in First Half of 2025. Tempo English. 2025. Available online: https://en.tempo.co/read/2037469/indonesias-bssn-records-3-64-billion-cyberattacks-in-first-half-of-2025 (accessed on 22 September 2025).
- BSSN. Lanskap Keamanan Siber Indonesia 2024; Badan Siber dan Sandi Negara (BSSN): Jakarta, Indonesia, 2024. Available online: https://www.scribd.com/document/834167154/LANSKAP-KEAMANAN-SIBER-2024 (accessed on 22 September 2025).
- OJK. OJK Performance Report 2024; OJK: Jakarta, Indonesia, 2024. [Google Scholar]
- Marinos, L. ENISA Threat Taxonomy A Tool for Structuring Threat Information. Heraklion. 2016. Available online: https://www.enisa.europa.eu (accessed on 22 September 2025).
- Europol. Internet Organised Crime Threat Assessment (IOCTA) 2023; Publications Office of the European Union: Luxembourg, 2023. [Google Scholar] [CrossRef]
- Verizon. DBIR 2023 Data Breach Investigations Report; Verizon: New York, NY, USA, 2023. [Google Scholar]
- MITRE Corporation. MITRE ATT&CK: A Knowledge Base of Adversary Tactics and Techniques; MITRE Corporation: McLean, VA, USA, 2018; Available online: https://attack.mitre.org/ (accessed on 20 December 2025).
- Chandra, A.; Snowe, M.J. A taxonomy of cybercrime: Theory and design. Int. J. Account. Inf. Syst. 2020, 38, 100467. [Google Scholar] [CrossRef]
- Agrafiotis, I.; Nurse, J.R.C.; Goldsmith, M.; Creese, S.; Upton, D. A taxonomy of cyber-harms: Defining the impacts of cyber-attacks and understanding how they propagate. J. Cybersecur. 2018, 4, tyy006. [Google Scholar] [CrossRef]
- Malavasi, M.; Peters, G.W.; Trück, S.; Shevchenko, P.V.; Jang, J.; Sofronov, G. Cyber risk taxonomies: Statistical analysis of cybersecurity risk classifications. Insur. Math. Econ. 2026, 126, 103167. [Google Scholar] [CrossRef]
- Arazzi, M.; Arikkat, D.R.; Nicolazzo, S.; Nocera, A.; K.A., R.R.; P., V.; Conti, M. NLP-based techniques for cyber threat intelligence. Comput. Sci. Rev. 2025, 58, 100765. [Google Scholar] [CrossRef]
- Albarrak, M.; Salonitis, K.; Jagtap, S. Natural language processing (NLP)-based frameworks for cyber threat intelligence and early prediction of cyberattacks in Industry 4.0: A systematic literature review. Appl. Sci. 2026, 16, 619. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019. [Google Scholar] [CrossRef]
- Bayer, M.; Kuehn, P.; Shanehsaz, R.; Reuter, C. CySecBERT: A domain-adapted language model for the cybersecurity domain. ACM Trans. Priv. Secur. 2024, 27, 18. [Google Scholar] [CrossRef]
- Aghaei, E.; Niu, X.; Shadid, W.; Al-Shaer, E. SecureBERT: A domain-specific language model for cybersecurity. In Security and Privacy in Communication Networks (SecureComm 2022); Li, F., Liang, K., Lin, Z., Katsikas, S.K., Eds.; Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering; Springer: Cham, Switzerland, 2023; Volume 462. [Google Scholar] [CrossRef]
- Koto, F.; Rahimi, A.; Lau, J.H.; Baldwin, T. IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 757–770. [Google Scholar] [CrossRef]
- Wilie, B.; Vincentio, K.; Winata, G.I.; Cahyawijaya, S.; Li, X.; Lim, Z.Y.; Soleman, S.; Mahendra, R.; Fung, P.; Bahar, S.; et al. IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, 4–7 December 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 843–857. [Google Scholar] [CrossRef]
- Ratner, A.; Bach, S.H.; Ehrenberg, H.; Fries, J.; Wu, S.; Ré, C. Snorkel: Rapid Training Data Creation with Weak Supervision. Proc. VLDB Endow. 2017, 11, 269–282. [Google Scholar] [CrossRef] [PubMed]
- Zhu, D.; Hedderich, M.A.; Zhai, F.; Adelani, D.I.; Klakow, D. Is BERT robust to label noise? A study on learning with noisy labels in text classification. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, Dublin, Ireland, 26 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 62–67. [Google Scholar] [CrossRef]
- Villena-Román, J.; Collada-Pérez, S.; Lana-Serrano, S.; González, J.C. Hybrid approach combining machine learning and a rule-based expert system for text categorization. In Proceedings of the Florida Artificial Intelligence Research Society Conference, (FLAIRS), Palm Beach, FL, USA, 18–20 May 2011. [Google Scholar]
- Li, X.; Cui, M.; Li, J.; Bai, R.; Lu, Z.; Aickelin, U. A hybrid medical text classification framework: Integrating attentive rule construction and neural network. Neurocomputing 2021, 443, 345–355. [Google Scholar] [CrossRef]
- Samtani, S.; Chinn, R.; Chen, H.; Nunamaker, J.F., Jr. Exploring Emerging Hacker Assets and Key Hackers for Proactive Cyber Threat Intelligence. J. Manag. Inf. Syst. 2017, 34, 1023–1053. [Google Scholar] [CrossRef]
- Shaukat, K.; Luo, S.; Chen, S.; Liu, D. Cyber Threat Detection Using Machine Learning Techniques: A Performance Evaluation Perspective. In Proceedings of the 1st Annual International Conference on Cyber Warfare and Security, Virtual, 20–21 October 2020; ICCWS 2020-Proceedings; IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
- Sarabi, A.; Huang, Z.; Wang, C.; Karir, T.; Liu, M. The Ransomware Decade: The Creation of a Fine-Grained Dataset and a Longitudinal Study. In Proceedings of the 34th USENIX Security Symposium, Seattle, WA, USA, 13–15 August 2025. [Google Scholar] [CrossRef]
- Demirol, D.; Das, R.; Hanbay, D. A Novel Approach for Cyber Threat Analysis Systems Using BERT Model from Cyber Threat Intelligence Data. Symmetry 2025, 17, 587. [Google Scholar] [CrossRef]
- Wirth, R.; Hipp, J. CRISP-DM: Towards a Standard Process Model for Data Mining. In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, Manchester, UK, 11–13 April 2000; pp. 29–40. [Google Scholar]
- Schröer, C.; Kruse, F.; Gómez, J.M. A Systematic Literature Review on Applying CRISP-DM Process Model. Procedia Comput. Sci. 2021, 181, 526–534. [Google Scholar] [CrossRef]
- Martínez-Plumed, F.; Contreras-Ochando, L.; Ferri, C.; Hernández-Orallo, J.; Kull, M.; Lachiche, N.; Ramírez-Quintana, M.J.; Flach, P. CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories. IEEE Trans. Knowl. Data Eng. 2019, 33, 3048–3061. [Google Scholar] [CrossRef]
- Achuthan, K.; Khobragade, S.; Kowalski, R. Cybercrime through the public lens: A longitudinal analysis. Humanit. Soc. Sci. Commun. 2025, 12, 282. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Calderon, N.; Reichart, R.; Dror, R. The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Vienna, Austria, 27 July–1 August 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; Volume 1, pp. 16051–16081. [Google Scholar]
- Beck, J. Quality aspects of annotated data: A research synthesis. AStA Wirtsch. Sozialstatistisches Arch. 2023, 17, 331–353. [Google Scholar] [CrossRef]
- Al-garadi, M.A.; Varathan, K.D.; Ravana, S.D. Cybercrime detection in online communications: The experimental case of cyberbullying detection in the Twitter network. Comput. Hum. Behav. 2016, 63, 433–443. [Google Scholar] [CrossRef]
- Karteris, A.; Tzanos, G.; Papadopoulos, L.; Soudris, D. Detection of Cyber Security Threats through Social Media Platforms. In 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), St. Petersburg, FL, USA, 15–19 May 2023; IEEE: New York, NY, USA, 2023; pp. 820–823. [Google Scholar] [CrossRef]








| ICTT Dimension | ICTT Subcategory | ENISA Equivalent | MITRE ATT&CK Equivalent | Indonesian- Specific? |
|---|---|---|---|---|
| Phishing & Social Engineering | Email Phishing | Phishing/Social Engineering | T1566 Phishing | No |
| WhatsApp Phishing | Phishing (platform-specific) | T1566 (partial) | Yes | |
| SMS/Phone (Smishing/Vishing) | Phishing/Social Engineering | T1566 (partial) | Yes | |
| Social Media/DM Phishing | Phishing/Social Engineering | T1566 (partial) | No | |
| Spear Phishing/Whaling | Phishing/Social Engineering | T1566 Phishing | No | |
| Romance Scam | Social Engineering/Fraud | T1566 (partial) | No | |
| Malware & Malicious Software | Ransomware | Malware/ Ransomware | T1486 Data Encrypted for Impact | No |
| Banking Trojan/Keylogger | Malware | T1056 Input Capture | No | |
| RAT/Spyware | Malware | T1005 Data from Local System | No | |
| Cryptojacking | Malware | T1496 Resource Hijacking | No | |
| Fraud & Online Scams | Loan Scam (Illegal Online Loan) | Fraud | No direct mapping | Yes |
| E-commerce/Marketplace Fraud | Fraud | No direct mapping | No | |
| Investment/Crypto/Robot Trading | Fraud | No direct mapping | No | |
| Charity/Donation/Lottery | Fraud | No direct mapping | No | |
| Data Breach & Identity Theft | Credential Theft/Stuffing | Identity Theft/Credential Compromise | T1110 Brute Force | No |
| Account Takeover (ATO) | Identity Theft | T1110 Brute Force | No | |
| Personal Data Leaks/Sale | Information Disclosure | T1041 Exfiltration | No | |
| SIM Swap | Identity Theft (emerging) | No direct mapping | Yes | |
| Hacking & System Intrusion | Website Defacement | Web-Based Attacks | T1491 Defacement | No |
| Exploit-Based Intrusion (SQLi/XSS/RCE) | Web-Based Attacks | T1190 Exploit Public-Facing Application | No | |
| DDoS/Botnet | Denial of Service | T1498 Network Denial of Service | No | |
| Financial & Payment Attacks | Carding/ATM Skimming | Fraud/ Payment Fraud | No direct mapping | No |
| E-wallet Fraud (OVO/Gopay/DANA/LinkAja/QRIS) | Fraud (platform-specific) | No direct mapping | Yes | |
| Crypto Theft/Scam | Fraud | No direct mapping | No | |
| Online Child Exploitation (CSAM) | CSAM Distribution/Creation | Information Disclosure/Abuse | No direct mapping | No |
| Online Grooming | Social Engineering/Abuse | No direct mapping | No | |
| Cyber Harassment & Online Abuse | Cyberbullying/Doxing | Information Disclosure/Abuse | No direct mapping | No |
| Revenge Porn/Sextortion | Information Disclosure/Extortion | No direct mapping | No | |
| Cyber-Enabled Traditional Crime | Online Gambling | Fraud | No direct mapping | Yes |
| Narcotics/Weapons Trade | Fraud/Illegal Activity | No direct mapping | Yes | |
| Emerging Threats | Deepfake Scams (Voice/Video) | Social Engineering (indirect) | No direct mapping | Yes |
| IoT/CCTV/ICS Attacks | System Failures/Malware | T1200 Hardware Additions | No |
| ICTT Subcategory | Optimal Window Size (Characters) | Precision | Recall | F1-Score |
|---|---|---|---|---|
| deepfake_scams | 20 | 0.471 | 0.800 | 0.593 |
| email_phishing | 10 | 0.352 | 0.833 | 0.495 |
| ransomware | 40 | 0.118 | 0.889 | 0.208 |
| whatsapp_phishing | 10 | 0.425 | 0.978 | 0.592 |
| ICTT Subcategory | Rule-Labeled Samples (n) | Rule Precision | File-Labeled Samples (n) | File Precision |
|---|---|---|---|---|
| deepfake_scams | 66 | 45.5% (30/66) | 81 | 8.6% (7/81) |
| email_phishing | 177 | 44.6% (79/177) | 110 | 0.9% (1/110) |
| whatsapp_phishing | 156 | 28.8% (45/156) | 52 | 1.9% (1/52) |
| ransomware | 60 | 8.3% (5/60) | 88 | 1.1% (1/88) |
| Overall | 459 | 34.6% (159/459) | 322 | 3.1% (10/322) |
| Noise Type | Rule-Labeled (n = 459) | File-Labeled (n = 322) |
|---|---|---|
| True Positive (weak label = gold label, gold ≠ “other”) | 159 (34.6%) | 10 (3.1%) |
| Type 1: Wrong Category (weak ≠ gold, gold ≠ “other”) | 7 (1.5%) | 0 (0.0%) |
| Type 2: False Positive (gold = “other”) | 293 (63.8%) | 312 (96.9%) |
| Total | 459 | 322 |
| Subcategory | τ* | Rule F1 | Hybrid F1 (τ*) |
|---|---|---|---|
| deepfake_scams | 0.00 | 0.97 | 0.99 |
| email_phishing | 0.63 | 0.98 | 0.98 |
| ransomware | 1.00 | 0.80 | 0.80 |
| whatsapp_phishing | 0.31 | 0.96 | 0.96 |
| Metric | Value |
|---|---|
| ICTT evaluation samples | 124 |
| IndoBERT accuracy | 96.8% |
| Hybrid fallback count | 0 |
| Hybrid fallback rate | 0% |
| Mean confidence | 0.986 |
| Median confidence | 0.996 |
| ECE (10 bins) | 0.026 |
| Subcategory | Support | Mean Conf. | Median Conf. |
|---|---|---|---|
| deepfake_scams | 39 | 0.972 | 0.996 |
| email_phishing | 30 | 0.995 | 0.996 |
| ransomware | 9 | 0.980 | 0.996 |
| whatsapp_phishing | 46 | 0.992 | 0.995 |
| Metric | ICTT | ENISA |
|---|---|---|
| Unambiguous Classification | 124 (20.7%) | 85 (14.2%) |
| Ambiguous/Multiple Mappings | Not available | 39 (6.5%) |
| Rejected/“Other” | 476 (79.3%) | 476 (79.3%) |
| Indonesian-Specific Threats Covered | 85 samples (68.5% of valid) | Not covered |
| Deepfake Scams | 39 samples (explicit category) | Mapped to Social Engineering (indirect) |
| WhatsApp Phishing | 46 samples (explicit category) | Mapped to Phishing (email-centric; indirect) |
| Model | Accuracy | Macro-Precision | Macro-Recall | Macro-F1 | Samples Evaluated |
|---|---|---|---|---|---|
| Rule (Baseline) | 0.618 | 0.454 | 0.765 | 0.503 | 600 |
| Rule (Tuned) | 0.667 | 0.473 | 0.771 | 0.531 | 600 |
| IndoBERT | 0.968 | 0.977 | 0.910 | 0.935 | 124 * |
| Hybrid (rule + IndoBERT) | 0.968 | 0.977 | 0.910 | 0.935 | 124 * |
| Threat Category | Rule (Baseline) Precision/Recall/F1 | Rule (Tuned) Precision/Recall/F1 | Support |
|---|---|---|---|
| Deepfake Scams | 0.470/0.775/0.585 | 0.463/0.775/0.579 | 40 |
| Email Phishing | 0.420/0.967/0.586 | 0.431/0.933/0.589 | 30 |
| Ransomware | 0.085/0.556/0.147 | 0.083/0.556/0.145 | 9 |
| WhatsApp Phishing | 0.333/0.978/0.497 | 0.425/0.978/0.592 | 46 |
| Other | 0.963/0.549/0.700 | 0.964/0.613/0.749 | 475 |
| Macro Avg | 0.454/0.765/0.503 | 0.473/0.771/0.531 | 600 |
| Threat Category | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Deepfake Scams | 1.000 | 0.974 | 0.987 | 40 |
| Email Phishing | 0.968 | 1.000 | 0.984 | 30 |
| Ransomware | 1.000 | 0.667 | 0.800 | 9 |
| WhatsApp Phishing | 0.939 | 1.000 | 0.968 | 46 |
| Macro Avg | 0.977 | 0.910 | 0.935 | 600 |
| Model | X Accuracy | X Macro-F1 | YouTube Accuracy | YouTube Macro-F1 | X Sample | YouTube Samples |
|---|---|---|---|---|---|---|
| Rule (Tuned) | 0.982 | 0.966 | 0.813 | 0.803 | 109 | 16 |
| IndoBERT | 0.991 | 0.973 | 0.813 | 0.803 | 109 | 16 |
| Hybrid (Rule + IndoBERT) | 0.991 | 0.973 | 0.813 | 0.803 | 109 | 16 |
| Threat Category | Rule (Tuned) X/YT | IndoBERT X/YT | Hybrid X/YT | X Support | YT Support |
|---|---|---|---|---|---|
| Deepfake Scams | 0.986/0.889 | 1.000/0.889 | 1.000/0.889 | 35 | 5 |
| Email Phishing | 0.983/1.000 | 0.983/1.000 | 0.983/1.000 | 29 | 1 |
| Ransomware | 0.909/0.500 | 0.909/0.500 | 0.909/0.500 | 6 | 3 |
| WhatsApp Phishing | 0.987/0.824 | 1.000/0.824 | 1.000/0.824 | 39 | 7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Arifman, F.; Mantoro, T.; Handayani, D.O.D. A Hybrid Machine Learning Approach for Classifying Indonesian Cybercrime Discourse Using a Localized Threat Taxonomy. Information 2026, 17, 301. https://doi.org/10.3390/info17030301
Arifman F, Mantoro T, Handayani DOD. A Hybrid Machine Learning Approach for Classifying Indonesian Cybercrime Discourse Using a Localized Threat Taxonomy. Information. 2026; 17(3):301. https://doi.org/10.3390/info17030301
Chicago/Turabian StyleArifman, Firman, Teddy Mantoro, and Dini Oktarina Dwi Handayani. 2026. "A Hybrid Machine Learning Approach for Classifying Indonesian Cybercrime Discourse Using a Localized Threat Taxonomy" Information 17, no. 3: 301. https://doi.org/10.3390/info17030301
APA StyleArifman, F., Mantoro, T., & Handayani, D. O. D. (2026). A Hybrid Machine Learning Approach for Classifying Indonesian Cybercrime Discourse Using a Localized Threat Taxonomy. Information, 17(3), 301. https://doi.org/10.3390/info17030301

