Automated Cyber Threat Intelligence Extraction from Distributed Honeypots: A Hybrid Machine Learning Approach
Abstract
1. Introduction
1.1. Background and Motivation
1.2. Problem Statement
- Limitations of Supervised Learning: Most existing studies focus heavily on supervised classification for Intrusion Detection Systems (IDS), identifying known attack signatures [6]. However, these models struggle to detect novel, evolving threat campaigns (Zero-day attacks) that lack labeled historical data.
- Lack of Behavioral Attribution: Researchers have placed less emphasis on predicting the source infrastructure and grouping attacks based on multidimensional behaviors. Traditional analysis often treats alerts in isolation rather than correlating them to identify broader campaigns, such as region-specific botnets or coordinated credential stuffing [7].
- Latency in Incident Response: Manual triage processes are prohibitively time-consuming. Montasari et al. [3] emphasized that without automated mechanisms to cluster and prioritize threats, the window of exposure to cyber risks significantly increases.
1.3. Contributions
- Real-World Distributed Evaluation Dataset: Unlike studies relying on synthetic or outdated traffic, this study validates the proposed framework using a dataset collected directly from a globally distributed honeypot infrastructure, capturing authentic adversarial behaviors and diverse attack vectors from live threat landscapes. The dataset is available on request from the corresponding author, subject to applicable data sharing agreements and institutional data governance policies.
- Hybrid Learning Framework for Known and Unknown Threats: We propose a dual-stage architecture that combines supervised classification for known attack techniques with unsupervised clustering for anomaly discovery. This hybrid approach ensures both high-precision categorization of established threats and the proactive identification of emerging, previously unseen attack patterns.
- Optimized Feature Engineering for CTI: We developed a specialized feature engineering pipeline that incorporates cyclical temporal encoding and geographical metadata. Our results demonstrate that gradient boosting classifiers, leveraging these enriched features, achieve competitive stability in addressing the severe class imbalances inherent in real-world CTI data. Among the evaluated models, CatBoost achieves the highest Balanced Accuracy of 0.7895 on Dataset 1, while XGBoost and LightGBM demonstrate superior F1-Macro performance across other configurations. Classifier selection should be guided by the specific dataset and operational requirements of the target SOC environment.
- Behavioral Pattern and Campaign Discovery: By implementing density-based clustering, we successfully identified coordinated threat campaigns and hidden behavioral groups. This module correlates agent behaviors with source attribution, enabling the discovery of region-specific botnets and automated attack infrastructures that bypass traditional rule-based detection.
- Operational Efficiency in SOC Environments: We validated the framework’s suitability for real-time triage, achieving sub-minute inference latency across large-scale datasets. This ensures that the proposed system can be integrated into Security Operations Centers (SOCs) to reduce analyst workload and improve the mean time to detect (MTTD).
2. Literature Review
2.1. CTI
2.1.1. CTI Platforms and Sharing
2.1.2. Data Sources for CTI
2.1.3. CTI Benefits
2.1.4. Honeypots
2.2. Machine Learning in Cybersecurity
2.2.1. Supervised Learning Approaches
- Phishing: Alam et al. emphasized the value of ML in thwarting social engineering threats, achieving the best accuracy of 97% using Random Forest for phishing attack detection [19].
- IoT Security: Al-Hawawreh et al. offered a deep learning-based threat intelligence algorithm (DLTI) designed for complex IoT networks, comparing it with K-Nearest Neighbors (KNN), Naïve Bayes (NB), and Logistic Regression (LR) [20]. Mishra et al. used several models for IoT CTI, reporting high accuracies of 99.94% and 95.67% for Random Forest and KNN, respectively [21].
- Framework Vulnerabilities: Khurana et al. investigated poisoning attacks on AI-based threat intelligence systems. They used an ensembled semi-supervised approach combining an embedding model and an SVM model, achieving 71.73% accuracy [22].
- Data Gathering: Koloveas et al. advanced threat intelligence gathering by introducing “intime,” an ML-based framework for obtaining and utilizing web data for CTI [23].
- Deep Learning: Lee et al. emphasized the use of artificial neural networks for cyber threat detection, highlighting ML’s significance in identifying intricate attack patterns [24].
- Attribution: Noel introduced “RedAI,” utilizing Naïve Bayes, Logistic Regression, Linear SVM, and Random Forest. The highest accuracy achieved was 93.65% using Linear SVM [25]. Noor et al. achieved 95% accuracy in attributing cybercrime threats using ANNs, demonstrating the effectiveness of high-level IoCs over low-level IoCs [26].
- Mobile Security: Tahtaci and Canbay used Random Forest and Decision Trees to detect malware on Android, achieving an accuracy of 95% with Random Forest [27].
2.2.2. Unsupervised Learning Approaches
2.3. Summary and Implications
3. Data Description
3.1. Feature Engineering and Preprocessing
- UTC Standardization: All timestamps from the distributed agents (US and Saudi Arabia) were normalized to Coordinated Universal Time (UTC) to align event sequences across different time zones.
- Cyclical Time Encoding: Simply treating ’Hour’ as a linear integer (0–23) creates a logical discontinuity (e.g., 23:00 is far from 00:00 numerically but close temporally). To resolve this, we transformed time features into cyclical coordinates using Sine and Cosine functions:This ensures that the model correctly interprets the continuity of time.
- Categorical Encoding: To support high-dimensional feature spaces required for gradient boosting classifiers and HDBSCAN, categorical variables such as agent.name and rule.description were transformed using One-Hot Encoding. This prevents the models from inferring incorrect ordinal relationships while preserving the distinct identity of each feature.
3.2. Data Collection Environment
- United States Zone (US Zone): This region hosted two agents, wazuhOT and deepfake_v2_us. These agents were deployed on public network segments (IP range 199.168.x.x), exposing them directly to the open internet. Consequently, this zone attracted the highest volume of automated scanning and brute-force attacks.
- Saudi Arabia Zone (Riyadh): This region included agents deepfake_v1 and koko. Unlike the US zone, these agents were deployed within a private network environment (IP range 192.168.x.x) in Riyadh. Despite sitting behind a firewall/NAT, these nodes exhibited a unique attack profile, specifically targeted by persistent threat actors originating from specific regions such as Russia.
3.3. Distributed Threat Stream Overview
- Phase 1: 24–26 May 2025 (10,000 events).
- Phase 2: 1–2 June 2025 (10,000 events).
- Phase 3: 1–2 July 2025 (10,000 events).
3.4. Operational Triage Gap and Research Motivation
4. Methodology
4.1. Phase 1: Data Pre-Processing and Sampling
4.2. Phase 2: Hybrid Feature Engineering and Encoding
- Cyclical Temporal Transformation: To preserve the continuous nature of time, the event hour is mapped to sine and cosine coordinates (). This mathematical transformation ensures that 23:00 and 00:00 are treated as temporally adjacent.
- Numerical Identity Mapping: Categorical metadata, including destination username and agent name, are transformed into numerical vectors using label encoding or hashing. This allows the model to process non-numeric identity context.
- Network Infrastructure Representation: Source IP addresses are converted to their integer equivalents () to preserve implicit subnet proximity in the feature space. Future work will explore AS-level embeddings as a richer network infrastructure representation.
- Behavioral and Geo-Spatial Contextualization: We quantified the recurrence of alerts via rule.firedtimes and derived geographical origin metadata () from IP-to-location mapping.
4.3. Experimental Feature Configurations
- Minimal (IP + Time): Represents the baseline telemetry focusing on the fundamental “when” and “where” (at the network level). It utilizes only the integer-converted source IP and cyclical temporal coordinates.
- Contextual: Enhances the baseline by adding the “who” and “how” of an event. This set includes target user identity, reporting agent profiles, and rule firing frequency to capture the behavioral footprint of the adversary.
- Contextual + Geo: The most comprehensive configuration, incorporating geographical intelligence. By adding encoded country-level data, the model can correlate behavioral patterns with regional anomalies.
4.4. Phase 3: Hybrid Modeling Framework (Track 1 & Track 2)
4.4.1. Track 1: High-Performance Classification
4.4.2. Track 2: Density-Based Pattern Discovery
4.5. Validation Framework
4.5.1. Algorithm Selection Rationale
4.5.2. Decision-Level Integration
5. Experimental Results
5.1. Supervised Learning Results: Classification and Attribution
5.1.1. Experimental Setup and Iterations
- Rule Description: To evaluate the model’s ability to classify raw SIEM alert semantics.
- MITRE ATT&CK Tactic: To assess high-level adversarial goals.
- MITRE ATT&CK Technique: To validate granular-level classification of specific attack methods.
5.1.2. Operational Efficiency and Real-Time Feasibility
5.2. Unsupervised Learning Results
5.2.1. Clustering Scenarios
- Scenario 1 (Comprehensive): Uses all available features (Attack Technique, Agent, Country, User, Time).
- Scenario 2 (Temporal Patterns): Uses Attack Technique, Attacker IP, and Time to find time-synchronized attacks.
- Scenario 3 (Infrastructure Focus): Uses Attack Technique, Agent, and Country to identify infrastructure-based campaigns.
- Scenario 4 (Victim Targeting): Uses Attack Technique, User, and Country to detect credential stuffing campaigns.
- Scenario 5 (Time-Agent Correlation): Uses Attack Technique, Agent, and Time.
5.2.2. Unsupervised Learning Results: Pattern Discovery via K-Means
- Scenario 3 (Agent + Country): Scenario 3 (Agent + Country): This scenario achieved the highest overall performance, with a maximum Silhouette Score of 0.9253 on dataset 3, as further illustrated by the Silhouette Score analysis in Figure 5, which confirms the optimal cluster granularity at K = 40. The scores were consistently high across all datasets (0.8755, 0.8189, 0.9253 for Datasets 1, 2, and 3, respectively), indicating that grouping attacks by their technique, target agent, and source country creates the most distinct and interpretable clusters.
- Scenario 4 (User + Country): This scenario also performed exceptionally well, achieving a maximum score of 0.8805 on dataset 2. This suggests that attacks targeting specific user accounts from specific regions form very cohesive groups.
- Scenario 5 (Agent + Time): While achieving a high score of 0.9187 on dataset 1, the performance was slightly less stable than Scenario 3.
- Scenarios 1 & 2: Scenarios involving raw timestamps or high-dimensional combinations (Scenario 1) showed significantly lower scores (0.54–0.75). This implies that raw time data introduces high variance, reducing cluster cohesion [29].
5.2.3. Unsupervised Learning Results: Pattern Discovery via HDBSCAN
5.2.4. Unsupervised Learning Results: Pattern Discovery and Baseline Comparison
5.2.5. Comparative Analysis of Clustering Performance
6. Discussion
6.1. Supervised Classification and Balanced Attribution
- Gradient Boosting vs. Random Forest: All three gradient boosting models—XGBoost, LightGBM, and CatBoost—consistently outperform Random Forest across all datasets and metrics, as shown in Table 4. Random Forest ranked last in 9 out of 12 metric-dataset combinations, confirming that gradient boosting methods are more suitable for high-dimensional, class-imbalanced CTI data. Real-world CTI data is characterized by extreme class imbalances, where routine events (e.g., SSH authentication failures) dwarf rare but critical attack techniques. As shown in Table 3, while raw accuracy remained around 0.60–0.72, the Balanced Accuracy of gradient boosting models reached up to 0.7895 (CatBoost, dataset 1), confirming their resilience against severe class imbalance through their gradient boosting mechanisms.
- Dataset-Dependent Performance: No single gradient boosting model dominates across all configurations. CatBoost achieves the highest Balanced Accuracy across all three datasets (0.7895, 0.8178, 0.7693), while XGBoost leads in F1-Macro in dataset 2 (0.7183) and LightGBM performs best in dataset 3 across multiple metrics. This variability reflects the inherent diversity of real-world threat landscapes, where log structure, attack distribution, and class imbalance characteristics differ across SOC environments. Practitioners are therefore encouraged to evaluate all gradient boosting models on their own SIEM data and select the most suitable classifier for their operational context.
- Feature Synergy (Temporal and Geo-Context): Our feature importance analysis (Figure 4) reveals that the integration of cyclical temporal encoding () and geographical metadata was the primary driver of classification success across all gradient boosting models. The ability to correlate “when” an attack occurs with “where” it originates allows for more nuanced attribution than simple IP-based filtering.
- Computational Scalability: The framework processed the entire test set in under 30 s. The computational efficiency of gradient boosting models ensures that the proposed framework can be deployed in live SIEM pipelines to provide near-real-time labels for incoming threat streams without introducing significant latency.
6.2. Interpretation of Density-Based Clustering (HDBSCAN)
6.2.1. Automated Noise Suppression
6.2.2. Discovery of Coordinated Campaigns (Scenario 3)
6.3. Practical Application and Case Study Interpretation
- Rapid Attribution (Supervised Track): The gradient boosting classifier (XGBoost in this study) serves as the first line of defense, categorizing approximately 95% of incoming telemetry such as routine SSH authentication failures with sub-second inference latency per feature configuration. This rapid labeling allows SOC analysts to bypass established threat patterns and focus on anomalies that lack predefined signatures.
- Intelligent Campaign Discovery (Unsupervised Track): Following initial classification, HDBSCAN processes the remaining data to identify coordinated adversarial behaviors. By correlating temporal bursts with agent-specific telemetry, the model successfully isolated high-density clusters representing region-specific botnet infrastructures, achieving a peak Silhouette Score of 0.8584. This transition from analyzing individual alerts to investigating “attack campaigns” enables security teams to implement bulk-blocking strategies based on behavioral signatures rather than volatile IP addresses.
6.4. Strategic Noise Suppression and Operational Impact
6.5. Methodological Analysis and Limitations
- IP-to-Location Accuracy: Although the model achieved high attribution stability, geographical metadata derived from IP-to-location mapping should be treated as a probabilistic contextual signal rather than a definitive attribution mechanism. Future iterations should integrate AS reputation scoring and cross-session behavioral fingerprinting as secondary validation signals.
- Dynamic Feature Cardinality: The high number of clusters discovered () reflects the diversity of the threat landscape but also suggests high feature cardinality. Future work should explore embedding techniques (e.g., Word2Vec for logs) to represent these features in a lower-dimensional, semantic space.
- Class Imbalance and Under-sampling: One limitation of this study is the use of random under-sampling to address severe class imbalance, as SSH authentication attacks accounted for more than 95% of the collected telemetry. While this approach was necessary to ensure adequate representation of minority attack classes, it reduced the preservation of real-world traffic distributions and may have resulted in optimistic performance estimates. In practical deployments, high-volume SSH brute-force traffic can be filtered prior to classification, allowing the model to focus on more diverse threat activities. Future work will investigate cost-sensitive learning and synthetic over-sampling techniques, such as SMOTE-NC, to better preserve real-world traffic characteristics while maintaining minority-class detection performance.
7. Conclusions
7.1. Summary of Research Findings
7.2. Future Research Directions
- Transition to Real-time Streaming: Future iterations will implement online learning versions of XGBoost, LightGBM, CatBoost and HDBSCAN to update threat clusters dynamically as data flows directly from SIEM streams, further minimizing the detection window.
- Automated Model Selection: Future work will investigate automated classifier selection mechanisms that dynamically identify the optimal gradient boosting model based on the characteristics of incoming SIEM data, such as class distribution, feature dimensionality, and attack diversity, enabling fully adaptive CTI pipelines across diverse SOC environments.
- Cascaded Pipeline Integration: While this study evaluated the supervised and unsupervised tracks in parallel to establish baseline efficacies, future work will focus on developing a fully cascaded integration architecture. This will involve implementing dynamic confidence thresholds where high-confidence predictions from the gradient boosting classifier are filtered out, and only anomalous or low-confidence traffic is forwarded to HDBSCAN. This sequential approach will optimize computational efficiency for large-scale, real-time SOC deployments.
- Explainable AI (XAI) for CTI: To enhance trust in automated triage, we plan to integrate SHAP (SHapley Additive exPlanations) or Large Language Models (LLMs) to generate human-readable explanations for why specific clusters or alerts were flagged as high-risk.
- Advanced Log Embeddings: We aim to explore semantic feature engineering using Transformer-based log embeddings to capture the deeper contextual relationships between heterogeneous Indicators of Compromise (IoCs) more effectively than traditional encoding methods.
- STIX Integration: Future work will explore ways to serialize extracted IoC and cluster intelligence into STIX format bundles to enable seamless integration with threat sharing platforms such as MISP or OpenCTI.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wagner, T.D.; Mahbub, K.; Palomar, E.; Abdallah, A.E. Cyber threat intelligence sharing: Survey and research directions. Comput. Secur. 2019, 87, 101589. [Google Scholar] [CrossRef]
- Brown, R.; Lee, R.M. The Evolution of Cyber Threat Intelligence (CTI): 2019 SANS CTI Survey. SANS Institute White Paper. 2019. Available online: https://www.sans.org/white-papers/38790/ (accessed on 19 April 2026).
- Montasari, R.; Carroll, F.; Macdonald, S.; Jahankhani, H.; Hosseinian-Far, A.; Daneshkhah, A. Application of artificial intelligence and machine learning in producing actionable cyber threat intelligence. In Digital Forensic Investigation of Internet of Things (IoT) Devices; Springer: Berlin/Heidelberg, Germany, 2020; pp. 47–64. [Google Scholar]
- Guo, R.; Li, A.; Liu, H. An Adversarial Attack Detection Method Based on Bidirectional Consistency Discrimination for Deep Learning-Based Soft Sensors. In Proceedings of the 2025 CAA Symposium on Fault Detection, Supervision, and Safety for Technical Processes (SAFEPROCESS), Urumqi, China, 22–24 August 2025; pp. 1–6. [Google Scholar]
- Spyros, A.; Koritsas, I.; Papoutsis, A.; Panagiotou, P.; Chatzakou, D.; Kavallieros, D.; Tsikrika, T.; Vrochidis, S.; Kompatsiaris, I. AI-based holistic framework for cyber threat intelligence management. IEEE Access 2025, 13, 20820–20846. [Google Scholar] [CrossRef]
- Alqahtani, H.; Sarker, I.H.; Kalim, A.; Hossain, S.M.M.; Ikhlaq, S.; Hossain, S. Cyber intrusion detection using machine learning classification techniques. In Computing Science, Communication and Security: First International Conference, COMS2 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 121–131. [Google Scholar]
- Landauer, M.; Skopik, F.; Wurzenberger, M.; Rauber, A. System log clustering approaches for cyber security applications: A survey. Comput. Secur. 2020, 92, 101739. [Google Scholar] [CrossRef]
- Alzahrani, I.Y.; Lee, S.; Kim, K. Enhancing Cyber-Threat Intelligence in the Arab World: Leveraging IoC and MISP Integration. Electronics 2024, 13, 2526. [Google Scholar] [CrossRef]
- Korte, K. Measuring the Quality of Open Source Cyber Threat Intelligence Feeds. Master’s Thesis, Jyväskylä University of Applied Sciences, Jyväskylä, Finland, 2021. Available online: https://www.theseus.fi/handle/10024/500534 (accessed on 19 April 2026).
- Bautista, W. Practical Cyber Intelligence: How Action-Based Intelligence Can Be an Effective Response to Incidents; Packt Publishing Ltd.: Mumbai, India, 2018. [Google Scholar]
- Pouget, F.; Dacier, M. Honeypot-based forensics. In Proceedings of the AusCERT Asia Pacific Information Technology Security Conference, Gold Coast, Australia, 23–27 May 2004. [Google Scholar]
- Mairh, A.; Barik, D.; Verma, K.; Jena, D. Honeypot in network security: A survey. In Proceedings of the 2011 International Conference on Communication, Computing & Security, Odisha, India, 12–14 February 2011; pp. 600–605. [Google Scholar]
- Franco, J.; Aris, A.; Canberk, B.; Uluagac, A.S. A Survey of Honeypots and Honeynets for Internet of Things, Industrial Internet of Things, and Cyber-Physical Systems. IEEE Commun. Surv. Tutor. 2021, 23, 2351–2383. [Google Scholar] [CrossRef]
- Vetterl, A.; Clayton, R. Honware: A virtual honeypot framework for capturing CPE and IoT zero days. In Proceedings of the 2019 APWG Symposium on Electronic Crime Research (eCrime), Pittsburgh, PA, USA, 13–15 November 2019; pp. 1–13. [Google Scholar]
- El Kouari, O.; Lazaar, S.; Achoughi, T. Fortifying industrial cybersecurity: A novel industrial internet of things architecture enhanced by honeypot integration. Int. J. Electr. Comput. Eng. 2025, 15, 1089. [Google Scholar] [CrossRef]
- Al-Mhiqani, M.N.; Ahmad, R.; Zainal Abidin, Z.; Yassin, W.; Hassan, A.; Abdulkareem, K.H.; Ali, N.S.; Yunos, Z. A review of insider threat detection: Classification, machine learning techniques, datasets, open challenges, and recommendations. Appl. Sci. 2020, 10, 5208. [Google Scholar] [CrossRef]
- Abu Al-Haija, Q. Top-down machine learning-based architecture for cyberattacks identification and classification in IoT communication networks. Front. Big Data 2022, 4, 782902. [Google Scholar] [PubMed]
- Asif, M.; Abbas, S.; Khan, M.; Fatima, A.; Khan, M.A.; Lee, S.W. MapReduce based intelligent model for intrusion detection using machine learning technique. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 9723–9731. [Google Scholar] [CrossRef]
- Alam, M.N.; Sarma, D.; Lima, F.F.; Saha, I.; Ulfath, R.-E.; Hossain, S. Phishing attacks detection using machine learning approach. In Proceedings of the 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 20–22 August 2020; pp. 1173–1179. [Google Scholar] [CrossRef]
- Al-Hawawreh, M.; Moustafa, N.; Garg, S.; Hossain, M.S. Deep learning-enabled threat intelligence scheme in the Internet of Things networks. IEEE Trans. Netw. Sci. Eng. 2020, 8, 2968–2981. [Google Scholar] [CrossRef]
- Mishra, S.; Albarakati, A.; Sharma, S.K. Cyber threat intelligence for IoT using machine learning. Processes 2022, 10, 2673. [Google Scholar] [CrossRef]
- Khurana, N.; Mittal, S.; Piplai, A.; Joshi, A. Preventing poisoning attacks on AI based threat intelligence systems. In Proceedings of the 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), Pittsburgh, PA, USA, 13–16 October 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Koloveas, P.; Chantzios, T.; Alevizopoulou, S.; Skiadopoulos, S.; Tryfonopoulos, C. INTIME: A machine learning-based framework for gathering and leveraging web data to cyber-threat intelligence. Electronics 2021, 10, 818. [Google Scholar] [CrossRef]
- Lee, J.; Kim, J.; Kim, I.; Han, K. Cyber threat detection based on artificial neural networks using event profiles. IEEE Access 2019, 7, 165607–165626. [Google Scholar] [CrossRef]
- Noel, L. RedAI: A Machine Learning Approach to Cyber Threat Intelligence. Master’s Thesis, James Madison University, Harrisonburg, VA, USA, 2021. [Google Scholar]
- Noor, U.; Shahid, S.; Kanwal, R.; Rashid, Z. A Machine Learning Based Empirical Evaluation of Cyber Threat Actors High Level Attack Patterns Over Low Level Attack Patterns in Attributing Attacks. arXiv 2023, arXiv:2307.10252. [Google Scholar]
- Tahtaci, B.; CANBAY, B. Android malware detection using machine learning. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020; pp. 1–6. [Google Scholar]
- Riadi, I.; Istiyanto, J.; Ashari, A.; Saleh, S.S. Log Analysis Techniques Using Clustering in Network Forensics. arXiv 2013, arXiv:1307.0072. [Google Scholar]
- Suyal, M.; Sharma, S. A Review on Analysis of K-Means Clustering Machine Learning Algorithm based on Unsupervised Learning. J. Artif. Intell. Syst. 2024, 6, 85–95. [Google Scholar] [CrossRef]
- Sinaga, K.P.; Yang, M.S. Unsupervised K-means Clustering Algorithm. IEEE Access 2020, 8, 80716–80727. [Google Scholar] [CrossRef]
- Mustafa, Z.; Amin, R.; Aldabbas, H.; Ahmed, N. Intrusion detection systems for software-defined networks: A comprehensive study on machine learning-based techniques. Clust. Comput. 2024, 27, 9635–9661. [Google Scholar]
- Yoga, C.A.; Rodrigues, A.J.; Abeka, S.O. Hybrid Machine Learning Approach for Attack Classification and Clustering in Network Security. Int. J. Comput. Appl. 2023, 185, 45–51. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Vielberth, M.; Böhm, F.; Fichtinger, I.; Pernul, G. Security operations center: A systematic study and open challenges. IEEE Access 2020, 8, 227756–227779. [Google Scholar] [CrossRef]
- Magán-Carrión, R.; Urda, D.; Díaz-Cano, I.; Dorronsoro, B. Towards a reliable comparison and evaluation of network intrusion detection systems based on machine learning approaches. Appl. Sci. 2020, 10, 1775. [Google Scholar] [CrossRef]
- Strom, B.E.; Applebaum, A.; Miller, D.P.; Nickels, K.C.; Pennington, A.G.; Thomas, C.B. MITRE ATT&CK: Design and Philosophy; Technical report; The MITRE Corporation: McLean, VA, USA, 2018. [Google Scholar]
- Campello, R.J.; Moulavi, D.; Sander, J. Density-based clustering based on hierarchical density estimates. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining; Springer: Berlin/Heidelberg, Germany, 2013; pp. 160–172. [Google Scholar]
- Humaira, H.; Rasyidah, R. Determining the appropiate cluster number using elbow method for k-means algorithm. In Proceedings of the 2nd Workshop on Multidisciplinary and Applications (WMA), Padang, Indonesia, 24–25 January 2018; pp. 1–8. [Google Scholar]









| Feature Name | Minimal | Contextual | Contextual + Geo |
|---|---|---|---|
| Source IP () | ✔ | ✔ | ✔ |
| Temporal () | ✔ | ✔ | ✔ |
| Destination User | ✔ | ✔ | |
| Agent Name | ✔ | ✔ | |
| Rule Fired Times | ✔ | ✔ | |
| Geo Location (Country) | ✔ |
| Model | Parameter | Value |
|---|---|---|
| XGBoost | n_estimators | 500 |
| max_depth | 6 | |
| learning_rate | 0.05 | |
| subsample | 0.9 | |
| colsample_bytree | 0.9 | |
| reg_lambda | 1.0 | |
| objective | multi:softprob/binary:logistic | |
| LightGBM | n_estimators | 500 |
| max_depth | 6 | |
| learning_rate | 0.05 | |
| subsample | 0.9 | |
| colsample_bytree | 0.9 | |
| class_weight | Inverse-frequency weighted (class_w) | |
| objective | multiclass/binary | |
| CatBoost | iterations | 500 |
| depth | 6 | |
| learning_rate | 0.05 | |
| bootstrap_type | Bernoulli | |
| subsample | 0.9 | |
| auto_class_weights | Balanced | |
| objective | MultiClass/Logloss | |
| HDBSCAN | min_cluster_size | 10 |
| min_samples | 5 | |
| metric | euclidean | |
| cluster_selection_method | eom |
| Dataset | Model | Accuracy | Balanced Acc. | Recall | F1-Score |
|---|---|---|---|---|---|
| Dataset 1 (Geo) | Random Forest | 0.5560 | 0.6348 | 0.6242 | 0.5534 |
| XGBoost | 0.6015 | 0.7712 | 0.6849 | 0.6005 | |
| Dataset 2 (Geo) | Random Forest | 0.7269 | 0.6712 | 0.6357 | 0.7248 |
| XGBoost | 0.7289 | 0.7696 | 0.6447 | 0.7354 | |
| Dataset 3 (Context) | Random Forest | 0.5555 | 0.5952 | 0.5997 | 0.5558 |
| XGBoost | 0.5860 | 0.7261 | 0.6501 | 0.5863 |
| Dataset | Model | Accuracy | Balanced Acc. | F1-Macro | F1-Weighted |
|---|---|---|---|---|---|
| Dataset 1 (Geo) | Random Forest | 0.5560 | 0.6348 | 0.6242 | 0.5534 |
| XGBoost | 0.5975 | 0.7628 | 0.6806 | 0.5965 | |
| LightGBM | 0.5995 | 0.7593 | 0.6756 | 0.5994 | |
| CatBoost | 0.6000 | 0.7895 | 0.6571 | 0.5931 | |
| Dataset 2 (Geo) | Random Forest | 0.7271 | 0.6478 | 0.6563 | 0.7242 |
| XGBoost | 0.7326 | 0.7653 | 0.7183 | 0.7384 | |
| LightGBM | 0.7284 | 0.7221 | 0.6653 | 0.7339 | |
| CatBoost | 0.6893 | 0.8178 | 0.6044 | 0.7003 | |
| Dataset 3 (Context) | Random Forest | 0.5540 | 0.5897 | 0.5898 | 0.5538 |
| XGBoost | 0.5770 | 0.7212 | 0.6506 | 0.5769 | |
| LightGBM | 0.5885 | 0.7255 | 0.6522 | 0.5891 | |
| CatBoost | 0.5815 | 0.7693 | 0.6470 | 0.5788 |
| Scenario | Dataset | Optimal k | Silhouette Score | Time (s) |
|---|---|---|---|---|
| Dataset 1 | 38 | 0.4437 | 0.21 | |
| Scenario 1 | Dataset 2 | 39 | 0.3660 | 0.25 |
| Dataset 3 | 39 | 0.5050 | 0.52 | |
| Dataset 1 | 40 | 0.7175 | 0.18 | |
| Scenario 2 | Dataset 2 | 39 | 0.6624 | 0.17 |
| Dataset 3 | 40 | 0.6973 | 0.18 | |
| Dataset 1 | 40 | 0.8755 | 0.20 | |
| Scenario 3 | Dataset 2 | 40 | 0.8189 | 0.19 |
| Dataset 3 | 40 | 0.9253 | 0.20 | |
| Dataset 1 | 40 | 0.8603 | 0.18 | |
| Scenario 4 | Dataset 2 | 40 | 0.8805 | 0.20 |
| Dataset 3 | 40 | 0.9054 | 0.19 | |
| Dataset 1 | 40 | 0.9187 | 0.16 | |
| Scenario 5 | Dataset 2 | 40 | 0.6632 | 0.20 |
| Dataset 3 | 39 | 0.7323 | 0.17 |
| Scenario | Dataset | Clusters | Silhouette Score | Time (s) |
|---|---|---|---|---|
| Dataset 1 | 36 | 0.5492 | 8.66 | |
| Scenario 1 | Dataset 2 | 42 | 0.4802 | 7.76 |
| Dataset 3 | 32 | 0.7883 | 10.34 | |
| Dataset 1 | 43 | 0.7905 | 7.92 | |
| Scenario 2 | Dataset 2 | 44 | 0.7749 | 7.91 |
| Dataset 3 | 37 | 0.7562 | 8.19 | |
| Dataset 1 | 25 | 0.6801 | 8.96 | |
| Scenario 3 | Dataset 2 | 38 | 0.8365 | 7.78 |
| Dataset 3 | 25 | 0.8170 | 8.57 | |
| Dataset 1 | 23 | 0.5619 | 8.63 | |
| Scenario 4 | Dataset 2 | 28 | 0.8047 | 7.84 |
| Dataset 3 | 24 | 0.8111 | 7.28 | |
| Dataset 1 | 23 | 0.8584 | 7.42 | |
| Scenario 5 | Dataset 2 | 38 | 0.7603 | 7.80 |
| Dataset 3 | 36 | 0.8562 | 7.14 |
| Scenario | Dataset | K-Means | HDBSCAN | |||
|---|---|---|---|---|---|---|
| Sil. Score | Time (s) | Sil. Score | Noise (%) | Time (s) | ||
| Scenario 3 | Dataset 1 | 0.8755 | 0.20 | 0.6801 | 9.60 | 8.96 |
| (Infra Focus) | Dataset 2 | 0.8189 | 0.19 | 0.8365 | 7.94 | 7.78 |
| Dataset 3 | 0.9253 | 0.20 | 0.8170 | 4.65 | 8.57 | |
| Scenario 5 | Dataset 1 | 0.9187 | 0.16 | 0.8584 | 4.37 | 7.42 |
| (Time-Agent) | Dataset 2 | 0.6632 | 0.20 | 0.7603 | 11.33 | 7.80 |
| Dataset 3 | 0.7323 | 0.17 | 0.8562 | 8.00 | 7.14 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
AlJuhaiman, H.A.; Emad-ul-Haq, Q.; Kim, K.; Lee, S. Automated Cyber Threat Intelligence Extraction from Distributed Honeypots: A Hybrid Machine Learning Approach. Electronics 2026, 15, 2900. https://doi.org/10.3390/electronics15132900
AlJuhaiman HA, Emad-ul-Haq Q, Kim K, Lee S. Automated Cyber Threat Intelligence Extraction from Distributed Honeypots: A Hybrid Machine Learning Approach. Electronics. 2026; 15(13):2900. https://doi.org/10.3390/electronics15132900
Chicago/Turabian StyleAlJuhaiman, Hessa Abdulaziz, Qazi Emad-ul-Haq, Kyounggon Kim, and Seokhee Lee. 2026. "Automated Cyber Threat Intelligence Extraction from Distributed Honeypots: A Hybrid Machine Learning Approach" Electronics 15, no. 13: 2900. https://doi.org/10.3390/electronics15132900
APA StyleAlJuhaiman, H. A., Emad-ul-Haq, Q., Kim, K., & Lee, S. (2026). Automated Cyber Threat Intelligence Extraction from Distributed Honeypots: A Hybrid Machine Learning Approach. Electronics, 15(13), 2900. https://doi.org/10.3390/electronics15132900

