Bridging the Data Gap in ML-Based NIDS: An Automated Honeynet Platform for Generating Real-World Malware Traffic Datasets †
Abstract
1. Introduction
2. Materials and Methods
2.1. Automated Workflow and Data Processing
- Traffic Capture (SENSOR): Upon reboot, the node automatically launches the honeypots and initiates tcpdump, capturing all incoming traffic for a set period (e.g., 24 h).
- Data Compression and Transfer (SENSOR): Pre-reboot, a script stops the capture, consolidates the data (PCAPs, logs, binaries) into a folder, compresses it, and transfers the archive to the IMPORT node via SCP over the VPN.
- Data Decompression and Organization (IMPORT): The received archive is decompressed into a date-stamped directory for processing.
- Malicious Traffic Filtering and Analysis (IMPORT): A key processing step. Using honeypot logs containing malicious IPs, a script filters the large PCAP to extract only flows related to these IPs, creating smaller, curated PCAPs. Concurrently, malware binary hashes are queried against VirusTotal and VxAPI for threat intelligence labeling.
- Dataset Export and Storage (IMPORT): The filtered PCAPs, enriched logs, and binaries are compressed into a final dataset archive and uploaded to the private cloud storage. The SENSOR node reboots, cleansing its state for the next cycle.
2.2. Implementation
3. Results
3.1. Dataset Overview and Threat Analysis
3.2. External Validation Using TrafficColab Platform
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| NIDS | Network Intrusion Detection System |
| ML | Machine Learning |
| DevSecOps | Development, Security, and Operations |
| PCAP | Packet Capture |
| VPN | Virtual Private Network |
References
- Ji, I.H.; Lee, J.H.; Kang, M.J.; Park, W.J.; Jeon, S.H.; Seo, J.T. Artificial Intelligence-Based Anomaly Detection Technology over Encrypted Traffic: A Systematic Literature Review. Sensors 2024, 24, 898. [Google Scholar] [CrossRef] [PubMed]
- Ulloa-Cano, G. Diseño e Implementación de una Plataforma para la Generación de Datasets de Malware en tráFico Real Mediante una Honeynet. [Master’s Thesis, National Polytechnic Institute]. Repositorio Digital Institucional (RDI). 2023. Available online: https://www.repositoriodigital.ipn.mx/ (accessed on 14 November 2024).
- Velarde-Alvarado, P.; Gonzalez, H.; Martínez-Peláez, R.; Mena, L.J.; Ochoa-Brust, A.; Moreno-García, E.; Félix, V.G.; Ostos, R. A Novel Framework for Generating Personalized Network Datasets for NIDS Based on Traffic Aggregation. Sensors 2022, 22, 1847. [Google Scholar] [CrossRef] [PubMed]
- pfSense® Software Configuration Recipes—WireGuard Site-to-Site VPN Configuration Example|pfSense Documentation. Available online: https://docs.netgate.com/pfsense/en/latest/recipes/wireguard-s2s.html (accessed on 29 November 2023).
- Proxmox VE Administration Guide. Available online: https://pve.proxmox.com/pve-docs/pve-admin-guide.html (accessed on 28 November 2023).
- Gabriel-Ulloa/SENSOR: A Honeypot System Capturing Suspicious Network Traffic in a T-Pot Installation. Available online: https://github.com/Gabriel-Ulloa/SENSOR (accessed on 28 October 2025).
- Gabriel-Ulloa/IMPORT: Processes and Filters Malicious Traffic Data Captured by SENSOR. Available online: https://github.com/Gabriel-Ulloa/IMPORT (accessed on 28 October 2025).
- Rodriguez-Adame, O.G. Generación de Data Sets Etiquetados para Diseño de Sistemas de Detección de Intrusos en Red. [Master’s Thesis, National Polytechnic Institute]. Repositorio Digital Institucional (RDI). 2025. Available online: https://www.repositoriodigital.ipn.mx/ (accessed on 25 October 2025).
- NSL-KDD | Datasets | Research | Canadian Institute for Cybersecurity | UNB. Available online: https://www.unb.ca/cic/datasets/nsl.html (accessed on 5 December 2023).
- Intrusion Detection Evaluation Dataset (CIC-IDS2017). Available online: https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 5 December 2023).


| Capture ID | Duration (hrs) | Packets (k) | Size (MB) | Primary Threats |
|---|---|---|---|---|
| 11-29 | 9.00 | 584 | 292 | WannaCry, Downloaders |
| 11-30 | 22.98 | 1123 | 438 | Mirai, Miners |
| 12-01 | 22.98 | 1135 | 458 | Mirai variants |
| … | … | … | … | … |
| 12-14 | 22.98 | 1459 | 563 | Aidra, Gafgyt |
| Total/Avg. | 23.0 | 15,800 | 5900 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ulloa Cano, G.; Sánchez Pérez, G.; Portillo-Portillo, J.; Toscano Medina, L.K.; Hernández Suárez, A.; Olivares Mercado, J.; Pérez Meana, H.M.; García Villalba, L.J.; Velarde Alvarado, P. Bridging the Data Gap in ML-Based NIDS: An Automated Honeynet Platform for Generating Real-World Malware Traffic Datasets. Eng. Proc. 2026, 123, 36. https://doi.org/10.3390/engproc2026123036
Ulloa Cano G, Sánchez Pérez G, Portillo-Portillo J, Toscano Medina LK, Hernández Suárez A, Olivares Mercado J, Pérez Meana HM, García Villalba LJ, Velarde Alvarado P. Bridging the Data Gap in ML-Based NIDS: An Automated Honeynet Platform for Generating Real-World Malware Traffic Datasets. Engineering Proceedings. 2026; 123(1):36. https://doi.org/10.3390/engproc2026123036
Chicago/Turabian StyleUlloa Cano, Gabriel, Gabriel Sánchez Pérez, José Portillo-Portillo, Linda Karina Toscano Medina, Aldo Hernández Suárez, Jesús Olivares Mercado, Héctor Manuel Pérez Meana, Luis Javier García Villalba, and Pablo Velarde Alvarado. 2026. "Bridging the Data Gap in ML-Based NIDS: An Automated Honeynet Platform for Generating Real-World Malware Traffic Datasets" Engineering Proceedings 123, no. 1: 36. https://doi.org/10.3390/engproc2026123036
APA StyleUlloa Cano, G., Sánchez Pérez, G., Portillo-Portillo, J., Toscano Medina, L. K., Hernández Suárez, A., Olivares Mercado, J., Pérez Meana, H. M., García Villalba, L. J., & Velarde Alvarado, P. (2026). Bridging the Data Gap in ML-Based NIDS: An Automated Honeynet Platform for Generating Real-World Malware Traffic Datasets. Engineering Proceedings, 123(1), 36. https://doi.org/10.3390/engproc2026123036

