Hybrid AI-Based Framework for Generating Realistic Attack-Related Network Flow Data for Cybersecurity Digital Twins
Abstract
1. Introduction
1.1. Research Contribution
- Hybrid multivariate temporal generation method: We introduce a modular hybrid framework that leverages LSTM networks to model temporal features, capturing sequential dependencies across network flows and ensuring temporal coherence in attack patterns. For non-temporal features, the framework employs complementary generative models, with the best-performing technique selected based on dataset characteristics. Experimental results demonstrate that TVAE achieves the best performance on the CICFlowMeter dataset, while Gaussian Copula performs best on OCPPFlowMeter, both outperforming CTGAN. This adaptive strategy enables the joint preservation of temporal dynamics and multivariate statistical distributions.
- Decoupled temporal and static feature modeling: A key innovation of the proposed method is its architectural decoupling of sequence-aware pattern modeling from the modeling of static or non-temporal feature values. By isolating the temporal dimension and assigning it to a dedicated LSTM-based module, the framework ensures that the sequential nature of network flows is learned independently of the statistical modeling of other features. This modular design enhances scalability and flexibility, allowing for targeted optimization of each component. It also enables the system to be easily adapted to different network environments and attack scenarios.
- Comprehensive Experimental Validation: The proposed framework is evaluated using multiple network flow datasets, including CICFlowMeter [8] and OCPPFlowMeter [9], both containing labeled traffic data for various cyberattack types. The evaluation assesses the statistical fidelity, temporal realism, and overall utility of the generated data. A combination of quantitative metrics is used to measure the realism and generalization capacity of the synthetic data. The results demonstrate the effectiveness of the hybrid approach in generating high-quality, temporally coherent synthetic network flows that are suitable for training, testing, and validating AI-driven cybersecurity systems, including intrusion detection system (IDS) solutions.
1.2. Article Structure
2. Background: Synthetic Data Generation Techniques
2.1. Long Short-Term Memory (LSTM)-Based Models
2.2. Generative Adversarial Networks (GANs)
2.3. Tabular Variational Autoencoder (TVAE)
2.4. Statistical Methods: Gaussian Copula Models
3. Related Work
4. Method for Multivariate Temporal Synthetic Network Flow Data Generation
4.1. Limitations of Existing Approaches and the Rationale of Our Method
- LSTM networks are employed to model and generate temporal features, capturing sequential dependencies across flows and ensuring temporal coherence of attack patterns;
- Complementary generative models are selectively applied to non-temporal features, choosing the best-performing technique among GAN, TVAE, and Gaussian Copula according to dataset characteristics. Experimental results show that TVAE achieves the best performance for the CICFlowMeter dataset, while Gaussian Copula performs best for OCPPFlowMeter, both outperforming CTGAN;
- This adaptive hybrid strategy combines LSTM-based temporal modeling with data-driven selection of statistical generators, enabling the joint preservation of temporal dynamics and multivariate statistical distributions across heterogeneous feature types.
4.2. Method Formulation
5. Experimental Validation
5.1. Infrastructure
5.2. Validation Methodology
5.2.1. Sequence Modeling Evaluation
5.2.2. Synthesizers Evaluation
- Gaussian Copula, which models the joint distribution of variables using Gaussian mathematical functions, capturing statistical dependencies between features.
- CTGAN, which uses a conditional GAN architecture tailored for tabular data.
- TVAE, which leverages a variational autoencoder to learn and sample from the latent distribution of mixed-type tabular data.
- Column Shapes (CS): Measures how well the distribution of each individual column in the synthetic data matches the real data.
- Column Pair Trends (CPT): Assesses the preservation of relationships between pairs of columns.
- Overall Score (OS): Represents the average of the previous metric scores.
5.2.3. Train-on-Real–Test-on-Synthetic (TRTS) Evaluation
5.2.4. Evaluation Strategy
5.3. Results
6. Discussion
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. LSTM Model Architecture and Hyperparameters
- B denotes the batch size, which is the number of sequences processed in parallel during training or inference. In the experiments, a batch size of 64 was used.
- W denotes the sequence length or window size, which is the number of time steps in each input sequence. The values of W are specified per attack type in Table A2.
| Layer | Input/Output Size | Function | 
|---|---|---|
| Input Layer | (B, W) | Accepts label indices as input | 
| Embedding Layer | (B, W, 32) | Maps labels to dense vectors of size 32 | 
| LSTM Layer | (B, W, 256) | Captures temporal dependencies | 
| Dropout Layer | (B, 256) | Regularization to reduce overfitting | 
| Dense Output Layer | (B, n) | Outputs class probabilities over n classes | 
| Attack | Dataset Type | Window | Stride | 
|---|---|---|---|
| Flooding Heartbeat | CICFlowMeter | 28 | 0.1 | 
| Flooding Heartbeat | OCPPFlowMeter | 16 | 0.2 | 
| DoCharge | CICFlowMeter | 16 | 0.3 | 
| DoCharge | OCPPFlowMeter | 16 | 0.4 | 
| Charging Profile Manipulation | CICFlowMeter | 16 | 0.3 | 
| Charging Profile Manipulation | OCPPFlowMeter | 32 | 0.01 | 
| Balanced | CICFlowMeter | 32 | 0.01 | 
| Balanced | OCPPFlowMeter | 32 | 0.06 | 
Appendix B. Generative Model Training Parameters
- CTGAN- –
- Epochs: 400–700 (with model checkpoints saved periodically)
- –
- Batch size: 500
- –
- Embedding size: 64
- –
- Generator network dimensions: [128, 128]
- –
- Discriminator network dimensions: [128, 128]
 
- TVAE- –
- Epochs: 400–700 (with model checkpoints saved periodically)
- –
- Batch size: 500
- –
- Embedding size: 64
 
- Gaussian copula: gaussian_kde distribution has been used (kernel density estimation with gaussian kernel)
References
- Li, G.; Jung, J.J. Deep learning for anomaly detection in multivariate time series: Approaches, applications, and challenges. Inf. Fusion 2023, 91, 93–102. [Google Scholar] [CrossRef]
- Schmidl, S.; Wenig, P.; Papenbrock, T. Anomaly detection in time series: A comprehensive evaluation. Proc. VLDB Endow. 2022, 15, 1779–1797. [Google Scholar] [CrossRef]
- Pokhrel, A.; Katta, V.; Colomo-Palacios, R. Digital twin for cybersecurity incident prediction: A multivocal literature review. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, Seoul, Republic of Korea, 27 June–19 July 2020; pp. 671–678. [Google Scholar]
- Yoon, J.; Jarrett, D.; Van der Schaar, M. Time-series generative adversarial networks. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Alzahrani, N.; Cała, J.; Missier, P. Experience: A comparative analysis of multivariate time-series generative models: A case study on human activity data. ACM J. Data Inf. Qual. 2024, 16, 18. [Google Scholar] [CrossRef]
- Brophy, E.; Wang, Z.; She, Q.; Ward, T. Generative adversarial networks in time series: A survey and taxonomy. arXiv 2021, arXiv:2107.11098. [Google Scholar] [CrossRef]
- Empl, P.; Koch, D.; Dietz, M.; Pernul, G. Digital twins in security operations: State of the art and future perspectives. ACM Comput. Surv. 2024, 58, 18. [Google Scholar] [CrossRef]
- Engelen, G.; Rimmer, V.; Joosen, W. Troubleshooting an intrusion detection dataset: The CICIDS2017 case study. In Proceedings of the 2021 IEEE Security and Privacy Workshops (SPW), Piscataway, NJ, USA, 27–27 May 2021; IEEE: New York, NY, USA, 2021; pp. 7–12. [Google Scholar]
- Dalamagkas, C.; Radoglou-Grammatikis, P.; Bouzinis, P.; Papadopoulos, I.; Lagkas, T.; Argyriou, V.; Goudos, S.; Margounakis, D.; Fountoukidis, E.; Sarigiannidis, P. Federated detection of open charge point protocol 1.6 cyberattacks. arXiv 2025, arXiv:2502.01569. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Nelsen, R.B. An Introduction to Copulas; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Patki, N.; Wedge, R.; Veeramachaneni, K. The synthetic data vault. In Proceedings of the 2016 IEEE international conference on data science and advanced analytics (DSAA), Montreal, QC, Canada, 17–19 October 2016; IEEE: New York, NY, USA, 2016; pp. 399–410. [Google Scholar]
- Faleiro, R.; Pan, L.; Pokhrel, S.R.; Doss, R. Digital twin for cybersecurity: Towards enhancing cyber resilience. In Proceedings of the International Conference on Broadband Communications, Networks and Systems, Melbourne, Australia, 28–29 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 57–76. [Google Scholar]
- Homaei, M.; Mogollón-Gutiérrez, Ó.; Sancho, J.C.; Ávila, M.; Caro, A. A review of digital twins and their application in cybersecurity based on artificial intelligence. Artif. Intell. Rev. 2024, 57, 201. [Google Scholar] [CrossRef]
- Epiphaniou, G.; Hammoudeh, M.; Yuan, H.; Maple, C.; Ani, U. Digital twins in cyber effects modelling of IoT/CPS points of low resilience. Simul. Model. Pract. Theory 2023, 125, 102744. [Google Scholar] [CrossRef]
- Dietz, M.; Pernul, G. Unleashing the digital twin’s potential for ics security. IEEE Secur. Priv. 2020, 18, 20–27. [Google Scholar] [CrossRef]
- Acquaah, Y.; Roy, K. Realistic synthetic dataset generation for cyber-physical systems: A performance evaluation. Discov. Appl. Sci. 2025, 7, 719. [Google Scholar] [CrossRef]
- Gatta, F.; Giampaolo, F.; Prezioso, E.; Mei, G.; Cuomo, S.; Piccialli, F. Neural networks generative models for time series. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 7920–7939. [Google Scholar] [CrossRef]
- Xu, S.; Marwah, M.; Arlitt, M.; Ramakrishnan, N. Stan: Synthetic network traffic generation with generative neural models. In Proceedings of the International Workshop on Deployable Machine Learning for Security Defense, Virtual Event, 15 August 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 3–29. [Google Scholar]
- Li, D.; Chen, D.; Jin, B.; Shi, L.; Goh, J.; Ng, S.K. MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks. In Proceedings of the International Conference on Artificial Neural Networks, Munich, Germany, 17–19 September 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 703–716. [Google Scholar]
- Ammara, D.A.; Ding, J.; Tutschku, K. Synthetic Network Traffic Data Generation: A Comparative Study. arXiv 2024, arXiv:2410.16326. [Google Scholar]
- Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 2018, 1, 108–116. [Google Scholar]
- Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, ON, Canada, 8–10 July 2009; IEEE: New York, NY, USA, 2009; pp. 1–6. [Google Scholar]
- Landauer, M.; Skopik, F.; Stojanović, B.; Flatscher, A.; Ullrich, T. A review of time-series analysis for cyber security analytics: From intrusion detection to attack prediction. Int. J. Inf. Secur. 2025, 24, 3. [Google Scholar] [CrossRef]
- KATEA Blue Prints. Available online: https://katea.digital.tecnalia.dev/docs/hpc/overview/ (accessed on 23 September 2025).
- Dalamagkas, C.; Radoglou-Grammatikis, P.; Bouzinis, P.; Papadopoulos, I.; Lagkas, T.; Argyriou, V.; Sarigiannidis, P. Federated OCPP 1.6 Intrusion Detection Dataset. IEEE Dataport 2025. [Google Scholar] [CrossRef]
- Yujian, L.; Bo, L. A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1091–1095. [Google Scholar] [CrossRef] [PubMed]
- Navarro, G. A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 2001, 33, 31–88. [Google Scholar] [CrossRef]
- Miletic, M.; Sariyar, M. Challenges of using synthetic data generation methods for tabular microdata. Appl. Sci. 2024, 14, 5975. [Google Scholar] [CrossRef]
- DataCebo, Inc. Synthetic Data Metrics; DataCebo, Inc.: Boston, MA, USA, 2016. [Google Scholar]
- Esteban, C.; Hyland, S.L.; Rätsch, G. Real-valued (medical) time series generation with recurrent conditional gans. arXiv 2017, arXiv:1706.02633. [Google Scholar] [CrossRef]
- Stenger, M.; Leppich, R.; Foster, I.; Kounev, S.; Bauer, A. Evaluation is key: A survey on evaluation measures for synthetic time series. J. Big Data 2024, 11, 66. [Google Scholar] [CrossRef]


| Attack | Data Type | CTGAN | TVAE | Gaussian Copula | TLS | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| OS | CS | CPT | OS | CS | CPT | OS | CS | CPT | |||
| Flooding Heartbeat | CICFlowMeter | 0.7620 | 0.6442 | 0.8799 | 0.8611 | 0.8017 | 0.9205 | 0.8082 | 0.7001 | 0.9164 | 0.8798 | 
| OCPPFlowMeter | 0.8000 | 0.8447 | 0.7554 | 0.8545 | 0.9080 | 0.8010 | 0.8856 | 0.9041 | 0.8670 | 0.9680 | |
| Denial of Charge | CICFlowMeter | 0.7997 | 0.7301 | 0.8693 | 0.8518 | 0.7975 | 0.9061 | 0.8567 | 0.7741 | 0.9394 | 0.8936 | 
| OCPPFlowMeter | 0.8171 | 0.8420 | 0.7922 | 0.8076 | 0.8811 | 0.7341 | 0.9136 | 0.9439 | 0.8832 | 0.8801 | |
| Charging Profile Manipulation | CICFlowMeter | 0.8086 | 0.7239 | 0.8933 | 0.8604 | 0.7974 | 0.9234 | 0.8595 | 0.7727 | 0.9462 | 0.9442 | 
| OCPPFlowMeter | 0.8556 | 0.8538 | 0.8573 | 0.8588 | 0.9035 | 0.8142 | 0.9030 | 0.9001 | 0.9059 | 0.5410 | |
| Balanced | CICFlowMeter | 0.8382 | 0.7636 | 0.9127 | 0.8736 | 0.8270 | 0.9201 | 0.8646 | 0.7934 | 0.9358 | 0.2686 | 
| OCPPFlowMeter | 0.8654 | 0.8640 | 0.8669 | 0.8651 | 0.8904 | 0.8399 | 0.8910 | 0.8638 | 0.9183 | 0.2146 | |
| Attack | Class Name | CTGAN | TVAE | Gaussian Copula | TLS | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OS | CS | CPT | OS | CS | CPT | OS | CS | CPT | ||||
| Flooding Heartbeat | cyberattack_ocpp16_dos_flooding_heartbeat | 0.8836 | 0.8427 | 0.9245 | 0.9191 | 0.9100 | 0.9282 | 0.8968 | 0.8555 | 0.9382 | 0.8808 | 0.8934 | 
| normal | 0.7823 | 0.7723 | 0.7923 | 0.8303 | 0.7747 | 0.8860 | 0.8425 | 0.7644 | 0.9207 | |||
| Denial of Charge | cyberattack_ocpp16_doc_idtag | 0.8469 | 0.8116 | 0.8821 | 0.8943 | 0.9018 | 0.8869 | 0.8891 | 0.8452 | 0.9330 | 0.8747 | 0.9044 | 
| normal | 0.7658 | 0.6835 | 0.8482 | 0.8413 | 0.7819 | 0.9006 | 0.8551 | 0.7691 | 0.9410 | |||
| Charging Profile Manipulation | cyberattack_ocpp16_fdi_chargingprofile | 0.8129 | 0.7499 | 0.8759 | 0.8848 | 0.8661 | 0.9034 | 0.8777 | 0.8128 | 0.9426 | 0.8700 | 0.9266 | 
| normal | 0.7824 | 0.701 | 0.8637 | 0.8552 | 0.7855 | 0.9250 | 0.8538 | 0.7627 | 0.9449 | |||
| Balanced | cyberattack_ocpp16_doc_idtag | 0.8390 | 0.8284 | 0.8497 | 0.8757 | 0.8983 | 0.8531 | 0.8862 | 0.8461 | 0.9263 | 0.8879 | 0.6340 | 
| cyberattack_ocpp16_dos_flooding_heartbeat | 0.8457 | 0.829 | 0.8624 | 0.8753 | 0.8892 | 0.8614 | 0.8829 | 0.8346 | 0.9312 | |||
| cyberattack_ocpp16_fdi_chargingprofile | 0.8020 | 0.7234 | 0.8807 | 0.8781 | 0.8603 | 0.8958 | 0.8741 | 0.8101 | 0.9382 | |||
| cyberattack_ocpp16_unauthorized_access | 0.8815 | 0.9599 | 0.8032 | 0.8818 | 0.9662 | 0.7975 | 0.9542 | 0.9796 | 0.9289 | |||
| normal | 0.7842 | 0.72 | 0.8485 | 0.8197 | 0.7574 | 0.882 | 0.8382 | 0.7498 | 0.9266 | |||
| Attack | Class Name | CTGAN | TVAE | Gaussian Copula | TLS | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OS | CS | CPT | OS | CS | CPT | OS | CS | CPT | ||||
| Flooding Heartbeat | cyberattack_ocpp16_dos_flooding_heartbeat | 0.8182 | 0.8564 | 0.7800 | 0.8396 | 0.8979 | 0.7812 | 0.9247 | 0.9344 | 0.9149 | 0.9183 | 0.9835 | 
| normal | 0.8541 | 0.8866 | 0.8216 | 0.8217 | 0.8930 | 0.7504 | 0.9119 | 0.9213 | 0.9025 | |||
| Denial of Charge | cyberattack_ocpp16_doc_idtag | 0.8147 | 0.8551 | 0.7742 | 0.7921 | 0.8979 | 0.6863 | 0.9363 | 0.9587 | 0.9139 | 0.9101 | 0.9324 | 
| normal | 0.8024 | 0.8300 | 0.7747 | 0.7979 | 0.8552 | 0.7405 | 0.8839 | 0.8884 | 0.8793 | |||
| Charging Profile Manipulation | cyberattack_ocpp16_fdi_chargingprofile | 0.8161 | 0.8278 | 0.8044 | 0.8585 | 0.9052 | 0.8118 | 0.9037 | 0.8959 | 0.9116 | 0.8965 | 0.9199 | 
| normal | 0.7654 | 0.7748 | 0.7559 | 0.8224 | 0.8827 | 0.7620 | 0.8894 | 0.8862 | 0.8927 | |||
| Balanced | cyberattack_ocpp16_doc_idtag | 0.7931 | 0.8576 | 0.7286 | 0.7747 | 0.8745 | 0.6750 | 0.9379 | 0.9566 | 0.9192 | 0.9152 | 0.7109 | 
| cyberattack_ocpp16_dos_flooding_heartbeat | 0.7939 | 0.853 | 0.7349 | 0.7571 | 0.8915 | 0.6228 | 0.9005 | 0.9196 | 0.8814 | |||
| cyberattack_ocpp16_fdi_chargingprofile | 0.8134 | 0.8367 | 0.7901 | 0.8347 | 0.8898 | 0.7796 | 0.9044 | 0.8967 | 0.9121 | |||
| cyberattack_ocpp16_unauthorized_access | 0.9438 | 0.9890 | 0.8985 | 0.8475 | 0.9643 | 0.7307 | 0.9394 | 0.9903 | 0.8886 | |||
| normal | 0.7799 | 0.7821 | 0.7778 | 0.8190 | 0.8656 | 0.7723 | 0.8894 | 0.8605 | 0.9183 | |||
| Attack | Data Type | Accuracy | Precision | Recall | F1-Score | 
|---|---|---|---|---|---|
| Flooding Heartbeat | CICFlowMeter | 0.9997 | 0.9999 | 0.9997 | 0.9998 | 
| Flooding Heartbeat | OCPPFlowMeter | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 
| DoCharge | CICFlowMeter | 0.9995 | 1.0000 | 0.9995 | 0.9998 | 
| DoCharge | OCPPFlowMeter | 0.9944 | 1.0000 | 0.9944 | 0.9972 | 
| Charging Profile Manipulation | CICFlowMeter | 0.9955 | 0.9955 | 0.9955 | 0.9954 | 
| Charging Profile Manipulation | OCPPFlowMeter | 0.9981 | 0.9981 | 0.9981 | 0.9981 | 
| Balanced | CICFlowMeter | 0.9900 | 0.9904 | 0.9900 | 0.9900 | 
| Balanced | OCPPFlowMeter | 0.9987 | 0.9987 | 0.9987 | 0.9987 | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Iturbe, E.; Arcas, J.; Gaminde, G.; Rios, E.; Toledo, N. Hybrid AI-Based Framework for Generating Realistic Attack-Related Network Flow Data for Cybersecurity Digital Twins. Appl. Sci. 2025, 15, 11574. https://doi.org/10.3390/app152111574
Iturbe E, Arcas J, Gaminde G, Rios E, Toledo N. Hybrid AI-Based Framework for Generating Realistic Attack-Related Network Flow Data for Cybersecurity Digital Twins. Applied Sciences. 2025; 15(21):11574. https://doi.org/10.3390/app152111574
Chicago/Turabian StyleIturbe, Eider, Javier Arcas, Gabriel Gaminde, Erkuden Rios, and Nerea Toledo. 2025. "Hybrid AI-Based Framework for Generating Realistic Attack-Related Network Flow Data for Cybersecurity Digital Twins" Applied Sciences 15, no. 21: 11574. https://doi.org/10.3390/app152111574
APA StyleIturbe, E., Arcas, J., Gaminde, G., Rios, E., & Toledo, N. (2025). Hybrid AI-Based Framework for Generating Realistic Attack-Related Network Flow Data for Cybersecurity Digital Twins. Applied Sciences, 15(21), 11574. https://doi.org/10.3390/app152111574
 
        




 
       