A Hybrid Federated Learning Framework for Privacy-Preserving Near-Real-Time Intrusion Detection in IoT Environments
Abstract
:1. Introduction
Related Works
2. Materials and Methods
2.1. Databricks
2.2. Google Cloud Storage
2.3. Fivetran
2.4. Flower Federated Learning Framework
2.5. Data, Preprocessing, and Integration with Hugging Face
2.6. Federated Learning Implementation Workflow
- Data Storage: The original dataset was uploaded to Google Cloud Storage.
- Data Ingestion: Fivetran was used to automate data transfer from Google Cloud Storage to Databricks, ensuring a streamlined data pipeline.
- Cluster Configuration: A Databricks cluster was provisioned, and appropriate computing resources (CPUs and GPUs) were selected based on workload requirements.
- FL Execution: Using Apache Spark and MLlib, the Apache Spark machine learning library, we implemented an FL training process, where each node processed its local dataset without sharing raw data.
- Model Aggregation: After local training, updates were aggregated centrally to refine the global model while preserving data privacy.
- Evaluation: The final model was assessed to compare its performance against a traditionally centralized training approach.
2.7. Evaluation Parameters
3. Results
3.1. Logistic Regression
3.2. Support Vector Machines Using Stochastic Gradient Descent
3.3. Random Forest
- Local Training: Each node trains a subset of decision trees on its local data.
- Model Aggregation: The central server aggregates the trees from all nodes to create a global RF model.
- Model Distribution: The global model is sent back to the nodes for further training or inference.
- Bootstrap Sampling: Each node performs bootstrap sampling on its local data to create subsets for training individual trees. ensuring diversity among the trees.
- Feature Selection: At each split in a decision tree, a random subset of features is considered. This randomness helps reduce overfitting and improves generalization.
- Tree Construction: Based on the local data, each node constructs its trees using standard decision tree algorithms (e.g., CART).
- Communication Overhead: RF models can be large, especially when the number of trees is high. Transmitting these models between nodes and the central server can incur significant communication overhead. Compression techniques (e.g., quantization, sparsification) can mitigate this issue but may affect model accuracy.
- Tree Alignment: Aggregating trees from different nodes requires aligning their structures, which is computationally expensive. Mismatched tree structures can lead to suboptimal global models.
- Data Heterogeneity: In FL, data across nodes are often non-IID (non-independent and identically distributed). This heterogeneity can lead to biased or inconsistent trees, reducing the global model’s performance.
- Privacy Concerns: While FL preserves data privacy by not sharing raw data, model updates (e.g., tree structures) can still leak information. Techniques like differential privacy can mitigate this risk but may degrade model performance.
- Scalability: As the number of nodes increases, the complexity of aggregating and managing the global RF grows. Scalability becomes a significant concern, especially for large-scale deployments.
3.4. Implementation Repository
4. Discussion
- Faster Training Times: Parallel processing across multiple nodes reduces the time required for model training.
- Scalability: The same code can be run on any number of virtual machines, allowing us to scale experiments effortlessly.
- Cost Efficiency: Databricks’ pay-as-you-go model ensures that we only pay for the resources we use, making it a cost-effective solution for large-scale experiments.
- Flexibility: The ability to modify only the configuration file (e.g., pyproject.toml) without changing the core code simplifies the process of scaling and adapting the system to different use cases.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
AI | Artificial intelligence |
DS | Differential privacy |
FLWR | Flower |
FL | Federated learning |
GCS | Google Cloud Storage |
GDPR | General Data Protection Regulation |
HE | Homomorphic encryption |
IoT | Internet of Things |
LR | Logistic regression |
RF | Random Forest |
SMPC | Secure multi-party computation |
SGD | Stochastic Gradient Descent |
SVM | Support Vector Machine |
Appendix A. Federated Frameworks
Appendix A.1. TensorFlow Federated (TFF)
Appendix A.2. PySyft
Appendix A.3. Federated AI Technology Enabler (FATE)
Appendix A.4. Flower (FLWR)
Appendix B. Federated Random Forest Aggregation Techniques
Appendix B.1. Tree Aggregation
Appendix B.2. Weighted Aggregation
Appendix B.3. Pruning-Based Aggregation
Appendix B.4. Federated Averaging for Random Forest
Appendix C. Federated Random Forest Pruning Techniques
Appendix C.1. Cost-Complexity Pruning
Appendix C.2. Error-Based Pruning
Appendix C.3. Feature Importance Pruning
Appendix D. Advanced Techniques for Federated Random Forest
Appendix D.1. Hybrid Federated Learning
Appendix D.2. Secure Multi-Party Computation (SMPC)
Appendix D.3. Distributed Pruning
Appendix D.4. Adaptive Tree Construction
Appendix E. Distributed Environment Software Setup
- Update pyproject.toml: This file must be modified to include the IP addresses of all client machines participating in the FL process. This ensures that the server can communicate with each client effectively.
- Cluster Configuration: In Databricks, a cluster must be configured with the appropriate number of worker nodes. Each node will act as a client, running its own instance of the training process.
- Resource Allocation: It is ensured that each node has sufficient computational resources (CPU, GPU, and memory) to handle its share of the workload.
- Network Configuration: it is verified that all nodes can communicate with each other and with the server without network bottlenecks.
References
- Morfino, V.; Rampone, S. Towards Near-Real-Time Intrusion Detection for IoT Devices using Supervised Learning and Apache Spark. Electronics 2020, 9, 444. [Google Scholar] [CrossRef]
- Konečný, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated Learning: Strategies for Improving Communication Efficiency. arXiv 2016, arXiv:1610.05492. [Google Scholar]
- Hard, A.; Rao, K.; Mathews, R.; Beaufays, F.; Augenstein, S.; Eichner, H.; Kiddon, C.; Ramage, D. Federated Learning for Mobile Keyboard Prediction. arXiv 2018, arXiv:1811.03604. [Google Scholar]
- Nguyen, D.C.; Ding, M.; Pathirana, P.N.; Seneviratne, A.; Li, J.; Poor, H.V. Federated Learning for Internet of Things: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2021, 23, 1622–1658. [Google Scholar]
- Goddard, M. The EU general data protection regulation (GDPR): European regulation that has a global impact. Int. J. Mark. Res. 2017, 59, 703–705. [Google Scholar]
- Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 308–318. [Google Scholar]
- McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Aguera y Arcas, B. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Lauderdale, FL, USA, 20–22 April 2017; Volume 54, pp. 1273–1282. [Google Scholar]
- Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. In Proceedings of the Machine Learning and Systems 2 (MLSys 2020), Austin, TX, USA, 2–4 March 2020. [Google Scholar]
- Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated Machine Learning: Concept and Applications. ACM Trans. Intell. Syst. Technol. 2019, 10, 1–19. [Google Scholar]
- Yisroel, M.; Tomer, D.; Yuval, E.; Asaf, S. Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection. In Proceedings of the Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 18–21 February 2018; Available online: https://archive.ics.uci.edu/dataset/516/kitsune+network+attack+dataset (accessed on 1 March 2025).
- Li, Y.; Li, Z.; Li, M. A comprehensive survey on intrusion detection algorithms. Comput. Electr. Eng. 2025, 121, 109863. [Google Scholar] [CrossRef]
- Zhou, W.; Xia, C.; Wang, T.; Liang, X.; Lin, W.; Li, X.; Zhang, S. HIDIM: A novel framework of network intrusion detection for hierarchical dependency and class imbalance. Comput. Secur. 2025, 148, 104155. [Google Scholar] [CrossRef]
- Lin, W.; Xia, C.; Wang, T.; Zhao, Y.; Xi, L.; Zhang, S. Input and Output Matter: Malicious Traffic Detection with Explainability. IEEE Netw. 2024, 39, 259–267. [Google Scholar]
- Najafimehr, M.; Zarifzadeh, S.; Mostafavi, S. DDoS attacks and machine-learning-based detection methods: A survey and taxonomy. Eng. Rep. 2023, 5, e12697. [Google Scholar] [CrossRef]
- Alqudhaibi, A.; Albarrak, M.; Jagtap, S.; Williams, N.; Salonitis, K. Securing industry 4.0: Assessing cybersecurity challenges and proposing strategies for manufacturing management. Cyber Secur. Appl. 2025, 3, 100067. [Google Scholar] [CrossRef]
- Bebortta, S.; Barik, S.C.; Sahoo, L.K.; Mohapatra, S.S.; Kaiwartya, O.; Senapati, D. Hybrid Machine Learning Framework for Network Intrusion Detection in IoT-Based Environments. Lect. Notes Netw. Syst. 2024, 1, 573–585. [Google Scholar] [CrossRef]
- Konečný, J.; McMahan, B.; Ramage, D.; Richtàrik, P. Federated Optimization: Distributed Machine Learning for On-Device Intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar]
- Taha, K. Machine learning in biomedical and health big data: A comprehensive survey with empirical and experimental insights. J. Big Data 2025, 12, 61. [Google Scholar] [CrossRef]
- Hamdi, N. A hybrid learning technique for intrusion detection system for smart grid. Sustain. Comput. Inform. Syst. 2025, 46, 101102. [Google Scholar] [CrossRef]
- Cao, S.; Liu, S.; Yang, Y.; Du, W.; Zhan, Z.; Wang, D.; Zhang, W. A hybrid and efficient Federated Learning for privacy preservation in IoT devices. Ad Hoc Netw. 2025, 170, 103761. [Google Scholar] [CrossRef]
- Gu, Y.; Wang, J.; Zhao, S. HT-FL: Hybrid Training Federated Learning for Heterogeneous Edge-Based IoT Networks. IEEE Trans. Mob. Comput. 2025, 24, 2817–2831. [Google Scholar] [CrossRef]
- Albogami, N.N. Intelligent deep federated learning model for enhancing security in internet of things enabled edge computing environment. Sci. Rep. 2025, 15, 4041. [Google Scholar] [CrossRef]
- Databricks Data Intelligence Platform. Available online: https://www.databricks.com/ (accessed on 1 March 2025).
- Google Cloud Storage. Available online: https://cloud.google.com/storage?hl=en (accessed on 1 March 2025).
- Fivetran Data Integration. Available online: https://www.fivetran.com/learn/data-integration (accessed on 1 March 2025).
- Apache Spark. Available online: https://spark.apache.org/ (accessed on 1 March 2025).
- Flower Federated Learning Framework. Available online: https://flower.ai/ (accessed on 1 March 2025).
- Hugging Face. Available online: https://huggingface.co/ (accessed on 1 March 2025).
- Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression, 3rd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
- Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
- Bottou, L. Stochastic Gradient Learning in Neural Networks. Proc. Neuro-Nımes 1991, 91, 12. [Google Scholar]
- Bodynek, M.; Leiser, F.; Thiebes, S.; Sunyaev, A. Applying Random Forests in Federated Learning: A Synthesis of Aggregation Techniques. In Proceedings of the Wirtschaftsinformatik, Paderborn, Germany, 18–21 September 2023; Volume 46. Available online: https://aisel.aisnet.org/wi2023/46 (accessed on 1 March 2025).
- Implementation Code Repository. Available online: https://github.com/n3pt7un/Federated-Learning-LR_RF (accessed on 26 March 2025).
- Bonawitz, K.; Ivanov, V.; Kreuter, B.; Marcedone, A.; McMahan, H.B.; Patel, S.; Ramage, D.; Segal, A.; Seth, K. Practical Secure Aggregation for Privacy-Preserving Machine Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS ’17), Dallas, TX, USA, 30 October–3 November 2017; pp. 1175–1191. [Google Scholar] [CrossRef]
- Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and Open Problems in Federated Learning. Found. Trends Mach. Learn. 2021, 14, 1–210. [Google Scholar]
- Phong, L.T.; Aono, Y.; Hayashi, T.; Wang, L.; Moriai, S. Privacy-Preserving Deep Learning via Additively Homomorphic Encryption. IEEE Trans. Inf. Forensics Secur. 2018, 13, 1333–1345. [Google Scholar] [CrossRef]
Centralized Learning | Federated Learning |
---|---|
Trained on centralized data | Trained on distributed data |
Data reside on the cloud or centralized server | Data reside at the various nodes in the network |
Training takes place primarily in the cloud | Training happens primarily at the edge |
Nodes/edge devices share local data | Nodes/edge devices share local version of the model |
Cannot operate on heterogeneous data | Can operate on heterogeneous data |
Low user data privacy | High user data privacy |
Parameter | Setup |
---|---|
Classes | 2 |
Features | 115 |
Supernodes | 25 |
Epochs | 5 |
Penalty | L2 |
Server rounds | 3 |
Parameter | Setup |
---|---|
Classes | 2 |
Features | 115 |
Loss function | hinge |
Penalty | L1 (Lasso), L2 |
Learning Rate | ‘Optimal’ |
Logistic Regression | SVM & SGD | RF | |
---|---|---|---|
Accuracy | 0.9783 | 0.9746 | 1.0000 |
Precision | 0.9586 | 0.9599 | 1.0000 |
Recall | 0.9999 | 0.9641 | 1.0000 |
F1-score | 0.9788 | 0.9620 | 1.0000 |
False positive rate | 0.0432 | 0.0201 | 0.0000 |
False negative rate | 0.0001 | 0.0359 | 0.0001 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rampone, G.; Ivaniv, T.; Rampone, S. A Hybrid Federated Learning Framework for Privacy-Preserving Near-Real-Time Intrusion Detection in IoT Environments. Electronics 2025, 14, 1430. https://doi.org/10.3390/electronics14071430
Rampone G, Ivaniv T, Rampone S. A Hybrid Federated Learning Framework for Privacy-Preserving Near-Real-Time Intrusion Detection in IoT Environments. Electronics. 2025; 14(7):1430. https://doi.org/10.3390/electronics14071430
Chicago/Turabian StyleRampone, Glauco, Taras Ivaniv, and Salvatore Rampone. 2025. "A Hybrid Federated Learning Framework for Privacy-Preserving Near-Real-Time Intrusion Detection in IoT Environments" Electronics 14, no. 7: 1430. https://doi.org/10.3390/electronics14071430
APA StyleRampone, G., Ivaniv, T., & Rampone, S. (2025). A Hybrid Federated Learning Framework for Privacy-Preserving Near-Real-Time Intrusion Detection in IoT Environments. Electronics, 14(7), 1430. https://doi.org/10.3390/electronics14071430