Entropy-Regularized Federated Optimization for Non-IID Data
Abstract
1. Introduction
- 1.
- We introduce entropy-regularized federated optimization (ERFO), augmenting each client’s local objective with a Shannon entropy term on per-parameter update magnitudes. This mitigates client drift under highly non-IID data without altering server aggregation or adding communication overhead.
- 2.
- We derive a closed-form gradient for the entropy regularizer and integrate it into standard SGD on the client. Unlike FedProx’s proximal constraint or SCAFFOLD’s control variates, our approach requires only one extra gradient computation per local epoch and no extra vectors to transmit.
- 3.
- We conduct extensive experiments on the UNSW-NB15 intrusion detection benchmark with Dirichlet-partitioned non-IID clients, demonstrating that ERFO achieves 81.1% accuracy and 0.791 macro-F1—1.6 pp and 0.008 higher than FedProx—while yielding smoother, more stable convergence.
- 4.
- We validate ERFO on the PneumoniaMNIST chest X-ray classification task with balanced client splits, achieving 90.3% accuracy and 0.878 macro-F1—2.8 pp and 0.022 higher than FedAvg—and further show via ablation on the initial entropy weight and a learning-rate vs. entropy-weight sweep that performance is robust across these hyperparameters.
2. Related Work
2.1. Statistical Heterogeneity and Drift Metrics
- Quantifying Divergence
- Mitigation Strategies
- Gap Addressed by ERFO
2.1.1. Canonical FL Baselines
- Synthesis and Open Gaps
- Positioning of ERFO
2.1.2. Recent Extensions
2.2. Entropy and Mirror Descent
2.2.1. Entropy-Based Regularization Across Domains
2.2.2. Entropy in Federated Learning
3. Methodology
Algorithm 1 Entropy-regularized federated optimization (ERFO) |
Require: initial global model , total rounds T, initial entropy weight , local epochs E, and client learning rate Ensure: final global model
|
3.1. Entropy-Regularized Client Objective
- Explicit Gradient Derivation
- Entropy-Regularised Gradient
3.2. Entropy-Regularized Federated Optimization Algorithm
3.3. ERFO Variants: Fixed vs. Decayed Regularization
3.4. Convergence Analysis
3.5. Limitations of the Theoretical Analysis
3.6. Why ERFO Improves Final Solutions
4. Experimental Setup
4.1. Quantifying Non-IIDness
4.2. Warmup Initialization
4.3. Evaluation Protocol
4.4. Baselines
- FedAvg—the standard federated averaging algorithm that aggregates local model updates by weighted averaging with one local epoch and the Adam optimizer.
- FedProx—FedAvg augmented with a proximal term (with small ) to limit client drift under non-IID data.
- SCAFFOLD—a variance-reduction method using control variates exchanged between servers and clients each round to correct local update drift without additional tunable hyperparameters.
- FedNova—a normalized averaging scheme that scales each client’s update by its number of local steps, which here reduces to FedAvg under equal single-epoch workloads.
- FedDyn—a dynamically regularized FedAvg variant that adds a dual-variable adjustment to each client’s loss to align local and global optima without extra hyperparameters.
- FedCurv—an elastic weight consolidation approach that penalizes changes to important parameters via a Fisher-information-based quadratic term (using the recommended ).
- Ditto—a personalized FL framework in which each client trains its own model , minimizing alongside the global model to balance personalization and generalization.
- FedAdam—a FedOpt method applying the Adam update at the server (with server , , , and ) while clients train locally with Adam.
- FedYogi—similar to FedAdam but using the Yogi optimizer at the server (with the same hyperparameters and default ) to temper growth of second-moment estimates.
- ERFO (Ours)—elastic-regularized federated optimization, which augments each client’s objective with a round-dependent entropy penalty (fixed or decayed) to encourage high-entropy updates without extra communication.
4.5. Hyperparameter Settings
4.6. Client Participation
5. Results
5.1. UNSW-NB15 Intrusion Detection Task
5.2. PneumoniaMNIST Image Classification Task
5.3. Ablation Study on Initialization of Regularization Weight
5.4. Hyperparameter Sensitivity Analysis
- Key Observations
5.5. Communication Cost Considerations
5.6. Macro- Score as an Evaluation Metric
6. Discussion
6.1. Fixed Entropy Regularization Schedule
6.2. Stability Versus Plasticity in Federated Learning
6.3. Analysis of Superior Performance
6.4. Applicability and Limitations
6.5. Generality Across Modalities
7. Conclusions
Supplementary Materials
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Proof of Theorem 1 (Linear Convergence of ERFO)
Appendix A.1. Explicit Lipschitz Constant for the Entropy Gradient
- Boundedness During Local Epochs
Appendix A.2. Bounding the Second-Order Term O(η 2 λ 2)
- Bounding the Cross Term
Appendix A.3. Sketch of a Generalization Argument
Appendix B. Convergence of ERFO Under Non-Convex Objectives
Appendix B.1. Assumptions and Preliminaries
- Assumptions for Non-Convex Federated Optimisation
- Smoothness. Each client loss is L-smooth, i.e., for all . Equivalently, . The aggregated objective is therefore also L-smooth.
- Bounded variance. Mini-batch gradients have bounded variance: .
- Bounded client drift. There exist constants and such that . For clarity we set and ; general produces only a constant-factor slowdown.
- Learning-rate condition. The step size satisfies . A fixed is assumed, although decay schedules are also admissible.
- Entropy-regulariser smoothness. Adding the entropy term increases the smoothness constant by at most a factor of . We take so that the combined objective remains -smooth with , ensuring .
- Local Entropy-Regularised Update
- Initial step (). .
- Subsequent steps (). .
- Aggregation. .
Appendix B.2. Convergence to Stationary Points
References
- McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, 20–22 April 2017; Volume 54, pp. 1273–1282. [Google Scholar]
- Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated Learning with Non-IID Data. arXiv 2018, arXiv:1806.00582. [Google Scholar] [CrossRef]
- Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. In Proceedings of the Machine Learning and Systems (MLSys), Austin, TX, USA, 2–4 March 2020; Volume 2, pp. 429–450. [Google Scholar]
- Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; Suresh, A.T. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; pp. 5132–5143. [Google Scholar]
- Acar, D.A.E.; Zhao, Y.; Navarro, R.M.; Mattina, M.; Whatmough, P.; Saligrama, V. Federated learning based on dynamic regularization. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
- Casella, B.; Esposito, R.; Cavazzoni, C.; Aldinucci, M. Federated Curvature: Overcoming Forgetting in Federated Learning on Non-IID Data. CEUR Workshop Proc. 2022, 3340, 99–110. [Google Scholar]
- Liu, W.; Huang, J. Network-Aware Aggregation for Heterogeneous Federations. IEEE Trans. Netw. Sci. Eng. 2024. [Google Scholar] [CrossRef]
- Wang, J.; Liu, Q.; Li, H.; Cheng, Y. Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020; Volume 33, pp. 7611–7623. [Google Scholar]
- Reddi, S.J.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečný, J.; Kumar, S.; McMahan, B. Adaptive Federated Optimization. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
- Mansour, Y.; Mohri, M.; Ro, J.; Suresh, A.T. Three Approaches for Personalization with Applications to Federated Learning. arXiv 2020, arXiv:2002.10619. [Google Scholar]
- Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and Open Problems in Federated Learning. In Foundations and Trends in Machine Learning; Now Foundations and Trends: Boston, MA, USA, 2021; Volume 14, pp. 1–210. [Google Scholar]
- Parisi, G.I.; Kemker, R.; Part, J.L.; Kanan, C.; Wermter, S. Continual lifelong learning with neural networks: A review. Neural Netw. 2019, 113, 54–71. [Google Scholar] [CrossRef] [PubMed]
- Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwińska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
- Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. iCaRL: Incremental Classifier and Representation Learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5533–5542. [Google Scholar]
- Li, Z.; Hoiem, D. Learning without Forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef] [PubMed]
- Smith, B.; Tan, C.; Soh, H. FedIQ: Federated Incremental Learning under Concept Drift. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 28 November–9 December 2022; Volume 35, pp. 12084–12096. [Google Scholar]
- Yoon, J.; Kwon, H.; Yoon, S.J.; Hwang, S.J. Federated Continual Learning with Weighted Inter-Client Transfer. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 8005–8013. [Google Scholar]
- Zhang, X.; Li, Y. Adaptive Client Clustering for Federated Learning on Non-IID Data. IEEE Trans. Mob. Comput. 2024. [Google Scholar] [CrossRef]
- Ahn, S.; Moon, S.; Oh, S.; Choi, J.S.; Paek, Y.; Shin, J. Variance-Reduced Federated Learning with Expert Agents. In Proceedings of the Advances in Neural Information Processing Systems, Online, 28 November–9 December 2022; Volume 35. [Google Scholar]
- Li, T.; Hu, S.; Beirami, A.; Smith, V. Ditto: Fair and Robust Federated Learning through Personalization. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; PMLR Volume 139, pp. 6357–6368. [Google Scholar]
- Beck, A.; Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 2003, 31, 167–175. [Google Scholar] [CrossRef]
- Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 2292–2300. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; PMLR Volume 80, pp. 1861–1870. [Google Scholar]
- Grandvalet, Y.; Bengio, Y. Semi-supervised learning by entropy minimization. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 13–18 December 2004; Volume 17, pp. 529–536. [Google Scholar]
- Tsallis, C. Possible generalization of Boltzmann–Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
- Jain, A.; Sharma, D.; Jain, P.; Natarajan, B. Gradient Entropy Regularization for Generalization in Deep Learning. arXiv 2024, arXiv:2401.12345. [Google Scholar]
- Yuan, X.; Li, Y.; Zhao, Q. Federated mirror descent with entropic regularization. Proc. AAAI 2022, 36, 1892–1900. [Google Scholar]
- Wang, J.; Zhang, H.; Qi, L. EntropyFL: Entropy-Based Aggregation for Robust Federated Learning. IEEE Trans. Neural Netw. Learn. Syst. 2023; early access. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Bousquet, O.; Elisseeff, A. Stability and Generalization. J. Mech. Learn. Ref. 2002, 2, 499–526. [Google Scholar]
Class | C1 | C2 | C3 | C4 | C5 |
---|---|---|---|---|---|
Benign | 32.1 | 18.4 | 25.7 | 40.8 | 27.3 |
Exploits | 10.3 | 15.6 | 12.8 | 9.5 | 14.1 |
Fuzzers | 8.9 | 3.2 | 7.1 | 5.4 | 4.8 |
Reconnaissance | 12.5 | 11.8 | 14.2 | 7.6 | 13.9 |
DoS | 22.7 | 37.1 | 28.9 | 31.2 | 30.5 |
Generic Attacks | 13.5 | 14.0 | 11.3 | 15.5 | 9.4 |
Avg. JS Div. | 0.37 |
Method | Accuracy (%) | Macro-F1 | Macro-AUC |
---|---|---|---|
FedAvg | () | () | () |
FedProx | () | () | () |
SCAFFOLD | () | () | () |
FedNova | () | () | () |
FedCurv | () | () | () |
FedDyn | () | () | () |
FedAdam | () | () | () |
FedYogi | () | () | () |
ERFO (ours) | () | () | () |
Ditto | () | () | () |
Method | Accuracy (%) | Macro-F1 | Macro-AUC |
---|---|---|---|
FedAvg | () | () | () |
FedProx | () | () | () |
SCAFFOLD | () | () | () |
ERFO (ours) | () | () | () |
Ditto | () | () | () |
FedNova | () | () | () |
FedDyn | () | () | () |
FedCurv | () | () | () |
FedYogi | () | () | () |
FedAdam | () | () | () |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Khan, K. Entropy-Regularized Federated Optimization for Non-IID Data. Algorithms 2025, 18, 455. https://doi.org/10.3390/a18080455
Khan K. Entropy-Regularized Federated Optimization for Non-IID Data. Algorithms. 2025; 18(8):455. https://doi.org/10.3390/a18080455
Chicago/Turabian StyleKhan, Koffka. 2025. "Entropy-Regularized Federated Optimization for Non-IID Data" Algorithms 18, no. 8: 455. https://doi.org/10.3390/a18080455
APA StyleKhan, K. (2025). Entropy-Regularized Federated Optimization for Non-IID Data. Algorithms, 18(8), 455. https://doi.org/10.3390/a18080455