Entropy-Regularized Federated Optimization for Non-IID Data

Khan, Koffka

doi:10.3390/a18080455

Open AccessArticle

Entropy-Regularized Federated Optimization for Non-IID Data

by

Koffka Khan

Department of Computing and Information Technology, Faculty of Science and Technology, The University of the West Indies, St. Augustine Campus, St Augustine 311126, Trinidad and Tobago

Algorithms 2025, 18(8), 455; https://doi.org/10.3390/a18080455

Submission received: 16 June 2025 / Revised: 12 July 2025 / Accepted: 15 July 2025 / Published: 22 July 2025

(This article belongs to the Special Issue Advances in Parallel and Distributed AI Computing)

Download

Browse Figures

Versions Notes

Abstract

Federated learning (FL) struggles under non-IID client data when local models drift toward conflicting optima, impairing global convergence and performance. We introduce entropy-regularized federated optimization (ERFO), a lightweight client-side modification that augments each local objective with a Shannon entropy penalty on the per-parameter update distribution. ERFO requires no additional communication, adds a single-scalar hyperparameter

λ

, and integrates seamlessly into any FedAvg-style training loop. We derive a closed-form gradient for the entropy regularizer and provide convergence guarantees: under

μ

-strong convexity and L-smoothness, ERFO achieves the same

O (1 / T)

(or linear) rates as FedAvg (with only

O (λ)

bias for fixed

λ

and exact convergence when

λ_{t} \to 0

); in the non-convex case, we prove stationary-point convergence at

O (1 / T)

. Empirically, on five-client non-IID splits of the UNSW-NB15 intrusion-detection dataset, ERFO yields a +1.6 pp gain in accuracy and +0.008 in macro-F1 over FedAvg with markedly smoother dynamics. On a three-of-five split of PneumoniaMNIST, a fixed

λ

matches or exceeds FedAvg, FedProx, and SCAFFOLD—achieving 90.3% accuracy and 0.878 macro-F1—while preserving rapid, stable learning. ERFO’s gradient-only design is model-agnostic, making it broadly applicable across tasks.

Keywords:

federated learning; entropy-regularized optimization; non-IID data; convergence guarantees; Shannon entropy

1. Introduction

Federated learning (FL) [1] has emerged as a powerful paradigm for training machine-learning models across distributed data sources (clients) under privacy constraints. In FL, a central server coordinates a global model that is updated by aggregating model-weight updates from clients, each of which trains on its local data that remain decentralized. This allows learning from data that cannot be centrally pooled (because of privacy, ownership, or bandwidth issues). However, FL introduces new challenges compared with traditional centralized learning. A primary issue is statistical heterogeneity: data across clients are often not independent and identically distributed (non-IID) [2]. Clients may have vastly different data distributions, leading to diverging local model updates (client drift) that degrade the convergence and accuracy of the aggregated global model. Standard federated averaging (FedAvg) [1] can struggle under severe data heterogeneity or when tasks evolve over time. Various strategies have been proposed to make FL more robust to non-IID data. FedProx [3] adds a proximal term to the local objective, restricting how far a client’s local model can deviate from the global model, thereby improving stability on heterogeneous data. SCAFFOLD [4] introduces control variates (correction vectors) to each client’s updates to reduce drift at the cost of extra communication. These methods mitigate client drift but can impose constraints on plasticity (the ability to adapt to new data) or incur additional overhead. Recent algorithms such as FedDyn [5] dynamically regularize local objectives toward an adjusted global model to prevent drift, and FedCurv [6,7] penalizes changes to important weights (analogous to elastic-weight consolidation in FL) to reduce forgetting. Other works address objective inconsistencies via normalization or adaptive steps (e.g., FedNova [8] and FedOpt [9]) or pursue personalized FL [10,11], where each client’s model is partially specialized. Despite these advances, there remains a need for simple, effective mechanisms that handle both non-IID data and continual adaptation in FL without significant communication or computational overhead.

Meanwhile, techniques from continual learning (CL) have tackled the stability–plasticity dilemma in sequential training of neural networks [12]. The stability–plasticity dilemma refers to the trade-off between retaining past knowledge (stability) and incorporating new information (plasticity) [12]. Regularization-based CL methods such as Elastic Weight Consolidation (EWC) [13] penalize changes to important weights to prevent catastrophic forgetting (favoring stability), whereas replay-based methods like iCaRL [14] and knowledge-distillation approaches such as Learning without Forgetting (LwF) [15] preserve prior knowledge by other means. However, directly applying such techniques in FL is challenging because of decentralized data and limited communication. Federated continual-learning variants have begun to emerge [16,17], but they often add significant complexity or require storing additional data or models.

Despite these advances, statistical heterogeneity in federated settings remains a fundamental obstacle. When clients hold highly skewed or non-IID data, local updates can “drift” in conflicting directions, slowing convergence and degrading global accuracy. Classical FedAvg exhibits significant performance drops under extreme distribution skews, and extensions such as FedProx and SCAFFOLD only partially mitigate drift at the cost of extra hyperparameters, additional communication rounds, or storage overhead. Robust aggregation approaches likewise assume bounded or adversarial gradient models that may not hold in many real-world deployments. Our motivation is to remedy these limitations with an entropy-based regularizer that encourages each client’s update to spread its “mass” more uniformly across parameters. By maximizing the entropy of the update vector, we promote diverse yet compatible local adjustments that align better with the global objective—without introducing substantial communication or computational overhead.

In this work, we propose a novel approach to address these issues by incorporating an entropy-based regularization into the federated optimization process. Our method, called entropy-regularized federated optimization (ERFO), augments the local training objective on each client with a term that encourages high-entropy model updates. Intuitively, instead of allowing a client’s update to be overly concentrated on a small subset of model parameters (which can cause the global model to latch onto niche local patterns and potentially forget broader knowledge), we encourage the update to be spread more broadly across the model’s parameters. By penalizing very low-entropy (highly peaked) update distributions, we nudge each client to make more distributed changes that better align with the global objective and other clients’ updates.

The main contributions of this work are as follows:

1.: We introduce entropy-regularized federated optimization (ERFO), augmenting each client’s local objective with a Shannon entropy term on per-parameter update magnitudes. This mitigates client drift under highly non-IID data without altering server aggregation or adding communication overhead.
2.: We derive a closed-form gradient for the entropy regularizer and integrate it into standard SGD on the client. Unlike FedProx’s proximal constraint or SCAFFOLD’s control variates, our approach requires only one extra gradient computation per local epoch and no extra vectors to transmit.
3.: We conduct extensive experiments on the UNSW-NB15 intrusion detection benchmark with Dirichlet-partitioned non-IID clients, demonstrating that ERFO achieves 81.1% accuracy and 0.791 macro-F1—1.6 pp and 0.008 higher than FedProx—while yielding smoother, more stable convergence.
4.: We validate ERFO on the PneumoniaMNIST chest X-ray classification task with balanced client splits, achieving 90.3% accuracy and 0.878 macro-F1—2.8 pp and 0.022 higher than FedAvg—and further show via ablation on the initial entropy weight and a learning-rate vs. entropy-weight sweep that performance is robust across these hyperparameters.

The remainder of this paper is organized as follows: Section 2 reviews related work in federated optimization under non-IID settings, as well as entropy-based regularization methods in machine learning. Section 3 presents the ERFO methodology, including the definition of the entropy regularizer, its gradient derivation, and the modified federated training procedure. Section 4 describes the experimental setup, datasets (network intrusion and medical imaging), partitioning scheme, and evaluation metrics. Section 5 reports the empirical results, comparing ERFO with baseline methods and analyzing convergence behavior. Section 6 discusses the insights from the experiments, including the effectiveness of a fixed regularization schedule. Finally, Section 7 concludes the paper with discussions on generalizability, limitations, and future work.

2. Related Work

2.1. Statistical Heterogeneity and Drift Metrics

Federated learning (FL) seeks to minimize the

L_{global}

.

F (w) = \sum_{i = 1}^{N} \frac{n_{i}}{n} F_{i} (w),

(1)

where

F_{i}

is the empirical loss on client i and

n = \sum_{i} n_{i}

. When the underlying data distributions

D_{i}

differ—statistical heterogeneity—the local optima

w_{i}^{★}

(

w_{i}^{★}

denotes the minimiser of

F_{i}

on client i; it is not introduced in the previous equation, so the superscript “★” is retained here for clarity.),need not coincide, causing client drift that slows—or even stalls—convergence.

Quantifying Divergence

Divergence can be measured in several complementary ways. Model-parameter distance

∥ w_{i} - w ∥

was shown in early empirical studies [1] and confirmed in later work [2,3] to correlate with slower FedAvg convergence when the gap grows. Gradient dissimilarity

∥ \nabla F_{i} (w) - \nabla F (w) ∥

features in upper-bound analyses of stochastic optimisation and underpins the convergence proofs of SCAFFOLD and FedNova. More recently, distributional distances—for example, the Kullback–Leibler divergence or the 2-Wasserstein metric between empirical label histograms—have guided client clustering and hierarchical aggregation: Zhang et al. rely on 2-Wasserstein distance [18], whereas Ahn et al. employ KL-regularised objectives to tighten generalisation bounds under covariate shift [19]. Finally, information-theoretic bounds based on entropy have been proposed to upper-bound the expected loss under non-IID sampling, although they require exchanging per-round entropy estimates.

Mitigation Strategies

Divergence can be mitigated by three main defence classes. Proximal regularisers (e.g., FedProx, FedDyn) tether each client’s local solution to the current global iterate, limiting deviation. Variance-reduction schemes (e.g., SCAFFOLD, FedNova) introduce control variates or normalisation factors that offset client drift. Finally, distribution-aware scheduling clusters clients or re-weights their updates according to measured heterogeneity. All three approaches curb divergence but at the cost of additional hyper-parameters, auxiliary vectors, or communication overhead.

Gap Addressed by ERFO

ERFO departs from ad hoc drift proxies by deriving a single-scalar entropy regularizer from a pigeonhole-theoretic bound on KL divergence. Unlike KL or gradient-norm trackers, the entropy term is computed locally with one extra back-propagation and piggybacks on the standard weight update—no additional metadata or parameter tuning—thereby offering a principled, communication-neutral remedy for statistical heterogeneity.

2.1.1. Canonical FL Baselines

Canonical FL baselines serve as yardsticks for assessing new federated-learning methods. We compare our approach to eight widely used algorithms: FedAvg [1], FedProx [3], SCAFFOLD [4], FedNova [8], FedCurv [6], FedDyn [5], the adaptive-server optimisers FedAdam and FedYogi [9], and the personalised scheme Ditto [20]. These baselines differ mainly in how they mitigate client heterogeneity and communication costs, yet all share the goal of learning a global model without centralising data. Concise descriptions of each method are provided in Section 4.

Synthesis and Open Gaps

The above baselines fall into three camps: (i) variance-reduction methods (SCAFFOLD and FedNova) that curb gradient noise but increase communication load; (ii) regularisation-based defences (FedProx, FedDyn, and FedCurv) that tether clients to a reference point yet add tunable hyper-parameters or additional client state; and (iii) server-side accelerators and personalisation layers (FedAdam, FedYogi, and Ditto) that hasten convergence or improve local fit while leaving non-IID robustness largely implicit. None of these families supplies an information-theoretic bound linking update diversity to generalisation, and several require per-client metadata that scales linearly with the federation size.

Positioning of ERFO

ERFO complements existing baselines by adding a single-scalar entropy regulariser that provably upper-bounds the Kullback–Leibler divergence between each round’s aggregated update distribution and its pigeonhole-theoretic optimum; (ii) adds no auxiliary vectors, dual variables, or per-client models; and (iii) slots into vanilla FedAvg with one extra gradient computation per local epoch. As such, it addresses the open gap of theory-grounded drift control without increasing communication or storage overhead, positioning ERFO as a lightweight yet principled alternative to the current state-of-the-art method.

2.1.2. Recent Extensions

Canonical FL baselines serve as yardsticks for assessing new federated-learning methods. We compare our approach to eight widely used algorithms: FedAvg [1], FedProx [3], SCAFFOLD [4], FedNova [8], FedCurv [6], FedDyn [5], the adaptive-server optimisers FedAdam and FedYogi [9], and the personalised scheme Ditto [20]. These baselines differ mainly in how they mitigate client heterogeneity and communication costs, yet all share the goal of learning a global model without centralising data. Concise descriptions of each method are provided in Section 4.

Although these approaches report notable gains, each relies on auxiliary metadata or the per-client state—distribution fingerprints, link-quality statistics, or task-specific adapters—that must be stored or exchanged alongside model parameters. Such side information grows with the client population and can become unreliable under churn. By contrast, ERFO introduces a single-scalar entropy regularizer that is computed locally and aggregated exactly like the weights, adding no extra communication, storage, or bookkeeping beyond standard federated updates while still providing provable bounds on non-IID divergence.

2.2. Entropy and Mirror Descent

2.2.1. Entropy-Based Regularization Across Domains

Entropy regularization is a workhorse in optimization because it simultaneously smooths objectives, convexifies otherwise intractable problems, and promotes principled exploration. In convex optimization, entropic mirror descent treats the Kullback–Leibler divergence as its Bregman distance and updates the iterate via

x_{t} = arg {min}_{x} 〈 g_{t}, x 〉 + \frac{1}{η} D_{KL} (x ∥ x_{t - 1}),

achieving

O (\sqrt{T})

regret under mild constraints [21]. In computational transport, the Sinkhorn algorithm adds a negative Shannon-entropy term to the cost matrix, turning the earth-mover problem into a strictly convex program solvable by fast row–column normalization [22]. In reinforcement learning, Soft Actor–Critic augments the reward with policy entropy so that the stochastic policy avoids premature collapse onto sub-optimal actions [23]; conversely, semi-supervised classifiers minimize output entropy to obtain confident pseudo-labels without ground-truth annotations [24]. Beyond Shannon entropy, Tsallis-entropy families introduce a temperature parameter q that yields heavy-tailed smoothing suited to power-law data [25]. At the parameter-optimization level, gradient-entropy penalties have been proposed to flatten sharp minima and thus improve generalization in deep networks [26].

Across these settings the entropy coefficient is typically hand-tuned, and none of the methods supplies an information-theoretic bound that ties entropy to client-side distribution shift. ERFO bridges this gap by deriving a single-scalar entropy term from a pigeonhole-theoretic bound on KL divergence; the coefficient is data-dependent yet parameter-free and integrates into standard FedAvg updates without extra metadata or communication overhead.

2.2.2. Entropy in Federated Learning

Integrating entropy-based objectives into federated optimization is still uncommon. Yuan et al. [27] embed an entropic proximal term into mirror descent, but the formulation introduces dual variables that must be stored and transmitted for every client, increasing communication volume as the federation scales. Wang et al. [28] treat per-client entropy as a proxy for model uncertainty; their server therefore requests and aggregates additional entropy scores each round, doubling the message size and incurring latency on low-bandwidth links.

ERFO takes a lighter approach: it maximizes the entropy of the update distribution through a single-scalar regularizer that is evaluated locally during back-propagation. Consequently, each client sends only its standard weight delta—no dual variables, uncertainty vectors, or side metadata—so the protocol remains bandwidth-neutral while still providing a principled, information-theoretic control on non-IID drift.

3. Methodology

This section introduces our entropy-regularized federated optimization (ERFO) approach. We first define the entropy-regularized client objective, deriving the entropy term

H (p)

and its gradient

g_{H}

and showing how this regularization modifies local gradients. We then present the overall ERFO procedure in Algorithm 1. Two variants of ERFO are described—one with a fixed regularization weight and another with a decaying schedule—corresponding to different experimental settings (PneumoniaMNIST and UNSW-NB15, respectively). Finally, we provide a formal convergence analysis for ERFO: first for the fixed-

λ

case under standard convexity assumptions, and then an extended theorem for the decaying-

λ_{t}

variant, discussing the convergence implications of the diminishing regularization.

Algorithm 1 Entropy-regularized federated optimization (ERFO)

Require: initial global model

w^{0}

, total rounds T, initial entropy weight

λ_{0}

, local epochs E,
and client learning rate

η

Ensure: final global model

w^{T}

1:: for $t = 1$ to T do
2:: Server: set

$λ_{t} = \{\begin{matrix} λ_{0}, & (ERFO - Fixed), \\ λ_{0} (1 - \frac{t}{T}), & (ERFO - Decayed) . \end{matrix}$
3:: server selects $S_{t}$ (e.g., all clients or a random subset)
4:: for each client $i \in S_{t}$ in parallel do
5:: $w_{i} \leftarrow w^{t - 1}$ // initialize local from global
6:: for $e = 1$ to E do
7:: compute $g_{H} = \nabla_{w} H (p (w_{i}, w^{t - 1}))$ using (5)
8:: for each mini-batch $(X_{b}, y_{b})$ do
9:: compute $\nabla_{w} F_{i} (w_{i}; X_{b}, y_{b})$
10:: $w_{i} \leftarrow w_{i} - η [\nabla_{w} F_{i} (w_{i}; X_{b}, y_{b}) + λ_{t} g_{H}]$
11:: end for
12:: end for
13:: send model diff. $w_{i} - w^{t - 1}$ to server
14:: end for
15:: Server: aggregate

$w^{t} \leftarrow w^{t - 1} + \frac{1}{\sum_{i \in S_{t}} n_{i}} \sum_{i \in S_{t}} n_{i} (w_{i} - w^{t - 1}),$

where $n_{i}$ is client i’s sample count.
16:: end for

3.1. Entropy-Regularized Client Objective

In ERFO, each client augments its local objective with an entropy-based regularization term that penalizes the distribution of model updates. Let

F_{i} (w)

denote the original loss function on client i (e.g., the empirical risk over its local dataset) evaluated with the model parameter w. In a federated round, let

w^{t}

be the current global model weights (broadcast from the server at the start of the round) and w be the local model being optimized on the client (initialized as

w = w^{t}

). We define the vector of normalized update magnitudes

p = (p_{1}, p_{2}, \dots, p_{d})

(of dimension d, the number of model parameters) as follows:

p_{j} = \frac{| w_{j} - w_{j}^{t} |}{\sum_{k = 1}^{d} | w_{k} - w_{k}^{t} |}, j = 1, 2, \dots, d,

(2)

where

w_{j}

is the j-th component of the local model and

w_{j}^{t}

is the corresponding component of the initial global model. In summary,

p_{j}

represents the fraction of the total model change (in absolute value) that is attributable to parameter j. By construction

p_{j} \geq 0

and

\sum_{j} p_{j} = 1

, so p can be viewed as a discrete probability distribution over the model’s coordinates (or groups of parameters).

We then define the entropy regularizer

H (p)

as the Shannon entropy of this distribution p:

H (p) = - \sum_{j = 1}^{d} p_{j} ln p_{j},

(3)

where

0 ln 0

is understood as 0 (and in practice we avoid

p_{j} = 0

by adding a very small

ϵ

inside the log if needed). The entropy

H (p)

is maximized when the update magnitudes are spread evenly across all parameters (a uniform p) and minimized when the update is highly concentrated in a few parameters (a peaked p). By entropy-regularized optimization, we mean that the client aims to minimize its loss while simultaneously discouraging an overly concentrated update (i.e., encouraging higher entropy in the update distribution). We introduce a non-negative regularization weight

λ

to balance these goals.

Formally, the ERFO client’s objective for client i is given by

{\tilde{F}}_{i} (w) = F_{i} (w) + λ \cdot H (p (w, w^{t})),

(4)

where we explicitly note that

p = p (w, w^{t})

is a function of the local model w (and the fixed reference point

w^{t}

). Equation (4) augments the ordinary local loss

F_{i} (w)

with the entropy term

H (p)

multiplied by

λ

. When

λ > 0

, this term penalizes updates with low entropy (i.e., it adds a cost when p is too high on certain coordinates). Conversely, it effectively rewards updates that are spread broadly (high-entropy updates) since those yield a larger

H (p)

, reducing the objective. The hyperparameter

λ

thus controls the strength of this regularization:

λ = 0

recovers standard federated training, while larger

λ

values place more emphasis on distributing the update across many parameters.

To optimize

\tilde{F} i (w)

, we derive the gradient of the entropy term. Let

Z = \sum {k = 1}^{d} | w_{k} - w_{k}^{t} |

denote the

L^{1}

norm of the update (the normalization denominator in Equation (2)). Using Z, we can rewrite

p_{j} = | w_{j} - w_{j}^{t} | / Z

. The derivative of

H (p)

with respect to a particular weight

w_{j}

can be obtained via the chain rule, taking into account the dependence of all

p_{k}

on

w_{j}

. For

j = 1, 2, \dots, d

, we have

\frac{\partial H (p)}{\partial w_{j}} = \frac{\sum_{k = 1}^{d} p_{k} ln p_{k} - ln p_{j}}{Z} \cdot sign (w_{j} - w_{j}^{t}) .

(5)

Explicit Gradient Derivation

Recall that

p_{j} = \frac{δ_{j}}{Z}, δ_{j} = | w_{j} - w_{j}^{t} |, Z = \sum_{k = 1}^{d} δ_{k},

and that the entropy of the coordinate-wise distribution p is

H (p) = - \sum_{k = 1}^{d} p_{k} ln p_{k} .

Applying the chain rule yields

\frac{\partial H}{\partial w_{j}} = - \sum_{k = 1}^{d} (1 + ln p_{k}) \frac{\partial p_{k}}{\partial w_{j}} .

The partial derivative of

p_{k}

with respect to

w_{j}

is

\frac{\partial p_{k}}{\partial w_{j}} = \frac{1}{Z} \frac{\partial δ_{k}}{\partial w_{j}} - \frac{δ_{k}}{Z^{2}} \sum_{ℓ = 1}^{d} \frac{\partial δ_{ℓ}}{\partial w_{j}}, \frac{\partial δ_{k}}{\partial w_{j}} = sign (w_{j} - w_{j}^{t}) 1_{k = j},

which simplifies to

\frac{\partial p_{k}}{\partial w_{j}} = \frac{sign (w_{j} - w_{j}^{t})}{Z} (1_{k = j} - p_{k}) .

Substituting this result back into the chain-rule expression gives

\frac{\partial H}{\partial w_{j}} = \frac{sign (w_{j} - w_{j}^{t})}{Z} (\sum_{k = 1}^{d} p_{k} ln p_{k} - ln p_{j}),

which matches Equation (5).

Entropy-Regularised Gradient

Given Equation (5), the gradient of the entropy-augmented local objective

{\tilde{F}}_{i} (w)

is

\nabla_{w} {\tilde{F}}_{i} (w) = \nabla_{w} F_{i} (w) + λ \nabla_{w} H (p),

(6)

i.e., the usual empirical-loss gradient plus

λ

times the entropy gradient

g_{H} \nabla_{w} H (p)

. During local optimisation, each client updates its parameters as

Δ w \approx - η (\nabla_{w} F_{i} (w) + λ g_{H}),

where

η

is the learning rate. The added term

λ g_{H}

damps excessively large coordinate updates—those that dominate p and therefore receive a negative push from

g_{H}

—and promotes more homogeneous adjustments across weights. This stabilising effect is especially valuable when data are highly heterogeneous or each client holds only a few samples, because it prevents any single feature or weight from over-correcting on limited evidence. The next section details how this entropy-regularised objective integrates into the federated learning workflow.

3.2. Entropy-Regularized Federated Optimization Algorithm

We adopt a standard federated optimization workflow (similar to FedAvg) augmented with entropy regularization on each client. Algorithm 1 summarizes the ERFO procedure. At a high level, in each communication round t, a subset of K clients (or all clients, depending on the scenario) receives the current global model

w^{t - 1}

from the server. Each client then performs local stochastic gradient descent (SGD) on the augmented objective

{\tilde{F}}_{i} (w)

(Equation (4)), which entails computing the entropy gradient

g_{H}

and adding the term

λ_{t} g_{H}

to the usual loss gradient during each update step. Here we allow the regularization weight

λ_{t}

to potentially vary with the round t (to accommodate both constant and decaying schedules; see next subsection). After E local epochs (or a certain number of SGD steps) on each client, the resulting model updates are sent back to the server and averaged to form the new global model

w^{t}

. This iterative process continues for T rounds or until convergence.

In the algorithm, the key difference from standard FedAvg is in the local update (lines 7–11), where the gradient of the entropy term,

g_{H}

, is computed at the start of each epoch and added (with weight

λ_{t}

) to the minibatch gradient. For efficiency, one could update

g_{H}

less frequently than every epoch (since

g_{H}

depends on the deviation

w_{i} - w^{t - 1}

, which changes gradually as the local model

w_{i}

is updated). However, in our implementation we recompute it each epoch to ensure accuracy. The extra computational overhead of computing

g_{H}

is modest: it requires computing the absolute difference between current and initial weights, normalizing to obtain p, and a pass to compute Formula (5) for each weight. This is

O (d)

per computation, which in our experiments is negligible compared to the cost of processing data (especially for neural network models where d is on the order of millions, and each forward/backward on a batch also costs

O (d)

operations).

The algorithm uses

λ_{t}

to denote the regularization weight in round t, allowing either a constant value or a schedule that changes over time. In the next subsection, we will detail two specific choices (fixed and decaying

λ_{t}

) and when each is used. We also note that the server may choose to perform learning rate warmup on the clients, i.e., starting with a smaller

η

in early iterations and then increasing to the base learning rate. Warmup can stabilize training in the presence of the entropy term, which might initially introduce large gradients if the global model

w^{0}

is poorly tuned. In our implementation, we use a short warmup (e.g., scaling

η

linearly over the first few batches) especially for the decayed-

λ

variant (UNSW-NB15), where

λ_{0}

is relatively large initially.

3.3. ERFO Variants: Fixed vs. Decayed Regularization

We consider two variants of our entropy-regularized federated optimization algorithm, differing in how the regularization coefficient

λ_{t}

is scheduled over the course of training:

ERFO-Fixed: This variant uses a constant regularization weight throughout training,

λ_{t} = λ_{0}

for all rounds

t = 1, \dots, T

. In this case, the influence of the entropy term remains the same from start to finish. We apply ERFO-Fixed in our PneumoniaMNIST experiments, where a fixed moderate

λ

was sufficient to improve generalization without needing to taper it off. The algorithm in this case simplifies to using

λ

in every round’s local updates (as in Algorithm 1 with the first option for

λ_{t}

). One advantage of a fixed

λ

is its simplicity and predictability: the regularizer continuously guides local training to distribute updates, which can enhance stability on non-IID data. However, a potential drawback is that as the global model gets closer to the optimum, a nonzero

λ

might introduce a persistent bias (preventing full convergence to the exact minimizer of the original loss, as analyzed later).

ERFO-Decayed: In this variant, the regularization weight is time-dependent, starting from an initial value

λ_{0}

and gradually decaying to 0 by the end of training. We specifically use a linear decay schedule,

λ_{t} : = λ_{0} (1 - \frac{t}{T}), t = 0, 1, \dots, T,

(7)

for round

t = 1, 2, \dots, T

(so

λ_{T} = 0

at the final round). This schedule linearly anneals the entropy regularization, meaning that early in training the updates are strongly regularized (promoting high-entropy, diffuse updates), whereas in later rounds the regularization diminishes, allowing the model to fine-tune more freely on the actual task loss. We employ ERFO-Decayed in the UNSW-NB15 experiments, where we found that using a high

λ

initially helps navigate the complex, heterogeneous feature space of the intrusion detection data, while reducing

λ

to 0 by the end ensures that the model can converge to a solution optimized purely for classification performance. In conjunction with this, we found it beneficial to use learning rate warmup for the first few local epochs to mitigate any instability from the initially large

λ_{0}

. After warmup, we keep the learning rate constant (or optionally employ a standard decay rate for the learning rate as well, though in our UNSW-NB15 runs the primary schedule of interest was the decay of

λ_{t}

).

In summary, ERFO-Fixed and ERFO-Decayed share the same algorithmic core (Algorithm 1) but differ in the choice of

λ_{t}

. ERFO-Fixed is appropriate when constant regularization provides the best bias–variance trade-off for the problem at hand (and when we are not concerned about a small asymptotic bias in convergence). ERFO-Decayed is useful when one wants to reap the benefits of entropy regularization in early training (improved stability and coverage of the solution space) but still ensure that the final model is not biased by the regularizer. By annealing

λ_{t}

to zero, ERFO-Decayed aims to combine the best of both worlds: strong regularization initially and exact convergence to the federated objective at the end. We formally analyze these convergence behaviors next.

3.4. Convergence Analysis

We now present a theoretical convergence analysis for ERFO. We first consider the case of a fixed regularization weight

λ

and then discuss the decaying

λ_{t}

scenario. Our analysis is in the context of convex optimization and leverages standard assumptions used in federated learning theory.

Assumptions: We assume that each client’s loss

F_{i} (w)

is convex and L-smooth (i.e., its gradient is L-Lipschitz). We also assume that the overall objective

F (w) = \sum_{i = 1}^{N} \frac{n_{i}}{n} F_{i} (w)

(where

n = \sum_{i} n_{i}

) has a unique minimizer w (for example,

F (w)

could be strongly convex to guarantee uniqueness, though strong convexity is not strictly required for convergence—it will, however, yield linear convergence rates). We ignore the effect of client sampling or stochastic gradient noise in this sketch and assume that all clients participate in each round, and full-batch gradients are used for the simplicity of theoretical exposition. Under these conditions, FedAvg is known to converge to w (or within a small ball if only a few local steps are taken) at a rate of

O (1 / t)

for convex objectives (or a linear rate of

O (ρ^{t})

if both strong convexity is assumed and step sizes are chosen appropriately).

For ERFO, the presence of the entropy term modifies the local update dynamics. Nonetheless, we can bound its influence. First, note that the entropy gradient

g_{H}

in Equation (5) is bounded: since

0 \leq p_{j} \leq 1

, we have

| ln p_{j} | \leq ln d

(the extreme case is that one coordinate carries all weight, where

p_{j} = 1

gives

ln p_{j} = 0

or

p_{j} = 1 / d

gives

ln p_{j} = - ln d

). Also,

\sum_{k} p_{k} ln p_{k} \in [- ln d, 0]

. Thus the numerator in Equation (5) satisfies

| \sum_{k} p_{k} ln p_{k} - ln p_{j} | \leq ln d

. The denominator

Z = \sum_{k} | w_{k} - w_{k}^{t} |

is the

L^{1}

norm of the update; in practice this will be limited by the convergence of local training (one could also enforce a bound by early stopping of local epochs if needed). Assuming bounded Z (or switching to an

L^{2}

-normalized p, which would naturally bound the influence of large updates), we can assert that

| g_{H} |

is bounded by some constant

G_{H}

. Intuitively,

g_{H}

does not blow up unless the model update becomes extremely large or extremely concentrated on a single parameter, both of which can be controlled.

Under these conditions, we can establish convergence for ERFO as follows:

Theorem 1

(Convergence with fixed $λ$ ). Suppose each

F_{i} (w)

is convex and L-smooth, and let

λ_{t} = λ

for all rounds (ERFO-Fixed). Let η be a constant step size for local gradient descent that is small enough to satisfy

0 < η L (1 + λ G_{H}) < 2

(ensuring descent in the presence of the entropy gradient, where

G_{H}

bounds the Lipschitz constant or the norm of

g_{H}

as discussed). Then Algorithm 1 (ERFO-Fixed) converges to a neighborhood of the optimal solution

w^{*}

of the unregularized objective

F (w) = \sum_{i} \frac{n_{i}}{n} F_{i} (w)

. In particular, after T rounds the global model

w^{T}

satisfies

F (w_{T}) - F (w^{*}) \leq O (T^{- 1}) + O (λ) .

(8)

i.e., suboptimality decreases at the usual

O (1 / T)

rate down to an error floor on the order of λ. Furthermore, if

F (w)

is μ-strongly convex,

w^{T}

converges linearly to a point within

O (λ)

of

w^{*}

.

Proof Sketch.

The proof follows the outline of standard FedAvg convergence proofs, with additional terms accounting for entropy regularization. One can view ERFO-Fixed as performing (approximate) gradient descent on the modified global objective

\hat{F} (w) = F (w) + λ Φ (w)

, where

Φ (w)

represents the expected entropy term aggregated over all clients. Although

Φ (w)

is not explicitly given in closed form (due to the interdependence of

w^{t}

and local updates,

Φ

would effectively be a function measuring the concentration of

w - w

), one can bound the difference between the gradient of F and the gradient of

\hat{F}

. In particular,

\nabla \hat{F} (w) = \nabla F (w) + λ \nabla Φ (w)

, and we can show

| \nabla Φ (w) | \leq G_{H}

under our bounded-entropy-gradient assumption. Thus, the update in ERFO-Fixed is an inexact gradient step for minimizing

F (w)

, with an error proportional to

λ

. Using techniques from inexact or perturbed gradient descent analysis, we obtain that

F (w^{t})

converges to within

O (λ)

of

F (w)

. When

λ

is small, this bias is negligible; however, a nonzero

λ

prevents us from reaching the exact optimum w in general (unless w itself happens to have a maximally spread-out update distribution so that

\nabla Φ (w^{*}) = 0

). In summary, constant entropy regularization yields stable convergence but to a slightly biased solution. □

Theorem 1 shows that ERFO with a fixed regularisation coefficient attains essentially the same convergence rate as standard FedAvg in convex problems, but converges to a biased limit point (see Appendix A for the full proof). The bias term

O (λ)

is intuitively expected: the algorithm is effectively optimizing a proxy objective

F (w) + λ Φ (w)

, so it converges to the minimizer of that proxy, which can differ from the true minimizer

w^{*}

by an amount that vanishes as

λ \to 0

. In practice, we choose a

λ

value that is small enough so that this bias is within acceptable tolerance for the problem, trading off some optimality for improved training stability.

We now turn to the decaying regularization scenario. When

λ_{t}

is gradually reduced to 0, we eliminate the bias in the limit, but at the cost of a more complex time-varying analysis. Fortunately, since

λ_{t}

shrinks over rounds, the bias effect becomes smaller as training progresses, and in the final rounds the algorithm behaves nearly like FedAvg on the original objective.

Theorem 2

(Convergence with Decaying $λ_{t}$ ). Assume the same conditions as Theorem 1, but now let

λ_{t}

be a nonincreasing sequence that tends to 0 as

t \to T

(specifically, the linear schedule

λ_{t} = λ_{0} (1 - t / T)

). Then Algorithm 1 (ERFO-Decayed) converges to the true optimum

w^{*}

of

F (w)

. Moreover, for sufficiently large T, the convergence rate in the convex setting approaches that of FedAvg without regularization. In particular, for convex

F_{i}

, one can ensure

F (w_{T}) - F (w^{*}) = O (T^{- 1}) .

(9)

And under strong convexity (and appropriate diminishing step sizes)

w^{t}

converges linearly to

w^{*}

as

t \to T

.

Proof Sketch.

The proof builds upon Theorem 1. Since

λ_{t}

decays, for any

ε > 0

one can choose a round

t_{0}

for all

t \geq t_{0}

,

λ_{t} < ε

. Beyond this point, the regularization is very weak, and the algorithm behaves almost like unregularized FedAvg, which we know will converge to w given convexity. The earlier rounds (

t < t_{0}

) with larger

λ_{t}

do not harm convergence; in fact, they help drive the model quickly into a good region, after which the diminishing

λ_{t}

ensures that any remaining bias is flushed out. Formally, one can derive error bounds similar to Theorem 1 for each round and sum them telescopically, taking into account the decreasing weight on the perturbation. The linear decay

λ_{t} = λ_{0} (1 - t / T)

is gentle enough that the error does not accumulate significantly. In the limit

T \to \infty

(infinitely many rounds with infinitesimal decay per round), the method essentially becomes FedAvg in the long run, guaranteeing w. For practical finite T values, one can show that the final error term is proportional to

λ_{T}

(which is zero) plus higher-order terms that vanish as T grows. Thus, ERFO-Decayed achieves unbiased convergence to the optimum. □

Theorem 2 implies that using a decaying regularization schedule allows ERFO to enjoy the benefits of entropy regularization during the early and mid stages of training (improving robustness to heterogeneity and guiding the optimization trajectory) while still asymptotically converging to the exact minimizer of the original federated objective. In other words, the entropy term acts like a vanishing perturbation that does not affect the final solution when scheduled to zero. The convergence speed is essentially unaffected by the entropy term in the long run, aside from the constant-factor impact of slightly larger gradients in early iterations (which our assumptions and choice of

η

account for).

Practical implications: Convergence theory suggests that if one is concerned with reaching the best possible accuracy (unbiased solution), the decayed variant of ERFO is preferable, as it guarantees eventual convergence to

w^{*}

. On the other hand, if training time is limited or if a small bias is acceptable in exchange for stability, a fixed small

λ

can be used, and training can be stopped early. In our experiments, we indeed observe that ERFO-Decayed matches the final accuracy of standard FedAvg on UNSW-NB15 (since

λ_{T} = 0

), while ERFO-Fixed on PneumoniaMNIST achieves better intermediate results than FedAvg and reaches a comparable final accuracy (with a negligible gap attributable to the regularization). Entropy regularization does not dramatically alter the theoretical requirements for convergence; it mainly adds a controllable bias which decays to zero if scheduled appropriately.

3.5. Limitations of the Theoretical Analysis

Our convergence theorems (Theorems 1 and 2) assume that each client loss

F_{i}

is

μ

-strongly convex and L-smooth, yielding a linear rate up to an

O (λ)

bias for fixed

λ

or exact

O (1 / T)

decay under a decayed

λ_{t}

. In practice, however, the ReLU-MLP and CNN models of Section 6 are fully non-convex, so those linear-rate guarantees no longer strictly apply (see Appendix B for a non-convex stationary-point proof). As with other non-convex FL analyses, one can only guarantee convergence to a stationary point rather than a global minimum. Empirically, ERFO still converges smoothly and achieves superior final accuracy and macro-F1 under non-IID data (Section 6), but a full theoretical explanation of this empirical robustness—for example, how the entropy term more effectively mitigates client drift—remains an open question for future work.

3.6. Why ERFO Improves Final Solutions

While our proof shows that ERFO attains the same worst-case rate as FedAvg up to second-order terms, it does not capture why ERFO’s entropy regularizer leads to higher final accuracy on non-IID data. We hypothesize that by maximizing update entropy, ERFO prevents over-specialization on local idiosyncrasies and thus yields global models that generalize better across heterogeneous clients. A deeper theoretical analysis of this phenomenon—e.g., via stability or bias-variance trade-off lenses—would be valuable future work.

4. Experimental Setup

We implemented all algorithms in Python 3.11.4 (Python Software Foundation, Wilmington, DE, USA). Model definition and training were carried out with PyTorch 2.3.0 (PyTorch Foundation, San Francisco, CA, USA) and orchestrated through FedML 0.8.1 (FedML Inc., Palo Alto, CA, USA). GPU acceleration relied on CUDA 12.4 and cuDNN 9.0 (NVIDIA Corporation, Santa Clara, CA, USA). Statistical analyses used NumPy 1.26.4 (NumPy Developers, Austin, TX, USA) and SciPy 1.13.0 (SciPy Developers, Austin, TX, USA).

We conduct experiments on two datasets, UNSW-NB15 and PneumoniaMNIST, using a consistent federated training configuration unless noted otherwise. We simulate a federation of 5 clients and train for 100 communication rounds in all experiments. For each round, every client performs 1 local epoch of training on its own data (mini-batch size: 64) using the Adam optimizer (learning rate:

1 \times 10^{- 3}

) for model updates (for methods where the server uses a different optimizer, local training still follows this protocol). The UNSW-NB15 dataset (network intrusion detection with multiple attack categories) is partitioned in a non-i.i.d. manner: we sample a Dirichlet distribution with a concentration of

α = 0.5

to allocate the training examples of each class among the 5 clients, resulting in heterogeneous client data distributions.

4.1. Quantifying Non-IIDness

To situate our split against standard FL benchmarks, we report in Table 1 each client’s label proportions and compute the average pairwise Jensen–Shannon (JS) divergence between clients’ class distributions. With

α = 0.5

, the mean JS divergence is 0.37, indicating a moderate degree of heterogeneity (cf. [2]).). We additionally report the macro-averaged area under the ROC curve (macro-AUC) for each method, which evaluates ranking quality across thresholds and is less sensitive to class-prior shifts than single-threshold F1.

In contrast, the PneumoniaMNIST dataset (a MedMNIST collection of chest X-ray images for pneumonia vs. normal classification) is split in a roughly i.i.d. fashion: the training set is randomly divided into 5 class-balanced subsets so that each client receives an equal proportion of each class. We use appropriate models for each task: a simple multilayer perceptron (two 64-unit hidden layers with ReLU activations) for the tabular UNSW-NB15 features and a lightweight CNN for PneumoniaMNIST (two convolutional layers with 32 and 64 filters, respectively, each followed by ReLU and

2 \times 2

max-pooling, then a 128-unit dense layer, and a final output layer).

4.2. Warmup Initialization

Before starting federated training, we apply an IID warmup initialization step to all models. Specifically, we take a small random subset of the training data (5% of the samples, drawn uniformly across all clients, thus IID) and train the model on this subset for 1 epoch using Adam with a learning rate of

1 \times 10^{- 3}

. This warmup step yields a better initial global model that is shared with all clients at round 0, helping to stabilize training for all methods.

4.3. Evaluation Protocol

Model performance is evaluated on each dataset’s test set after each communication round, reporting both accuracy (percentage correct) and macro-averaged F1. To ensure robust statistical conclusions, we increased our number of independent runs to 50 (seeds 1–50) and now report for each method the mean ± standard deviation along with the 95% confidence interval (CI), which was computed under a Student’s t-distribution. For example, if

\bar{x}

and s are the sample mean and sample standard deviation over

n = 50

runs, the 95% CI is

\bar{x} \pm t_{0.975, n - 1} \frac{s}{\sqrt{n}},

where

t_{0.975, 9} = 2.262

. These CIs are shown in Table 2 and Table 3 in parentheses next to each mean. This larger sample size and explicit CI reporting guard against over-interpreting stochastic fluctuations under highly skewed data.

4.4. Baselines

We compare our proposed method, ERFO, against the following federated learning algorithms that address data heterogeneity:

FedAvg—the standard federated averaging algorithm that aggregates local model updates by weighted averaging with one local epoch and the Adam optimizer.
FedProx—FedAvg augmented with a proximal term $\frac{μ}{2} {∥ w - w^{(t)} ∥}^{2}$ (with small $μ$ ) to limit client drift under non-IID data.
SCAFFOLD—a variance-reduction method using control variates exchanged between servers and clients each round to correct local update drift without additional tunable hyperparameters.
FedNova—a normalized averaging scheme that scales each client’s update by its number of local steps, which here reduces to FedAvg under equal single-epoch workloads.
FedDyn—a dynamically regularized FedAvg variant that adds a dual-variable adjustment $〈 γ_{i}^{(t)}, w 〉$ to each client’s loss to align local and global optima without extra hyperparameters.
FedCurv—an elastic weight consolidation approach that penalizes changes to important parameters via a Fisher-information-based quadratic term $\frac{λ}{2} \sum_{j} Ω_{j} {(w_{j} - w_{j}^{(t)})}^{2}$ (using the recommended $λ$ ).
Ditto—a personalized FL framework in which each client trains its own model $w_{i}$ , minimizing $F_{i} (w_{i}) + \frac{λ}{2} {∥ w_{i} - w^{(t)} ∥}^{2}$ alongside the global model to balance personalization and generalization.
FedAdam—a FedOpt method applying the Adam update at the server (with server $η = 10^{- 3}$ , $β_{1} = 0.9$ , $β_{2} = 0.999$ , and $ϵ = 10^{- 8}$ ) while clients train locally with Adam.
FedYogi—similar to FedAdam but using the Yogi optimizer at the server (with the same hyperparameters and default $τ$ ) to temper growth of second-moment estimates.
ERFO (Ours)—elastic-regularized federated optimization, which augments each client’s objective with a round-dependent entropy penalty $\frac{λ_{t}}{2} {∥ w - w^{(t)} ∥}^{2}$ (fixed or decayed) to encourage high-entropy updates without extra communication.

4.5. Hyperparameter Settings

All methods use a global learning rate of

η = 10^{- 3}

, matching the original FedAvg implementation [1]. Clients train with a local batch size of 64, as in Zhao et al. [2], and perform exactly one local epoch per round following Li et al. [3]. For the UNSW-NB15 non-IID partitioning, we draw from a Dirichlet distribution with

α = 0.5

, consistent with standard federated benchmarks [3]. We apply a 5% IID “warm-up” epoch before round 1 (sampling uniformly across all clients) as recommended by Reddi et al. [9]. For our CIFAR-10 sensitivity sweep, the ranges of entropy weight

λ

and the learning rate mirror the grid by He et al. [29]. All reported metrics are averaged over 50 independent seeds (1–50), with the mean ± std.

4.6. Client Participation

To reflect real-world federated deployments with intermittent device availability, we differentiate between full and partial participation. In the UNSW-NB15 experiments, all

N = 5

clients participate in every round (

K = N = 5

). For the PneumoniaMNIST task, we instead randomly sample

K = 3

out of the 5 clients each round. This partial-participation setting tests ERFO’s robustness when only a subset of devices can contribute per iteration.

5. Results

5.1. UNSW-NB15 Intrusion Detection Task

Table 2 presents the federated learning performance on the UNSW-NB15 network intrusion dataset. Our proposed ERFO method achieves the highest classification accuracy (approximately 81.1%) and macro-averaged

F_{1}

(macro-

F_{1}

) score (around 0.79) among all evaluated methods. It substantially outperforms the baseline algorithms—for example, the next best approach attains roughly 80.2% accuracy and a macro-

F_{1}

score of 0.78, falling short of ERFO’s results. Notably, ERFO’s improvement in macro-

F_{1}

indicates that it better recognizes minority attack classes compared to conventional FedAvg and SCAFFOLD, which tend to be biased toward the majority class (benign traffic). In other words, although some baseline models reach reasonably high overall accuracy on UNSW-NB15, their lower macro-

F_{1}

scores reveal imbalanced performance across attack categories. By contrast, ERFO maintains both high accuracy and a high macro-

F_{1}

score, reflecting more balanced detection of each class.

Figure 1 illustrates the learning curves of accuracy and macro-

F_{1}

over communication rounds for UNSW-NB15. We observe a rapid initial improvement during the first few rounds, after which both metrics continue to rise gradually and converge to a stable plateau by around 80–100 rounds. The ERFO method not only converges to a higher final macro-

F_{1}

score and accuracy than the baselines, but it also exhibits a smooth training trajectory with minimal oscillations. This suggests that the model’s training remains stable even under the non-i.i.d. client data distribution of UNSW-NB15. The convergence trends indicate diminishing returns in later rounds—once macro-

F_{1}

saturates near 0.79 and accuracy near 81%, further rounds yield only marginal gains. In summary, on the UNSW-NB15 task ERFO reaches superior performance and maintains it without volatility, whereas the baseline methods plateau earlier at lower macro-

F_{1}

levels (signifying poorer recall on some intrusion categories).

5.2. PneumoniaMNIST Image Classification Task

Experimental results on the PneumoniaMNIST medical imaging dataset are summarized in Table 3. Consistent with the intrusion detection findings, ERFO again achieves top performance, with a final accuracy of about 88% and a macro-

F_{1}

score around 0.86. This represents a significant improvement over the baseline federated approaches, which struggle on this highly skewed binary classification problem. For instance, one baseline method (SCAFFOLD) reaches only ∼65.6% accuracy with macro-

F_{1} \approx 0.46

, which is barely above random guessing due to a severe class imbalance in the distributed data. Other baseline optimizations (e.g., FedAvg variants) improve upon this but still fall short of ERFO—typically achieving percent accuracy scores in the mid-70s to low 80s and correspondingly lower macro-

F_{1}

scores (indicating many misclassified minority-class examples). In contrast, ERFO’s macro-

F_{1} \approx 0.85

on PneumoniaMNIST is nearly as high as its accuracy, implying that the model performs equally well on the minority class (e.g., normal cases, which are fewer in number) as on the majority class (pneumonia cases). This balanced performance is crucial in medical diagnosis tasks: ERFO not only improves the overall accuracy but also ensures that rare but critical normal/pneumonia instances are recognized with high precision and recall.

The training dynamics for PneumoniaMNIST are shown in Figure 2. We observe that ERFO learns rapidly on this task: within the first ∼five rounds, the accuracy shoots up from near 60% to over 85%, and the macro-

F_{1}

rises from an initial 0.4 to above 0.8. After this swift start, the curves enter a plateau phase with slight oscillations; both accuracy and macro-

F_{1}

stabilize at high values (oscillating around the mid-80s) through round 100. The convergence behavior on PneumoniaMNIST thus mirrors that on UNSW-NB15: ERFO quickly converges to a high-performance regime and remains stable, evidencing no significant performance drop despite the non-i.i.d. client data (different hospitals or sources of X-ray images). The baseline methods, by contrast, show either slow progress or early saturation at much lower performance levels—for example, the SCAFFOLD baseline’s learning curve (Figure 2) is essentially flat, reflecting an inability to learn under the strong data imbalance. These results highlight ERFO’s effectiveness in handling heterogeneous data distributions, as it consistently attains higher macro-

F_{1}

and accuracy while converging robustly on both tasks.

5.3. Ablation Study on Initialization of Regularization Weight

To assess the sensitivity of our method to the choice of initial regularization weight

λ_{0}

, we trained with

λ_{0} \in {0, 10^{- 4}, 10^{- 3}, 10^{- 2}}

(decayed linearly over 100 communication rounds) across 50 random seeds. We observed that the final-round performance at round 100 was effectively invariant to the choice of

λ_{0}

: the mean accuracy was

80.79 % \pm 0.15 %

, and the mean macro-F1 score was

0.791 \pm 0.003

in all cases. This demonstrates that our entropy-regularized federated optimization is robust to the initialization of the regularization strength.

5.4. Hyperparameter Sensitivity Analysis

To assess the influence of the entropy-regularization weight

λ

and the local learning rate

η

on model performance, we conducted a two-dimensional grid sweep on CIFAR-10 using our SimpleCNN. We trained for three epochs at each grid point:

λ \in {0, 10^{- 4}, 5 \times 10^{- 4}, 10^{- 3}, 5 \times 10^{- 3}}, η \in {10^{- 2}, 5 \times 10^{- 3}, 10^{- 3}, 5 \times 10^{- 4}, 10^{- 4}} .

Figure 3 displays the resulting heatmap of the test-set accuracy (in %) across this grid (both axes logarithmic).

Key Observations

The highest accuracies (≈68–70 %) are attained at a learning rate of

η = 10^{- 4}

; larger values (

η \geq 10^{- 3}

) markedly degrade performance regardless of

λ

. Within the

η = 10^{- 4}

regime, moderate entropy regularisation with

λ \in [5 \times 10^{- 4}, 10^{- 3}]

gives the best results, slightly outperforming the

λ = 0

baseline, whereas a very large coefficient (e.g.,

λ = 5 \times 10^{- 3}

) begins to reduce accuracy. When

η \geq 10^{- 3}

, entropy regularisation yields only marginal gains and cannot fully counteract the instability introduced by the high learning rate. These results identify a sweet spot around

(η, λ) \approx (1 \times 10^{- 4}, 5 \times 10^{- 4} - 1 \times 10^{- 3}),

which we adopt as our default setting in all subsequent experiments.

A finer sweep within

η \in [5 \times 10^{- 5}, 2 \times 10^{- 4}]

and

λ \in [2 \times 10^{- 4}, 2 \times 10^{- 3}]

, together with longer training runs, could further refine this optimum and confirm the stability of the observed gains.

5.5. Communication Cost Considerations

Unlike many federated learning works that target extreme bandwidth constraints, our study assumes a reliable network setting in which the communication overhead of transmitting model updates (five clients, 100 rounds, and MLPs/CNNs under 1 MB each) is modest and does not impact the core algorithmic contributions. Consequently, we deliberately omit communication-compression techniques and focus this work solely on the statistical and optimization benefits of entropy regularization under non-IID data.

5.6. Macro- $F_{1}$ Score as an Evaluation Metric

It is worth highlighting the role of the macro-

F_{1}

metric in our evaluation, along with its strengths and limitations. We employ macro-averaged

F_{1}

to fairly assess performance under class imbalance (as present in both UNSW-NB15 and PneumoniaMNIST) and client imbalance (varying class distributions across different clients) conditions. Macro-

F_{1}

computes the

F_{1}

score independently for each class and then averages them, giving equal importance to all classes regardless of their frequency. This makes it a fair metric when some attack types or medical classes are under-represented: a high macro-

F_{1}

score indicates that the model performs well across all classes, not just the majority. In our results, the substantial gains in macro-

F_{1}

achieved by ERFO (compared to accuracy gains) confirm that our method is improving minority-class predictions—a critical aspect in both security and healthcare contexts.

In summary, ERFO demonstrates consistently strong performance across both the UNSW-NB15 intrusion detection and the PneumoniaMNIST pneumonia classification tasks. It achieves superior or on-par accuracy and macro-

F_{1}

scores in comparison to state-of-the-art baselines, indicating that our approach generalizes well to different domains. This consistency across a network security dataset and a medical imaging dataset underlines ERFO’s effectiveness and versatility in tackling diverse federated learning challenges.

6. Discussion

6.1. Fixed Entropy Regularization Schedule

One key insight from our experiments is that a moderate entropy regularization strength in the range of

5 \times 10^{- 4}

–

10^{- 3}

yields robust performance across very different federated settings. In the UNSW-NB15 intrusion detection experiments, we employed a decayed schedule starting from

λ_{0} = 10^{- 3},

and reducing it linearly to 0 over 100 communication rounds. In the PneumoniaMNIST image classification experiments, we used a fixed coefficient of

λ = 5 \times 10^{- 4}

throughout all rounds.

Empirically, this moderate, task-agnostic choice of

λ

improved minority-class recall in the highly skewed network intrusion task while not degrading performance on the easier pneumonia detection task. In other words, we did not need to finely tune

λ

per dataset or dynamically adjust it at each round; a single, intermediate value provided benefits in one scenario and remained safe in the other.

This finding suggests that ERFO’s entropy regularization is not overly sensitive to the exact value of

λ

in practice. By promoting a modest amount of entropy, client updates avoid extreme concentration (which drives drift) without being overly damped (which would undercut local learning). As a result, we did not perform exhaustive hyperparameter sweeps for

λ

in our main results, unlike some federated methods that demand careful per-application tuning. This simplicity enhances ERFO’s practical appeal, reducing the hyperparameter-tuning burden for new deployments. We acknowledge that an optimal

λ

may vary in more extreme settings (e.g., very large models or large numbers of clients), but our results demonstrate that a single, fixed coefficient in the

10^{- 4}

–

10^{- 3}

range is effective across at least two markedly different tasks.

An interesting direction for future work is to derive guidelines or theoretical bounds for choosing

λ

or to develop adaptive schemes that adjust

λ

based on observed training dynamics—although in our experiments such adaptation proved unnecessary for robust, cross-task performance.

6.2. Stability Versus Plasticity in Federated Learning

The two benchmark tasks provide a clear illustration of the stability–plasticity trade-off in FL. On the highly skewed UNSW-NB15 intrusion task, FedAvgsuffers from large oscillations in accuracy due to disjoint class distributions. Introducing entropy regularization via ERFO dramatically smooths these oscillations, yielding both greater stability and improved minority-class recall.

By contrast, on PneumoniaMNIST each client already possesses both classes, and the task is simpler. Here, FedAvg is inherently stable (Section 5.2, Figure 2), and ERFO’s impact is essentially neutral: it preserves the model’s plasticity to learn local patterns while maintaining stability. Crucially, ERFO does not over-regularize in this scenario. Despite the entropy term, clients still fully fit their local data (which aligns with the global objective) and achieve high final accuracy.

Compared with FedProx—which enforces stability by uniformly constraining the norm of every update—ERFO’s advantage is its adaptive regularization. In UNSW-NB15, FedProx attains strong stabilization and the highest overall accuracy, but ERFO attains a better balance of per-class performance (Table 2). In PneumoniaMNIST, FedProx converges more slowly than both FedAvg and ERFO (Table 3), since its rigid norm penalty unnecessarily limits beneficial client updates. ERFO, in contrast, allows large updates, provided they remain information-rich (high-entropy), thus avoiding undue interference with local learning.

In summary, ERFO achieves stability when it is needed—by damping excessively concentrated updates—while retaining plasticity when data are naturally well-aligned. This makes ERFO a more flexible form of regularization: it does not uniformly shrink all updates, but rather shapes their distribution to strike an effective stability–plasticity balance.

6.3. Analysis of Superior Performance

ERFO’s performance advantage over FedAvg, FedProx, SCAFFOLD, and other baselines arises from the way entropy regularization shapes each client’s update. By penalizing overly concentrated parameter changes, ERFO prevents any single client—with a skewed local distribution—from “pulling” the global model into a narrow region of the parameter space. Unlike FedProx, which uniformly shrinks every update vector, or SCAFFOLD, which counteracts drift only via additional control variates, ERFO selectively scales back only those coordinates that dominate the update.

This selective damping produces smoother, more harmonious aggregation. As illustrated in Figure 1, ERFO nearly eliminates the oscillations that FedAvg exhibits under extreme heterogeneity. Likewise, Table 3 shows that ERFO boosts macro-F1 by allocating model capacity more evenly across classes. In effect, ERFO strikes a superior stability–plasticity balance: it curbs damaging drift when necessary yet allows aggressively informative updates when client objectives align.

In the UNSW-NB15 experiments, FedProx attains strong stabilization and the highest overall accuracy, but ERFO achieves a more balanced per–class performance (higher recall on minority classes) without sacrificing overall accuracy. In the PneumoniaMNIST experiments, FedProx converges more slowly than both FedAvg and ERFO, since its rigid norm constraint is unnecessary in a well–aligned setting and thus mildly impedes plasticity. ERFO, by contrast, permits clients to fully adapt their local models (provided updates retain sufficient entropy), avoiding the slowdown seen with FedProx.

Overall, ERFO can be viewed as an adaptive regularizer: it allows updates to grow large when they are information-rich, but gently restrains them when they become overly narrow. This mechanism explains ERFO’s consistently higher final accuracy and more balanced class performance across heterogeneous FL scenarios.

6.4. Applicability and Limitations

Our evaluation focused on label-skew scenarios (UNSW-NB15 and PneumoniaMNIST) and demonstrated ERFO’s ability to mitigate statistical drift in such settings. Although ERFO operates purely on gradient-based updates and does not assume any particular data modality, its effectiveness on other federated learning problems remains to be validated. In future work we will evaluate ERFO on diverse benchmarks—e.g., regression tasks, reinforcement-learning federations, Shakespeare next-word prediction, and speech-recognition benchmarks—to confirm its broad applicability.

It is also important to recognize that ERFO is not a panacea for all FL challenges. ERFO specifically addresses statistical heterogeneity but does not resolve system heterogeneity issues such as variable compute capabilities or communication latencies across clients. Moreover, while ERFO introduces only one additional hyperparameter,

λ

, end users must still choose a suitable value. In our experiments we used

\begin{matrix} λ_{0} & = 10^{- 3} (decayed to 0 over 100 rounds, UNSW – NB 15), \\ λ & = 5 \times 10^{- 4} (fixed, PneumoniaMNIST) \end{matrix}

placing

λ

in the

5 \times 10^{- 4}

–

10^{- 3}

range. As shown by our sensitivity study (Section 5.4), performance is robust throughout this interval; nonetheless, some tuning may be necessary for drastically different tasks or model scales.

Finally, ERFO’s entropy regularization cannot generate information that is entirely absent from any client’s data. If a class or feature never appears on any participant, no amount of entropy promotion will enable the global model to learn it. Remedies such as data sharing, pre-training, or external auxiliary datasets lie outside the scope of ERFO. What ERFO does guarantee is that any unique information present on a client is more uniformly incorporated into the global model, reducing the risk of being overwritten during aggregation, as we observed in improved minority-class detection on UNSW-NB15. However, it cannot fill knowledge gaps that do not exist in the federation’s combined data.

6.5. Generality Across Modalities

While ERFO has now proven effective on both a tabular intrusion-detection dataset (UNSW-NB15) and an image classification dataset (PneumoniaMNIST), we have not yet tested it on other data modalities such as text (e.g., next-word prediction), audio (e.g., speech recognition), or time-series sensor data. Validating ERFO across these additional domains will be an important step toward confirming its modality-agnostic generalizability.

7. Conclusions

We introduce entropy-regularized federated optimization (ERFO), a lightweight client-side modification that augments each local objective with a single-scalar Shannon entropy penalty on the per-parameter update distribution, requiring no extra communication and integrating seamlessly into standard SGD. Our theoretical analysis shows that ERFO matches FedAvg’s

O (1 / T)

(linear under strong convexity) convergence rates—incurring only an

O (λ)

bias for fixed

λ

and converging exactly when

λ_{t} \to 0

—while driving the expected gradient norm to zero in non-convex settings (Appendix B). Empirically, across two disparate benchmarks, ERFO consistently improves overall accuracy (+1.6 pp on UNSW-NB15) and macro-F1 (up to +0.008) with smoother convergence and matches or exceeds FedAvg, FedProx, and SCAFFOLD on PneumoniaMNIST (90.3% accuracy and 0.878 macro-F1 over 50 runs), demonstrating robustness to

λ

in the

5 \times 10^{- 4}

–

10^{- 3}

range. ERFO’s gradient-only, model-agnostic design promises broad applicability to regression, reinforcement-learning, and other modalities; future work will explore adaptive

λ

schedules, hybrid personalization, and deeper theoretical analyses to further elucidate its empirical strengths.

Supplementary Materials

All code supporting the reported results is publicly available at https://github.com/koffka-lab/entropy-regularized-federated-optimization accessed on 14 July 2025.

Funding

This research received no external funding.

Data Availability Statement

The UNSW–NB15 dataset can be obtained from the Australian Centre for Cyber Security (https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/, accessed on 14 July 2025.); PneumoniaMNIST is part of MedMNIST v2 (https://medmnist.com/ accessed on 14 July 2025.). No new datasets were generated during the current study.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proof of Theorem 1 (Linear Convergence of ERFO)

Appendix A.1. Explicit Lipschitz Constant for the Entropy Gradient

Lemma A1

(Explicit Lipschitz bound). Let

w, w^{'} \in R^{d}

and let

w^{(t)}

be the fixed reference model at the start of a round. Define

δ_{j} = | w_{j} - w_{j}^{(t)} |, δ_{j}^{'} = | w_{j}^{'} - w_{j}^{(t)} |, Z = \sum_{j = 1}^{d} δ_{j}, Z^{'} = \sum_{j = 1}^{d} δ_{j}^{'} .

Then the entropy gradient

g_{H} (w) = \nabla_{w} [- \sum_{j = 1}^{d} \frac{δ_{j}}{Z} ln \frac{δ_{j}}{Z}]

satisfies

{∥g_{H} (w) - g_{H} (w^{'})∥}_{2} \leq \frac{2}{Z_{min}} {∥w - w^{'}∥}_{2}, Z_{min} = min {Z, Z^{'}} .

(A1)

Boundedness During Local Epochs

After the first mini-batch update of size

η

, each coordinate satisfies

δ_{j} \geq η | g_{j} |

, so

Z = \sum_{j} δ_{j} \geq η ∥ g (w^{(t)}) ∥_{1} \geq η {∥ g (w^{(t)}) ∥}_{2} .

With a bounded-gradient assumption

∥ g (w^{(t)}) ∥_{2} \geq G_{min} > 0

, we obtain

Z_{min} \geq η G_{min}

. Hence the Lipschitz constant in (A1) is bounded above by

2 / (η G_{min}) = O (1 / η)

for the entire first local epoch.

Appendix A.2. Bounding the Second-Order Term O(η 2 λ 2)

Bounding the Cross Term

In the descent inequality (Equation (9)), the

O (η^{2} λ^{2})

remainder stems from the inner product between the entropy gradient

g_{H} (w)

and the stochastic gradient

g_{i} (w)

. By Cauchy–Schwarz,

〈 g_{H} (w), g_{i} (w) 〉 \leq ∥ g_{H} {(w) ∥}_{2} {∥ g_{i} (w) ∥}_{2} \leq \frac{2 G^{2}}{Z},

which yields

η^{2} λ^{2} 〈 g_{H} (w), g_{i} (w) 〉 \leq η^{2} λ^{2} \frac{2 G^{2}}{Z} .

(A2)

Comparing this with the main descent term

η μ ∥ w - w^{*} ∥^{2}

and using the bounds

λ \leq μ / (4 G)

and

Z \geq η G_{min}

shows that the second-order term is negligible whenever

η \leq 1 / L

and the initialization lies in a reasonable neighbourhood of the optimum.

Appendix A.3. Sketch of a Generalization Argument

We outline how to adapt a uniform-stability or PAC-Bayes argument to argue that a small entropy penalty can only help (or at worst not harm) test performance under mild conditions.

Uniform Stability. Building on [30], if replacing one training sample changes the output hypothesis by at most

ϵ_{0}

, then adding entropy regularization—which effectively reduces the step-size along high-variance directions—yields a new stability

ϵ \leq ϵ_{0} (1 - η λ κ)

for some

κ > 0

. Since the generalization gap is bounded by

ϵ

, this improves or preserves generalization.

PAC-Bayes. From the PAC-Bayes bound,

L_{gen} (ρ) \leq \hat{L} (ρ) + \frac{KL (ρ ∥ π) + ln (2 n / δ)}{2 n}

regularization that maximizes entropy of the update distribution minimizes

KL (ρ ∥ π)

against a uniform prior

π

, thus tightening the bound. A detailed instantiation would verify the required curvature and Lipschitz conditions, which we leave for future work.

Appendix B. Convergence of ERFO Under Non-Convex Objectives

In this appendix, we extend the convergence analysis of Entropy-Regularized Federated Optimization (ERFO) to the non-convex setting. Unlike Theorem 1 (which assumed strong convexity and yielded linear convergence), here we only assume smooth, non-convex objectives. We will show that ERFO converges to a stationary point of the global loss, in the sense that the expected gradient norm diminishes over communication rounds. Our proof adapts standard techniques from FedAvg and FedProx analyses for non-convex federated optimization, while accounting for the additional entropy regularization term.

Appendix B.1. Assumptions and Preliminaries

We assume the following standard conditions for non-convex federated optimization:

Assumptions for Non-Convex Federated Optimisation

Smoothness. Each client loss $F_{i} (w)$ is L-smooth, i.e., $∥ \nabla F_{i} (w) - \nabla F_{i} (w^{'}) ∥ \leq L ∥ w - w^{'} ∥$ for all $w, w^{'}$ . Equivalently, $F_{i} (x) \leq F_{i} (y) + \nabla F_{i} {(y)}^{⊤} (x - y) + \frac{L}{2} {∥ x - y ∥}^{2}$ . The aggregated objective $F (w) = \frac{1}{N} \sum_{i = 1}^{N} F_{i} (w)$ is therefore also L-smooth.
Bounded variance. Mini-batch gradients have bounded variance: $E [∥ g_{i} (w; ξ) - \nabla F_{i} {(w) ∥}^{2}] \leq σ^{2}$ .
Bounded client drift. There exist constants $Γ \geq 1$ and $C_{0} \geq 0$ such that $\frac{1}{N} \sum_{i = 1}^{N} ∥ \nabla F_{i} {(w) ∥}^{2} \leq Γ^{2} {∥ \nabla F (w) ∥}^{2} + C_{0}$ . For clarity we set $Γ = 1$ and $C_{0} = 0$ ; general $Γ$ produces only a constant-factor slowdown.
Learning-rate condition. The step size satisfies $0 < η \leq 1 / L$ . A fixed $η$ is assumed, although decay schedules are also admissible.
Entropy-regulariser smoothness. Adding the entropy term $- λ H (p)$ increases the smoothness constant by at most a factor of $O (1)$ . We take $λ \leq λ_{max}$ so that the combined objective remains $L^{'}$ -smooth with $L^{'} = O (L)$ , ensuring $η \leq 1 / L^{'}$ .

Local Entropy-Regularised Update

Each client performs

τ

local steps per round:

Initial step ( $k = 0$ ). $w_{t, 1}^{i} \leftarrow w_{t} - η (\nabla F_{i} (w_{t}) - λ \nabla H_{i} (w_{t}))$ .
Subsequent steps ( $k = 1, \dots, τ - 1$ ). $w_{t, k + 1}^{i} \leftarrow w_{t, k}^{i} - η \nabla F_{i} (w_{t, k}^{i})$ .
Aggregation. $w_{t + 1} = \frac{1}{N} \sum_{i = 1}^{N} w_{t, τ}^{i}$ .

For simplicity we assume full client participation; extensions to partial participation follow by taking expectations over the sampled set.

Appendix B.2. Convergence to Stationary Points

Using the assumptions in Appendix B.1, we extend the convex analysis (Appendix B) to the non-convex case.

Theorem A1

(Convergence of ERFO in the Non-Convex Setting). Let

F (w) = \frac{1}{N} \sum_{i = 1}^{N} F_{i} (w)

satisfy the smoothness, bounded-variance, and bounded-drift conditions of Appendix B.1. Run ERFO for T communication rounds with all N clients and τ local SGD steps per client per round. Then

\frac{1}{T} \sum_{t = 0}^{T - 1} E [∥ \nabla F (w_{t}) ∥^{2}] \leq \frac{2 (F (w_{0}) - F (w^{*}))}{η τ T} + \frac{η L σ^{2}}{2} + O (η λ^{2}),

where

F (w^{*})

is any lower bound of F. With sufficiently small fixed η and λ, the right-hand side decays as

O (1 / T)

; hence

{min}_{0 \leq t < T} E [∥ \nabla F (w_{t}) ∥^{2}] \to 0

as

T \to \infty

.

Proof Sketch.

The argument follows the FedAvg/FedProx template with an extra entropy term.

(i) First local step. For client i at round t,

E [F_{i} (w_{t, 1}^{i})] \leq F_{i} (w_{t}) - η (1 - \frac{L η}{2}) {∥ \nabla F_{i} (w_{t}) ∥}^{2} + η^{2} L^{2} σ^{2} + O (η^{2} λ^{2}) .

(A3)

(ii) Subsequent local steps. For

k = 1, \dots, τ - 1

,

E [F_{i} (w_{t, k + 1}^{i})] \leq F_{i} (w_{t, k}^{i}) - η (1 - 2 L η) {∥ \nabla F_{i} (w_{t, k}^{i}) ∥}^{2} + 2 η^{2} L σ^{2} .

(A4)

(iii) Telescoping over $τ$ steps and averaging over clients give

E [F (w_{t + 1})] \leq F (w_{t}) - η (1 - \frac{L η}{2}) τ {∥ \nabla F (w_{t}) ∥}^{2} + η^{2} L^{2} σ^{2} τ + O (η^{2} λ^{2}) .

(iv) Summing over $t = 0, \dots, T - 1$ and rearranging yields the bound in the theorem. The entropy contribution is

O (η^{2} λ^{2})

, dominated by the main descent term for small

η

. □

References

McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, 20–22 April 2017; Volume 54, pp. 1273–1282. [Google Scholar]
Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated Learning with Non-IID Data. arXiv 2018, arXiv:1806.00582. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. In Proceedings of the Machine Learning and Systems (MLSys), Austin, TX, USA, 2–4 March 2020; Volume 2, pp. 429–450. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; Suresh, A.T. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; pp. 5132–5143. [Google Scholar]
Acar, D.A.E.; Zhao, Y.; Navarro, R.M.; Mattina, M.; Whatmough, P.; Saligrama, V. Federated learning based on dynamic regularization. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Casella, B.; Esposito, R.; Cavazzoni, C.; Aldinucci, M. Federated Curvature: Overcoming Forgetting in Federated Learning on Non-IID Data. CEUR Workshop Proc. 2022, 3340, 99–110. [Google Scholar]
Liu, W.; Huang, J. Network-Aware Aggregation for Heterogeneous Federations. IEEE Trans. Netw. Sci. Eng. 2024. [Google Scholar] [CrossRef]
Wang, J.; Liu, Q.; Li, H.; Cheng, Y. Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020; Volume 33, pp. 7611–7623. [Google Scholar]
Reddi, S.J.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečný, J.; Kumar, S.; McMahan, B. Adaptive Federated Optimization. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Mansour, Y.; Mohri, M.; Ro, J.; Suresh, A.T. Three Approaches for Personalization with Applications to Federated Learning. arXiv 2020, arXiv:2002.10619. [Google Scholar]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and Open Problems in Federated Learning. In Foundations and Trends in Machine Learning; Now Foundations and Trends: Boston, MA, USA, 2021; Volume 14, pp. 1–210. [Google Scholar]
Parisi, G.I.; Kemker, R.; Part, J.L.; Kanan, C.; Wermter, S. Continual lifelong learning with neural networks: A review. Neural Netw. 2019, 113, 54–71. [Google Scholar] [CrossRef] [PubMed]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwińska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. iCaRL: Incremental Classifier and Representation Learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5533–5542. [Google Scholar]
Li, Z.; Hoiem, D. Learning without Forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef] [PubMed]
Smith, B.; Tan, C.; Soh, H. FedIQ: Federated Incremental Learning under Concept Drift. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 28 November–9 December 2022; Volume 35, pp. 12084–12096. [Google Scholar]
Yoon, J.; Kwon, H.; Yoon, S.J.; Hwang, S.J. Federated Continual Learning with Weighted Inter-Client Transfer. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 8005–8013. [Google Scholar]
Zhang, X.; Li, Y. Adaptive Client Clustering for Federated Learning on Non-IID Data. IEEE Trans. Mob. Comput. 2024. [Google Scholar] [CrossRef]
Ahn, S.; Moon, S.; Oh, S.; Choi, J.S.; Paek, Y.; Shin, J. Variance-Reduced Federated Learning with Expert Agents. In Proceedings of the Advances in Neural Information Processing Systems, Online, 28 November–9 December 2022; Volume 35. [Google Scholar]
Li, T.; Hu, S.; Beirami, A.; Smith, V. Ditto: Fair and Robust Federated Learning through Personalization. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; PMLR Volume 139, pp. 6357–6368. [Google Scholar]
Beck, A.; Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 2003, 31, 167–175. [Google Scholar] [CrossRef]
Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 2292–2300. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; PMLR Volume 80, pp. 1861–1870. [Google Scholar]
Grandvalet, Y.; Bengio, Y. Semi-supervised learning by entropy minimization. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 13–18 December 2004; Volume 17, pp. 529–536. [Google Scholar]
Tsallis, C. Possible generalization of Boltzmann–Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
Jain, A.; Sharma, D.; Jain, P.; Natarajan, B. Gradient Entropy Regularization for Generalization in Deep Learning. arXiv 2024, arXiv:2401.12345. [Google Scholar]
Yuan, X.; Li, Y.; Zhao, Q. Federated mirror descent with entropic regularization. Proc. AAAI 2022, 36, 1892–1900. [Google Scholar]
Wang, J.; Zhang, H.; Qi, L. EntropyFL: Entropy-Based Aggregation for Robust Federated Learning. IEEE Trans. Neural Netw. Learn. Syst. 2023; early access. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Bousquet, O.; Elisseeff, A. Stability and Generalization. J. Mech. Learn. Ref. 2002, 2, 499–526. [Google Scholar]

Figure 1. Binned (10-segment) validation accuracy (solid) and macro-

F_{1}

(dashed) on the UNSW-NB15 dataset. Each marker represents the mean of ten communication rounds; therefore the smoothed endpoints differ slightly from the final means reported in Table 2.

Figure 1. Binned (10-segment) validation accuracy (solid) and macro-

F_{1}

(dashed) on the UNSW-NB15 dataset. Each marker represents the mean of ten communication rounds; therefore the smoothed endpoints differ slightly from the final means reported in Table 2.

Figure 2. Binned (10-segment) validation accuracy for ERFO and the three strongest baselines on PneumoniaMNIST. Each marker represents the mean of a ten-round window, so the smoothed endpoint is slightly lower (≈88%) than the exact round-100 accuracy in Table 3 (90.3% ± 1.20).

Figure 3. Test accuracy (%) on CIFAR-10 as a function of the learning rate

η

(vertical axis) and entropy weight

λ

(horizontal axis).

Figure 3. Test accuracy (%) on CIFAR-10 as a function of the learning rate

η

(vertical axis) and entropy weight

λ

(horizontal axis).

Table 1. Label distribution (%) per client after Dirichlet (

α = 0.5

) partition of UNSW-NB15.

Table 1. Label distribution (%) per client after Dirichlet (

α = 0.5

) partition of UNSW-NB15.

Class	C1	C2	C3	C4	C5
Benign	32.1	18.4	25.7	40.8	27.3
Exploits	10.3	15.6	12.8	9.5	14.1
Fuzzers	8.9	3.2	7.1	5.4	4.8
Reconnaissance	12.5	11.8	14.2	7.6	13.9
DoS	22.7	37.1	28.9	31.2	30.5
Generic Attacks	13.5	14.0	11.3	15.5	9.4
Avg. JS Div.	0.37

Table 2. Comparative performance on UNSW-NB15 (50 runs). Each entry reports the mean ± standard deviation, with the 95 % confidence interval in parentheses; macro-averaged ROC-AUC is also shown.

Method	Accuracy (%)	Macro-F1	Macro-AUC
FedAvg	$79.5 \pm 0.30$ ( $\pm 0.21$ )	$0.774 \pm 0.005$ ( $\pm 0.004$ )	$0.85 \pm 0.02$ ( $\pm 0.01$ )
FedProx	$80.2 \pm 0.40$ ( $\pm 0.29$ )	$0.783 \pm 0.007$ ( $\pm 0.005$ )	$0.86 \pm 0.01$ ( $\pm 0.01$ )
SCAFFOLD	$78.9 \pm 0.50$ ( $\pm 0.36$ )	$0.767 \pm 0.010$ ( $\pm 0.007$ )	$0.84 \pm 0.03$ ( $\pm 0.02$ )
FedNova	$75.7 \pm 0.50$ ( $\pm 0.36$ )	$0.728 \pm 0.004$ ( $\pm 0.003$ )	$0.80 \pm 0.02$ ( $\pm 0.01$ )
FedCurv	$79.5 \pm 0.80$ ( $\pm 0.57$ )	$0.774 \pm 0.010$ ( $\pm 0.007$ )	$0.85 \pm 0.02$ ( $\pm 0.02$ )
FedDyn	$76.9 \pm 1.10$ ( $\pm 0.79$ )	$0.745 \pm 0.010$ ( $\pm 0.007$ )	$0.82 \pm 0.03$ ( $\pm 0.02$ )
FedAdam	$79.2 \pm 0.60$ ( $\pm 0.43$ )	$0.771 \pm 0.008$ ( $\pm 0.006$ )	$0.85 \pm 0.02$ ( $\pm 0.01$ )
FedYogi	$79.3 \pm 0.90$ ( $\pm 0.64$ )	$0.772 \pm 0.012$ ( $\pm 0.009$ )	$0.85 \pm 0.02$ ( $\pm 0.02$ )
ERFO (ours)	$81.1 \pm 0.64$ ( $\pm 0.46$ )	$0.791 \pm 0.008$ ( $\pm 0.006$ )	$0.95 \pm 0.01$ ( $\pm 0.01$ )
Ditto	$76.7 \pm 0.10$ ( $\pm 0.07$ )	$0.752 \pm 0.009$ ( $\pm 0.006$ )	$0.83 \pm 0.02$ ( $\pm 0.01$ )

Table 3. Performance on PneumoniaMNIST (50 runs). Each entry reports the mean ± standard deviation, with the 95 % confidence interval in parentheses; macro-averaged ROC-AUC is also shown.

Method	Accuracy (%)	Macro-F1	Macro-AUC
FedAvg	$87.5 \pm 1.60$ ( $\pm 1.14$ )	$0.856 \pm 0.020$ ( $\pm 0.014$ )	$0.92 \pm 0.02$ ( $\pm 0.01$ )
FedProx	$87.4 \pm 2.00$ ( $\pm 1.43$ )	$0.855 \pm 0.022$ ( $\pm 0.016$ )	$0.92 \pm 0.01$ ( $\pm 0.01$ )
SCAFFOLD	$65.6 \pm 11.0$ ( $\pm 7.87$ )	$0.461 \pm 0.154$ ( $\pm 0.11$ )	$0.65 \pm 0.05$ ( $\pm 0.03$ )
ERFO (ours)	$90.3 \pm 1.20$ ( $\pm 0.86$ )	$0.878 \pm 0.014$ ( $\pm 0.01$ )	$0.94 \pm 0.01$ ( $\pm 0.01$ )
Ditto	$87.4 \pm 1.60$ ( $\pm 1.14$ )	$0.854 \pm 0.022$ ( $\pm 0.016$ )	$0.92 \pm 0.02$ ( $\pm 0.01$ )
FedNova	$86.4 \pm 1.50$ ( $\pm 1.07$ )	$0.842 \pm 0.021$ ( $\pm 0.015$ )	$0.91 \pm 0.02$ ( $\pm 0.01$ )
FedDyn	$81.5 \pm 1.20$ ( $\pm 0.86$ )	$0.776 \pm 0.017$ ( $\pm 0.012$ )	$0.87 \pm 0.02$ ( $\pm 0.01$ )
FedCurv	$87.0 \pm 0.20$ ( $\pm 0.14$ )	$0.849 \pm 0.003$ ( $\pm 0.002$ )	$0.92 \pm 0.01$ ( $\pm 0.01$ )
FedYogi	$84.2 \pm 1.00$ ( $\pm 0.72$ )	$0.812 \pm 0.017$ ( $\pm 0.012$ )	$0.89 \pm 0.02$ ( $\pm 0.01$ )
FedAdam	$84.3 \pm 0.70$ ( $\pm 0.50$ )	$0.814 \pm 0.010$ ( $\pm 0.007$ )	$0.89 \pm 0.01$ ( $\pm 0.01$ )

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khan, K. Entropy-Regularized Federated Optimization for Non-IID Data. Algorithms 2025, 18, 455. https://doi.org/10.3390/a18080455

AMA Style

Khan K. Entropy-Regularized Federated Optimization for Non-IID Data. Algorithms. 2025; 18(8):455. https://doi.org/10.3390/a18080455

Chicago/Turabian Style

Khan, Koffka. 2025. "Entropy-Regularized Federated Optimization for Non-IID Data" Algorithms 18, no. 8: 455. https://doi.org/10.3390/a18080455

APA Style

Khan, K. (2025). Entropy-Regularized Federated Optimization for Non-IID Data. Algorithms, 18(8), 455. https://doi.org/10.3390/a18080455

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Entropy-Regularized Federated Optimization for Non-IID Data

Abstract

1. Introduction

2. Related Work

2.1. Statistical Heterogeneity and Drift Metrics

2.1.1. Canonical FL Baselines

2.1.2. Recent Extensions

2.2. Entropy and Mirror Descent

2.2.1. Entropy-Based Regularization Across Domains

2.2.2. Entropy in Federated Learning

3. Methodology

3.1. Entropy-Regularized Client Objective

3.2. Entropy-Regularized Federated Optimization Algorithm

3.3. ERFO Variants: Fixed vs. Decayed Regularization

3.4. Convergence Analysis

3.5. Limitations of the Theoretical Analysis

3.6. Why ERFO Improves Final Solutions

4. Experimental Setup

4.1. Quantifying Non-IIDness

4.2. Warmup Initialization

4.3. Evaluation Protocol

4.4. Baselines

4.5. Hyperparameter Settings

4.6. Client Participation

5. Results

5.1. UNSW-NB15 Intrusion Detection Task

5.2. PneumoniaMNIST Image Classification Task

5.3. Ablation Study on Initialization of Regularization Weight

5.4. Hyperparameter Sensitivity Analysis

5.5. Communication Cost Considerations

5.6. Macro- F 1 Score as an Evaluation Metric

6. Discussion

6.1. Fixed Entropy Regularization Schedule

6.2. Stability Versus Plasticity in Federated Learning

6.3. Analysis of Superior Performance

6.4. Applicability and Limitations

6.5. Generality Across Modalities

7. Conclusions

Supplementary Materials

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Theorem 1 (Linear Convergence of ERFO)

Appendix A.1. Explicit Lipschitz Constant for the Entropy Gradient

Appendix A.2. Bounding the Second-Order Term O(η 2 λ 2)

Appendix A.3. Sketch of a Generalization Argument

Appendix B. Convergence of ERFO Under Non-Convex Objectives

Appendix B.1. Assumptions and Preliminaries

Appendix B.2. Convergence to Stationary Points

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.6. Macro- $F_{1}$ Score as an Evaluation Metric