Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning

We consider information-theoretic bounds on the expected generalization error for statistical learning problems in a network setting. In this setting, there are K nodes, each with its own independent dataset, and the models from the K nodes have to be aggregated into a final centralized model. We consider both simple averaging of the models as well as more complicated multi-round algorithms. We give upper bounds on the expected generalization error for a variety of problems, such as those with Bregman divergence or Lipschitz continuous losses, that demonstrate an improved dependence of 1/K on the number of nodes. These “per node” bounds are in terms of the mutual information between the training dataset and the trained weights at each node and are therefore useful in describing the generalization properties inherent to having communication or privacy constraints at each node.


I. INTRODUCTION
A key property of machine learning systems is their ability to generalize to new and unknown data.Such a system is trained on a particular set of data, but must then perform well even on new datapoints that have not previously been considered.This ability, deemed generalization, can be formulated in the language of statistical learning theory by considering the generalization error of an algorithm, i.e, the difference between the population risk of a model trained on a particular dataset and the empirical risk for the same model and dataset.We say that a model generalizes well if it has a small generalization error, and because models are often trained by minimizing empirical risk or some regularized version of it, a small generalization error also implies a small population risk which is the average loss over new samples taken randomly from the population.It is therefore of interest to upper bound generalization error and understand which quantities control it, so that we can quantify the generalization properties of a machine learning system and offer guarantees about how well it will perform.
In recent years, it has been shown that information theoretic quantities such as mutual information can be used to bound generalization error under assumptions on the tail of the distribution of the loss function [1], [2], [3].In particular, when the loss function is sub-Gaussian, the expected generalization error can scale at most with the square root of the mutual information between the training dataset and the model weights [2].These bounds offer an intuitive explanation for generalization and overfitting -if an algorithm uses only limited information from its training data, then this will bound the expected generalization error and prevent overfitting.Conversely, if a training algorithm uses all of the information from its training data in the sense that the model is a deterministic function of the training data, then this mutual information can be infinite and there is the possibility of unbounded generalization error and thus overfitting.
Another modern focus of machine learning systems has been that of distributed and federated learning [4], [5], [6].In these systems, data is generated and processed in a distributed network of machines.The main differences between the distributed and centralized settings are the information constraints imposed by the network.There has been considerable interest in understanding the impact of both communication constraints [7], [8] and privacy constraints [9], [10], [11], [12] on the performance of machine learning systems, and in designing protocols that efficiently train systems under these constraints.
Since both communication and local differential privacy constraints can be thought of as special cases of mutual information constraints, they should pair naturally with some form of information theoretic generalization bound in order to induce control over the generalization error of the distributed machine learning system.The information constraints inherent to the network can themselves give rise to tighter bounds on generalization error and thus provide better guarantees against overfitting.Along these lines, in recent work [13], a subset of the present authors introduced the framework of using information theoretic quantities to bound both expected generalization error and a measure of privacy leakage in distributed and federated learning systems.The generalization bounds in this work, however, are essentially the same as those obtained by thinking of the entire system, from the data at each node in the network to the final aggregated model, as a single centralized algorithm.Any improved generalization guarantees from these bounds would remain implicit in the mutual information terms involved.
In this work, we develop improved bounds on expected generalization error for distributed and federated learning systems.Instead of leaving the differences between these systems and their centralized counterparts implicit in the mutual information terms, we bring analysis of the structure of the systems directly into the bounds.By working with the contribution from each node separately, we are able to derive upper bounds on expected generalization error that scale with arXiv:2202.02423v2[cs.IT] 15 Jan 2024 This improvement is shown to be tight for certain examples such as learning the mean of a Gaussian with squared ℓ 2 loss.We develop bounds that apply to distributed systems in which the submodels from each one of K different nodes are averaged together, as well as bounds that apply to more complicated multiround stochastic gradient descent (SGD) algorithms such as in federated learning.For linear models with Bregman divergence losses, these "per node" bounds are in terms of the mutual information between the training dataset and the trained weights at each node, and are therefore useful in describing the generalization properties inherent to having communication or privacy constraints at each node.For arbitrary nonlinear models that have Lipschitz continuous losses, the improved dependence of O 1 K can still be recovered, but without a description in terms of mutual information.We demonstrate the improvements given by our bounds over the existing information theoretic generalization bounds via simulation of a distributed linear regression example.

A. Technical Preliminaries
Suppose we have independent and identically distributed (i.i.d.) data Z i ∼ π for i = 1, . . ., n and let S = (Z 1 , . . ., Z n ).Suppose further that W = A(S) is the output of a potentially stochastic algorithm.Let ℓ(W, Z) be a real-valued loss function and define to be the population risk for weights (or model) w.We similarly define to be the empirical risk on dataset s for model w.The generalization error for dataset s is then and the expected generalization error is where the expectation is also over any randomness in the algorithm.Below we present some standard results on the expected generalization error that will be needed.
In many of the results in this paper, we will use one of the two following assumptions.
where W , Z are taken independently from the marginals for W, Z, respectively, The next assumption is a special case of the previous one with Theorem 2 (Theorem 2 in [3]).Under Assumption 1, where Recall that for a continuously differentiable and strictly convex function F : R m → R, we define the associated Bregman divergence [15] between two points p, q ∈ R m to be where ⟨•, •⟩ denotes the usual inner product.

II. DISTRIBUTED LEARNING AND MODEL AGGREGATION
Now suppose that there are K nodes each with n samples.Each node k = 1, . . ., K has dataset S k = (Z 1,k , . . ., Z n,k ) with Z i,k taken i.i.d.from π.We use S = (S 1 , . . ., S K ) to denote the entire dataset of size nK.Each node locally trains a model W k = A k (S k ) with algorithm A k .After each node locally trains its model, the models W k are then combined to form the final model W using an aggregation algorithm W = A(W 1 , . . ., W K ).See Figure 1.In this section we will assume that W k ∈ R d and that the aggregation is done by simple averaging, i.e., Define A to be the total algorithm from data S to the final weights W so that W = A(S) .
The distributed learning setting with model aggregation.
Theorem 3. Suppose that ℓ(•, z) is a convex function of w ∈ R d for each z and that A k represents the empirical risk minimization algorithm on local dataset S k in the sense that Then Proof.
In the above display, line (4) follows by the convexity of ℓ via Jensen's inequality, and line (5) follows by minimizing the empirical risk over each node's local dataset, which exactly corresponds to what each node's local algorithm A k does.
While Theorem 3 seems like a nice characterization of generalization bounds for the aggregate model -in that the aggregate generalization error cannot be any larger than the average generalization errors over each node -it does not offer any improvement in the expected generalization error that one might expect given nK total samples instead of just n samples.A naive application of the information theoretic generalization bounds from Theorem 2, followed by the data processing inequality I( W ; Z i,k ) ≤ I(W k ; Z i,k ), runs into the same problem.

A. Improved Bounds
In this subsection, we prove bounds on expected generalization error that remedy the above shortcomings.In particular, we would like the following two properties.
(a) The bound should decay with the number of nodes K in order to take advantage of the total dataset from all K nodes.(b) The bound should be in terms of the information theoretic quantities I(W k ; S k ) which can represent (or be upper bounded by) the capacities of the channels that the nodes are communicating over.This can, for example, represent a communication or local differential privacy constraint for each node.
At a high level, we will improve on the bound from Theorem 3 by taking into account the fact that a small change in S k will only change W by a fraction 1 K of the amount that it will change W k .In the case that W is a linear or location model, and the loss ℓ is a Bregman divergence, we can obtain an upper bound on expected generalization error that satisfies both properties (a) and (b) as follows.
Assumption 3. When Z = (X, Y ) are labeled pairs and for loss functions of type (i) in Theorem 4 below, we assume that Whether or not this assumption holds true will depend on the distributions involved, the training algorithms, and the function F .For least squares regression examples similar to those discussed in the last section of this paper, we have verified, through Monte Carlo simulation, that this assumption appears to hold for all parameter values that we tested.It remains an interesting open problem to understand when this holds true.

Theorem 4 (Linear or Location Models with Bregman Loss).
Suppose that Assumption 1 holds for each node.Consider the following two cases: and Proof.Here we restrict our attention to case (ii), but the two cases have nearly identical proofs.Using Theorem 1, In ( 7), we use k ).Line (6) follows by the linearity of the inner product and by canceling the higher order terms F (A(S)) and F (A(S (i,k) )) which have the same expected values.The key step (7) then follows by noting that A(S (i,k) ) only differs from A(S) in the submodel coming from node k, which is multiplied by a factor of 1  K when averaging all of the submodels.By backing out step (6) and re-adding the appropriate canceled terms we get By applying Theorem 2, Then, by noting that ψ * −1 is non-decreasing and concave, And using we have The result in Theorem 4 is general enough to apply to many problems of interest.For example, if F (p) = ∥p∥ 2  2 , then the Bregman divergence D F gives the ubiquitous squared ℓ 2 loss, i.e., For a comprehensive list of realizable loss functions, the interested reader is referred to [16].Using the above F , Theorem 4 can apply to ordinary least squares regression which we will look at in more detail in Section IV.Other regression models such as logistic regression have a loss function that cannot be described with a Bregman divergence without the inclusion of an additional nonlinearity.However, the result in Theorem 4 is agnostic to the algorithm that each node uses to fit its individual model.In this way, each node could be fitting a logistic model to its data, and the total aggregate model would then be an average over these logistic models.Theorem 4 would still control the expected generalization error for the aggregate model with the extra 1 K factor, however, critically, the upper bound would only be for generalization error that is with respect to a loss of the form D F (⟨x, w⟩, y) such as squared ℓ 2 loss.
In order to show that the dependence on the number of nodes K from Theorem 4 is tight for certain problems, consider the following example from [3].Suppose that Z ∼ π = N (µ, σ 2 I d ) and ℓ(w, z) = ∥w − z∥ 2  2 so that we are trying to learn the mean µ of the Gaussian.An obvious algorithm for each node to use is simple averaging of its dataset: For this algorithm, it can be shown that (see Section IV.A. in [3]).If we apply the existing information theoretic bounds from Theorem 2 in an end-to-end way, such as would be the approach from [13], we would get However, for this choice of algorithm at each node, the true expected generalization error can be computed to be Applying our new bound from Theorem 4, we get which recovers the correct dependence on K and improves upon the O 1 √ K result from previous information theoretic methods.

B. General Models and Losses
In this section we briefly describe some results that hold for more general classes of models and loss functions, such as deep neural networks and other nonlinear models.
Theorem 5 (Lipschitz Continuous Loss).Suppose that ℓ(w, z) is Lipschitz continuous as a function of w in the sense that for any z, and that Proof.Starting with Theorem 1, Equation ( 8) follows due to Lipschitz continuity, equation ( 9) uses the triangle inequality, and equation ( 10) is by assumption.
The bound in Theorem 5 is not in terms of the information theoretic quantities I(W k ; S k ), but it does show that the O 1 K upper bound can be shown for much more general loss functions and arbitrary nonlinear models.

C. Privacy and Communication Constraints
Both communication constraints and local differential privacy constraints can be thought of as special cases of mutual information constraints.Motivated by this observation, Theorem 4 immediately implies corollaries for these types of system.
Corollary 1 (Privacy Constraints).Suppose each node's algorithm A k is an ε-local differentially private mechanism in the sense that p(w k |s k ) p(w k |s ′ k ) ≤ e ε for each w k , s k , s ′ k .Then for losses ℓ of the form in Theorem 4, and under Assumption 2, Corollary 2 (Communication Constraints).Suppose each node can only transit B bits of information to the model aggregator, meaning that each W k can only take 2 B distinct possible values.Then for losses ℓ of the form in Theorem 4, and under Assumption 2,

III. ITERATIVE ALGORITHMS
We now turn to considering more complicated multi-round and iterative algorithms.In this setup, after T rounds there is a sequence of weights W (T ) = (W 1 , . . ., W T ) and the final model W T = f T (W (T ) ) is a function of that sequence where f T gives a linear combination of the T vectors W 1 , . . ., W T .The function f T could represent, for example, averaging over the T iterates, picking out the last iterate W T , or some weighted average over the iterates.On each round t, each node k produces an updated model W t k based on its local dataset S k and the previous timestep's global model W t−1 .The global model is then updated via an average over all K updated submodels: The particular example that we will consider is that of distributed SGD, where each node constructs its updated model W t k by taking one or more gradient steps starting from W t−1 with respect to random minibatches of its local data.Our model is general enough to account for multiple local gradient steps as is used in so-called Federated Learning [4], [5], [6], as well as noisy versions of SGD such as in [17], [18].If only one local gradient step is taken on each iteration, then the update rule for this particular example could be written as where Z t,k is a data point (or minibatch) sampled from S k on timestep t, η t is the learning rate, and ξ t is some potential added noise.We assume that the data points Z t,k are sampled without replacement so that the samples are distinct across different values of t.
For this type of iterative algorithm, we will consider the following timestep averaged empirical risk quantity: and the corresponding generalization error Note that the quantity in (12) is slightly different than the endto-end generalization error that we would get considering the final model W T and whole dataset S. It is instead an average over the generalization error we would get from each model stopping at iteration t.We do this so that when we apply the leave-one-out expansion from Theorem 1, we do not have to account for the dependence of W t k on past samples Z t ′ ,k ′ for t ′ < t and k ′ ̸ = k.Since we expect the generalization error to decrease as we use more samples, this quantity should result in a more conservative upper bound and be a reasonable surrogate object to study.The following bound follows as a corollary to Theorem 4.

IV. SIMULATIONS
We simulated a distributed linear regression example in order to demonstrate the improvement in our bounds over the existing information theoretic bounds.To do this, we generated n = 10 synthetic datapoints at each of K different nodes for various values of K.Each datapoint consisted of a pair (x, y) where y = xw 0 + n with x, n ∼ N (0, 1), and w 0 ∼ N (0, 1) was the randomly generated true weight that was common to all datapoints.Each node constructed an estimate w k of w 0 using the well-known normal equations which minimize the ℓ 2 loss, i.e., w k = argmin w n i=1 (wx i,k − y i,k ) 2 .The aggregate model was then the average w = 1 K K k=1 w k .In order to estimate the old and new information theoretic generalization bounds (i.e., the bounds from Theorems 2 and 4, respectively), this procedure was repeated M = 10 6 times and the datapoint and model values were binned in order to estimate the mutual information quantities.The value of M was increased until the mutual information estimates were no longer particularly sensitive to the number and widths of the bins.In order to estimate the true generalization error, the expectations for both the population risk and the dataset were estimated by Monte Carlo with 10 4 trials each.The results can be seen in Figure 2, where it is evident that the new information theoretic bound is much closer to the true expected generalization error, and decays with an improved rate as a function of K.

Corollary 3 .2R 2
For losses ℓ of the form in Theorem 4, and under Assumption 2,E [∆ sgd (S)] ≤ I(W t k ; Z t,k ) .

Fig. 2 .
Fig. 2. Information theoretic upper bounds and expected generalization error for a simulated linear regression example in linear (top) and log (bottom) scales.