Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning

Barnes, Leighton Pate; Dytso, Alex; Poor, Harold Vincent

doi:10.3390/e24091178

Open AccessFeature PaperArticle

Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning^†

by

Leighton Pate Barnes

^1,*,

Alex Dytso

²

and

Harold Vincent Poor

¹

Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 08544, USA

²

Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102, USA

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in Proceedings of the 2022 IEEE International Symposium on Information Theory, Espoo, Finland, 26 June–1 July 2022.

Entropy 2022, 24(9), 1178; https://doi.org/10.3390/e24091178

Submission received: 25 July 2022 / Revised: 19 August 2022 / Accepted: 20 August 2022 / Published: 24 August 2022

(This article belongs to the Special Issue Information Theory and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

We consider information-theoretic bounds on the expected generalization error for statistical learning problems in a network setting. In this setting, there are K nodes, each with its own independent dataset, and the models from the K nodes have to be aggregated into a final centralized model. We consider both simple averaging of the models as well as more complicated multi-round algorithms. We give upper bounds on the expected generalization error for a variety of problems, such as those with Bregman divergence or Lipschitz continuous losses, that demonstrate an improved dependence of

1 / K

on the number of nodes. These “per node” bounds are in terms of the mutual information between the training dataset and the trained weights at each node and are therefore useful in describing the generalization properties inherent to having communication or privacy constraints at each node.

Keywords:

generalization error; information-theoretic bounds; distribution and federated learning

1. Introduction

A key feature of machine learning systems is their ability to generalize new and unknown data. Such a system is trained on a particular set of data but must then perform well even on new data points that have not previously been considered. This ability, deemed generalization, can be formulated in the language of statistical learning theory by considering the generalization error of an algorithm (i.e., the difference between the population risk of a model trained on a particular dataset and the empirical risk for the same model and dataset). We say that a model generalizes well if it has a small generalization error, and because models are often trained by minimizing empirical risk or some regularized version of it, a small generalization error also implies a small population risk, which is the average loss over new samples taken randomly from the population. It is therefore of interest to find an upper bound on the generalization error and understand which quantities control it so that we can quantify the generalization properties of a machine learning system and offer guarantees about its performance.

In recent years, it has been shown that information-theoretic measures such as mutual information can be used for generalization error bounds under the assumption of the tail of the distribution of the loss function [1,2,3,4]. In particular, when the loss function is sub-Gaussian, the expected generalization error can scale at most with the square root of the mutual information between the training dataset and the model weights [2]. Such bounds offer an intuitive explanation for generalization and overfitting: if an algorithm uses only limited information from its training data, then this will bound the expected generalization error and prevent overfitting. Conversely, if an algorithm uses all of the information from its training set, in the sense that the model is a deterministic function of the training set, then this mutual information can be infinite, and there is the possibility of overfitting.

Another modern focus of machine learning systems has been that of distributed and federated learning [5,6,7]. In these systems, data are generated and processed in a distributed network of machines. The main differences between the distributed and centralized settings are the information constraints imposed by the network. There has been considerable interest in understanding the impact of both communication constraints [8,9] and privacy constraints [10,11,12,13] on the performance of machine learning systems, as well as designing protocols that efficiently train the systems under these constraints.

Since both communication and local differential privacy constraints can be thought of as special cases of mutual information constraints, they should pair naturally with some form of information theoretic generalization bounding in order to induce control over the generalization error of the distributed machine learning system. The information constraints inherent to the network can themselves give rise to tighter bounds on generalization error and thus provide better guarantees against overfitting. Along these lines, in a recent work [14], a subset of the present authors introduced the framework of using information theoretic quantities for bounding both the expected generalization error and a measure of privacy leakage in distributed and federated learning systems. The generalization bounds in this work, however, are essentially the same as those obtained by thinking of the entire system, from the data at each node in the network to the final aggregated model, as a single, centralized algorithm. Any improved generalization guarantees from these bounds would remain implicit in the mutual information terms involved.

In this work, we develop improved bounds on the expected generalization error for distributed and federated learning systems. Instead of leaving the differences between these systems and their centralized counterparts implicit in the mutual information terms, we bring analysis of the structure of the systems directly to the bounds. By working with the contribution from each node separately, we are able to derive upper bounds on the expected generalization error that scale with the number of nodes K as

O (\frac{1}{K})

instead of

O (\frac{1}{\sqrt{K}})

. This improvement is shown to be tight for certain examples, such as learning the mean of a Gaussian distribution with quadratic loss. We develop bounds that apply to distributed systems in which the submodels from K different nodes are averaged together, as well as bounds that apply to more complicated multi-round stochastic gradient descent (SGD) algorithms, such as in federated learning. For linear models with Bregman divergence losses, these “per node” bounds are in terms of the mutual information between the training dataset and the trained weights at each node and are therefore useful in describing the generalization properties inherent to having communication or privacy constraints at each node. For arbitrary nonlinear models that have Lipschitz continuous losses, the improved dependence of

O (\frac{1}{K})

can still be recovered but without a description in terms of mutual information. We demonstrate the improvements given by our bounds over the existing information theoretic generalization bounds via simulation of a distributed linear regression example. A preliminary conference version of this paper was presented in [15]. The present paper completes the work by including all of the missing proof details as well as providing new bounds for noisy SGD in Corollary 4.

Technical Preliminaries

Suppose we have independent and identically distributed (i.i.d.) data

Z_{i}

∼

π

for

i = 1, \dots, n

, and let

S = (Z_{1}, \dots, Z_{n})

. Suppose further that

W = A (S)

is the output of a potentially stochastic algorithm. Let

ℓ (W, Z)

be a real-valued loss function and define

L (w) = E_{π} [ℓ (w, Z)]

to be the population risk for weights (or model) w. We similarly define

L_{s} (w) = \frac{1}{n} \sum_{i = 1}^{n} ℓ (w, z_{i})

to be the empirical risk on dataset s for model w. The generalization error for dataset s is then

Δ_{A} (s) = L (A (s)) - L_{s} (A (s))

In addition, the expected generalization error is

E_{S \sim π^{n}} [Δ_{A} (S)] = E_{S \sim π^{n}} [L (A (S)) - L_{S} (A (S))]

(1)

where the expectation is also over any randomness in the algorithm. Below, we present some standard results for the expected generalization error that will be needed:

Theorem 1

(Leave-One-Out Expansion; Lemma 11 in [16]). Let

S^{(i)} = (Z_{1}, \dots, Z_{i}^{'}, \dots, Z_{n})

be a version of S with

Z_{i}

replaced by an i.i.d. copy

Z_{i}^{'}

. Denote

S^{'} = (Z_{1}^{'}, \dots, Z_{n}^{'})

. Then, we have

E_{S \sim π^{n}} [Δ_{A} (S)] = \frac{1}{n} \sum_{i = 1}^{n} E_{S, S^{'}} [ℓ (A (S), Z_{i}^{'}) - ℓ (A (S^{(i)}), Z_{i}^{'})] .

Proof.

Observe that

E_{S \sim π^{n}} [L (A (S))] = E_{S, S^{'}} [ℓ (A (S), Z_{i}^{'})]

(2)

for each i and that

\begin{matrix} E_{S \sim π^{n}} [L_{S} (A (S))] & = \frac{1}{n} \sum_{i = 1}^{n} E_{S \sim π^{n}} [ℓ (A (S), Z_{i})] \\ = \frac{1}{n} \sum_{i = 1}^{n} E_{S, S^{'} \sim π^{n}} [ℓ (A (S^{(i)}), Z_{i}^{'})] . \end{matrix}

(3)

Putting Equations (2) and (3) together with (1) yields the result. □

In many of the results in this paper, we will use one of the two following assumptions:

Assumption 1.

The loss function

ℓ (\tilde{W}, \tilde{Z})

satisfies

log E [exp (λ (ℓ (\tilde{W}, \tilde{Z}) - E [ℓ (\tilde{W}, \tilde{Z})]))] \leq ψ (- λ)

for

λ \in (b, 0]

,

ψ (0) = ψ^{'} (0) = 0

, where

\tilde{W}

and

\tilde{Z}

are taken independently from the marginals for W and Z, respectively,

The next assumption is a special case of the previous one with

ψ (λ) = \frac{R^{2} λ^{2}}{2}

:

Assumption 2.

The loss function

ℓ (\tilde{W}, \tilde{Z})

is sub-Gaussian with parameter

R^{2}

in the sense that

log E [exp (λ (ℓ (\tilde{W}, \tilde{Z}) - E [ℓ (\tilde{W}, \tilde{Z})]))] \leq \frac{R^{2} λ^{2}}{2} .

Theorem 2

(Theorem 2 in [3]). Under Assumption 1, we have

E_{S \sim π^{n}} [Δ_{A} (S)] \leq \frac{1}{n} \sum_{i = 1}^{n} ψ^{* - 1} (I (W; Z_{i}))

where

ψ^{* - 1} (y) = {inf}_{λ \in [0, b)} (\frac{y + ψ (λ)}{λ}) .

For a continuously differentiable and strictly convex function

F : R^{m} \to R

, we define the associated Bregman divergence [17,18] between two points

p, q \in R^{m}

to be

D_{F} (p, q) = F (p) - F (q) - 〈 \nabla F (q), p - q 〉,

where

〈 \cdot, \cdot 〉

denotes the usual inner product.

2. Distributed Learning and Model Aggregation

Now suppose that there are K nodes each having n samples. Each node

k = 1, \dots, K

has a dataset

S_{k} = (Z_{1, k}, \dots, Z_{n, k})

, with

Z_{i, k}

taken i.i.d. from

π

. We use

S = (S_{1}, \dots, S_{K})

to denote the entire dataset of size

n K

. Each node locally trains a model

W_{k} = A_{k} (S_{k})

with algorithm

A_{k}

. After each node locally trains its model, the models

W_{k}

are then combined to form the final model

\hat{W}

using an aggregation algorithm

\hat{W} = \hat{A} (W_{1}, \dots, W_{K})

(see Figure 1). In this section, we will assume that

W_{k} \in R^{d}

and that the aggregation is performed by simple averaging (i.e.,

\hat{W} = \frac{1}{K} \sum_{k = 1}^{K} W_{k}

). Define

A

to be the total algorithm from the data S to the final weights

\hat{W}

such that

\hat{W} = A (S)

. In this section, if we say that Assumption 1 or 2 holds, we mean that it holds for each algorithm

A_{k}

. As in Theorem 1, we use

S^{(i, k)}

to denote the entire dataset S with sample

Z_{i, k}

replaced by an independent copy

Z_{i, k}^{'}

, and similarly, we use

S_{k}^{(i)}

to refer to the sub-dataset at node k, with sample

Z_{i, k}

replaced by an independent copy

Z_{i, k}^{'}

:

Theorem 3.

Suppose that

ℓ (\cdot, z)

is a convex function of

w \in R^{d}

for each z and that

A_{k}

represents the empirical risk minimization algorithm on local dataset

S_{k}

in the sense that

W_{k} = A_{k} (S_{k}) = \underset{w}{argmin} \sum_{i = 1}^{n} ℓ (w, Z_{i, k}) .

Then, we have

Δ_{A} (s) \leq \frac{1}{K} \sum_{k = 1}^{K} Δ_{A_{k}} (s_{k}) .

Proof.

\begin{matrix} Δ_{A} & (s) = E_{Z \sim π} [ℓ (A (s), Z)] - \frac{1}{n K} \sum_{i, k} ℓ (A (s), z_{i, k}) \\ = E_{Z \sim π} [ℓ (\frac{1}{K} \sum_{k = 1}^{K} w_{k}, Z)] - \frac{1}{n K} \sum_{i, k} ℓ (A (s), z_{i, k}) \\ \leq \frac{1}{K} \sum_{k = 1}^{K} E_{Z \sim π} [ℓ (w_{k}, Z)] - \frac{1}{n K} \sum_{i, k} ℓ (A (s), z_{i, k}) \end{matrix}

(4)

\begin{matrix} \leq \frac{1}{K} \sum_{k = 1}^{K} E_{Z \sim π} [ℓ (w_{k}, Z)] - \frac{1}{K} \sum_{k = 1}^{K} min_{w} \frac{1}{n} \sum_{i = 1}^{n} ℓ (w, z_{i, k}) \\ = \frac{1}{K} \sum_{k = 1}^{K} Δ_{A_{k}} (s_{k}) . \end{matrix}

(5)

In the above display, Equation (4) follows by the convexity of ℓ via Jensen’s inequality, and Equation (5) follows by minimizing the empirical risk over each node’s local dataset, which exactly corresponds to what each node’s local algorithm

A_{k}

does. □

While Theorem 3 seems to be a nice characterization of the generalization bounds for the aggregate model (in that the aggregate generalization error cannot be any larger than the average generalization errors over each node), it does not offer any improvement in the expected generalization error that one might expect when given

n K

total samples instead of just n samples. A naive application of the generalization bounds from Theorem 2, followed by the data processing inequality

I (\hat{W}; Z_{i, k}) \leq I (W_{k}; Z_{i, k})

, runs into the same problem.

2.1. Improved Bounds

In this subsection, we demonstrate bounds on the expected generalization error that remedy the above shortcomings. In particular, we would like to demonstrate the following two properties:

(1): The bound should decay with the number of nodes K in order to take advantage of the total dataset from all K nodes.
(2): The bound should be in terms of the information theoretic quantities $I (W_{k}; S_{k})$ , which can represent (or be bounded from above by) the capacities of the channels over which the nodes are communicating. This can, for example, represent a communication or local differential privacy constraint for each node.

At a high level, we will improve on the bound from Theorem 3 by taking into account the fact that a small change in

S_{k}

will only change

\hat{W}

by a fraction

\frac{1}{K}

of the amount that it will change

W_{k}

. In the case where W is a linear or location model, and the loss ℓ is a Bregman divergence, we can obtain an upper bound on the expected generalization error that satisfies properties (1) and (2) as follows:

Theorem 4

(Linear or Location Models with Bregman Loss). Suppose the loss ℓ takes the form of one of the following:

(i): $ℓ (w, (x, y)) = D_{F} (w^{T} x, y)$ ;
(ii): $ℓ (w, z) = D_{F} (w, z)$ .

In addition, assume that Assumption 1 holds. Then, we have

E_{S \sim π^{n K}} [Δ_{A} (S)] = \frac{1}{K^{2}} \sum_{k = 1}^{K} E_{S_{k} \sim π^{n}} [Δ_{A_{k}} (S_{k})]

and

\begin{matrix} E_{S \sim π^{n K}} [Δ_{A} (S)] & \leq \frac{1}{n K^{2}} \sum_{i, k} ψ^{* - 1} (I (W_{k}; Z_{i, k})) \\ \leq \frac{1}{K^{2}} \sum_{k = 1}^{K} ψ^{* - 1} (\frac{I (W_{k}; S_{k})}{n}) . \end{matrix}

Proof.

Here, we restrict our attention to case (ii), but the two cases have nearly identical proofs. Using Theorem 1, we have

\begin{matrix} E_{S \sim π^{n K}} [Δ_{A} (S)] \\ = \frac{1}{n K} \sum_{i, k} E_{S, S^{'}} [ℓ (A (S), Z_{i, k}^{'}) - ℓ (A (S^{(i, k)}), Z_{i, k}^{'})] \\ = \frac{1}{n K} \sum_{i, k} E_{S, S^{'}} [F (A (S)) - F (Z_{i, k}^{'}) - 〈\nabla F (Z_{i, k}^{'}), A (S) - Z_{i, k}^{'}〉 \\ - F (A (S^{(i, k)})) + F (Z_{i, k}^{'}) + 〈\nabla F (Z_{i, k}^{'}), A (S^{(i, k)}) - Z_{i, k}^{'}〉] \end{matrix}

\begin{matrix} = & \frac{1}{n K} \sum_{i, k} E_{S, S^{'}} [〈\nabla F (Z_{i, k}^{'}), A (S^{(i, k)}) - A (S)〉] \\ = & \frac{1}{n K} \sum_{i, k} E_{S, S^{'}} [〈\nabla F (Z_{i, k}^{'}), \frac{1}{K} W_{k}^{(i)} + \frac{1}{K} \sum_{j \neq k} W_{j} - \frac{1}{K} \sum_{j} W_{j}〉] \end{matrix}

(6)

\begin{matrix} = & \frac{1}{n K^{2}} \sum_{i, k} E_{S, S^{'}} [〈\nabla F (Z_{i, k}^{'}), W_{k}^{(i)} - W_{k}〉] . \end{matrix}

(7)

In Equation (7), we use

W_{k}^{(i)}

to denote

A_{k} (S_{k}^{(i)})

. Equation (6) follows the linearity of the inner product and cancels the higher order terms

F (A (S))

and

F (A (S^{(i, k)}))

, which have the same expected values. The key step in Equation (7) then follows by noting that

A (S^{(i, k)})

only differs from

A (S)

in the submodel coming from node k, which is multiplied by a factor of

\frac{1}{K}

when averaging all of the submodels. By backing out of Equation (6) and re-adding the appropriate canceled terms, we get

E_{S \sim π^{n K}} [Δ_{A} (S)] = \frac{1}{K^{2}} \sum_{k = 1}^{K} E_{S_{k} \sim π^{n}} [Δ_{A_{k}} (S_{k})] .

By applying Theorem 2, this yields

E_{S \sim π^{n K}} [Δ_{A} (S)] \leq \frac{1}{n K^{2}} \sum_{i, k} ψ^{* - 1} (I (W_{k}; Z_{i, k})) .

Then, by noting that

ψ^{* - 1}

is non-decreasing and concave, we have

\begin{matrix} \frac{1}{n K^{2}} \sum_{i, k} & ψ^{* - 1} (I (W_{k}; Z_{i, k})) \leq \frac{1}{K^{2}} \sum_{k = 1}^{K} ψ^{* - 1} (\sum_{i = 1}^{n} \frac{I (W_{k}; Z_{i, k})}{n}) . \end{matrix}

Using the property that conditioning decreases entropy yields

\sum_{i = 1}^{n} I (W_{k}; Z_{i, k}) \leq I (W_{k}; S_{k}),

and we have

\begin{matrix} \frac{1}{K^{2}} \sum_{k = 1}^{K} & ψ^{* - 1} (\sum_{i = 1}^{n} \frac{I (W_{k}; Z_{i, k})}{n}) \leq \frac{1}{K^{2}} \sum_{k = 1}^{K} ψ^{* - 1} (\frac{I (W_{k}; S_{k})}{n}) \end{matrix}

as desired. □

The result in Theorem 4 is general enough to apply to many problems of interest. For example, if

F (p) = {∥ p ∥}_{2}^{2}

, then the Bregman divergence

D_{F}

gives the ubiquitous squared

ℓ^{2}

loss (i.e.,

D_{F} (p, q) = {∥ p - q ∥}_{2}^{2}

). For a comprehensive list of realizable loss functions, the interested reader is referred to [19]. Using F above, Theorem 4 can be applied to ordinary least squares regression, which we will examine in greater detail in Section 4. Other regression models such as logistic regression have loss functions that cannot be described with a Bregman divergence without the inclusion of additional nonlinearity. However, the result in Theorem 4 is agnostic to the algorithm that each node uses to fit its individual model. In this way, each node could fit a logistic model to its data, and the total aggregate model would then be an average over these logistic models. Theorem 4 would still control the expected generalization error for the aggregate model with the extra

\frac{1}{K}

factor. However, critically, the upper bound would only be for the generalization error that is with respect to a loss of the form

D_{F} (w^{T} x, y)

, such as quadratic loss.

In order to show that the dependence on the number of nodes K from Theorem 4 is tight for certain problems, consider the following example from [3]. Suppose that Z∼

π = N (μ, σ^{2} I_{d})

and

ℓ (w, z) = {∥ w - z ∥}_{2}^{2}

so that we are trying to learn the mean

μ

of a Gaussian distribution. An obvious algorithm for each node to use is simple averaging of its dataset:

w_{k} = A_{k} (s_{k}) = \frac{1}{n} \sum_{i = 1}^{n} z_{i, k} .

For this algorithm, it can be shown that

I (\hat{W}; Z_{i, k}) = \frac{d}{2} log \frac{n K}{n K - 1}

and

ψ^{* - 1} (y) = 2 \sqrt{d {(1 + \frac{1}{n K})}^{2} σ^{4} y}

See Section IV.A. in [3] for further details. If we apply the existing information theoretic bounds from Theorem 2 in an end-to-end way, such as in the approach from [14], we would get

\begin{matrix} E_{S \sim π^{n K}} [Δ_{A} (S)] & \leq σ^{2} d \sqrt{2 {(1 + \frac{1}{n K})}^{2} log \frac{n K}{n K - 1}} \\ = O (\frac{1}{\sqrt{n K}}) . \end{matrix}

However, for this choice of algorithm at each node, the true expected generalization error can be computed to be

E_{S \sim π^{n K}} [Δ_{A} (S)] = \frac{2 σ^{2} d}{n K} .

By applying our new bound from Theorem 4, we get

\begin{matrix} E_{S \sim π^{n K}} [Δ_{A} (S)] & \leq \frac{σ^{2} d}{K} \sqrt{2 {(1 + \frac{1}{n})}^{2} log \frac{n}{n - 1}} \\ \leq O (\frac{1}{K \sqrt{n}}) \end{matrix}

which shows the correct dependence on K and improves upon the

O (\frac{1}{\sqrt{K}})

result from prior information theoretic methods.

2.2. General Models and Losses

In this section, we briefly describe some results that hold for more general classes of models and loss functions, such as deep neural networks and other nonlinear models:

Theorem 5

(Lipschitz Continuous Loss). Suppose that

ℓ (w, z)

is Lipschitz continuous as a function of w in the sense that

| ℓ (w, z) - ℓ (w^{'}, z) | \leq C ∥ w - w^{'} ∥_{2}

for any z and that

E [{∥W_{k} - E [W_{k}]∥}_{2}] \leq σ_{0}

for each k. Then, we have

\begin{matrix} E_{S \sim π^{n K}} [Δ_{A} (S)] & \leq \frac{2 C σ_{0}}{K} . \end{matrix}

Proof.

Starting with Theorem 1, we have

\begin{matrix} E_{S \sim π^{n K}} & [Δ_{A} (S)] \\ = \frac{1}{n K} \sum_{i, k} E_{S, S^{'}} [ℓ (A (S), Z_{i, k}^{'}) - ℓ (A (S^{(i, k)}), Z_{i, k}^{'})] \end{matrix}

\begin{matrix} \leq \frac{1}{n K} \sum_{i, k} E_{S, S^{'}} [C {∥A (S) - A (S^{(i, k)})∥}_{2}] \\ = \frac{1}{n K^{2}} \sum_{i, k} E_{S, S^{'}} [C {∥W_{k} - W_{k}^{(i)}∥}_{2}] \end{matrix}

(8)

\begin{matrix} \leq \frac{C}{n K^{2}} \sum_{i, k} E_{S, S^{'}} [{∥W_{k} - E [W_{k}]∥}_{2}] + E_{S, S^{'}} [{∥W_{k}^{(i)} - E [W_{k}]∥}_{2}] \end{matrix}

(9)

\begin{matrix} \leq \frac{2 C σ_{0}}{K}, \end{matrix}

(10)

where Equation (8) follows from Lipschitz continuity, Equation (9) uses the triangle inequality, and Equation (10) is assumed. □

The bound in Theorem 5 is not in terms of the information theoretic quantities

I (W_{k}; S_{k})

, but it does show that the

O (\frac{1}{K})

upper bound can be shown for much more general loss functions and arbitrary nonlinear models.

2.3. Privacy and Communication Constraints

Both communication and local differential privacy constraints can be thought of as special cases of mutual information constraints. Motivated by this observation, Theorem 4 immediately implies corollaries for these types of systems:

Corollary 1

(Privacy Constraints). Suppose each node’s algorithm

A_{k}

is an ε-local, differentially private mechanism in the sense that

\frac{p (w_{k} | s_{k})}{p (w_{k} | s_{k}^{'})} \leq e^{ε}

for each

w_{k}, s_{k}, s_{k}^{'}

. Then, for losses ℓ of the form in Theorem 4, and under Assumption 2, we have

E_{S \sim π^{n K}} [Δ_{A} (S)] \leq \frac{1}{K} \sqrt{\frac{2 R^{2} min {ε, (e - 1) ε^{2}}}{n}} .

Proof.

Note that

\begin{matrix} I (W_{k}; S_{k}) & = \sum_{w_{k}, s_{k}} p (w_{k}, s_{k}) log \frac{p (w_{k} | s_{k})}{\sum_{s_{k}^{'}} p (w_{k} | s_{k}^{'}) p (s_{k}^{'})} \\ \leq \sum_{w_{k}, s_{k}} p (w_{k}, s_{k}) log \frac{p (w_{k} | s_{k})}{{inf}_{s_{k}^{'}} p (w_{k} | s_{k}^{'})} \\ \leq \sum_{w_{k}, s_{k}} p (w_{k}, s_{k}) ε = ε . \end{matrix}

Similarly, it is true that

\begin{matrix} I (W_{k}; S_{k}) & = KL (P_{W_{k} S_{k}} ∥ P_{S_{k}} P_{W_{k}}) \\ \leq KL (P_{W_{k} S_{k}} ∥ P_{S_{k}} P_{W_{k}}) + KL (P_{S_{k}} P_{W_{k}} ∥ P_{W_{k} S_{k}}) \\ = \sum_{w_{k}, s_{k}} p (w_{k}) p (s_{k}) (\frac{p (w_{k} | s_{k})}{p (w_{k})} - 1) log \frac{p (w_{k} | s_{k})}{p (w_{k})} \\ \leq \sum_{w_{k}, s_{k}} p (w_{k}) p (s_{k}) (e^{ε} - 1) ε \leq (e - 1) ε^{2} \end{matrix}

where the last inequality is only true for

ε \leq 1

. Putting these two displays together gives

I (W_{k}; S_{k}) \leq min {ε, (e - 1) ε^{2}}

, and the result follows from Theorem 4. □

Corollary 2

(Communication Constraints). Suppose each node can only transit B bits of information to the model aggregator, meaning that each

W_{k}

can only take

2^{B}

distinct possible values. Then, for losses ℓ of the form in Theorem 4, and under Assumption 2, this yields

E_{S \sim π^{n K}} [Δ_{A} (S)] \leq \frac{1}{K} \sqrt{\frac{2 (log 2) R^{2} B}{n}} .

Proof.

The corollary follows immediately from Theorem 4 and

I (W_{k}; S_{k}) \leq H (W_{k}) \leq (log 2) B .

□

3. Iterative Algorithms

We now turn to considering more complicated multi-round and iterative algorithms. In this setting, after T rounds, there is a sequence of weights

W^{(T)} = (W^{1}, \dots, W^{T})

, and the final model

{\hat{W}}_{T} = f_{T} (W^{(T)})

is a function of that sequence, where

f_{T}

gives a linear combination of the T vectors

W^{1}, \dots, W^{T}

. The function

f_{T}

could represent, for example, averaging over the T iterates, choosing the last iterate

W^{T}

or some weighted average over the iterates. For each round t, each node k produces an updated model

W_{k}^{t}

based on its local dataset

S_{k}

and the previous timestep’s global model

W^{t - 1}

. The global model is then updated via an average over all K updated submodels:

W^{t} = \frac{1}{K} \sum_{k = 1}^{K} W_{k}^{t} .

The particular example that we will consider is that of a distributed SGD, where each node constructs its updated model

W_{k}^{t}

by taking one or more gradient steps starting from

W^{t - 1}

with respect to random minibatches of its local data. Our model is general enough to account for multiple local gradient steps, as are used in so-called federated learning [5,6,7], as well as noisy versions of SGDs, such as in [20,21]. If only one local gradient step is taken for each iteration, then the update rule for this example could be written as

W_{k}^{t} = W^{t - 1} - η_{t} \nabla_{w} ℓ (W^{t - 1}, Z_{t, k}) + ξ_{t}

(11)

where

Z_{t, k}

is a data point (or minibatch) sampled from

S_{k}

on timestep t,

η_{t}

is the learning rate, and

ξ_{t}

is some potential added noise. We assume that the data points

Z_{t, k}

are sampled without replacement so that the samples are distinct across different values of t. We will also assume, for notational simplicity, that

{\hat{W}}_{T} = W^{T}

, although the more general result follows in a straightforward manner.

For this type of iterative algorithm, we will consider the following timestep-averaged empirical risk quantity:

\frac{1}{K T} \sum_{t = 1}^{T} \sum_{k = 1}^{K} ℓ (W^{t}, Z_{t, k}),

and the corresponding generalization error, expressed as

Δ_{sgd} (S) = \frac{1}{T} \sum_{t = 1}^{T} (E_{Z \sim π} [ℓ (W^{t}, Z)] - \frac{1}{K} \sum_{k = 1}^{K} ℓ (W^{t}, Z_{t, k})) .

(12)

Note that Equation (12) is slightly different from the end-to-end generalization error that we would get from considering the final model

W^{T}

and whole dataset S. It is instead an average over the generalization error we would get from each model, stopping at iteration t. We perform this so that when we apply the leave-one-out expansion from Theorem 1, we do not have to account for the dependence of

W_{k}^{t}

on past samples

Z_{t^{'}, k^{'}}

for

t^{'} < t

and

k^{'} \neq k

. Since we expect the generalization error to decrease as we use more samples, this quantity should result in a more conservative upper bound and be a reasonable surrogate object to study. The next bound follows as a corollary to Theorem 4:

Corollary 3.

For losses ℓ of the form in Theorem 4, and under Assumption 2 (for each

W_{k}^{t}

), we have

E [Δ_{sgd} (S)] \leq \frac{1}{T} \sum_{t = 1}^{T} \frac{1}{K^{2}} \sum_{k = 1}^{K} \sqrt{2 R^{2} I (W_{k}^{t}; Z_{t, k})} .

In the particular example described in Equation (11), where Gaussian noise

ξ_{t} \sim N (0, I_{d} σ_{t}^{2})

is added to each iterate, Corollary 3 yields the following. As in [20], we assume that the updates are magnitude-bounded (i.e.,

{sup}_{w, x} {∥ \nabla_{w} ℓ (w, z) ∥}_{2} \leq L

), the stepsizes satisfy

η_{t} = \frac{c}{t}

for a constant

c > 0

, and that

σ_{t} = \sqrt{η_{t}}

:

Corollary 4.

Under the assumptions above, we have

E [Δ_{sgd} (S)] \leq \frac{2 R L}{K} \sqrt{\frac{c}{T}} .

Proof.

The mutual information terms in Corollary 3 satisfy

\begin{matrix} I (W_{k}^{t}; Z_{t, k}) & \leq I (W_{k}^{t}, W^{t - 1}; Z_{t, k}) \end{matrix}

(13)

\begin{matrix} = I (W_{k}^{t}; Z_{t, k} | W^{t - 1}) + I (W^{t - 1}; Z_{t, k}) \end{matrix}

(14)

\begin{matrix} = I (W_{k}^{t}; Z_{t, k} | W^{t - 1}) \end{matrix}

(15)

\begin{matrix} \leq \frac{d}{2} log (1 + \frac{η_{t}^{2} L^{2}}{d σ_{t}^{2}}) \end{matrix}

(16)

\begin{matrix} \leq \frac{η_{t}^{2} L^{2}}{2 σ_{t}^{2}} = \frac{c L^{2}}{2 t} . \end{matrix}

(17)

Equation (13) follows from the data-processing inequality, Equation (14) is the chain rule for mutual information, and Equation (15) follows from the independence of

Z_{t, k}

and

W^{t - 1}

. Equation (16) is due to the capacity of the additive white Gaussian noise channel, and Equation (17) just uses the approximation

log (1 + x) \leq x

. Thus, we have

E [Δ_{sgd} (S)] \leq \frac{1}{T K} \sum_{t = 1}^{T} R L \sqrt{\frac{c}{t}} \leq \frac{2 R L}{K} \sqrt{\frac{c}{T}} .

□

4. Simulations

We simulated a distributed linear regression example in order to demonstrate the improvement in our bounds over the existing information-theoretic bounds. To accomplish this, we generated

n = 10

synthetic datapoints at each of K different nodes for various values of K. Each datapoint consisted of a pair

(x, y)

, where

y = x w_{0} + n

with

x, n

∼

N (0, 1)

, and

w_{0}

∼

N (0, 1)

was the randomly generated true weight that was common to all datapoints. Each node constructed an estimate

{\hat{w}}_{k}

of

w_{0}

using the well-known normal equations which minimize the quadratic loss (i.e.,

{\hat{w}}_{k} = {argmin}_{w} \sum_{i = 1}^{n} {(w x_{i, k} - y_{i, k})}^{2}

). The aggregate model was then the average

\hat{w} = \frac{1}{K} \sum_{k = 1}^{K} {\hat{w}}_{k}

. In order to estimate the old and new information-theoretic generalization bounds (i.e., the bounds from Theorems 2 and 4, respectively), this procedure was repeated

M = 10^{6}

times, and the datapoint and model values were binned in order to estimate the mutual information quantities. The value of M was increased until the mutual information estimates were no longer particularly sensitive to the number and widths of the bins. In order to estimate the true generalization error, the expectations for both the population risk and the dataset were estimated by Monte Carlo experimentation, with

10^{4}

trials each. The results can be seen in Figure 2, where it is evident that the new information theoretic bound is much closer to the true expected generalization error and decays with an improved rate as a function of K.

Author Contributions

Conceptualization, L.P.B., A.D. and H.V.P.; Formal analysis, L.P.B., A.D. and H.V.P.; Investigation, L.P.B., A.D. and H.V.P.; Methodology, L.P.B., A.D. and H.V.P.; Supervision, H.V.P.; Writing—original draft, L.P.B.; Writing—review & editing, L.P.B., A.D. and H.V.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science Foundation grant number CCF-1908308.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Russo, D.; Zou, J. How Much Does Your Data Exploration Overfit? Controlling Bias via Information Usage. IEEE Trans. Inf. Theory 2020, 66, 302–323. [Google Scholar] [CrossRef]
Xu, A.; Raginsky, M. Information-Theoretic Analysis of Generalization Capability of Learning Algorithms. Adv. Neural Inf. Process. Syst. 2017, 30, 2521–2530. [Google Scholar]
Bu, Y.; Zou, S.; Veeravalli, V.V. Tightening Mutual Information-Based Bounds on Generalization Error. IEEE J. Sel. Areas Inf. Theory 2020, 1, 121–130. [Google Scholar] [CrossRef]
Aminian, G.; Bu, Y.; Wornell, G.W.; Rodrigues, M.R. Tighter Expected Generalization Error Bounds via Convexity of Information Measures. In Proceedings of the 2022 IEEE International Symposium on Information Theory (ISIT), Espoo, Finland, 26 June–1 July 2022. [Google Scholar]
McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017. [Google Scholar]
Konecný, J.; McMahan, H.B.; Ramage, D.; Richtárik, P. Federated Optimization: Distributed Machine Learning for On-Device Intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar]
Konečný, J.; McMahan, H.B.; Yu, F.X.; Richtarik, P.; Suresh, A.T.; Bacon, D. Federated Learning: Strategies for Improving Communication Efficiency. arXiv 2016, arXiv:1610.05492. [Google Scholar]
Lin, Y.; Han, S.; Mao, H.; Wang, Y.; Dally, W.J. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In Proceedings of the 6th International Congress on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Barnes, L.P.; Inan, H.A.; Isik, B.; Ozgur, A. rTop-k: A Statistical Estimation Approach to Distributed SGD. IEEE J. Sel. Areas Inf. Theory 2020, 1, 897–907. [Google Scholar] [CrossRef]
Warner, S.L. Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias. J. Am. Stat. Assoc. 1965, 60, 63–69. [Google Scholar] [CrossRef] [PubMed]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography Conference; Halevi, S., Rabin, T., Eds.; Springer: Berlin/Heidelberg, Geramny, 2006. [Google Scholar]
Kasiviswanathan, S.P.; Lee, H.K.; Nissim, K.; Raskhodnikova, S.; Smith, A. What Can We Learn Privately? SIAM J. Comput. 2011, 40, 793–826. [Google Scholar] [CrossRef]
Cuff, P.; Yu, L. Differential Privacy as a Mutual Information Constraint. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 43–54. [Google Scholar]
Yagli, S.; Dytso, A.; Poor, H.V. Information-Theoretic Bounds on the Generalization Error and Privacy Leakage in Federated Learning. In Proceedings of the 2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Atlanta, GA, USA, 26–29 May 2020; pp. 1–5. [Google Scholar] [CrossRef]
Barnes, L.P.; Dytso, A.; Poor, H.V. Improved Information Theoretic Generalization Bounds for Distributed and Federated Learning. arXiv 2022, arXiv:2202.02423. [Google Scholar]
Shalev-Shwartz, S.; Shamir, O.; Srebro, N.; Sridharan, K. Learnability, Stability and Uniform Convergence. J. Mach. Learn. Res. 2010, 11, 2635–2670. [Google Scholar]
Bregman, L.M. The Relaxation Method of Finding the Common Point of Convex Sets and Its Application to the Solution of Problems in Convex Programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
Dytso, A.; Fauß, M.; Poor, H.V. Bayesian Risk With Bregman Loss: A Cramér–Rao Type Bound and Linear Estimation. IEEE Trans. Inf. Theory 2022, 68, 1985–2000. [Google Scholar] [CrossRef]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J.; Lafferty, J. Clustering with Bregman Divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Pensia, A.; Jog, V.; Loh, P.L. Generalization Error Bounds for Noisy, Iterative Algorithms. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 546–550. [Google Scholar]
Wang, H.; Gao, R.; Calmon, F.P. Generalization Bounds for Noisy Iterative Algorithms Using Properties of Additive Noise Channels. arXiv 2021, arXiv:2102.02976. [Google Scholar]

Figure 1. The distributed learning setting with model aggregation.

Figure 2. Information-theoretic upper bounds and expected generalization error for a simulated linear regression example in linear (top) and log (bottom) scales.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Barnes, L.P.; Dytso, A.; Poor, H.V. Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning. Entropy 2022, 24, 1178. https://doi.org/10.3390/e24091178

AMA Style

Barnes LP, Dytso A, Poor HV. Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning. Entropy. 2022; 24(9):1178. https://doi.org/10.3390/e24091178

Chicago/Turabian Style

Barnes, Leighton Pate, Alex Dytso, and Harold Vincent Poor. 2022. "Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning" Entropy 24, no. 9: 1178. https://doi.org/10.3390/e24091178

APA Style

Barnes, L. P., Dytso, A., & Poor, H. V. (2022). Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning. Entropy, 24(9), 1178. https://doi.org/10.3390/e24091178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning^†

Abstract

1. Introduction

Technical Preliminaries

2. Distributed Learning and Model Aggregation

2.1. Improved Bounds

2.2. General Models and Losses

2.3. Privacy and Communication Constraints

3. Iterative Algorithms

4. Simulations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning †

Abstract

1. Introduction

Technical Preliminaries

2. Distributed Learning and Model Aggregation

2.1. Improved Bounds

2.2. General Models and Losses

2.3. Privacy and Communication Constraints

3. Iterative Algorithms

4. Simulations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Improved Information-Theoretic Generalization Bounds for Distributed, Federated, and Iterative Learning^†