## 1. Introduction

Nowadays, more and more intelligent devices, such as smart phones, wearable devices and autonomous vehicles, are widely used [

1,

2], generating a wealth of data. The generated data can be used to develop deep learning models powering applications such as speech recognition, face detection and text entry. With the increasing amount of generated data and increasing computing power of the smart devices [

3], recent studies have explored distributed training of models at these edging devices [

4,

5]. Considering that centralized storage of data in the central server, e.g., in the cloud, is often not feasible due to privacy of consumer data [

6],

federated learning [

7] has been advocated as an alternative setting.

Federated learning [

7] can be viewed as an extension of conventional distributed deep learning [

8], as it aims to train a high quality shared model while keeping data distributed over clients. That is, each client computes an updated model based on his/her own locally collected data (that is not shared with others). After all model parameters are computed locally, they are sent to the server where an updated global model is computed by combining the received local model parameters.

Federated learning plays a critical role in various privacy-sensitive scenarios, such as power intelligence applications like keyboard prediction and emoji prediction in smart phone [

9,

10], optimising patient’s healthcare in hospitals [

11], helping internet of things (IoT) systems adapt to environmental changes [

12] etc. Specifically, assume there are

Q participating clients over which the data is partitioned, with

${P}_{q}$ being the set of indices of data points at client

q, and

${n}_{q}=\left|{P}_{q}\right|$. A federated learning system optimizes a global model summarized as follows:

- (1)
All participating clients download the latest global model parameter vector ${w}^{,}$.

- (2)
Each client improves downloaded model based on their local data using, e.g., stochastic gradient descent (SGD) [

13] with a fixed learning rate

$\eta $:

${w}_{q}={w}^{,}-\eta {g}_{q}$, where

${g}_{q}=\u25bd{F}_{q}\left({w}^{,}\right)$, and

${F}_{q}\left(w\right)$ is the local objective function for the

qth device.

- (3)
Each improved model parameter ${w}_{q}$ is sent from the client to the server.

- (4)
The server aggregates all updated parameters to construct an enhanced global model ${w}^{*}$.

Current studies on federated learning and its variants address a series of challenges. In [

14], they compress the model in both upstream and downstream links to promote communication efficiency. In [

15], they demonstrate user-level differential privacy in large recurrent language models. In [

16], they propose algorithms that achieve both communication efficiency and differential privacy. In [

17], they propose a decentralized training framework to overcome issues related to heterogeneous devices. In [

18], they suggest a general oracle-based framework in parallel stochastic optimization. In [

19], they investigate a pluralistic solution to address the cyclic patterns in data samples during federated training. In these state-of-the-art studies, at step 4) above, the updated model parameters are calculated by directly averaging the models received from all clients via

${w}^{*}={\sum}_{q=1}^{\left|Q\right|}\frac{{n}_{q}}{n}{w}_{q}$, where

$n={\sum}_{q=1}^{\left|Q\right|}{n}_{q}$. However, after each client computes

${w}_{q}$ by optimizing their local objective function

${F}_{q}$, there is no evidence

${w}^{*}$ will be close to the optimized value of the global objective function

${\sum}_{q}\frac{{n}_{q}}{n}{F}_{q}\left(w\right)$. That raises a question: is averaging model parameters a reasonable approach in federated learning? With this question we begin our research.

Mutual information (MI) is a powerful statistic for measuring the degree of relatedness [

20]. MI can detect any kind of relationship between random variables even those that do not manifest themselves in the covariance [

21], and has a straightforward interpretation as the amount of shared information between datasets (measured in, for example, bits) [

22]. In this paper, we treat each client’s computed parameter vector

${w}_{q}$ as an random vector because of the stochastic properties of SGD, and train each client in one training epoch many times to obtain a series of parameter samples. We firstly calculate the MI between different parameters in two learning tasks: (1) convolutional neural network (CNN) [

23] designed for a classification task and (2) recurrent neural network (RNN) [

24] on a regression task. By using two methods, one is base on the MI derivative formula under the multi-dimensional Gaussian distribution assumption, and another is a continuous-continuous MI estimator based on k-nearest neighbor (KNN) distances described in [

21]. We estimate MI at different training phases. The results confirm the correlation between the model parameters at different clients, and show an increasing trend of MI. We further calculate the distance variation between different parameters using three regular distance metrics: Euclidean distance [

25], Manhattan distance [

26], Chebyshev distance [

27]. By comparing the distance variation with MI variation, we show that the parameters are getting more correlated while not getting closer. A proposition derived in

Section 3 implies that averaging parameters is not supported by theory and requires further scrutiny.

The rest of the paper is organized as follows. In

Section 2, we provide an overview of related work. In

Section 3, we describe the details of estimating MI.

Section 4 shows the experiment results. Conclusion and future work are discussed in

Section 5.

## 2. Related Work

In [

28], the authors represent Deep Neural Networks (DNNs) as a Markov chain and propose to use information theoretical tools, and in particular, Information Plane (IP)–the plane of the MI values of any other variable with the input variable (input data) and desired output variable (label), to study and visualise DNNs. In [

29], they extend the approach of [

28] demonstrating the effectiveness of the Information Plane visualization of DNNs. In particular, the authors of [

29] suggest that the training process of DNNs is composed of an initial fitting phase and a representation compression phase, where the goal of the network is to optimize the Information Bottleneck (IB) tradeoff between compression and prediction. These results are later questioned by [

30], who argue that the compression phase is not general in the DNN training process. However, [

31] propose ensemble dependency graph estimator (EDGE) to estimate MI supporting the claim of the compression phase. Following this line of work, [

32] propose a method for performing IP on arbitrarily-distributed discrete and/or continuous variables, while [

33] use matrix-based

$R\stackrel{\xb4}{e}nyi\u2019s$ entropy to provide comprehensive IP analysis of large DNNs.

There are few other attempts to study federated learning from an information-theoretic perspective. In [

34], they apply information theory to understand the aggregation process of DNNs in the federated learning setup. The authors use EDGE [

31] to measure the MI between representation and inputs as well as representation and labels in each local model, and compare them to the respective information contained in the representation of the averaged model. In [

34], they leave as an open question for the theoretical analysis of when and how to aggregate local models. In contrast to [

34] and the other work reviewed above, in this paper, we treat each local model as a continuous random vector, and directly measure the MI between different local models through the training phase using two methods. By doing so we explore the correlation between each local model. By further comparing with the distance variation between local models, we propose that averaging may not be the optimum way of aggregating trained parameters.

## 3. Estimating the Mutual Information

In this section, we show two methods to estimate MI between model parameters at two clients. To do that, we need to perform training at each client many times to acquire enough samples of each parameter vector. Generally speaking, let X and Y be the model parameters computed at two different clients, $X,Y\in {\mathbb{R}}^{t}$. Let $Z=(X,Y),Z\in {\mathbb{R}}^{2t}$ be the joint variable of X and Y. After training at the two clients N times, we get N samples of X and Y, denoted by ${x}_{1},{x}_{2},\dots ,{x}_{N}$ and ${y}_{1},{y}_{2},\dots ,{y}_{N}$, and also N samples of Z, ${z}_{1},{z}_{2},\dots ,{z}_{N},{z}_{i}=({x}_{i},{y}_{i})$.

#### 3.1. MI Derivative Formula under Multi-D Gaussian

In the first case, we assume that both X and Y satisfy the multi-dimensional Gaussian distribution, i.e., $X\sim \mathcal{N}({\mu}_{x},{\Sigma}_{x})$, $Y\sim \mathcal{N}({\mu}_{y},{\Sigma}_{y})$, which implies that Z also satisfies the multi-dimensional Gaussian distribution $Z\sim \mathcal{N}({\mu}_{z},{\Sigma}_{z})$.

Firstly, we calculate

Z’s covariance matrix

${\Sigma}_{z}$ according to the

N samples:

where

${X}^{i},{Y}^{j},i,j=1,2,\dots ,t$ denote

ith and

jth dimension of

X and

Y, and

$cov({X}^{i},{Y}^{j})=E[{X}^{i}-E\left[{X}^{i}\right]][{Y}^{j}-E\left[{Y}^{j}\right]]$ denotes the covariance between

${X}^{i}$ and

${Y}^{j}$. The mean of each variable is estimated from the collected samples.

Then, to calculate

$H\left(Z\right)=H(X,Y)$, we use the entropy formula for multi-dimensional Gaussian distribution [

35]:

Since

Z is the joint distribution of

X and

Y, from Equation (1) we can infer that

${\Sigma}_{x}={\Sigma}_{z}[1:t,1:t],{\Sigma}_{y}={\Sigma}_{z}[t+1:2t,t+1:2t]$, and through Equation (2), we then calculate

$H\left(X\right)$ and

$H\left(Y\right)$ as:

Finally, through the relationship formula between entropy and MI [

35],

we calculate the mutual information between

X and

Y.

#### 3.2. KNN Discretization Estimator

In the previous subsection, we estimated MI under multi-Gaussian assumption. In this subsection, we show how to estimate MI with unknown distribution by the averaging

k-nearest neighbor distance [

21], which is one of the most commonly used continuous-continuous MI discretization estimators. Note that there are other methods to estimate MI, such as EDGE [

31]; however, EDGE method [

31] is better for estimating a continuous-discrete mixture MI.

Specifically, let us denote by

$\u03f5\left(i\right)$ the distance from

${z}_{i}$ to its

kth neighbor. We count the number

${n}_{x}\left(i\right)$ of points,

${x}_{j}$, whose distance to

${x}_{i}$ is strictly less than

$\u03f5\left(i\right)/2$, and similarly for

Y. The estimate for MI is then:

where

$\psi (\xb7)$ is the digamma function, and

k is a hyperparameter. In this paper we choose

k to 3 after tuning. This method is mainly using the probability of points being in the

kth nearest distance, to approximately approach the true probability density, and has an advantage in vastly reducing systematic errors. For more details please refer to [

21].