Lazy Aggregation for Heterogeneous Federated Learning

Xu, Gang; Kong, De-Lun; Chen, Xiu-Bo; Liu, Xin

doi:10.3390/app12178515

Open AccessArticle

Lazy Aggregation for Heterogeneous Federated Learning

¹

School of Information Science and Technology, North China University of Technology, Beijing 100144, China

²

Information Security Center, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

³

School of Information Engineering, Inner Mongolia University of Science and Technology, Baotou 014010, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2022, 12(17), 8515; https://doi.org/10.3390/app12178515

Submission received: 30 July 2022 / Revised: 19 August 2022 / Accepted: 20 August 2022 / Published: 25 August 2022

(This article belongs to the Special Issue Federated and Transfer Learning Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Federated learning (FL) is a distributed neural network training paradigm with privacy protection. With the premise of ensuring that local data isn’t leaked, multi-device cooperation trains the model and improves its normalization. Unlike centralized training, FL is susceptible to heterogeneous data, biased gradient estimations hinder convergence of the global model, and traditional sampling techniques cannot apply FL due to privacy constraints. Therefore, this paper proposes a novel FL framework, federated lazy aggregation (FedLA), which reduces aggregation frequency to obtain high-quality gradients and improve robustness in non-IID. To judge the aggregating timings, the change rate of the models’ weight divergence (WDR) is introduced to FL. Furthermore, the collected gradients also facilitate FL walking out of the saddle point without extra communications. The cross-device momentum (CDM) mechanism could significantly improve the upper limit performance of the global model in non-IID. We evaluate the performance of several popular algorithms, including FedLA and FedLA with momentum (FedLAM). The results show that FedLAM achieves the best performance in most scenarios and the performance of the global model can also be improved in IID scenarios.

Keywords:

federated learning; heterogeneous data; lazy aggregation; cross-device momentum

1. Introduction

With the development of information technology, the amount of data generated by human beings is unprecedented. To mine the potential value of data, a series of emerging information technologies have been promoted (for example, Internet of Things, Big Data, and Data Mining). Nevertheless, it is tough for industry competition, monopoly, and data protection laws to achieve data sharing. Thus, it is necessary to discuss how to break the “data silos” to achieve secure data sharing. In 2016, Google proposed a privacy-constrained distributed neural network training paradigm named FL. Different from centralized training, FL aggregates the models’ weights of devices to integrate data knowledge of all parties without disclosing local data and improves the performance of the model on unknown data. FL realizes the efficient and safe sharing of data knowledge among multiple parties. FL has been widely explored and applied in data-sensitive fields such as medical treatment, personalized recommendation, and monetary risk assessment.

Unfortunately, there are some challenges in FL [1,2]. First of all, the open distributed architecture requires network transmission for synchronous training. In practice, the local training delay is much smaller than network transmission delay, especially in the case of large scale and big model. The network quality directly determines training efficiency. Secondly, the central server cannot access local data and data statistics of the participants, which restricts data preprocessing operations (e.g., data cleaning, deduplication, and enhancement). In addition, different distributions and data sparsity greatly weaken the final effect and efficiency, and even fail to converge, which has become a dark cloud over FL.

In order to resolve the issues of communication and data heterogeneity in FL, researchers have carried out a series of explorations. McManhan et al. [3] proposed the vanilla FL framework federated averaging (FedAvg), whose optimization focuses on increasing local epochs to reduce the synchronization frequency, and just some devices are chosen in each round. In Independent and Identically Distributed (IID), FedAvg per- forms very well and takes the first step of FL from theory to practice. However, FedAvg not only slows down the convergence speed but also greatly reduces the performance of the global model in non-IID. Li et al. [4] analyze theoretically convergence of FedAvg for the strongly convex and smooth problem, different data distributions expand the weight difference, and more epochs further exacerbate gradient offsets. Duan et al. [5] proved that it could also affect FL performance in the case of global unbalanced dataset, and they proposed to use a global shared and balanced dataset to alleviate gradient offsets caused by different distributions. As sharing proportion increased, it performs better, which maybe cause data leakage unfortunately. Zeng et al. [6] proposed a novel FL framework FedGSP, which counteracts perturbations for heterogeneous data by sequential execution. Sequential training is more robust to heterogeneous data than parallel training; however, the efficiency of sequential training is lower. For this reason, the author proposes the inter-cluster grouping (IGP), which is achieved inter-group homogeneity and intra-group heterogeneity. Grouping requires device characteristics, which has security risks.

In this paper, we propose a novel method for FL, which is shown in Figure 1. Federated lazy aggregating (FedLA) is used to address data heterogeneity. On the basis of FedAvg, FedLA allows more devices, which are from successional rounds, to train sequentially. FedLA doesn’t need extra informations about devices. As paper [6] said, finding out heterogeneous device groups is an NP-hard problem. Due to sequential training performs well under both homogeneous and heterogeneous data, just finding out sufficient sampling to aggregate in time. As for the aggregation timing, this paper proposes the change rate of the models’ weight divergence (WDR) to monitor sampling. Furthermore, we propose FedLA with Cross-Device Momentum (CDM) mechanism, FedLA with Momentum (FedLAM), which relieves vertical optimization jitters and helps local models jump out of local optima and release the performance of the global model in FL. In the end, we introduce several benchmark algorithms for comparison. The result shows that FedLAM achieves excellent performance in most scenarios. Compared with FedAvg, FedLAM promotes test accuracy by 3.1% on Mnist and is even higher by 7.3% in class 1. It has a comprehensive increase of 3.7% on Cifar10 and also has a 2.4% improvement on Emnist.

2. Related Work

At present, there are three main solutions to handle the FL problem in non-IID, including (1) adaptive optimization, (2) model personalization, and (3) data-based optimization.

2.1. Adaptive Optimization

FedAvg excels in IID but is not ideal in non-IID. To this end, there is a series of works to modify FedAvg to adapt to non-IID scenarios. Li et al. [7] eased the extreme point oscillations by dynamically reducing the local learning rate; moreover, a proximal term is added to local loss to limit the gradient offset. In the presence of straggler devices, aggregation stability is greatly improved. In general, it is similar to FedAvg. Shamir et al. [8] used multi-party distributed optimization to coordinate all parties to find the global optimal solution. However, FL cannot fully participate in each round. Karimireddy et al. [9] proposed a control state scheme, Scaffold which uses device-state and server-state control variables to coordinate the differences in training objectives of all parties, which greatly improves the convergence speed and upper bound of the global model, but its transmission overhead is twice as much as FedAvg. Durmus et al. [10] proposed a dynamic regularization method, FedDyn, by dynamically modifying local objectives to make them asymptotically consistent with the global objective and avoids transferring extra data. It should be noted that the method is very sensitive to learning rate.

2.2. Model Personalization

Just as its name implies, this kind of scheme allows device models to be personalized, and each model may be completely different. Although the personalized models are different, they still contain some global shared knowledge. Finn et al. [11] proposed a federated meta-learning framework, MAML which pretrains a meta-model using auxiliary dataset, and then fine-tunes the meta-model with local data. Fallah et al. proposed scheme [12] is to extract the feature extraction layers of pretraining model and fine-tune output layers to implement personalization. Arivazhagan et al. [13] proposed a personalized scheme which splits the model into two parts: base layer and personalized layer, and only upload weights of the base layer for aggregation in each round. It is personalized learning with the lower requirement for datasets. While the personalized model is excellent locally and not often satisfactory. Sattler et al. [14] proposed a fresh multitask framework, clustered federated learning (CFL), which introduced a recursive dual-partition mechanism based on FedAvg to separate heterogeneous devices and combine homogeneous devices to improve generalization and reduce gradient conflicts, but it is computationally inefficiency. Ghosh et al. [15] put forward an iterative federated clustering learning algorithm IFCA similar to EM. First, several group models are randomly initialized, and the group models are distributed to selected devices in each round, and selected devices choose the optimal weight base on local data. Finally, the server aggregates the gradients by group. The FedSem scheme of Xie et al. [16] is based on

l_{2}

distance between device and group model weights. The above two schemes can improve grouping efficiency. The former needs to transmit more redundancy models, while the latter requires complex model initialization. Duan et al. [17] proposed a pre-grouping scheme, FedGroup, which uses decomposed cosine similarity instead of Euler distance to avoid the “Dimensional Curse” and improve the efficiency and stability of the cluster algorithms. Although CFL establishes the connections of devices to a certain extent, each group is independent. In reality, the relationships of devices are complex and diverse, and hard grouping will cut off connections between devices. Therefore, Li et al. [18] proposed a federated learning with soft clustering algorithm, FLSC based on IFCA. The devices of each group can overlap, and the updated weights can be shared by multiple groups. Its experiment shows that group models of FLSC have better performance than IFCA.

2.3. Data-Based Optimization

Data sharing [4,19,20] is a simple and effective scheme. FedShare [19] uses a global shared and balanced dataset

G

to initialize the global model, distributes the global model and partial

G

in each round, and devices use the distributed and local data to train jointly. Experiments show that only 5% of shared dataset makes the global model improve by 30% on Cifar10. It is difficult to obtain dataset such as this in practice. Data augmentation [21,22] can expand dataset through random transformation, knowledge transfer etc., which can effectively alleviate the impact of data sparsity. However, these methods need to acquire statistical informations, which will undoubtedly break the privacy constraint. Knowledge distillation [23] is a kind of promising solution for FL in non-IID, which transfers the knowledge of teacher model to student models with the help of auxiliary dataset. Lin et al. [24] used the ensemble distillation method to fuse multiple local models (teacher) to generate the next global model (student) and extract the knowledge of local models to ensure privacy, but the efficiency of knowledge distillation is relatively low.

In summary, these works have made improvements for heterogeneous federated learning from the data to model and system architecture. To mitigate the difference in data distribution and deal with the conflict gradient are the main goals. Our work focuses on the former, and continuously sampling multiple devices is to make sampling closer to global data distribution. Data sparsity is also very common on the device side, which makes the local model fall into local optimization. To this end, CDM is introduced. More importantly, the two-point optimization does not introduce extra communication overhead and FL privacy is effectively guaranteed, even reducing the server communication load.

3. Methods

3.1. Problem

FedAvg is the prototype of a series of FL algorithms whose theoretical basis is the federated optimization [25] problem, which is essentially a distributed optimization problem with privacy constraints. The distributed optimization goal of FL is:

\underset{w}{m i n} \{F (w) = \sum_{k = 1}^{N} a_{k} L_{k} (w)\}

(1)

where

N

is the number of total devices and

p_{k}

is the global weight of the k-th device

a_{k} \geq 0

,

\sum_{k = 1}^{N} a_{k} = 1

. Generally,

a_{k} = n_{k} / \sum_{k = 1}^{N} n_{k}

;

n_{k}

is the number of the k-th device samples. In general,

L_{k} (w)

is widely used cross-entropy loss as the loss function of the classification task:

L_{k} (w) = \sum_{i = 1}^{C} p_{k} (y = i) E_{x | y = i} [l o g f_{i} (x, w)]

(2)

The local optimization goal is to find the weight

w_{k}

with the minimum loss, and FL algorithms look for global optimal weight

w

with the minimum total loss. To cope with the complex network environments, FedAvg only selects a subset

S_{t}

of all devices

\{N\}

in each round. Then, they execute training in parallel. Finally, the server aggregates recycled updates to renew the global model weight

w^{t + 1}

whose transfer expression is:

w^{t + 1} = \frac{1}{\sum_{k \in S_{T}} n_{k}} w_{k}^{t + 1}

(3)

When the local distributions of devices are consistent with global distribution (IID), there are few distinctions between partial and full training in each round, while the amount of computation and communication is greatly reduced. In reality, local data are sparse and distributed inconsistently, and gradient conflicts are prone to occur. The effect of gradient aggregation is poor [4], and it is easy to form a low-quality model.

3.2. FedLA

When the sampling is sufficient and IID, the gradients are unbiased estimates, and training in parallel can accelerate convergence of the global model. Otherwise, the gradient directions are divergent or even conflicting, which impacts aggregation [26]. It is known from the literature [6] that sequential SGD is more robust to heterogeneous data. An intuitive idea is to make devices train one by one. However, this way is inefficient, especially for large-scale FL. Moreover, sequential training has higher requirements for network connectivity, and too long sequential training would cause knowledge forgetting phenomenon. On the one hand, the efficiency of parallel training is higher in IID. On the other hand, sequential training is non-negligible robustness. We combine the two methods to form a mutual and allow homogeneous devices to train parallelly and heterogeneous devices to train sequentially. In FL settings, the server cannot obtain data distributions on the device-side, grouping by data distribution becomes an NP-Hard like knapsack problem. Therefore, Zeng et al. [6] proposed an IGP approximation to solve the problem, but the operation of acquiring device characteristics may cause data leakage. We consider that it isn’t necessary to judge whether devices are homogeneous or not, and more attention should be paid to how to sample fully. That is to say, the gradients of only one round of devices are often biased in non-IID, which makes aggregation unsatisfactory. Reducing the frequency of the server’s aggregation maybe a better choice.

The accuracy of gradient depends on whether data sampling is sufficient. In non-IID settings, data distribution is inconsistent with global distribution on the device- side, which is likely to cause gradient bias. The cross-device sequential training can sample more samples and gain high-quality gradients. Too long sequential training also leads to knowledge forgetting and is low efficiency. It is necessary to aggregate and cut off sequential chains periodically. To this end, we introduce the change rate of the models’ weight divergence (WDR) to judge the aggregation timing. Set WDR in the t-th round as:

W D R_{t} = |\frac{W D_{t} - W D_{t - 1}}{W D_{t}}|

(4)

W D_{t}

is the models’ weight divergence (WD) in the t-th round:

W D_{t} = \frac{1}{K} \sum_{0 \leq i < j < K} ‖ w_{t}^{i} - w_{t}^{j} ‖

(5)

K

is the number of selected devices per round. The model weight

W_{t}^{i}

is updated for the selected i-th device after local training in the t-th round. In this way, the server recycles updated weights, it judges the timing of aggregation according to (4) calculated WDR. When the WDR is less than the threshold, the server aggregates and forms a new global model, otherwise, the server would randomly distribute collected and updated weights to the next round of devices. When the model weights are not aggregated, the models’ WD becomes larger and larger, but their WDR becomes smaller and smaller. Because, with the increase in continuous sampling, the updated directions of the models’ weights are gradually consistent with the global optimization direction.

Zhao et al. [19] have theoretically proved that the WD between FedAvg and vanilla SGD is related to the probabilistic distance of the local and global distribution. Since the global data distribution is unknown as usual, the WD focuses on the divergence of the models’ weights, which are to be aggregated in this paper.

\begin{matrix} ‖ w_{t}^{i} - w_{t}^{j} ‖ \leq (1 + η β) ‖ w_{t - 1}^{i} - w_{t - 1}^{j} ‖ \\ + η g_{m a x} (w_{t - 1}^{j}) \sum_{k = 1}^{C} ‖ p^{i} (y = k) - p^{j} (y = k) ‖ \end{matrix}

(6)

where

0 < β = \sum_{k}^{C} p^{i} (y = k) λ_{x | y = k}

and

g_{m a x} (w_{t - 1}^{j}) = \underset{0 \leq k < C}{m a x} \nabla_{w} E_{x | y = k} [l o g f_{k} (x, w_{t - 1}^{j})]

and

η

is the fixed learning rate (LR) of local SGD in our research. We can conclude that the WD mainly comes from the difference in the initial weights and sampling distributions. The former can be reduced by synchronous aggregation, which is the focus of previous works, but this paper has shifted its attention to the latter. For the aggregation, the result of multi-epochs training with a small LR isn’t different from an epoch training with a lager LR on the device side, and multiple devices with few samples train sequentially which is equivalent to a device training with a large number of samples. Therefore, we can generalize the conclusion of the inequality (6). The sampling range is extended from batch to device.

With sampling, the WD growth caused by distribution difference disappears gradually. Therefore, monitoring WDR is a great way to judge whether sampling is necessary. If WDR is greater than the threshold, the current round of weights won’t be aggregated but forwarded to the next round of selected devices. On the contrary, we consider that the sampling is sufficient. Only when their whole distributions are consistent with the global distribution will the WD growth slow down. On this basis, this paper proposes a federated lazy aggregate (FedLA) as shown in Algorithm 1:

Algorithm 1: Federated Lazy Aggregate (FedLA)
Input: $T, K, E, B, w_{0}, d_{0}, ε$
Initialize $n_{0}^{0}, n_{0}^{1} \dots n_{0}^{k - 1} \leftarrow 0; w_{0}^{0}, w_{0}^{1} \dots w_{0}^{K - 1} \leftarrow w_{0}$
for t = 0, 1, 2, …, T − 1 do
	Sample devices $S_{t} \in [N], \|S_{t}\| = K \leq N$
	Transmit $w_{t}^{0} w_{t}^{1}, \dots, w_{t}^{K - 1}$ to selected $K$ devices, respectively
	for each device k $\in S_{t}$ in parallel do
		$n_{k}, w_{t + 1}^{k} = C l i e n t U p d a t e (w_{t}^{k}, E, B)$
		Transmit device $n_{k}, w_{t + 1}^{k}$ to server
	end for
	for k = 0, 1, 2, …, T − 1 do
		$n_{t + 1}^{k} \leftarrow n_{t}^{k} + n_{k}$
	end for
	$d_{t + 1} = W D (w_{t + 1}^{0}, w_{t + 1}^{1}, \dots w_{t + 1}^{K - 1})$
	if $\|\frac{d_{t + 1} - d_{t}}{d_{t + 1}}\| \leq ε$ do
		$w_{t + 1} \leftarrow \frac{1}{\sum_{k = 0}^{K - 1} n_{t + 1}^{k}} \sum_{k = 0}^{K - 1} n_{t + 1}^{k} w_{t + 1}^{k}; w_{t + 1}^{0}, w_{t + 1}^{1} \dots w_{t + 1}^{K - 1} \leftarrow w_{t + 1}$
		$n_{t + 1}^{0}, n_{t + 1}^{1} \dots n_{t + 1}^{K - 1} \leftarrow 0; d_{t + 1} = 0$
	else
		$w_{t + 1} \leftarrow w_{t}$
	end if
end for
Output: $w_{T}$

Where

E

represents the number of local training epochs,

B

represents the batch size for loading data, and

w^{k}

represents the weight of k-th device which is initialized to the received weight in the t-th round.

T

represents the number of total rounds;

w_{0}

represents initial global weight;

d_{0}

indicates the initial WD value, generally assign 0;

ε

is the threshold of WDR; and

n_{t}^{k}

indicates the number of cumulative samples in the t-th round. FedLA adopts the same local solver as FedAvg, which optimizes the received weight by mini-batch SGD, and then the updated weights are sent back to server. Unlike FedAvg, aggregation doesn’t occur every round, but is determined according to WDR. The impact of heterogeneous data could be effectively mitigated through aggregation with intervals. In addition to adding WDR (computational complexity

O (K^{2} d

), where

d

is the number of parameters in the model) and postponing aggregation, no other optimizations are introduced. Thus, FedLA has the same convergence rate

O (1 / T

) as FedAvg [4] for strongly convex and smooth problems in non-IID. In practice, only classifier weights are needed for judgment, and the weights of feature layers are transmitted when aggregation is possible which ensures efficiency and reduces redundant aggregation. Moreover, it is compatible with other optimization and privacy strategies.

3.3. FedLAM

FedLA eliminates redundant aggregations and allows more devices to train sequentially, enabling more data to be sampled and alleviating data sparsity (e.g., few samples, class imbalance, and poor quality). Continuous training sequentially without aggregation occurs more jitters than vanilla FL on vertical optimization. It is more urgent for momentum mechanism, especially for cross-devices. It can correct current gradients by previous gradients and raise the utility of historical gradients. Firstly, we assign

K

groups of gradient momentums on the server side, the initial value is zero gradient

\hat{0}

and its transfer equation is:

m_{t}^{k} \leftarrow μ m_{t - 1}^{k} + Δ w_{t}^{k}

(7)

where

μ

is the momentum coefficient, the bigger coefficient means that historical gradients are dominant and have a longer survival period;

m_{t}^{k}

is the momentum of the k-th group in the t-th round; and

Δ w_{t}^{k}

indicates the k-th updated gradient in the t-th round. Therefore, we propose the FedLA with momentum (FedLAM) that adds the CDM mechanism. Algorithm 2 is as follows:

Algorithm 2: Federated Lazy Aggregate with Momentum (FedLAM)
Input: $T, K, E, B, w_{0}, d_{0}, ε, μ$
Initialize $m_{0}^{0}, m_{0}^{1} \dots m_{0}^{K - 1} \leftarrow \hat{0}; n_{0}^{0}, n_{0}^{1} \dots n_{0}^{K - 1} \leftarrow 0; w_{0}^{0}, w_{0}^{1} \dots w_{0}^{K - 1} \leftarrow w_{0}$
for t = 0, 1, 2, …, T − 1 do
	Sample devices $S_{t} \in [N], \|S_{t}\| = K \leq N$
	Transmit $w_{t}^{0} w_{t}^{1}, \dots, w_{t}^{K - 1}$ to selected $K$ devices, respectively
	for each device $k \in S_{t}$ in parallel do
		$n_{k}, Δ w_{t + 1}^{k} = C l i e n t U p d a t e (w_{t}^{k}, E, B)$
		Transmit device $n_{k}, Δ w_{t + 1}^{k}$ to server
	end for
	for k = 0, 1, 2, …, T − 1 do
		$n_{t + 1}^{k} \leftarrow n_{t}^{k} + n_{k}$
		$m_{t + 1}^{k} \leftarrow μ m_{t}^{k} + Δ w_{t + 1}^{k}$
		$w_{t + 1}^{k} \leftarrow w_{t}^{k} + m_{t + 1}^{k}$
	end for
	$d_{t + 1} = W D (w_{t + 1}^{0}, w_{t + 1}^{1}, \dots w_{t + 1}^{K - 1})$
	if $\|\frac{d_{t + 1} - d_{t}}{d_{t + 1}}\| \leq ε$ do
		$w_{t + 1} \leftarrow \frac{1}{\sum_{k = 0}^{K - 1} n_{t + 1}^{k}} \sum_{k = 0}^{K - 1} n_{t + 1}^{k} w_{t + 1}^{k}; w_{t + 1}^{0}, w_{t + 1}^{1} \dots w_{t + 1}^{K - 1} \leftarrow w_{t + 1}$
		$m_{t + 1} \leftarrow \frac{1}{\sum_{k = 0}^{K - 1} n_{t + 1}^{k}} \sum_{k = 0}^{K - 1} n_{t + 1}^{k} m_{t + 1}^{k}; m_{0}^{0}, m_{0}^{1} \dots m_{0}^{K - 1} \leftarrow m_{t + 1}$ (*)
		$n_{t + 1}^{0}, n_{t + 1}^{1} \dots n_{t + 1}^{K - 1} \leftarrow 0; d_{t + 1} = 0$
	else
		$w_{t + 1} \leftarrow w_{t}$
	end if
end for
Output: $w_{T}$

It is noteworthy that the selected devices send their gradients instead of weights, and the weights of each group are updated by corrected gradients which enhance the horizontal optimization and offset the vertical optimization jitters. In non-IID settings, the local models fall easily into the local optimum [7], which is further passed to the global model through their gradients [27]. The step of momentum aggregation is optional in Algorithm 2(*). It is suggested to train early and close it later to expand the search optimization space, which contributes to the walking out of the saddle point. CDM could help FL alleviate gradient disappearance and take the global performance to a higher level. Compared with [28], our scheme doesn’t pass momentum on the device, which undoubtedly reduces the traffic between the device side and server side.

4. Evaluation

In this section, we will evaluate the actual performance of FedLA and FedLAM. First, we implement the above and several benchmark algorithms based on PyTorch and Ray frameworks. Three public datasets are selected for testing. In addition, we set up IID and two kinds of non-IID scenarios (label distribution skew, [5] Dirichlet [29]) to simulate actual environments in FL.

4.1. Experiment Setup

To compare different benchmark algorithms simply and fairly, this section selects three kinds of image classification datasets, Mnist [30], Emnist [31], and Cifar10 [32], and simulates IID and non-IID scenarios by partitioning the datasets. All experimental datasets are divided into 100 copies; each device holds one, and the number of selected devices K = 10. It should be noted that the local training doesn’t use momentum, weight decay options of SGD optimizer and gradient clipping technologies, and the constant learning rate is used by local solver. The datasets, partitions and models used in our experiments are shown in Table 1.

4.1.1. Datasets and Models

Mnist, a widely used image classification task, contains a training dataset of 60,000 handwritten digit pictures and a test dataset of 10,000. Its elements are 28 × 28 pixel black and white pictures with 10 classification labels. Since the dataset is relatively simple and pure, and complex models (e.g., CNN, RNN) are little discriminated, for comparison, only single-layer MLP is used in this section.
Emnist (Extend Mnist) expands 402,000 numbers and 411,000 26 letters samples on the basis of Mnist. Due to the huge amount of dataset, six types of splits (e.g., by class, by merge, letters etc.) are introduced. To distinguish from Mnist, the letters split is used, and a two-layer simple CNN is used for the test model.
Cifar10 contains 60,000 32 × 32 pixel color images in 10 categories, and its classes are completely mutually exclusive (e.g., car and airplane), while the samples of the same category are quite different (e.g., car and truck). The difficulty of classification tasks has undoubtedly increased significantly, so we choose a three-layer convolutional layer and a two-layer fully connected layer CNN.

4.1.2. Data Partitions

As shown in Figure 2, data distributions of the clients are under three federated data partitions.

IID: The dataset is randomly shuffled and divided into several subsets of the same size.
Label distribution skew: Zhao et al. [5] proposed a more demanding non-IID setting, the outstanding feature of which is that each subset only holds a few classes of samples.
Dirichlet: Wang et al. [29] proposed a partition scheme based on the Dirichlet distribution, where each subset is also class-imbalanced with different amounts of samples. It is closer to a partition scheme than the real FL environments.

Figure 2. Schematic diagram of class distribution of data partitions.

Table 1. Summary of federated datasets and models.

Dataset	Samples	Classes	Devices	Partitions	Models
MNIST	69,035	10	100	IID Class 1/3 Dir-0.1/0.3	MLP	101,770
CIFAR10	60,000	10		IID Class 1/3 Dir-0.1/0.3	CNN	553,514
EMNIST (letters)	145,600	26		Class 3/9	SimpleCNN	28,938

4.1.3. Baselines

FedAvg: The vanilla FL Algorithm [3].
FedProx: A popular FL adaptive optimization scheme that limits excessive offset of the device-side model by adding a proximal term [7].
FedDyn: A dynamic regularization method for FL, and local objectives are dynamically updated to ensure that local optimums are asymptotically consistent with the global optimum [10].
CFL: A clustering federated learning framework based on recursive bi-partition that uses cosine similarity to separate the devices with gradient conflicts and form their respective cluster center models [14].

4.1.4. Evaluation Metric

In order to comprehensively evaluate the FL algorithms’ performances, we take top-1 classification accuracy of the devices’ test dataset as the main indicator. Note that all original training and test samples are merged into a new dataset which is divided into 100 copies, including 80% training dataset and 20% test dataset, which is to ensure that the training and test dataset is the same distribution on the devices. As it involves multi-party evaluation and local data are inaccessible, we use weighted accuracy on the test dataset of selected devices as verification accuracy and we take the weighted accuracy of all devices as test accuracy. There are several cluster center models in CFL, and it cannot be directly compared with other algorithms. To facilitate comparison, except for the weighted accuracies of cluster models, the pseudo-global model aggregated by all cluster models is introduced [17]. Note that this paper selects the benchmark algorithms without introducing extra synchronized parameters, so their communication traffics are at the same level.

4.2. Effects of the Proposed Algorithm

4.2.1. Performance

There are several points to explain: (1) The highest score of testing is taken after training with 300 rounds on Mnist and Emnist or 500 rounds on Cifar10. (2) The learning rates of the experiments are unified 0.002. (3) Other hyperparameters are as follows: FedProx: offset penalty coefficient u = 0.01; FedDyn: regularity coefficient α = 0.01; CFL: mean-norm threshold eps1 = 0.035, max-norm threshold eps2 = 0.5; FedLA: WDR threshold

ε

= 0.02; and FedLAM: historical gradient momentum coefficient

μ

= 0.5. Differences from the above will be noted. We found that the FedLA and FedLAM algorithms perform well in most scenarios, as shown in Table 2, compared with the baseline algorithms.

On Mnist, FedLAM is ahead of other algorithms. Compared with FedAvg, it leads by 3.8% in IID, and leads comprehensive by 2.9% in non-IID, especially by 7.3% in class 1. For the more difficult task, Cifar10, it increases by 5.5% in IID and there is an overall increase of 3.2% in non-IID. In most cases, FedProx is very close to FedAvg. The method of adding a proximal term doesn’t overcome gradient conflicts caused by heterogeneous data but only reduces excessive gradient offset. As for the straggler devices mentioned in Section 5.2 [7], we will not explore this case in this article. FedDyn performs well on Emnist by aligning the local targets with the global target in real-time through an attached dynamic regularizer. As shown in Figure 3, FedDyn has a fast convergence speed in the early stage, but the accuracy of the global model is stagnant in the later stage, and its loss fluctuates significantly. The main reason is that it doesn’t adapt to the constant learning rate which needs to be adjusted flexibly. CFL separates the devices with gradient conflicts to build a flesh model and utilizes multiple models to adapt to devices with different distributions to improve comprehensive accuracy. For Cifar10 in class 1, the performances of other algorithms are not satisfactory, and CFL uses recursive dual partition to separate the different classes of devices and bring test accuracies to 100%. We also observed that the pseudo-global model has a poor effect, which verifies the great differences between classes. Although FedLA mitigates the gradient conflict caused by insufficient sampling in a way, it is limited by the aggregation method and it’s inability to fuse gradients well in Cifar10 (class 1). Gradient projection [33] technology solves the impact of opposing gradients. In non-IID scenarios, FedLA has a better performance than FedAvg. It reduces redundant aggregation and finds parallel groups of devices which are consistent with global distribution by computing WDR and overcomes the impact of data sparsity on a single device. Furthermore, the collected gradients could be used to correct vertical oscillation and walk out of the steep fall. FedLAM supplemented by CDM has the best performance in most cases, which is an exciting optimization.

4.2.2. WDR and Cross-Device Momentum

In order to explore the effectiveness of WDR and cross-device momentum, we added fixed-interval aggregations and different proportions of momentum based on Mnist (class 1).

As shown in Figure 4a,c, we sampled 2, 5, 10, and 20 devices for sequential training before aggregation. We found that sampling five devices is enough to obtain unbiased gradients in this experiment. Oversampling will not further improve the performance of the global model and reduces learning efficiency, or even has negative effects (knowledge forgetting, resource waste). According to inequality 3, WD mainly comes from the difference in the initial weights and sampling distribution. We prefer to sacrifice part of the former for the latter in non-IID. However, the number of sampling rounds is unknown. Fortunately, it could be observed that the growth rate of WD slowed down significantly after five rounds. Apart from the difference caused by initial weight, sampling distribution gradually tends to be consistent, and the second term of the inequality (6) approaches 0. This is why this paper proposes that WDR is used to find sufficient sampling in a timely manner. It can be used to adjust aggregated intervals dynamically, which makes the performance curve of the global model smoother and accelerates convergence.

In Figure 4b,d, we explored the influence of proportions of historical gradients in FL. In early training, a higher momentum will hinder the convergence of the global model, but it can bring FL performance to a higher level in the later stage. The gradients of devices will gradually disappear with FL training; nonetheless, the main reason is that the device model falls into the local optimum. Whether increasing sampled devices or the proportions of momentum, their gradient qualities are enhanced remarkably, which is undoubtedly helpful to the promotion of the global model. Inspired by the experiment, the momentum mechanism with dynamic adjustment deserves further exploration in the future. To sum up, FedLA determines aggregation timing by monitoring WDR, flexibly handles complex FL environments, and CDM mechanism improves FL performance in both IID and non-IID.

5. Conclusions

This paper studies the training problem with heterogeneous data in FL. Due to sparse data and different distributions of devices in FL, sampling only one round of devices is not enough and creates a poor model. Therefore, this paper proposes a novel FL framework, FedLA, which allows us to sample more devices by putting off aggregation, making aggregation more robust. We also note that local optimization is easier to fall into the saddle point. Thus, the cross-device momentum mechanism is added to FedLA to further release the performance of the training model in FL. Compared with benchmark algorithms (e.g., FedAvg, FedProx, FedDyn, and CFL), FedLAM has the best performance in most scenarios. In the future, we plan to conduct the theoretical analysis of FedLA in detail, and study more advanced aggregation and sampling judgment strategies. Moreover, the dynamic scheduling strategies of learning rate and momentum are introduced to further accelerate the convergence speed of FedLA.

Author Contributions

Conceptualization, D.-L.K. and G.X.; methodology, D.-L.K.; software, D.-L.K.; validation, D.-L.K.; formal analysis, D.-L.K.; investigation, G.X., X.-B.C. and D.-L.K.; re-sources, D.-L.K.; data curation, D.-L.K.; writing original draft preparation, D.-L.K.; writing review and editing, X.L.; visualization, X.-B.C.; supervision, G.X.; project administration, X.L.; funding acquisition, G.X., X.-B.C. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by NSFC (Grant Nos. 92046001), the Fundamental Research Funds for Beijing Municipal Commission of Education, Beijing Urban Governance Research Base of North China University of Technology, the Natural Science Foundation of Inner Mongolia (2021MS06006), Baotou Kundulun District Science and technology plan project (YF2020013), and Inner Mongolia discipline inspection and supervision big data laboratory open project fund (IM-DBD2020020).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
Zhu, H.; Xu, J.; Liu, S.; Jin, Y. Federated learning on non-IID data: A survey. Neurocomputing 2021, 465, 371–390. [Google Scholar] [CrossRef]
McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017. [Google Scholar]
Li, X.; Huang, K.; Yang, W.; Wang, S.; Zhang, Z. On the Convergence of FedAvg on Non-IID Data. In Proceedings of the 8th International Conference on Learning Representations, Virtual, 26–30 April 2020. [Google Scholar]
Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated learning with non-iid data. arXiv 2018, arXiv:1806.00582. [Google Scholar] [CrossRef]
Zeng, S.; Li, Z.; Yu, H.; He, Y.; Xu, Z.; Niyato, D.; Yu, H. Heterogeneous federated learning via grouped sequential-to-parallel training. In International Conference on Database Systems for Advanced Applications; Springer: Berlin/Heidelberg, Germany, 2022; pp. 455–471. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Shamir, O.; Srebro, N.; Zhang, T. Communication-efficient distributed optimization using an approximate newton-type method. In Proceedings of the 31th International Conference on Machine Learning, Beijing, China, 21–26 June 2014. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.J.; Stich, S.U.; Suresh, A.T. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In Proceedings of the 37th on International Conference on Machine Learning, Virtual, 12–18 July 2020. [Google Scholar]
Acar, D.A.E.; Zhao, Y.; Navarro, R.M.; Mattina, M.; Whatmough, P.N.; Saligrama, V. Federated learning based on dynamic regularization. In Proceedings of the 9th International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptationof deep networks. In Proceedings of the 34th on International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017. [Google Scholar]
Fallah, A.; Mokhtari, A.; Ozdaglar, A. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. Adv. Neural Inf. Processing Syst. 2020, 33, 3557–3568. [Google Scholar]
Arivazhagan, M.G.; Aggarwal, V.; Singh, A.K.; Choudhary, S. Federated learning with personalization layers. arXiv 2019, arXiv:1912.00818. [Google Scholar]
Sattler, F.; Müller, K.-R.; Samek, W. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraint. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 3710–3722. [Google Scholar] [CrossRef] [PubMed]
Ghosh, A.; Chung, J.; Yin, D.; Ramchandran, K. An efficient framework for clustered federated learning. Adv. Neural Inf. Processing Syst. 2020, 33, 19586–19597. [Google Scholar] [CrossRef]
Long, G.; Xie, M.; Shen, T.; Zhou, T.; Wang, X.; Jiang, J. Multi-center federated learning: Clients clustering for better personalization. World Wide Web 2022, 6, 1–20. [Google Scholar]
Duan, M.; Liu, D.; Ji, X.; Liu, R.; Liang, L.; Chen, X.; Tan, Y. FedGroup: Efficient clustered federated learning via decomposed data-driven measure. arXiv 2020, arXiv:2010.06870. [Google Scholar]
Li, C.; Li, G.; Varshney, P.K. Federated Learning with Soft Clustering. IEEE Internet Things. 2021, 9, 7773–7782. [Google Scholar]
Yao, X.; Huang, T.; Zhang, R.-X.; Li, R.; Sun, L. Federated learning with unbiased gradient aggregation and controllable meta updating. arXiv 2019, arXiv:1910.08234. [Google Scholar]
Tuor, T.; Wang, S.; Ko, B.J.; Liu, C.; Leung, K.K. Overcoming noisy and irrelevant data in federated learning. In Proceedings of the 25th International Conference on Pattern Recognition, Virtual, 10–15 January 2021. [Google Scholar]
Tanner, M.A.; Wong, W.H. The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 1987, 82, 528–540. [Google Scholar] [CrossRef]
Duan, M.; Liu, D.; Chen, X.; Tan, Y.; Ren, J.; Qiao, L.; Liang, L. Astraea: Self-balancing federated learning for improving classification accuracy of mobile deep learning applications. In Proceedings of the 2019 IEEE 37th International Conference on Computer Design, Abu Dhabi, United Arab Emirates, 17–20 November 2019. [Google Scholar]
Buciluǎ, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006. [Google Scholar]
Lin, T.; Kong, L.; Stich, S.U.; Jaggi, M. Ensemble distillation for robust model fusion in federated learning. Adv. Neural Inf. Processing Syst. 2020, 33, 2351–2363. [Google Scholar]
Konečný, J.; McMahan, H.B.; Ramage, D.; Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar]
Khaled, A.; Mishchenko, K.; Richtárik, P. Tighter theory for local SGD on identical and heterogeneous data. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistic, Parerlmo, Sicily, Italy, 26–28 August 2020. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Liu, W.; Chen, L.; Chen, Y.; Zhang, W. Accelerating federated learning via momentum gradient descent. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 1754–1766. [Google Scholar] [CrossRef] [Green Version]
Wang, H.; Yurochkin, M.; Sun, Y.; Papailiopoulos, D.; Khazaeni, Y. Federated learning with matched averaging. arXiv 2020, arXiv:2002.06440. [Google Scholar]
Yann, L.; Corest, C.; Burges, C.J.C. The MNIST Database of Handwritten Digits. Available online: http://yann.lecun.com/exdb/mnist (accessed on 10 October 2021).
Cohen, G.; Afshar, S.; Tapson, J.; Van Schaik, A. EMNIST: Extending MNIST to handwritten letters. In Proceedings of the 2017 International Joint Conference on Neural Networks, Anchorage, AK, USA, 14–19 May 2017. [Google Scholar]
Doon, R.; Rawat, T.K.; Gautam, S. Cifar-10 classification using deep convolutional neural network. In Proceedings of the 2018 IEEE Punecon, Pune, India, 30 November–2 December 2018. [Google Scholar]
Wang, Z.; Fan, X.; Qi, J.; Wen, C.; Wang, C.; Yu, R. Federated learning with fair averaging. arXiv 2021, arXiv:2104.14937. [Google Scholar]

Figure 1. Federated lazy aggregating framework.

Figure 3. Mnist-MLP: Test top 1 accuracy and test loss under IID, class 1, dir 0.1. (a–c) test accuracy curves of FedLA(M) and baselines, (d–f) corresponding loss curves under three scenarios. To evaluate the convergence efficiency and stability.

Figure 4. WD and loss under class 1 on Mnist. (a,b) model weight divergence during training, (c,d) model convergence performance.

Table 2. Comparisons of FedAvg, FedProx, FedDyn, CFL, FedLA, and FedLAM on Mnist, Emnist, and Cifar10 (lr = 0.01) in IID and four kinds of non-IID settings. Local epoch E = 5, batch size B = 32.

		FedAvg	FedProx	FedDyn	FedLA	FedLAM	CFL
Mnist	IID	91.5	91.5	93.7	91.4	95.3	${91.6}_{1}$ 91.6
	Class 1	83.3	83.2	85.8	87.6	90.6	${93.4}_{2}$ 70.8
	Class 3	90.0	90.0	88.9	89.8	92.4	${90.0}_{1}$ 90.0
	Dir 0.1	90.0	90.0	88.7	88.8	91.5	${91.2}_{2}$ 88.8
	Dir 0.3	89.8	89.8	89.1	90.3	90.4	${89.8}_{1}$ 89.8
Emnist	IID	89.4	89.7	91.4	89.9	89.4	${89.9}_{1}$ 89.9
	Class 3	79.1	79.6	84.9	79.3	85.3	${79.2}_{1}$ 79.2
	Class 9	87.6	87.4	90.0	85.7	89.1	${87.6}_{1}$ 87.6
	Dir 0.1	84.3	84.9	88.5	84.2	87.8	${85.1}_{1}$ 85.1
	Dir 0.3	89.1	89.3	90.7	89.1	89.7	${89.1}_{1}$ 89.1
Cifar10	IID	70.8	69.6	71.5	70.4	76.3	${70.9}_{1}$ 70.9
	Class 1	27.0	27.0	41.8	${28.0}_{0.03}$	${30.0}_{0.07, 0.2}$	${100.0}_{10}$ 17.1
	Class 3	65.3	66.2	64.1	67.3	71.1	${66.0}_{1}$ 66.0
	Dir 0.1	62.5	62.1	59.3	60.4	64.6	${62.7}_{1}$ 62.7
	Dir 0.3	66.0	66.1	65.6	68.1	71.5	${66.5}_{1}$ 66.5

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, G.; Kong, D.-L.; Chen, X.-B.; Liu, X. Lazy Aggregation for Heterogeneous Federated Learning. Appl. Sci. 2022, 12, 8515. https://doi.org/10.3390/app12178515

AMA Style

Xu G, Kong D-L, Chen X-B, Liu X. Lazy Aggregation for Heterogeneous Federated Learning. Applied Sciences. 2022; 12(17):8515. https://doi.org/10.3390/app12178515

Chicago/Turabian Style

Xu, Gang, De-Lun Kong, Xiu-Bo Chen, and Xin Liu. 2022. "Lazy Aggregation for Heterogeneous Federated Learning" Applied Sciences 12, no. 17: 8515. https://doi.org/10.3390/app12178515

APA Style

Xu, G., Kong, D.-L., Chen, X.-B., & Liu, X. (2022). Lazy Aggregation for Heterogeneous Federated Learning. Applied Sciences, 12(17), 8515. https://doi.org/10.3390/app12178515

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lazy Aggregation for Heterogeneous Federated Learning

Abstract

1. Introduction

2. Related Work

2.1. Adaptive Optimization

2.2. Model Personalization

2.3. Data-Based Optimization

3. Methods

3.1. Problem

3.2. FedLA

3.3. FedLAM

4. Evaluation

4.1. Experiment Setup

4.1.1. Datasets and Models

4.1.2. Data Partitions

4.1.3. Baselines

4.1.4. Evaluation Metric

4.2. Effects of the Proposed Algorithm

4.2.1. Performance

4.2.2. WDR and Cross-Device Momentum

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI