DACFL: Dynamic Average Consensus-Based Federated Learning in Decentralized Sensors Network

Federated Learning (FL) is a privacy-preserving way to utilize the sensitive data generated by smart sensors of user devices, where a central parameter server (PS) coordinates multiple user devices to train a global model. However, relying on centralized topology poses challenges when applying FL in a sensors network, including imbalanced communication congestion and possible single point of failure, especially on the PS. To alleviate these problems, we devise a Dynamic Average Consensus-based Federated Learning (DACFL) for implementing FL in a decentralized sensors network. Different from existing studies that replace the model aggregation roughly with neighbors’ average, we first transform the FL model aggregation, which is the most intractable in a decentralized topology, into the dynamic average consensus problem by treating a local training procedure as a discrete-time series.We then employ the first-order dynamic average consensus (FODAC) to estimate the average model, which not only solves the model aggregation for DACFL but also ensures model consistency as much as possible. To improve the performance with non-i.i.d data, each user also takes the neighbors’ average model as its next-round initialization, which prevents the possible local over-fitting. Besides, we also provide a basic theoretical analysis of DACFL on the premise of i.i.d data. The result validates the feasibility of DACFL in both time-invariant and time-varying topologies and declares that DACFL outperforms existing studies, including CDSGD and D-PSGD, in most cases. Take the result on Fashion-MNIST as a numerical example, with i.i.d data, our DACFL achieves 19∼34% and 3∼10% increases in average accuracy; with non-i.i.d data, our DACFL achieves 30∼50% and 0∼10% increases in average accuracy, compared to CDSGD and D-PSGD.


INTRODUCTION
N UMEROUSLY increasing smart mobile devices such as phones, tablets, etc. have provided convenience for human beings and generated plentiful user-related data like images, sound, text and others.In order to make full use of such huge amount of data, data mining and machine learning techniques [1], [2], [3] have been developed.However, these techniques usually work on a central parameter server (PS).This makes it inevitable to collect large amounts of users' data into PS, which may lead to an excessive overhead and expose users' privacy like food preference, consuming behavior, social ties and so on.According to the general data protection regulation (GDPR) [4], both industry and academia should pay more attention to the privacy protection of machine learning.Hence, taking into consideration the privacy concerns when exploiting users' data is now becoming an imperative in the era of artificial intelligence.To tackle this problem, Google has advocated an alternative known as federated learning (FL) [5], [6], [7].Rather than collecting users' raw data into a PS, the FL enables each user to train their respective models using their own data, and only the intermediate models are periodically synchronized by the PS to obtain a global model.
However, there exist several limitations in conventional FL due to the centralized network topology (see fig. 1).For example, to iteratively synchronize users' local-trained models and send back the aggregated global model to them, the communication load of the entire system is extremely unbalanced.Specifically, because all users need to communicate with the PS concurrently iteratively, the communication traffic jam is very likely to happen to the PS and the performance will be degraded when the network bandwidth is low.What's worse, if there is a single point failure that happens to the PS by any chance, e.g., being attacked, the entire framework will be paralyzed.
To eliminate the above bottleneck caused by centralized network topology, a natural idea is that, is it possible to facilitate FL in a decentralized topology (see fig. 1) without the support of a PS? Thankfully, some existing work on wireless sensor network (WSN) and device-to-device (D2D) communication [8] has confirmed the possibility of communication in a decentralized network topology.Therefore, we believe that it is not only important but also applicable to implement FL in a decentralized topology.
In this paper, we consider the problem of facilitating federated learning in a decentralized network topology in the absence of a central parameter server.To this end, we propose a new decentralized federated learning framework based on dynamic average consensus, which is coined as DACFL.In DACFL, there is no central PS and all users are connected through an undirected graph, which is denoted by a symmetric doubly stochastic matrix (i.e., mixing matrix).Each user trains its local model utilizing its own training data.The respective trained model intermediates of different users are treated as different discrete-time reference input sequences.In order to aggregate different users' intermediate models 1 , for each user, we employ an approximate estimation method, i.e., a first-order dynamic average consensus (FODAC), to track the average model estimation.Besides, to alleviate the possible local over-fitting problem, especially for non-i.i.d data, each user in DACFL takes its neighborhood weighted average model parameter to reinitialize its local model after each training round.
Actually, there has been some existing work to date devoting to implementing federated learning in a fully decentralized network topology (as is described in 2.2).Apart from them, references [9] and [10] should be the most similar work to this paper.In [9], the authors propose a new consensus-based distributed SGD (CDSGD) algorithm for collaborative deep learning over a fixed network topology that enables data parallelization as well as decentralized computation.In [10], the authors study a decentralized parallel SGD (D-PSGD) algorithm on a decentralized computational network.They further prove that D-PSGD achieves the same convergence rate as the centralized parallel SGD (C-PSGD) algorithm, but outperforms C-PSGD by avoiding the communication jam.Although some existing work summarizes CDSGD and D-PSGD into the same method [11], [12], note that we distinguish between CDSGD and D-PSGD by whether the algorithm requires a global average or not in this paper (D-PSGD additionally needs a network-wide model average compared with CDSGD).In what follows, we describe the deficiency of CDSGD and D-PSGD, and differentiate our solution from these two methods in detail.
For CDSGD in [9], there are two main limitations.One is that it only considers a fixed network topology with a uniform interaction matrix (identical value of each element in the matrix).The other is that it assumes that the training data is independent and identically distributed (i.i.d) over all users.These two limitations make it not well suited for federated learning since mutative network topology and non-i.i.d data are common occurrences in federated learning.Besides, we empirically verify that when faced with a sparse topology, there is obvious variance across all users' final models.This is also inconsistent with the goal of FL to 1.In a centralized federated learning, there is central PS to periodically aggregate different users' model into a global model and send it back to all users.However, to implement federated learning in a decentralized network topology where no PS exists, the global aggregation phase would be a critical problem.
attain a globally consistent model for all users.While for D-PSGD which additionally performs a network-wide model average in [10], since there is no PS in decentralized network topology, a problem that naturally arises is who should perform this network-wide model average task?And how to ensure the consistency of each user's model?Just imagine, if we make each user perform this network-wide average (although this is achievable by exchanging information through multi-hop communication), it will inevitably cause communication congestion in some users (especially those user nodes with higher degrees on the graph).Besides, the authors in [10] consider only a fixed ring network topology.Our solution differentiates CDSGD and D-PSGD from the following aspects.First, rather than taking roughly the neighborhood weighted average to replace the model aggregation like CDSGD or D-PSGD does, the DACFL employs an average consensus method, i.e., FODAC, to approximate the average model for each user.Thanks to the effectiveness of FODAC, each user is able to track the average model as the training progresses.Thus, no additional networkwide model average is needed in our DACFL.Second, we extend more typical network topologies in this work.Apart from time-invariant (or static) and dense topology, we also consider time-varying (or dynamic) and sparse topology in this paper.Third, non-i.i.d data assignment in addition to i.i.d case is also considered in this paper.To alleviate the possible local over-fitting problem, each user in DACFL takes its neighborhood weighted average model parameter to reinitialize its local model after each training round.Fig . 2 briefly compares our DACFL with CDSGD and D-PSGD.
The contributions of this paper can be summarized as follows: • This paper devises a new decentralized federated learning implementation coined as DACFL, which is more adaptable to non-ideal network topology compared with another two existing methods including CDSGD and D-PSGD w.r.t the average accuracy and the variance of accuracy.Unlike CDSGD and D-PSGD taking roughly a neighborhood weighted average to approximate the global aggregated model, DACFL treats each user's local training as a discretetime process and employs the FODAC to estimate the average model, through which the users are able to obtain a near-average model in the absence of PS during the training process.

•
We theoretically analyze the convergence of our proposed DACFL approach on the premise of some relatively ideal assumptions.The numeric result offers a convergence guarantee of our solution and reveals a positive correlation between the convergence speed and the learning rate and a negative correlation to the topology size.Specific experimental results also support our analysis.

•
The DACFL is also empirically validated through a wide range of experiments including experiments on both i.i.d and non-i.i.d data, experiments on both time-invariant topology and time-varying topology with dense connectivity and sparse connectivity.Results of these experiments show that our DACFL outperforms CDSGD and D-PSGD w.r.tAverage of ...  Acc and Var of Acc in most cases.
The remainder of this paper is organized as follows.First, some existing work about decentralized federated learning implementation and dynamic average consensus is summarized in section 2 following a brief introduction to conventional federated learning.Then in section 3, we present the system model comprising node model and communication model, and mathematically formulate the decentralized federated learning problem as a combination of distributed machine learning and dynamic average consensus problem.In section 4, we first provide a heuristic method to construct a symmetric doubly stochastic matrix and introduce the FODAC algorithm, which is further employed to design our DACFL approach.In section 5, the convergence of DACFL is theoretically analyzed on the premise of some relatively ideal assumptions.Then section 6 presents the experiments and evaluates the performance of DACFL compared with CDSGD and D-PSGD.Finally, section 7 concludes this paper.

RELATED WORKS
In this section, we first provide a brief introduction to federated learning, and then summarize some related work about decentralized federated learning implementations and about dynamic average consensus.

Federated Learning
From the technical perspective, there are two main strategies implementing federated learning: Horizontal Federated Learning (HFL) and Vertical Federated Learning [7].In HFL, participating clients share the same set of features but target different populations.While in VFL, the client devices share the same population but target different sets of features.Note that throughout this paper, we focus only on the HFL.For detail introduction about VFL, please refer to [13], [14].
The concept of federated learning was first introduced [15], where a distributed training model is executed by a number of participants, usually called clients or users, that share local model updates with a central parameter server whose role is to aggregate these updates to build a global model.Generally, a federated learning scenario consists of two main phases, i.e., local update and global aggregation.
While for the global loss function over all clients C and the whole training dataset, it can be given as To solve the above distributed optimization problem, existing work has offered several suggestions.A necessarily incomplete list of these solutions includes FedAvg [15], FedProx [16], FedPAQ [17], Turbo-Aggregate [18], FedMA [19], Semi-FL [20] and Hier-FL [21].However, all these approaches rely on a centralized network topology where a central parameter server is necessary to execute the global aggregation phase in federated learning2 .

Decentralized Federated Learning Implementation
Actually, apart from references [9] and [10] which have been introduced in section 1, there has also been some existing work to date devoting to deep learning with decentralized computation without the aid of a PS.In [24], the authors present a distributed learning algorithm to address the fully decentralized federated learning problem.In this framework, users take a Bayesian-like approach to iterate and aggregate the beliefs of their one-hop neighbors and collaboratively estimate the global optimal parameter.In [25], the authors propose a server-less, peer-to-peer approach for federated learning termed BrainTorrent, particularly targeted towards medical applications.In this approach, all clients are assumed pair-wise connected and update their models by checking the local model version with the latest model version over the network.In [26], the authors propose a decentralized federated learning design, Combo, based on the gossip protocol.They also present a model segmentation level synchronization mechanism in order to maximize the utilization of bandwidth capacities between users.However, their design also requires a fully connected network topology and a prerequisite of randomly distributed data among the workers.While in [27], the authors further extend the Combo into a bandwidth aware solution BACombo by greedily choosing the bandwidth-sufficient worker to reduce the transmission delay.In [28], the authors design an experimental study to compare federated learning with gossip learning, and find that gossip learning is comparable to federated learning in their result.In [29], the authors propose a decentralized federated learning framework based on blockchain termed BFLC.The framework uses blockchain for the global model storage and the local model update exchange without a central server.In [30], the authors introduce a fully decentralized federated learning framework, termed IPLS, that is partially based on the interplanetary file system (IPFS).By using IPLS and connecting into the corresponding private IPFS network, any party can initiate the training process of a model or join an ongoing training process that has already been started by another party.In [31], the authors propose the decentralized federated learning via mutual knowledge transfer (Def-KT) algorithm where local clients fuse models by transferring their knowledge to each other.
Algorithm 1: CDSGD (baseline 1) [9] Input: Maximum epoch m; learning rate α; number of agents N ; Output: The trained models of N users: Randomly shuffle the corresponding data subset D j ; Algorithm 2: D-PSGD on the ith node (baseline 2) [10] Input: initial point x 0,i = x 0 , step length γ, weight matrix W , and number of iterations K; Compute the neighborhood weighted average by fetching optimization variables from neighbors: Update the local optimization variable: x K,i ; / * the D-PSGD additional needs a network-wide average comparing to CDSGD; * /

Dynamic Average Consensus
The dynamic average consensus problem, in opposition to the more studied static consensus [32], is referred to the problem in which a set of autonomous agents aims to track the average of individually measured time-varying signals by local communication with neighbors.It has already been widely used in various fields such as formation control [33], sensor fusion [34], [35], [36], distributed estimation [37] and distributed tracking [38].Some existing work has studied the dynamic average consensus problem regarding the continuous-time reference inputs [34], [35], [39], [40].In [34], the authors use standard frequency-domain techniques and show that their algorithm is able to track the average of ramp reference inputs with zero steady-state error.In the context of input-to-state stability, the authors of [39] show that proportional dynamic average consensus algorithm can track with bounded steadystate error the average of bounded reference inputs with bounded derivatives.Besides, a proportional-integral dynamic average consensus algorithm is also designed to track the average of constant reference inputs with sufficiently small steady-state error.In [35], the authors propose a dynamic consensus algorithm and apply it to design consensus filters.Their algorithm can track with some bounded steadystate error the average of a common reference input with a bounded derivative.In [40], the authors study a problem similar to that in [36] but further assume that agents know the nonlinear model which generates the time-varying reference function.
While for the dynamic average consensus problem regarding the discrete-time reference inputs, the authors of [41] have proposed a class of discrete-time dynamic average consensus algorithms and analyzed their convergence properties.Their algorithms are able to track a class of time-varying reference inputs including polynomials, logarithmic-type functions, periodic functions and other functions whose n-th-order differences are bounded, for n ≥ 1, with zero or sufficiently small steady-state error.
For our decentralized federated learning implementation in this paper, we employ the first-order dynamic average consensus (FODAC) algorithm [41] (see Algorithm 4) for each user to track the average model in the absence of central PS.

SYSTEM MODEL AND PROBLEM FORMULATION
For federated learning in a centralized topology, the most critical stage is that the central PS aggregates the model updates from different users and sends back the global model to them iteratively.However, when it comes to a decentralized topology where there is no PS, one of the most difficult problem arises, i.e., how can we execute the global aggregation phase in a fully decentralized way in the absence of a central parameter server?
In this section, we first formulate a user as a node in an undirected weighted graph in 3.1.Then, how the users communicate is formulated as a communication model based on a symmetric and doubly stochastic symmetric matrix (also referred to as a mixing matrix) in 3.2.Finally, we formally constitute the decentralized federated learning implementation as a minimization optimization problem in 3.3.

Node Model
A node model refers to a user containing its own local dataset and training its local model using the local training data.
Assume that there are N nodes in a decentralized network topology, where each node is labeled by i ∈ V = {1, 2, • • • , N }.We denote the local dataset on the i-th node D i , then the whole dataset formed by all local dataset is The machine learning model of node i at round t (1 ≤ t ≤ T ) is represented by ω t i .Generally, all nodes' machine learning models are required to be structured identical (i.e., the depth of the neural network, each layer's type and width, etc.) and to be initialized by the same parameters, i.e., ω 0 Here the ω 0 denotes model initialization, which usually follows Glorotinitialization [42] or He-initialization [43] in Pytorch.For the local update phase, each node trains its own ω t i using its local training dataset.Hence, the intermediate model parameters generated by each node i during the local update phase can be regarded as a discrete-time reference signal the average model over N nodes at round t.However, to implement federated learning, a global aggregation phase is still needed in addition to local update.So, we are now faced with the problem of how to make nodes periodically synchronize the average model in the absence of a central PS?

Communication Model
A communication model in this section refers to two rules that govern the information exchange between all nodes.One is a connectivity rule ensuring that the information of each node influences the information of any other nodes.
The other is a rule on connection weights that a node uses when combing its information with the information received from its neighbors.We represent a decentralized network topology formed by N nodes at round t as an undirected graph G (t) := (V, E (t)), where the E(t) ⊂ V × V is an edge set.Besides, node i and j are called neighbor nodes to each other if and only if (i, j) ∈ E(t), which further means that node i and node j can directly communicate with each other at round t.For the connectivity rule, the G(t) is required to be a connected graph [44], such that the information of node i can influence the information of any other nodes directly or indirectly.While for the connection weights, we further define a mixing matrix W(t) = [w ij (t)] ∈ R N ×N to denote the connection weights between nodes in G(t), where For the sake of convergence, the W(t) is required to be a doubly stochastic symmetric matrix, i.e., it satisfies W(t)1 = 1 and 1 T W(t) = 1 T .Here 1 ∈ R N is the column vector whose entries are all ones, T indicates the matrix transpose operator.The doubly stochastic property ensures a weighted average where the weight sum is 1 when a node combines other nodes' information.

Problem Formulation
On the basis of node model and communication model, a remaining problem is how to effectively execute global aggregation phase without the help of a central PS?In other words, how to enable each node to synchronize the average model in the absence of the PS?In what follows, we show that implementing global aggregation phase in a decentralized topology can be converted into a dynamic average consensus problem and hence a dynamic consensus algorithm can be employed to effectively solve it.Specifically, by treating the local update phase as a discrete-time process and taking the ω i = {ω 0 i , ω 1 i , • • • , ω T i } as the discrete-time reference input on node i, to synchronize the average model is to track the average of all nodes' reference inputs, which can be also referred to as a dynamic average consensus problem, whose objective can be formulated as: min Here x t i denotes the estimation of node i at round t, ωt denotes the average model at round t, and x t is a vector denoting all the estimations on N nodes at round In this paper, we advocate the first-order dynamic average consensus (FODAC) algorithm [41] (see Algorithm 4) to solve problem (4) during the decentralized federated learning process.
In addition to the consensus problem (4), another objective for federated learning is to minimize the global loss function shown by (2).So, we can summarize the objective of the decentralized federated learning as follows: Moreover, if each node holds the same number of training samples, (5) can be rewritten as: In section 4, we design a DACFL algorithm to solve this problem.

ALGORITHM DESIGN
To facilitate our proposed solution, we need to first construct a symmetric doubly stochastic matrix.Therefore, we first demonstrate how we construct the symmetric doubly stochastic matrix.Then, a brief introduction to the firstorder dynamic average consensus algorithm is given.And last, we propose the DACFL training algorithm for our decentralized FL implementation.

Construct a Symmetric Doubly Stochastic Matrix
A symmetric doubly stochastic matrix is defined as a square matrix W which meets [45].Intuitively, a very simple matrix that meets these conditions can be W := 1 n n×n .To generate a more random matrix W, we take the following heuristic method to construct the mixing matrix used in this paper: (i) generate a doubly stochastic matrix A; (ii) transpose the matrix A and get A T ; (iii) symmetrization: W = 1 2 A + A T .The heuristic construction method is shown in Algorithm 3 3 .

First-Order Dynamic Average Consensus
As is demonstrated in section 3.3, by treating the intermediate models generated during the local update phase as the discrete-time reference inputs, to synchronize the average model can be also referred to as a dynamic average consensus problem (4).To solve this problem, [41] has proposed a first-order dynamic average consensus algorithm (FODAC) and has proved that the FODAC tracks the average state with either a zero steady-state error or an upper-bounded steady-state error (cf.[41]).Algorithm 4 briefly shows the pseudocode of FODAC.

DACFL Algorithm
As is formulated in (5) (or ( 6)), the objective of the decentralized federated learning implementation is to simultaneously 3.For the specific "sparse matrix" in section 6, it is generated by the Sinkhorn-Knopp algorithm [46].
Algorithm 4: First-order dynamic average consensus [41] Input: Reference inputs of N nodes: minimize a global loss function f (ω) and solve a dynamic average consensus problem.For the former sub-problem, a distributed stochastic gradient descent can be used here-in, while for the latter sub-problem, we employ the FODAC to solve it.By combining the distributed gradient descent and the FODAC method, we devise an algorithm called Dynamic Average Consensus based Federated Learning (DACFL).
Three main steps constitute our DACFL algorithm: (i) each node trains its own model using its local data; (ii) each node computes a neighborhood weighted average model ω t i by exchanging its intermediate model with its neighbors; (iii) each node performs the FODAC to track the average model ωt .More specifically, step (i) can be also referred to as the local update phase in federated learning.In step (ii), we further use the neighborhood weighted average model ω t i to reinitialize the user's local model (line 6, Algorithm 5), which also differentiates CDSGD and D-PSGD.As is empirically demonstrated in our experimental results, such re-initialization is more robust to sparse topology and noni.i.d data since it to some extent avoids the local overfitting problem.In step (iii), we employ the FODAC to estimate the average model of all nodes in a fully decentralized manner, which helps handle the model fusion in global aggregation phase without a central PS.The complete DACFL training algorithm is shown in Algorithm 5.

CONVERGENCE ANALYSIS
Without loss of generality, each user is assumed to hold the same number of training samples in this paper such that we can denote the loss function f (x) as To complete the analysis, we make the following assumptions on the loss functions.
Algorithm 5: DACFL (our solution) Input: mixing matrix: N ; number of communication rounds T ; number of nodes N ; learning rate λ; Output: consensus states of N nodes: x Compute the neighborhood weighted average model: In a conventional centralized learning way, the gradient ∇f (ω) is computed based on the whole dataset.However, in a decentralized federated learning, each user holds a local data shard for its own training and gradient computation ∇f i (ω).To declare the user-wise gradients ∇f i (ω), we have Assumption 2 and 3 on the premise of i.i.d data.

Assumption 2. (Bounded Gradients)
For each user i and a randomly sampled batch data ζ i , there exists G > 0 such that

Assumption 3. (Uniform Gradient First-order Difference)
At any round t, define as the first-order difference of gradient, this paper assumes an i.i.d data distribution over all users for the convergence analysis, such that Note that the mixing matrix W(t) of the communication network topology G(t) plays a pivotal role enabling our algorithm, which follows the assumptions.

Assumption 4. (Symmetric Doubly Stochastic Matrix)
At any round t, the mixing matrix W(t) is symmetric and doubly stochastic, such that Additionally, we make Assumption 5 to ensure the average model parameter can be well tracked.

Assumption 5. (Bounded First-order Differences of Model
Parameter) At any round t, there exists a constant θ > 0 ensures an upper bound of each user's model parameters such that Actually, the above statement can be guaranteed by choosing a sufficiently small learning step size λ and a proper activation function.
Following the Assumption 5, we can easily draw a relatively bounded first-order difference similar to Eq.( 1) in [41] such that where κ is the upper bound related to θ and The above (13) ensures the FODAC to track the average of the time-varying reference inputs, w.r.t., model parameters ω t i , with a sufficiently small steady-state error.For detail proof of the conclusion, please refer to [41].Note, to track the average of matrix-form reference inputs can be reformulated as tracking the average of scalar-form reference inputs by considering the matrix as a set of element-wise scalars.
Following the above assumptions, we present the convergence rate of our proposed algorithm.In our result, we follow the convention in literature [10] to use the average expected squared gradient norm to characterize the convergence rate.
Theorem 1.Following the aforementioned assumptions, we have the average expected squared gradient norm following ) where f * denotes the minimum loss value of f (x).
Note that in (14), the average squared gradient norm is bounded by C 0 + C 1 , where the C 0 gradually tends to 0 when training iteration T increases.In other words, the average squared gradient norm is bounded by a learning rate related term C 1 when T → +∞.Until now, we have declared the convergence taking the average model parameter ωt as the input of loss function f .Moreover, the final output of DACFL algorithm x T = [x T 1 , x T 2 , . . ., x T N ] 1×N can track the ωT with a small steady-state error under the above assumptions.(Please refer to [41] for the convergence guarantee of FODAC.)This further offers a convergence guarantee for our DACFL solution.
The detailed proof of Theorem 1 is deferred to the appendix of this paper.

EXPERIMENTS AND PERFORMANCE EVALUA-TION
In this section, we first declare the experimental setup and then evaluate the performance of DACFL on different datasets with vary topologies and data allocations.Note that, the curves of following experimental result are smoothed by Savitzky-Golay filter [47].

Dataset Allocation
We design two ways for data allocation in our experiments, i.e., i.i.d and non-i.i.d allocations.(i) i.i.d: each user is assigned the same number of training samples with a uniformly random distribution over 10 classes.(ii) non-i.i.d: the training set is sorted by labels first and then divided into multiple shards with the same number of training samples.Finally, each user samples only 2 shards of data without replacement.Note that in this setup, the non-i.i.d allocation only considers the class imbalance of data [22].

Communication Network Topology
As is demonstrated in section 3, we can represent the communication network topology with the mixing matrix.
In our experiments, we design the topology from different perspectives.(i) time-varying and time-invariant: for the time-invariant topology, we initialize the mixing matrix before training and keep it unchanged during the training process; while for the time-varying topology, we reconstruct the mixing matrix every 10 training rounds.(ii) sparse and dense connectivity: for the sparse (ψ=0.5)topology, half entries of the mixing matrix are 0; while for the dense (ψ=1.0)topology, all entries in the mixing matrix are non-zeros. 4All the above considered mixing matrices are symmetric and doubly stochastic.

Neural Network Structure
For the training models used to perform the image classification tasks, we use the same convolutional neural network (CNN) for MNIST and Fashion-MNIST which contains two 5×5 convolutional layers (each layer is followed with a 4. Note here we call a matrix with 'sparse' just for simplicity, which is not strictly equivalent to a "sparse matrix".The "sparse matrix" here is generated by Sinkhorn-Knopp algorithm [46].[15].Unless otherwise specified, some important hyperparameters in our experiments are set as TABLE 1.

Baselines and Performance Metric
In this work, we compare our DACFL with a conventional centralized federated learning method, i.e., FedAvg and another two decentralized federated learning implementations called CDSGD [9] and D-PSGD [10].For the FedAvg implementation in this paper, a centralized topology with the same number of users to the decentralized topology is considered, where all users are ensured to participate in each training round.For CDSGD and D-PSGD, the mixing matrix and other hyper-parameters are consistent with DACFL.Additionally, we assume that there still exists a "god node" for D-PSGD to perform the network-wide model average, hence generating a global model used for the performance test.

Why Choose FODAC? A Numerically Empirical Perspective
In order to clarify how FODAC benefits our solution, some specific numerical experiments are designed in this section.More specifically, we separately apply FODAC, CDSGD and D-PSGD to track the average of two types of discrete-time inputs under three different mixing matrices.Inputs I: A class of discrete-time inputs with relatively large variance between each user, Inputs II: A class of discrete-time inputs with relatively small variance between each user, where R i (t) denotes the input of i-th user at time t, with For the mixing matrices, (i) sparse: a 10×10 mixing matrix with ψ=0.5, i.e., half entries are 0; (ii) dense: a 10×10 mixing matrix with ψ=1.0, i.e., all entries are non-zeros; (iii) uniform: a 10×10 mixing matrix with all entries being 0.1.Note all kinds of matrices considered here are symmetric doubly stochastic.For CDSGD, we take roughly the neighborhood weighted average as the estimated value (see line 6 in Algorithm 1); for D-PSGD, the estimated value used here is the result of executing an additional network-wide average on the estimation by CDSGD.While for FODAC, we take the consensus state (see line 3 in Algorithm 4) as the estimated result.Then, the absolute error can be computed by where the Ri (t) denotes the estimated value and Ri (t) denotes the average of inputs with Ri (t) = 1 10 10 i=1 R i (t).The results are shown in Fig. 3.
As can be seen in fig.3(a), the FODAC is superior to CDSGD when users' reference inputs are with large variance under both sparse and dense mixing matrices.It can be concluded that if employing CDSGD to approximate the average of inputs, an estimating error and a large deviation between users are not negligible.While the CDSGD becomes feasible only when the inputs turn to be with small variance or when the mixing matrix is uniform.Nonetheless, the FODAC still slightly outperforms CDSGD from the perspective of convergence speed when users' inputs are with small variance under both sparse and dense mixing matrices, which are shown in Fig. 3(b).Note that D-PSGD outperforms the other two methods because it additionally executes a network-wide average which enables it to accurately compute the average value of the inputs.However, such a network-wide average is unacceptable when it comes to a fully decentralized topology where no PS helps to do this.
From this empirical result, we conclude that the FODAC method is more adaptable than CDSGD and D-PSGD on approximating the average in a fully decentralized topology.Actually, this also motivates us to employ FODAC to decentralized federated learning since (i) the common noni.i.d data usually leads to a large variance between users' local models; (ii) a non-ideal communication topology (a not uniform mixing matrix) often arises in practical; (iii) there is no central PS to perform a network-wide model average.

Performance on i.i.d Data
Fig. 4 and 5 respectively shows the performances of different algorithms with i.i.d data allocation under time-invariant and time-varying topologies.The hyper-parameters in this experiment follow Table 1.First, the proposed DACFL outperforms D-PSGD and CDSGD in terms of Average of Acc, albeit slightly inferior to the conventional FedAvg.Specifically, from fig. 4(a), 4(b) and 4(c), the DACFL finally harvests 97%, 86%, 70% accuracy under a dense topology and 96%, 85%, 67% accuracy under a sparse topology, respectively on MNIST, FMNIST and CIFAR-10 datasets.This is superior to the result of CDSGD with 93%, 64%, 18% accuracy under a dense topology and 68%, 51%, 20% accuracy under a sparse topology, and the result of D-PSGD with 97%, 83%, 55% accuracy under a dense topology and 95%, 75%, 45% accuracy under a sparse topology.Since D-PSGD additionally performs a network-wide model average over all users, it holds a better Average of Acc than CDSGD.Contrarily, there exists different levels of deviation between user's intermediate training models, which may severely rely on the unevenness of the decentralized communication network topology.By means of FODAC, each user in DACFL is able to well approximate the "near average" model.This is exactly why DACFL outperforms CDSGD and D-PSGD in this experiment.
Second, the DACFL is less sensitive to the sparsity of communication topology.That is, the DACFL has minimal degradation in the Average of Acc when a sparse topology arises.Specifically, the DACFL has 1%, 1%, 3% of accuracy reduction on three datasets.While the D-PSGD has reduction of 2%, 8%, 10% on three datasets and the CDSGD has reduction of 25%, 13% on MNIST and FMNIST. 5This result indicates that both CDSGD and D-PSGD ask for tighter topology requirements than DACFL for convergence guarantees.
Third, each user has closer performance in DACFL than that in CDSGD.From fig. 4(d), 4(e) and 4(f), it can be seen that, the variance of accuracy of DACFL is smaller and more stable when comparing to CDSGD, and gradually tends to around 0 as the training progresses, especially on MNIST.This result also supports that the average model can be well tracked by DACFL through the FODAC consensus method.
In summary, the DACFL approach outperforms D-PSGD and CDSGD on i.i.d data under time-invariant topology.

Time-varying Topology
In this section, we investigate how the time-varying topology affects the performance of DACFL and compare it with CDSGD and D-PSGD.Fig. 5 presents the result on i.i.d data under time-varying topology.
From the perspective of Average of Acc, the DACFL still outperforms D-PSGD and CDSGD.Take the result on FMNIST dataset as an example (fig.5(b)), the DACFL finally reaches 87% accuracy, which is better than the 84% accuracy of D-PSGD and 68% of CDSGD.Intuitively, the FedAvg with a centralized topology performs better than other decentralized implementations as it has a central parameter server to do the global aggregation phase.This would not be affected by the varying decentralized topology considered in this section.
5. Since CDSGD dose not converge on CIFAR-10 after 100 rounds, it is not counted here.Besides, a time-varying topology has greater randomness than a time-invariant topology.Due to the randomness, the accuracy degradation caused by the topology sparsity becomes smaller for all decentralized implementations considered in this section.Specially, for D-PSGD on FMNIST (fig.5(b)) and CIFAR-10 (fig.5(c)), the average accuracy under a sparse topology is even greater than that under a dense topology.This might because that the randomness introduced by time-varying topology reduces the possibility of early over-fitting that may be caused by a sparse topology.
Finally, for the Var of Acc shown by Fig. 5(d), 5(e) and 5(f), result similar to that of a static topology arises.As is shown in the figures, the DACFL holds smaller and more stable variance of accuracy than CDSGD on both dense and sparse topology over all the datasets.This confirms the effectiveness of our approach again.
In summary, the DACFL still has higher feasibility in the case of time-varying topology.

Performance on non-i.i.d Data
The section 6.3 has shown the result of DACFL on i.i.d data and declared its practicality under both time-invariant and time-varying topology.In this section, we test the performance of DACFL on non-i.i.d data and show the result in Fig. 6 and Fig. 7.The hyper-parameters in this experiment follow Table 1.

Time-invariant Topology
Fig. 6 presents the experimental result on non-i.i.d data under time-invariant topology.
First, for the Average of Acc, both FedAvg and three decentralized federated learning implementations have different levels of accuracy degradation on three datasets.This is because the non-i.i.d property would lead to users' local model divergence and early over-fitting.Since D-PSGD additionally performs a net-work wide model average, it has higher accuracy than DACFL and CDSGD in MNIST and FMNIST. 6However, we should note that a network-wide model average usually becomes impractical when users are very scattered in a very large decentralized topology since an acceptable overhead would be caused by such a network-wide communication.Therefore, in case that there is no network-wide model average, our DACFL outperforms CDSGD.Take the result under dense topology as an example, DACFL gets average accuracy of 86%, 70%, while CDSGD only reaches 58%, 40% on MNIST and FMNIST, respectively.
Second, the numerical result of Var of Acc in fig.6(d), 6(e) and 6(f) also shows the superiority of DACFL compared to CDSGD.Actually, the accuracy variance is larger than that of i.i.d data due to the non-i.i.d property.However, we can still see that the accuracy variance of DACFL gradually decreases and stabilizes as the training progresses.
In summary, the DACFL is also feasible on non-i.i.d data (MNIST and FMNIST) under time-invariant topology.
6. Since all three DFL approaches do not converge in CIFAR-10, the result on CIFAR-10 is not counted here.To sum up, the DACFL is also viable on non-i.i.d data (MNIST and FMNIST) under time-varying topology.

Convergence Speed vs Learning Rate and Topology Size
To figure out how the learning rate and network topology size affect our solution, we also log the average test accuracy and average training loss on i.i.d MNIST with different learning rates and topology sizes.Fig. 8 shows the numerical result.Note that there are no decaying on learning rate and all topologies are dense in this part of experiments.Except for the learning rate and topology size, other hyperparameters in this experiment follow Table 1.This is because a larger learning rate makes the loss function decreases with a larger step size, which leads to a faster convergence.However, this situation changes when 0.05 ≤ λ ≤ 0.1, i.e, when λ increases from 0.05 to 0.1, the convergence speed and convergence result become even worse.Smaller average test accuracy and larger average training loss with greater surge precisely reflect this phenomenon.It is because that an excessive learning rate λ would lead to a larger upper bound of first-order difference of model parameter θ and thus cause a larger relatively upper bound of first-order difference κ in (13).Consequently, a larger relatively bound would lead to a larger steady-stater error when using FODAC to track the average of users' models [41].Hence, an excessive learning rate would be unfriendly to our DACFL solution.From fig. 8(c), it can be seen that a greater learning rate leads to a greater variance of accuracy.So, λ = 0.01 should be the best choice in this experiment, which gets a higher average test accuracy and lower variance while ensuring fast convergence.8(d) and 8(e) that, as the size of topology N grows, the convergence speed slows down.Also, the larger topology size it is, the lower final test accuracy it gets.This is because that a larger size of topology would lead to a larger deviation among all users, which further cause a slower rate of our FODAC tracking for In a summary, a proper learning rate can accelerate the convergence of our DACFL training.Also, although DACFL is robust to different topology sizes, a smaller size is preferred to attain a better performance within a limit number of training rounds.

CONCLUSIONS
Over-reliance on the central PS makes the federated learning possibly paralyze when the server breaks down.To alleviate this single point failure in conventional FL, existing researches have offered different DFL implementations including CDSGD and D-PSGD.However, there exists significant variance between users' final models in CDSGD while D-PSGD necessitates a network-wide model average.In this paper, we devise a new DFL method coined as DACFL to solve the deficiency in CDSGD and D-PSGD.The DACFL treats the respective local training processes as discrete-time series and employ FODAC to track the average model over all users.To confirm the feasibility of DACFL, we also deliver a theoretical analysis on the premise of some assumptions, which offers a convergence guarantee of our solution.Besides, we design specific experiments on MNIST, Fashion-MNIST and CIFAR-10 under i.i.d and non-i.i.d allocations, and compare the DACFL with D-PSGD and CDSGD.The results verify the effectiveness of DACFL under different network topologies, and declare that DACFL outperforms D-PSGD and CDSGD in most cases.
There are several issues need further investigation.First, DACFL solves the problems of CDSGD and D-PSGD at the expense of more communication overhead because each user exchanges both estimation states and local models during the training progress.Therefore, a more communicationefficient method for DACFL is deserved.Second, since this work only considers DACFL with synchronized settings, an asynchronized decentralized federated learning deserves investigation for practical application.Third, because DACFL works effectively only when mixing matrix is symmetric and doubly stochastic, users dropping out or joining in during the training process will change the nature of mixing matrix and thus yields negative effects on this method.Thus, designing an offline and join aware DACFL would be worthy of future research.

APPENDIX A PROOF OF THEOREM A.1 Preliminaries
In this section, we give an upper bound on the expected average squared gradient norms, which serves as a metric to measure the convergence rate for the non-convex objectives.
Before the detailed proof, here are some notations avoiding ambiguity.We denote the mixing matrix W = [w ij (t)] ∈ R N ×N a fully decentralized communication network topology with N users, where x t i , ω t i represents for the esti- i=1 ω t i defined as the average model of all users at round t.Besides, we denote ω t i = t j=1 w ij (t)ω t j the neighborhood weighted average model, and use g t i = ∇f i ω t i , ζ t i to denote the stochastic gradient of user i at round t, where the ζ t i ⊆ D i is the uniformly sampled mini-batch from the i-th user's data shards at round t.With T denoting the transpose of a matrix, we also define the following notations to denote the sets of estimations and intermediate models at round t, respectively.In the following proof, we consider a time-invariant topology where W(t) = W, such that w ij (t) = w ij , t = 0, 1, . . ., T − 1.

A.2 Proof of Theorem 1
Proof: According to the updating rules in Algorithm 5 (line 4 to line 6), we have where w ij (t) ∈ W(t) denotes the (i, j)-th entry of the mixing matrix, thus where the (17) So, we can bound T 0 following (18), Given the L-smooth assumption, the following inequality holds E f (ω t+1 ) ≤E f (ω t ) + E ∇f (ω t ), ωt+1 − ωt Substituting ( 18) and ( 21) into (20), we have where (e) follows by the inequality where (h) follows because T 3 ≥ 0, (i) and (j) follow from the inequality and (18), respectively.Rearrange the (24), we have For ( 25), we sum it over t ∈ {0, 1, 2, • • • , T − 1} first and then divide both sides by T , where the f * is the optimum of loss function f , and this completes the proof.

Fig. 2 .
Fig.2.Comparison between CDSGD, D-PSGD and DACFL (our solution).We differ CDSGD and D-PSGD that D-PSGD additionally needs a network-wide model average before the output.Our solution differentiates CDSGD and D-PSGD by employing FODAC to approximate the "averaged model" over all users.An iteration of "1→2→3" for CDSGD and D-PSGD or an iteration of "1→2→3→4" is called a training round (i.e., a communication round) in this paper.For more details, please refer to Algorithm 1, 2 and 5.
In the local update phase, clients compute the gradients to minimize the underlying loss function using their local data.While in the global aggregation phase, the central parameter server collects the model updates from different clients, aggregates these model updates to form a global model and then sends back the aggregated result to the clients for their next training epoch.Formally, suppose there are a subset of clients C ⊆ N selected by the PS at training epoch t ≤ T .Each client c ∈ C keeps a local training dataset D c = {X c , Y c }, where X c ∈ R |Dc|×d represents the feature space of client c's training data and Y c ∈ R |Dc|×m is the associated label space of D c .Let (ω; x i , y i ) denote the loss function of data sample x i , where ω denotes the parameters of neural networks, then the local loss function of client c over training dataset D c can be expressed as where D = c D c represents for the whole training dataset over clients subset C and |D| = |C| c=1 |D c | denotes the total number of the data samples.

2
Distribute the training dataset to N agents; 3 for each agent do 4 for k = 0 : m do 5

Fig. 3 .
Fig. 3.The result of approximating the average by different algorithms

10 Fig. 4 .
Fig. 4. Performance comparison with i.i.d data and time-invariant topology (a communication round also corresponds to a training round in our experiments).

6. 5 . 1
Performance vs Learning Rate λ Fig. 8(a), 8(b) and 8(c) show the result of average test accuracy, average training loss and variance over test accuracy with different learning rates, respectively.From Fig.8(a) and 8(b), we can see that a larger λ could bring the benefit to faster convergence within the range 0.001 ≤ λ ≤ 0.01.

6. 5 . 2
Performance vs Topology Size N Fig. 8(d), 8(e) and 8(f) present the numerical result on different topology sizes.It is shown in fig.

TABLE 1
To make the result more clear, we design two metrics including Average of Acc and Var of Acc to indicate the performances of different methods.Specifically, we respectively test each user's trained model (or estimated model in DACFL) and get the test accuracy of all users.Then the Average of Acc is computed through averaging all users' test accuracy and Var of Acc is the variance over all users' test accuracy.Generally, a superior decentralized federated learning method is expected to obtain a higher Average of Acc and a smaller Var of Acc.Note that for FedAvg and D-PSGD where the final output is the only global model, the Average of Acc is actually same with the test accuracy and the Var of Acc is 0.