Towards Federated Learning With Byzantine-Robust Client Weighting

,


Introduction
Federated Learning (FL) [13,17,12,4] is a distributed machine learning paradigm where training data resides at autonomous client machines and the learning process is facilitated by a central server.The server maintains a shared model and alternates between requesting clients to try and improve it and integrating their suggested improvements back into that shared model.
A few challenges arise from this model.First, the need for communication efficiency, both in terms of the size of data transferred and the number of required messages for reaching convergence.Second, clients are outside of the control of the server and as such may be unreliable, or even malicious.Third, while classical learning models generally assume that data is homogeneous, here privacy and the aforementioned communication concerns force us to deal with the data as it is seen by the clients; that is 1) non-IID (identically and independently distributed)-data may depend on the client it resides at, and 2) unbalanceddifferent clients may possess different amounts of data.
In previous works [9,2,15,10,18], unbalancedness is either ignored or is represented by a collection of a priori known client importance weights that are usually derived from the amount of data each client has.This work investigates aspects that stem from this unbalancedness.Concretely, we focus on the case where unreliable clients declare the amount of data they have and may thus adversely influence their importance weight.We show that without some mitigation, a single malicious client can obstruct convergence in this manner even in the presence of popular FL defense mechanisms.Our experiments consider protections that replace the server step by a robust mean estimator, such as median [8,21,6] and trimmed mean [21].
The rest of this paper is organized as follows.In Section 2, we present required definitions and formalize the problem addressed by this work.Section 3 presents our truncation-based preprocessing method and proves that it can be applied to a randomly-selected sample of client weights .In Section 4, we report on the results of our empirical evaluation.Conclusions and directions for future work are presented in Section 5.

Problem Setup
In the following sections we denote the vector of client sample sizes as N = (n 1 , n 2 , . . ., n K ) and assume, w.l.o.g., that it is sorted in increasing order.

Collaboration Model
We restrict ourselves to the FL paradigm, which leaves the training data distributed among client machines, and learns a shared model by iterating between client updates and server aggregation.
Additionally, a subset of the clients, marked B, can be Byzantine, meaning they can send arbitrary and possibly malicious results on their local updates.
Moreover, unlike previous works, we also consider clients' sample sizes to be unreliable because they are reported by possibly Byzantine clients.When the distinction is important, values that are sent by clients are marked with an overdot to signify that they are unreliable (e.g., ṅk ), while values that have been preprocessed in some way are marked with a tilde (e.g., ñk ).

Federated Learning Meta Algorithm
We build upon the baseline federated averaging algorithm (F edAvg) described by [17].There, it is suggested that in order to save communication rounds, clients perform multiple stochastic gradient descent (SGD) steps while a central server occasionally averages the parameter vectors.
The intuition behind this approach becomes clearer when we mark the k th client's ERM objective function by F k (w) := 1 n k z∈Z k (w; z) and observe that the objective function in equation ( 1) can be rewritten as a weighted average of clients' objectives: Similarly to previous works [18,7,6], we capture a large set of algorithms by abstracting F edAvg into a meta-algorithm for FL (Algorithm 2).We require three procedures to be specified by any concrete algorithm: 1. P reprocess-receives possibly byzantine ṅk 's from clients and produces secure estimates marked as ñk 's.To the best of our knowledge, previous works ignore this procedure and assume that the n k 's are correct.

Preliminaries
The following assumption is common among works on Byzantine robustness: Assumption 1 (Bounded Byzantine proportion) The proportion of clients who are Byzantine is bounded by some constant α; i.e., 1  K |B| ≤ α.
The next assumption is a natural generalization when considering unbalancedness: Assumption 2 (Bounded Byzantine weight proportion) The proportion between the combined weight of Byzantine clients and the total weight is bounded by some constant α * ; i.e., 1 Previous works on robust aggregation [9,2,15,10,22] either used Assumption 1, without considering the unbalancedness of the data, or implicitly used Assumption 2. However, we observe that Assumption 2 is unattainable in practice since Byzantine clients can often influence their weight.We address this gap with the following definition and an appropriate P reprocess procedure.
Definition 1 (mwp).Given a proportion p, and a weights vector V = (v 1 , ..., v |V | ) sorted in increasing order, the maximal weight proportion, mwp(V , p), is the maximum combined weight for any p-proportion of the values of V : Note that this is just the weight proportion of the p|V | clients with the largest sample sizes.
In the rest of this work we assume Assumption 1 and design a P reprocess procedure that ensures the following: mwp(P reprocess(N ), α) ≤ α * . ( Observe that this requirement enables the use of weighted robust mean estimators in a realistic setting by ensuring that Assumption 2 holds for the preprocessed client sample sizes.Also note that here, α is our assumption about the proportion of Byzantine clients while α * relates to an analytical property of the underlying robust algorithm.For example, we may replace the federated average with a weighted median as suggested by [8], in which case, α * must be less than 1/2.

Truncating the Values of N
Our suggested preprocessing procedure uses element-wise truncation of the values of N by some value U , marked Given α and α * , we search for the maximal truncation which satisfies (3): Here α and U * present a trade-off.Higher α means more Byzantine tolerance but requires smaller truncation value U * , which, may cause slower and less accurate convergence, as we demonstrate empirically in Section 4 and theoretically in Theorem 2.
We note that given α and α * , truncating N by solving (4) is optimal in the sense that any other P reprocess procedure that adheres to (3) has an equal or larger L 1 distance from the original N .This follows immediately from the observation that, when truncating the values of N , the entire distance is due to the truncated elements, and if there was another applicable vector closer to N , we could have redistributed the difference to the largest elements and increase U * in contradiction to its maximality.
Finding U * Given α If one has an estimate for α it is easy to calculate U * .For example, by going over values in N in a decreasing order (i.e., from index K downwards) until finding a value that satisfies the inequality in (4).Then we mark the index of this value by u and use the fact that in the range [n u , n u+1 ] we can express mwp(trunc(N , U ), α) as a simple function of the form a+bU c+dU : for which we can solve (4) with The α-U * Trade-Off When we do not know α, as a practical procedure, we suggest plotting U * as a function of α.In order to do so, we can start with α ← α * , U ← n 1 , and alternate between decreasing α by 1/K (one less Byzantine client tolerated) and solving (4).This procedure can be made efficient by saving intermediate sums and using a specialized data structure for trimmed collections.See Algorithm 3 for pseudocode and Figure 1 for an example output.Algorithm 3 Report (α, U * ) Pairs end while end for

Truncation given a partial view of N
When K is very large we may want to sample only k K elements IID from N .In this case, we will need to test that the inequality in (4) holds with high probability.
We consider k discrete random variables taken IID from N after truncation by U , that is, taken from a distribution over {0, 1, . . ., U }.We mark these random variables as X 1 , X 2 , . . ., X k , and their order statistic as X (1) , X (2) , . . ., X (k) where X Proof.First, in the scope of this proof we use a couple of additional notations: top(V , p): The collection of p|V | largest values in V .
-V : The sum of all elements in V .
We observe that mwp(trunc(N , U ), α) ≤ α * can be rewritten as Then we note that membership in top(trunc(N , U ), α) can be viewed as a simple Bernoulli random variable with probability α, for which we obtain the following bound using Hoeffding's inequality, t ≥ 0: Therefore with t = ε 1 , we have the following with 1 − δ 3 confidence: Using Hoeffding's inequality again, we can bound the expectation of 3 confidence and together with (9) have that: Then, using Hoeffding's inequality for the third time, E[trunc(N , U )] is bound from below by ε 3 with 1 − δ 3 confidence: The proof is concluded by applying (9-11) to (7) using the union bound.

Convergence Analysis
After applying our P reprocess procedure we have the truncated number of samples per client, marked {ñ k } k∈ [K] .We can trivially ensure that any algorithm instance works as expected by requiring that clients ignore samples that were truncated.That is, even if an honest (non-Byzantine) client k has n k samples it may use only ñk samples during its ClienU pdate.
Although this solution always preserves the semantics of any underlying algorithm, it does hurt convergence guarantees since the total number of samples decreases [Tables 5 and 6 in [12]; [21]; [10]].Interestingly, Theorem 3 in [16] analyze the baseline FedAvg and show that the convergence bound increases with max n k /min n k (marked there as ν/ς).This suggests that in some cases, unbalancedness itself deteriorates the convergence rate, a phenomenon that may be mitigated by truncation to some degree.
Additionally, we note that in practice, the performance of federated averaging based algorithms improves when honest clients use all their original n k samples.Intuitively, this follows easily from the observation that Aggregate procedures are generally composite mean estimators and ClientU pdate calls are likely to produce more accurate results given more samples.
Lastly, as we have mentioned before, convergence is guaranteed, but we note that the optimization goal itself is inevitably skewed in our Byzantine scenario.The following theorem bounds this difference between the original weighted optimization goal (2) and the new goal after truncation.In order to emphasize the necessity of this bound (in terms of Assumption 2), we use overdot and tilde to signify unreliable and truncated values, respectively, as previously described in Subsection 2.2.
Theorem 2. Given the same setup as in (1) and a truncation bound U , the following holds for all w ∈ R d : Where L(Z i ) is defined as z∈Zi (w; z).
Proof.Using the fact that ñ ≤ U K we get: From the bound in Theorem 2 we can clearly see how the coefficients in the left term, ( ṅi / ṅ − 1/K), stem from unbalancedness in the values above the truncation threshold while the coefficient in the right term, (1/ ṅ−1/ñ), accounts for the increase of relative weight of the values below the truncation threshold.Additionally, note that this formulation demonstrates how a single Byzantine client can increase this difference arbitrarily by increasing its ṅi .Lastly, observe how both terms vanish as U increases, which motivates our selection of U * as the maximal truncation threshold for any given α and α * .

Evaluation
In this section, we demonstrate how truncating N is a crucial requirement for Byzantine robustness.That is, we show that no matter what the specific attack or aggregation method is, using N "as-is" categorically devoids any robustness guaranties.
The code for the experiments is based on the Tensorflow machine learning library [1].Specifically, the code for the shakespeare experiments is based on the Tensorflow Federated sub-library of Tensorflow.It is given under the Apache license 2.0.Our code can be found in the supplementary material and is given under the MIT license.We perform the experiments using a single NVIDIA GeForce RTX 2080 Ti GPU, but the results are easily reproducible on any device.

Experimental Setup
The Machine Learning Tasks and Models Shakespeare: next-character-prediction partitioned by speaker.Presented in the original FedAvg paper [17] and also part of the LEAF benchmark [5], the Shakespeare dataset contains 422,615 sentences taken from The Complete Works of William Shakespeare [20] (freely available public domain texts).The nextcharacter-prediction task with the per-speaker partitioning represents a realistic scenario in the FL domain.Each client trains using an LSTM recurrent model [11] with hyperparameters matching those suggested by [19] for FedAvg.
MNIST: digit recognition with synthetic client partitioning.The MNIST database [14] (available under Creative Commons Attribution-ShareAlike 3.0 license) includes 28×28 grayscale labeled images of handwritten digits split into 60,000 training images and 10,000 testing images.We randomly partition the training set among 100 clients.The partition sizes are determined by taking 100 samples from a Lognormal distribution with µ = 1.5, σ = 3.45, and then interpolating corresponding integers that sum to 60,000.This produces a right-skewed, fat-tailed partition size distribution that emphasizes the importance of correctly weighting aggregation rules and the effects of truncation.Clients train a classifier using a 64-unit perceptron with RelU activation and 20% dropout, followed by a softmax layer.Following [21], on every communication round, all clients perform mini-batch SGD with 10% of their examples.
Note that the Shakespeare and MNIST synthetic tasks were selected because they are relatively simple, unbalanced tasks.Simple, because we want to evaluate a preprocessing phase and avoid tuning of the underlying algorithms we compare.Unbalanced, since as can be understood from Theorem 2, when the client sample sizes are spread mostly evenly, ignoring the client sample size altogether is a viable approach.See Figure 2 for the histograms of the partitions.The Server We show three Aggregate procedures.Arithmetic mean, as used by the original F edAvg, and two additional procedures that replace the arithmetic mean with robust mean estimators.The first of the latter uses the coordinatewise median [8,21].That is, each server model coordinate is taken as the median of the clients' corresponding coordinates.The second robust aggregation method uses the coordinatewise trimmed mean [21] that, for a given hyperparameter β, first removes β-proportion lowest and β-proportion highest values in each coordinate and only then calculates the arithmetic mean of the remaining values.
When preprocessing the client-declared sample size, we compare three options: We either ignore client sample size, truncate according to α = 10% and α * = 50%, or just passthrough client sample size as reported.
The Clients and Attackers We examine a model negation attack [3].In this attack, each attacker "pushes" the model towards zero by always returning a negation of the server's model.When the data distribution is balanced, this attack is easily neutralized since Byzantine clients typically send easily detectable extreme values.However, in our unbalanced case, we demonstrate that without our preprocessing step, this attack cannot be mitigated even by robust aggregation methods.
In order to provide comparability, we additionally follow the experiment shown by [21] in which 10% of the clients use a label shifting attack on the MNIST task.In this attack, Byzantine clients train normally except for the fact that they replace every training label y with 9 − y.The values sent by these clients are then incorrect but are relatively moderate in value, making their attack somewhat harder to detect.
We first execute our experiment without any attacks for every server aggregation and preprocessing combination.Then, for each attack type, we repeat the process two additional times: 1) with a single attacker that declares 10 million samples, and 2) with 10% attackers that declare 1 million samples each.Fig. 3: Accuracy by round without any attackers for the Shakespeare experiments.Curves correspond to preprocessing procedures and columns correspond to different aggregation methods.It can be seen that our method (dashed orange curve) remains comparable to the properly weighted mean estimators (solid blue curve) while ignoring clients' sample sizes (dotted green curve) is sub-optimal.This effect is pronounced when the unweighted median is used, since with our unbalanced partition it is generally very far from the mean.Figure 5 shows similar results for the MNIST experiments.

Results
The Shakespeare experiments without any attackers is shown in Figure 3 and the executions with attackers are shown in Figure 4.The MNIST experiments without any attackers is shown in Figure 5 and the executions with attackers are shown in Figure 6.
The results from the first experiment, running without any attackers (Figure 3), demonstrate that ignoring client sample size results in reduced accuracy, especially when median aggregation is used, whereas truncating according to our procedure is significantly better and is on par with properly using all weights.These results highlight the imperativeness of using sample size weights when performing server aggregations.
While Figure 3 shows that truncation-based preprocessing performs on par with that of taking all weights into consideration when all clients are honest, Figure 4 demonstrates that the results are very different when there is an attack.In this case, we see that when even a single attacker reports a highly exaggerated sample size and the server relies on all the values of N , the performance of all aggregation methods including robust median and trimmed mean quickly degrades.
In contrast, in our experiments robustness is maintained when truncationbased preprocessing is used in conjunction with robust mean aggregations, even when Byzantine clients attain the maximal supported proportion (α = 10%).
The results of the MNIST experiments are similar to the results of the Shakespeare experiments.We observe that even with a single attacker performing a trivial attack (first row), using the weights directly (solid blue curve) is devastating while when our preprocessing method is used in conjunction with robust mean aggregations (dashed orange curve, two last columns) convergence remains stable even when there are actual α (=10%) attackers (second row).In contrast, the same cannot be said for the regular mean aggregator, as can be seen by the sub-optimal accuracy (2nd row) and occasional dips in accuracy (1st row) in the leftmost column (the dips can be explained by the fact that in each round we randomly select clients for training, and so the byzantine clients have varying effects across different rounds).We note that in some cases our method may be slightly less efficient compared with the preprocessing method that ignores sample size altogether (dotted green curve, second row, middle column).This is to be expected because we allow Byzantine clients to potentially get close to α * -proportion (50%, in this case) of the weight.However, our method is significantly closer to the optimal solution when there are no or only a few attackers (see Figure 3).Moreover, when used in conjunction with robust mean aggregation methods it maintains their robustness properties.Figure 6 shows similar results for the MNIST experiments.Curves correspond to preprocessing procedures and columns correspond to different aggregation methods.It can be seen that our method (dashed orange curve) remains comparable to the properly weighted mean estimators (solid blue curve) while ignoring clients' sample sizes (dotted green curve) is sub-optimal.This effect is pronounced when the unweighted median is used, since with our unbalanced partition it is generally very far from the mean.

Conclusion and future work
Our method is based on truncating the weight values reported by clients in a manner that bounds from above the proportion α * of weights that can be atto Byzantine clients, given an upper bound on the proportion of clients α that may be Byzantine.Different values of parameter α represent different points in the trade-off between model quality and Byzantine-robustness, where higher values increase robustness when attacks do occur but decrease convergence rate even in the lack of attacks.
We evaluated the performance of our truncation method empirically when applied as a preprocessing stage, prior to several aggregation methods.The results of our experiments establish that: 1) in the absence of attacks, model convergence is on par with that of properly using all reported weights, and 2) when attacks do occur, the performance of combining truncation-based preprocessing and robust aggregations incurs almost no penalty in comparison with the performance of using of all weights in the lack of attacks, whereas without preprocessing, even robust aggregation methods collapse to a performance that is worse than that of a random classifier.
When the number of clients is very large, performing server preprocessing and aggregation on the server may become computationally infeasible.We prove that, in this case, truncation-based preprocessing can achieve the same upper bound on α * w.h.p. based on the weight values reported from a sufficiently large number of the clients selected IID.
As with many Byzantine-robust algorithms, the selection of α has a significant impact on the underlying model and, specifically, on fairness towards clients that hold underrepresented data, which may inadvertently be considered  Curves correspond to preprocessing procedures and columns correspond to different aggregation methods.In the first two rows Byzantine clients perform a label shifting attack with one and 10% attackers, respectively.In the last two rows we repeat the experiment with a model negation attack.We observe that even with a single attacker performing a trivial attack (first and third rows), using the weights directly (solid blue curve) is devastating while when our preprocessing method is used in conjunction with robust mean aggregations (dashed orange curve, two last columns) convergence remains stable even when there are actual α (=10%) attackers (second and forth rows).
In contrast, the same cannot be said for the regular mean aggregator, as can be seen by the sub-optimal accuracy (2nd and 3rd rows) and complete failure to converge (last row) in the leftmost column.We note that in some cases our method may be slightly less efficient compared with the preprocessing method that ignores sample size altogether (dotted green curve, second row, last column).This is to be expected because we allow Byzantine clients to potentially get close to α * -proportion (50%, in this case) of the weight.However, our method is significantly closer to the optimal solution when there are no or only a few attackers (see Figure 5).Moreover, when used in conjunction with robust mean aggregation methods it maintains their robustness properties.
outliers.In future work, we plan to further analyze the trade-off between robustness and the usage of client sample size in rectifying data unbalancedness.We also plan to investigate alternative forms of estimating client importance that may avoid client sample size altogether.

2. 1
Optimization Goal We are given K clients where each client k has a local collection Z k of n k samples taken IID from some unknown distribution over sample space Z.We denote the unified sample collection as Z = k∈[K] Z k and the total number of samples as n (i.e., n = |Z| = k∈[K] n k ).Our objective is global empirical risk minimization (ERM) for some loss function class (w; •) : Z → R, parameterized by w ∈ R d 1 : min w∈R d F (w), where F (w) := 1 n z∈Z (w; z).

Fig. 1 :
Fig. 1: Example plot of data generated by executing Algorithm 3 on unbalanced vector N and α * = 50% (this vector corresponds to the partition used in our experiments; See Section 4.1 for details).

Fig. 4 :
Fig.4: Accuracy by round under Byzantine attacks for the Shakespeare experiments.Curves correspond to preprocessing procedures and columns correspond to different aggregation methods.In the two rows of the experiment the Byzantine clients perform a model negation attack with one and 10% attackers, respectively.We observe that even with a single attacker performing a trivial attack (first row), using the weights directly (solid blue curve) is devastating while when our preprocessing method is used in conjunction with robust mean aggregations (dashed orange curve, two last columns) convergence remains stable even when there are actual α (=10%) attackers (second row).In contrast, the same cannot be said for the regular mean aggregator, as can be seen by the sub-optimal accuracy (2nd row) and occasional dips in accuracy (1st row) in the leftmost column (the dips can be explained by the fact that in each round we randomly select clients for training, and so the byzantine clients have varying effects across different rounds).We note that in some cases our method may be slightly less efficient compared with the preprocessing method that ignores sample size altogether (dotted green curve, second row, middle column).This is to be expected because we allow Byzantine clients to potentially get close to α * -proportion (50%, in this case) of the weight.However, our method is significantly closer to the optimal solution when there are no or only a few attackers (see Figure3).Moreover, when used in conjunction with robust mean aggregation methods it maintains their robustness properties.Figure6shows similar results for the MNIST experiments.

Fig. 5 :
Fig. 5: Accuracy by round without any attackers for the MNIST experiments.Curves correspond to preprocessing procedures and columns correspond to different aggregation methods.It can be seen that our method (dashed orange curve) remains comparable to the properly weighted mean estimators (solid blue curve) while ignoring clients' sample sizes (dotted green curve) is sub-optimal.This effect is pronounced when the unweighted median is used, since with our unbalanced partition it is generally very far from the mean.

Fig. 6 :
Fig. 6: Accuracy by round under Byzantine attacks for the MNIST experiments.Curves correspond to preprocessing procedures and columns correspond to different aggregation methods.In the first two rows Byzantine clients perform a label shifting attack with one and 10% attackers, respectively.In the last two rows we repeat the experiment with a model negation attack.We observe that even with a single attacker performing a trivial attack (first and third rows), using the weights directly (solid blue curve) is devastating while when our preprocessing method is used in conjunction with robust mean aggregations (dashed orange curve, two last columns) convergence remains stable even when there are actual α (=10%) attackers (second and forth rows).In contrast, the same cannot be said for the regular mean aggregator, as can be seen by the sub-optimal accuracy (2nd and 3rd rows) and complete failure to converge (last row) in the leftmost column.We note that in some cases our method may be slightly less efficient compared with the preprocessing method that ignores sample size altogether (dotted green curve, second row, last column).This is to be expected because we allow Byzantine clients to potentially get close to α * -proportion (50%, in this case) of the weight.However, our method is significantly closer to the optimal solution when there are no or only a few attackers (see Figure5).Moreover, when used in conjunction with robust mean aggregation methods it maintains their robustness properties.