1. Introduction
Sparse learning has emerged as a central topic of study in a variety of fields that require highdimensional data analysis. Sparsityconstrained statistical models exploit the fact that high dimensional data arising from realworld applications frequently have low intrinsic complexity and have been shown to perform accurate estimation and inference in a variety of data mining fields, such as bioinformatics [
1], image analysis [
2,
3], graph sparsification [
4] and engineering [
5]. These models often require solving the following optimization problem with a nonconvex, nonsmooth sparsity constraint:
where
$f(x)$ is a smooth and convex cost function in terms of a vector of parameters to be optimized
x,
${\parallel x\parallel}_{0}$ denotes the
${l}_{0}$norm (cardinality) of
x, which computes the number of nonzero entries in
x, and
$\tau $ is the sparsity level prespecified for
x. Examples of this model include sparsityconstrained linear/logistic regression problems [
6,
7] and sparsityconstrained graphical models [
8].
Extensive research has been conducted for Problem (1). The methods largely fall into the regimes of either matching pursuit methods [
9,
10,
11,
12] or iterative hard thresholding (IHT) methods [
13,
14,
15]. Even though matching pursuit methods achieve remarkable success in minimizing quadratic loss functions (such as the
${\ell}_{0}$constrained linear regression problems), they require finding an optimal solution to
min $f(x)$ over the identified support after hard thresholding at each iteration, which lacks analytical solutions for arbitrary losses and can be timeconsuming [
16]. Hence, gradientbased IHT methods have gained significant interest and become popular for nonconvex sparse learning. IHT methods currently include the gradient descent HT (GDHT) [
14], stochastic gradient descent HT (SGDHT) [
15], hybrid stochastic gradient HT (HSGHT) [
17], and stochastic variance reduced gradient HT (SVRGHT) [
18,
19] methods. These methods update the iterate
${x}_{t}$ as follows:
${x}_{t+1}={\mathcal{H}}_{\tau}({x}_{t}{\gamma}_{t}{v}_{t})$, where
${\gamma}_{t}$ is the learning rate,
${v}_{t}$ can be the full gradient, stochastic gradient or variance reduced gradient at the
tth iteration, and
${\mathcal{H}}_{\tau}(x):{\mathbb{R}}^{d}\to {\mathbb{R}}^{d}$ denotes the HT operator that preserves the top
$\tau $ elements in
x and sets other elements to 0. However, finding a solution to Problem (1) is generally NPhard because of the nonconvexity and nonsmoothness of the cardinality constraint [
20].
Local datasets can be sensitive to sharing during the construction of a sparse inference model when sparse learning becomes distributed and uses data collected by distributed devices. For instance, metaanalyses may integrate genomic data from a large number of labs to identify (a sparse set of) genes contributing to the risk of a disease without sharing data across the labs [
21,
22]. Smartphonebased healthcare systems may need to learn the most important mobile health indicators from a large number of users; however, the personal health information gathered on the phone is private [
23]. Furthermore, communication efficiency can be the main challenge to distributively training a sparse learning model. Due to the power and bandwidth limitations of various sensors, the signal processing community, for instance, has been seeking more communicationefficient methods [
24].
Federated learning (FL) is a recently proposed communicationefficient distributed computing paradigm that enables collaborations among a collection of clients while preserving data privacy on each device by avoiding the transmission of local data to the central server [
25,
26,
27]. Hence, sparse learning can benefit from the setting of federated learning. In this paper, we solve the federated nonconvex sparsityconstrained empirical risk minimization problem with decentralized data as follows:
where
$f(x)$ is a smooth and convex function,
${f}_{i}(x)={E}_{z\sim {\mathcal{D}}_{i}}\left[{f}_{i}(x,z)\right]$ is the loss function of the
ith client (or device) with weight
${p}_{i}\in [0,1)$,
${\sum}_{i=1}^{N}{p}_{i}=1$,
${\mathcal{D}}_{i}$ is the distribution of data located locally on the
ith client, and
N is the total number of clients. It is thus desirable to solve Problem (2) in a communicationefficient way and investigate theory and algorithms applicable to a broader class of sparse constrained learning problems in highdimensional data analyses [
6,
7,
8,
28].
We thus propose federated HT algorithms with lower communication costs and provide the corresponding theoretical analysis under practical federated settings. The analysis of proposed methods is difficult due to the fact that distributions of training data on each client may be nonidentical and the data weights can be unbalanced across devices.
Our main contributions are summarized as follows.
(a) We develop two schemes for the federated HT method: the Federated Hard Thresholding (FedHT) algorithm and Federated Iterative Hard Thresholding (FedIterHT) algorithm. In FedHT, we apply the HT operator ${\mathcal{H}}_{\tau}$ at the central server right before distributing the aggregated model to clients. To further improve the communication efficiency and the ability of sparsity recovery, in FedIterHT, we consider applying ${\mathcal{H}}_{\tau}$ to both local updates and the central server aggregate. Note that this is the first trial to explore IHT algorithms under federated learning settings.
(b) We provide a set of theoretical results for the federated HT method, particularly of FedHT and FedIterHT, under the realistic condition that the distributions of training data over devices can be unbalanced and nonindependent and nonidentical (nonIID), i.e., for $i\ne j$, ${\mathcal{D}}_{i}$ and ${\mathcal{D}}_{j}$ are different. We prove that both algorithms enjoy a linear convergence rate and have a strong guarantee for sparsity recovery. In particular, Theorems 1 (for the FedHT) and 2 (for the FedIterHT) show that the estimation error between the algorithm iterate ${x}_{T}$ and the optimal solution ${x}^{*}$ is upper bounded as: $E\parallel {x}_{T}{x}^{*}\parallel \le {\theta}^{T}{\parallel {x}_{0}{x}^{*}\parallel}^{2}+g({x}^{*}),$ where ${x}_{0}$ is the initial guess of the solution, the convergence rate factor $\theta $ is related to the algorithm parameter K (the number of SGD steps on each device before communication), and the closeness between the prespecified sparsity level $\tau $ and the true sparsity ${\tau}^{*}$, and $g({x}^{*})$ determines a statistical bias term related not only to K but also to the gradient of f at the sparse solution ${x}^{*}$ and the measurement of the nonIIDness of the data across the devices.
The theoretical results allow us to evaluate and compare the proposed methods. For example, greater nonIIDness among clients increases the bias of both algorithms. More local iterations may reduce
$\theta $ but increase the statistical bias. Due to the utilization of the HT operator on local updates, the statistical bias induced by the FedIterHT in Theorem 2 matches the best known upper bound for traditional IHT methods [
17], which exhibits the powerful capability of sparsity recovery.
(c) When instantiating the general loss function by concrete squared or logistic loss, we arrive at specific sparse learning problems, such as sparse linear regression and sparse logistic regression. We provide statistical analysis of the maximum likelihood estimators (Mestimators) of these problems when using the FedIterHT to solve them. This result can be regarded as federated HT analysis for generalized linear models.
(d) Extensive experiments in simulations and on reallife datasets demonstrate the effectiveness of the proposed algorithms over standard distributed IHT learning.
2. Preliminaries
We formalize our problem as Problem (2) and provide the notations (
Table 1), assumptions and prepared lemmas used in this paper. We denote vectors by lowercase letters, e.g.,
x. The model parameters form a vector
$x\in {\mathbb{R}}^{d}$. The
${\ell}_{0}$norm,
${\ell}_{2}$norm and the
${\ell}_{\infty}$norm of a vector are denoted by
${\parallel \xb7\parallel}_{0}$,
$\parallel \xb7\parallel $ and
${\parallel \xb7\parallel}_{\infty}$, respectively. Let
$O(\xb7)$ represent the asymptotic upper bound,
$\left[N\right]$ be the integer set
$\{1,\dots ,N\}$. The support
${\mathcal{I}}_{t,k+1}^{(i)}=supp({x}^{*})\cup supp({x}_{t,k}^{(i)})\cup supp({x}_{t,k+1}^{(i)})$ is associated with the
$(k+1)$th iteration in the
tth round on device
i. For simplicity, we use
${\mathcal{I}}^{(i)}={\mathcal{I}}_{t,k+1}^{(i)}$,
$\mathcal{I}={\bigcup}_{i=1}^{N}{\mathcal{I}}_{t,k+1}^{(i)}$ throughout the paper without ambiguity, and
$\tilde{\mathcal{I}}=supp\left({\mathcal{H}}_{2N\tau}\left(\nabla f\left({\mathit{x}}^{*}\right)\right)\right)\cup supp\left({\mathit{x}}^{*}\right)$.
We use the same conditions employed in the theoretical analysis of other IHT methods by assuming that the objective function $f(x)$ satisfies the following conditions:
Assumption 1. We assume that the loss function ${f}_{i}(x)$ on each device i
 1.
is restricted ${\rho}_{s}$strongly convex (RSC [43]) at the sparsity level s for a given $s\in {\mathbb{N}}_{+}$, i.e., there exists a constant ${\rho}_{s}>0$ such that $\forall {x}_{1},{x}_{2}\in {\mathbb{R}}^{d}$ with $\parallel {x}_{1}{x}_{2}{\parallel}_{0}\le s$, $i\in \left[N\right]$, we have  2.
is restricted ${l}_{s}$strongly smooth (RSS [43]) at the sparsity level s for a given $s\in {\mathbb{N}}_{+}$, i.e., there exists a constant ${l}_{s}>0$ such that $\forall {x}_{1},{x}_{2}\in {\mathbb{R}}^{d}$ with $\parallel {x}_{1}{x}_{2}{\parallel}_{0}\le s$, $i\in \left[N\right]$, we have  3.
has ${\sigma}_{i}^{2}$bounded stochastic gradient variance, i.e.,
Remark 1. When $s=d$, the above assumption is no longer restricted to the support at a sparsity level, and ${f}_{i}$ is actually ${\rho}_{d}$strongly convex and ${l}_{d}$strongly smooth.
Following the same convention in FL [
35,
37], we also assume the dissimilarity between the gradients of the local functions
${f}_{i}$ and the global function
f is bounded as follows.
Assumption 2. The functions ${f}_{i}(x)$ ($i\in \left[N\right]$) are $\mathcal{B}$locally dissimilar, i.e., there exists a constant $\mathcal{B}>1$, such thatfor any $\mathcal{I}$. From the assumptions mentioned in the main text, we have the following lemmas to prepare for our theorems.
Lemma 1 ([
44]).
For $\tau >{\tau}^{*}$ and for any parameter $x\in {\mathbb{R}}^{d}$, we havewhere $\alpha =\frac{2\sqrt{{\tau}^{*}}}{\sqrt{\tau {\tau}^{*}}}$ and ${\tau}^{*}={\parallel {x}^{*}\parallel}_{0}$. Lemma 2. A differentiable convex function ${f}_{i}(x):{\mathbb{R}}^{d}\to \mathbb{R}$ is restricted ${l}_{s}$strongly smooth with parameter s, i.e., there exists a generic constant ${l}_{s}>0$ such that for any ${x}_{1}$, ${x}_{2}$ with $\parallel {x}_{1}{x}_{2}{\parallel}_{0}\le s$ andthen we have:The above two inequalities also hold for the global smoothness parameter ${l}_{d}$. 3. The FedHT Algorithm
In this section, we first describe our new federated ${\ell}_{0}$norm regularized sparse learning framework via hard thresholding—FedHT, and then discuss the convergence rate of our proposed algorithm.
A high level summary of FedHT is described in Algorithm 1. The FedHT algorithm generates a sequence of
$\tau $—sparse vectors
${x}_{1}$,
${x}_{2}$, ⋯, from an initial sparse approximation
${x}_{0}$. At the
$(t+1)$th round, clients receive the global parameter update
${x}_{t}$ from the central server, then run
K steps of minibatch SGD based on local private data. In each step, the
ith client updates
${x}_{t,k+1}^{(i)}=argmi{n}_{x}{f}_{i}({x}_{t,k}^{(i)})+\langle {g}_{t,k}^{(i)},x{x}_{t,k}^{(i)}\rangle +\frac{1}{2{\gamma}_{t}}{\parallel x{x}_{t,k}^{(i)}\parallel}^{2}$ for
$k\in \{0,\dots ,K1\}$, i.e.,
${x}_{t,k+1}^{(i)}={x}_{t,k}^{(i)}{\gamma}_{t}{g}_{t,k}^{(i)}$. Clients send
${x}_{t,K}^{(i)}$ for
$i\in \left[N\right]$ back to the central server; then, the server averages them to obtain a dense global parameter vector and applies the HT operator to obtain a sparse iterate
${x}_{t+1}$. Unlike the commonly used FedAvg, the FedHT is designed to solve the family of federated
${\ell}_{0}$norm regularized sparse learning problems. It has a strong ability to recover the optimal sparse estimator in decentralized nonIID and unbalanced data settings while at the same time reducing the communication cost by a large margin because the central server broadcasts a sparse iterate for each of the
T rounds.
Algorithm 1. Federated Hard Thresholding (FedHT) 
Input: The learning rate ${\gamma}_{t}$, the sparsity level $\tau $, and the number of clients N. Initialize${x}_{0}$ for$t=0$ to $T1$ do for client $i=1$ to N parallel do ${x}_{t,1}^{(i)}={x}_{t}$ for $k=1$ to K do Sample uniformly a batch ${I}_{t,k}^{(i)}$ with batchsize ${b}_{t,k}^{(i)}$ ${g}_{t,k}^{(i)}=\nabla {f}_{{I}_{t,k}^{(i)}}({x}_{t,k}^{(i)})$ ${x}_{t,k+1}^{(i)}={x}_{t,k}^{(i)}{\gamma}_{t}{g}_{t,k}^{(i)}$ end for end for ${x}_{t+1}={\mathcal{H}}_{\tau}({\sum}_{i=1}^{N}{p}_{i}{x}_{t,K}^{(i)})$ end for

The following theorem characterizes our theoretical analysis of FedHT in terms of its parameter estimation accuracy for sparsityconstrained problems. Although this paper is focused on the cardinality constraint, the theoretical result is applicable to other sparsity constraints, such as a constraint based on matrix rank. Then, we have the main theorem and the detailed proof.
Theorem 1. Let ${x}^{*}$ be the optimal solution to Problem (2), ${\tau}^{*}={\parallel {x}^{*}\parallel}_{0}$, and suppose $f(x)$ satisfies Assumptions 1 and 2. The condition number ${\kappa}_{d}=\frac{{l}_{d}}{{\rho}_{d}}\ge 1$. Let stepsize ${\gamma}_{t}=\frac{1}{6{l}_{d}}$ and the batch size ${b}_{t,k}^{(i)}=\frac{{\Gamma}_{1}}{{\omega}_{1}^{t}}$, ${\Gamma}_{1}\ge \frac{{\xi}_{1}{\sum}_{i=1}^{N}{p}_{i}{\sigma}_{i}^{2}}{{\delta}_{1}{\parallel {x}_{0}{x}^{*}\parallel}^{2}}$, ${\delta}_{1}=\alpha {(1\frac{1}{12{\kappa}_{d}})}^{K}$, $\alpha =\frac{2\sqrt{{\tau}^{*}}}{\sqrt{\tau {\tau}^{*}}}$, the sparsity level $\tau \ge (16{(12{\kappa}_{d}1)}^{2}+1){\tau}^{*}$. Then the following inequality holds for the FedHT:where ${\theta}_{1}={\omega}_{1}=(1+2\alpha ){(1\frac{1}{12{\kappa}_{d}})}^{K}\in (0,1)$, ${g}_{1}({x}^{*})=\frac{{\xi}_{1}{\mathcal{B}}^{2}}{1{\psi}_{1}}{\parallel \nabla f({x}^{*})\parallel}^{2}$, ${\psi}_{1}=(1+\alpha ){(1\frac{1}{12{\kappa}_{d}})}^{K}$, ${\xi}_{1}=\frac{(1+\alpha )(1{(1\frac{1}{12{\kappa}_{d}})}^{K}){\kappa}_{d}}{{l}_{d}^{2}}$. Note that if the sparse solution ${x}^{*}$ is sufficiently close to an unconstrained minimizer of $f(x)$, then $\parallel \nabla f({x}^{*})\parallel $ is small, so the first exponential term on the righthand side can be a dominating term, which approaches 0 when T goes to infinity. We further obtain the following corollary that bounds the number of rounds T to obtain a suboptimal solution, i.e., the difference between the solution and ${x}^{*}$ is bounded only by the second term.
Corollary 1. If all the conditions in Theorem 1 hold, for a given precision $\u03f5>0$, we need at most $T\le {C}_{1}log(\frac{\parallel {x}_{0}{x}^{*}\parallel}{\u03f5})$ rounds to obtainwhere ${C}_{1}={(log({\theta}_{1}))}^{1}$, ${\theta}_{1}=(1+2\alpha ){(1\frac{1}{12{\kappa}_{d}})}^{K}\in (0,1)$, and ${g}_{1}({x}^{*})=\frac{{\xi}_{1}{\mathcal{B}}^{2}}{1{\psi}_{1}}{\parallel \nabla f({x}^{*})\parallel}^{2}$. Remark 2. Corollary 1 indicates that under proper conditions and with sufficient rounds, the estimation error of the FedHT is determined by the second term—the statistical bias term—which we denote as ${g}_{1}({x}^{*})$. The term ${g}_{1}({x}^{*})$ can become small if ${x}^{*}$ is sufficiently close to an unconstrained minimizer of $f(x)$, so it represents the sparsityinduced bias to the solution of the unconstrained optimization problem. The upper bound result guarantees that the FedHT can closely approach ${x}^{*}$ arbitrarily under a sparsityinduced bias, and the speed of approaching the biased solution is linear (or geometric) and determined by ${\theta}_{1}$. In Theorem 1 and Corollary 1, ${\theta}_{1}$ is closely related to the number of local updates K. The condition number ${\kappa}_{d}>1$, so $(1\frac{1}{12{\kappa}_{d}})<1$. When K is larger, ${\theta}_{1}$ is smaller, so is the number of rounds T required for reaching a target ϵ. In other words, the FedHT converges faster with fewer communication rounds. However, the bias term ${g}_{1}({x}^{*})$ will increase when K increases. Therefore, K should be chosen to balance the convergence rate and statistical bias.
We further investigate how the objective function $f(x)$ approaches the optimal $f({x}^{*})$.
Corollary 2. If all the conditions in Theorem 1 hold, let ${\u2206}_{1}={l}_{d}{\parallel {x}_{0}{x}^{*}\parallel}^{2}$, and ${g}_{2}({x}^{*})=O(\parallel \nabla f({x}^{*}){\parallel}^{2})$, we have Because the local updates on each device are based on SGD with dense parameters, without the HT operator, ${l}_{d}$smoothness and ${\rho}_{d}$strongly convexity are required, which depend on dimension d and are stronger requirements for f. Furthermore, $\parallel \nabla f({x}^{*})\parallel \le d\parallel f({x}^{*}){\parallel}_{\infty}$, i.e., ${g}_{1}({x}^{*})$ and ${g}_{2}({x}^{*})$ are $O({d}^{2}\parallel f({x}^{*}){\parallel}_{\infty}^{2})$, which are suboptimal compared with the results for traditional IHT methods in terms of dimension d. In order to solve such drawbacks, we develop a new algorithm in the next section.
4. The FedIterHT Algorithm
If we apply the HT operator to each local update as well, we obtain the FedIterHT algorithm, as described in Algorithm 2. Hence, the local update on each device performs multiple SGDHT steps, which further reduces the communication cost because model parameters sent back from clients to the central server are also sparse. If a client has a communication bandwidth so small that it can not effectively pass the full set of parameters, the FedIterHT provides a good solution and also can relax the strict requirements for the objective function f and reduce the statistical bias. In this section, we first present a more communicationefficient federated ${\ell}_{0}$norm regularized sparse learning framework—FedIterHT; then, we theoretically show it enjoys a better convergence rate compared with FedHT, and we further provide statistical analysis for Mestimators under the framework of FedIterHT.
We again examine the convergence of the FedIterHT by developing an upper bound on the distance between the estimator
${x}_{T}$ and the optimal
${x}^{*}$, i.e.,
$E[\parallel {x}_{T}{x}^{*}{\parallel}^{2}]$ in the following theorem.
Algorithm 2. Federated Iterative Hard Thresholding (FedIterHT) 
Input: The learning rate ${\gamma}_{t}$, the sparsity level $\tau $, and the number of clients N. Initialize${x}_{0}$ for$t=0$ to $T1$ do for client $i=1$ to N parallel do ${x}_{t,1}^{(i)}={x}_{t}$ for $k=1$ to K do Sample uniformly a batch ${I}_{t,k}^{(i)}$ with batchsize ${b}_{t,k}^{(i)}$ ${g}_{t,k}^{(i)}=\nabla {f}_{{I}_{t,k}^{(i)}}({x}_{t,k}^{(i)})$ ${x}_{t,k+1}^{(i)}={\mathcal{H}}_{\tau}({x}_{t,k}^{(i)}{\gamma}_{t}{g}_{t,k}^{(i)})$ end for end for ${x}_{t+1}={\mathcal{H}}_{\tau}({\sum}_{i=1}^{N}{p}_{i}{x}_{t,K}^{(i)})$ end for

Theorem 2. Let ${x}^{*}$ be the optimal solution to (2), ${\tau}^{*}={\parallel {x}^{*}\parallel}_{0}$, and suppose $f(x)$ satisfies Assumptions 1 and 2. The condition number ${\kappa}_{s}=\frac{{l}_{s}}{{\rho}_{s}}\ge 1$. Let stepsize ${\gamma}_{t}=\frac{1}{6{l}_{s}}$ and the batch size ${b}_{t,k}^{(i)}=\frac{{\Gamma}_{2}}{{\omega}_{2}^{t}}$, ${\Gamma}_{2}\ge \frac{{\xi}_{2}{\sum}_{i=1}^{N}{p}_{i}{\sigma}_{i}^{2}}{{\delta}_{2}{\parallel {x}_{0}{x}^{*}\parallel}^{2}}$, ${\delta}_{2}=(2\alpha +3{\alpha}^{2}){(1\frac{1}{12{\kappa}_{s}})}^{K}$, $\alpha =\frac{2\sqrt{{\tau}^{*}}}{\sqrt{\tau {\tau}^{*}}}$, the sparsity level $\tau \ge (\frac{16}{{(\sqrt{\frac{12{\kappa}_{s}}{12{\kappa}_{s}1}}1)}^{2}}+1){\tau}^{*}$. Then, the following inequality holds for the FedIterHT:where ${\theta}_{2}={\omega}_{2}={(1+2\alpha )}^{2}{(1\frac{1}{12{\kappa}_{s}})}^{K}\in (0,1)$, ${g}_{3}({x}^{*})=\frac{{\xi}_{2}{\mathcal{B}}^{2}}{1{\psi}_{2}}{\parallel {\pi}_{\tilde{\mathcal{I}}}(\nabla f({x}^{*}))\parallel}^{2}$, ${\xi}_{2}=\frac{{(1+\alpha )}^{2}(1{(1\frac{1}{12{\kappa}_{s}})}^{K}){\kappa}_{s}}{{l}_{s}^{2}}$, ${\psi}_{2}={(1+\alpha )}^{2}{(1\frac{1}{12{\kappa}_{s}})}^{K}$, $\alpha =\frac{2\sqrt{{\tau}^{*}}}{\sqrt{\tau {\tau}^{*}}}$, $\tilde{{\mathcal{I}}^{i}}=supp\left({\mathcal{H}}_{2\tau}\left(\nabla {f}_{i}\left({\mathit{x}}^{*}\right)\right)\right)\cup supp\left({\mathit{x}}^{*}\right)$ and $\tilde{\mathcal{I}}=supp\left({\mathcal{H}}_{2N\tau}\left(\nabla f\left({\mathit{x}}^{*}\right)\right)\right)\cup supp\left({\mathit{x}}^{*}\right)$. Remark 3. The factor ${\theta}_{2}$, compared with ${\theta}_{1}$ in Theorem 1, is smaller if $2\alpha =\frac{4\sqrt{{\tau}^{*}}}{\sqrt{\tau {\tau}^{*}}}\le {(\frac{11/12{\kappa}_{d}}{11/12{\kappa}_{s}})}^{K}$ $1$, which means that the FedIterHT converges faster than the FedHT when the beforehandguessed sparsity τ is much larger than the true sparsity. Both ${\theta}_{2}$ and ${\theta}_{1}$ will decrease when the number of internal iterations K increases, but ${\theta}_{2}$ decreases faster than ${\theta}_{1}$ because $1\frac{1}{12{\kappa}_{s}}$ is smaller than $1\frac{1}{12{\kappa}_{d}}$. Thus, the FedIterHT is more likely to benefit by increasing K than the FedHT. The statistical bias term ${g}_{3}({x}^{*})$ can be much smaller than ${g}_{1}({x}^{*})$ in Theorem 1 because ${g}_{3}({x}^{*})$ only depends on the norm of $\nabla f({x}^{*})$ restricted to the support $\tilde{\mathcal{I}}$ of size $2N\tau +{\tau}^{*}$. Because the norm of the gradient is a dominating term in ${g}_{1}$ and ${g}_{3}$, slightly increasing K does not significantly vary the statistical bias terms (when $d\gg 2N\tau +{\tau}^{*}$).
Using the results in Theorem 2, we can further derive Corollary 3 to specify the number of rounds required to achieve a given estimation precision.
Corollary 3. If all the conditions in Theorem 2 hold, for a given $\u03f5>0$, the FedIterHT requires the most $T\le {C}_{2}log(\frac{\parallel {x}_{0}{x}^{*}\parallel}{\u03f5})$ rounds to obtainwhere ${C}_{2}={(log({\theta}_{2}))}^{1}$. Because ${g}_{3}({x}^{*})=O(\parallel {\pi}_{\tilde{\mathcal{I}}}(\nabla f({x}^{*})){\parallel}^{2})$, and we also know $\parallel {\pi}_{\tilde{\mathcal{I}}}(\nabla f({x}^{*})){\parallel}^{2}\le {(2N\tau +{\tau}^{*})}^{2}{\parallel \nabla f({x}^{*})\parallel}_{\infty}^{2}$ and $2N\tau +{\tau}^{*}\ll d$ in high dimensional statistical problems, the result in Corollary 3 gives a tighter bound than the one obtained in Corollary 1. Similarly, we also obtain a tighter upper bound for the convergence performance of the objective function $f(x)$.
Corollary 4. If all the conditions in Theorem 2 hold, let ${\u2206}_{2}={l}_{s}{\parallel {x}_{0}{x}^{*}\parallel}^{2}$, and ${g}_{4}({x}^{*})=O(\parallel {\pi}_{\tilde{\mathcal{I}}}(\nabla f({x}^{*})){\parallel}^{2})$, we have The theorem and corollaries developed in this section only depend on the ${l}_{s}$restricted smoothness and ${\rho}_{s}$restricted strong convexity, where $s=2\tau +{\tau}^{*}$, which are the same conditions used in the analysis of existing IHT methods. Moreover, $\parallel {\pi}_{\tilde{\mathcal{I}}}(\nabla f({x}^{*}))\parallel \le (2N\tau +{\tau}^{*}){\parallel \nabla f({x}^{*})\parallel}_{\infty}$, which means ${g}_{3}({x}^{*})$ and ${g}_{4}({x}^{*})$ are $O({(2N\tau +{\tau}^{*})}^{2}\parallel \nabla f({x}^{*}){\parallel}_{\infty}^{2})$, where $2N\tau +{\tau}^{*}$ is the size of support $\tilde{\mathcal{I}}$. Therefore, our results match the current bestknown upper bound for the statistic bias term compared with the results for traditional IHT methods.
5. Experiments
We empirically evaluate our methods in both simulations and on three realworld datasets: E2006tfidf, RCV1 and MNIST (
Table 2, which are downloaded from the LibSVM website (
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, accessed on 1 July 2022)), and compare them against a baseline method. The baseline method is a standard Distributed IHT and communicates every local update to the central server, which then aggregates and broadcasts back to clients (see
Appendix A.1 for more details). Specifically, experiments for simulation I and on the E2006tfidf dataset are conducted for sparse linear regression. We solve the sparse logistic regression problem in simulation II and for the RCV1 data set. The last experiment uses MNIST data in a multiclass softmax regression problem. The exact loss functions for the various problems are available in the
Appendix A.2.
Following the convention in the federated learning literature, we use the number of communication rounds to measure the communication cost. For a comprehensive comparison, we also include the number of iterations. For both synthetic and realworld datasets, algorithm parameters are determined by the following criteria. The number of local iterations
K is searched from
$\{3,5,8,10\}$. We have tested the performance of our proposed algorithms under different
K conditions (see
Figure 1). The stepsize
$\gamma $ for each algorithm is set by a grid search from
$\{10,1,0.6,0.3,0.1,0.06,0.03,0.01,0.001\}$. All the algorithms are initialized with
${x}^{(0)}=0$. The sparsity
$\tau $ is 500 for the MNIST dataset and 200 for the other two datasets.
5.1. Simulations
To generate synthetic data, we follow a similar setup to that in [
37]. In simulation I, for each device
$i\in \left[100\right]$, we generate samples
$({z}_{i,j},{y}_{i,j})$ for
$j\in \left[100\right]$ according to
${y}_{i,j}={z}_{i,j}^{T}{x}_{i}+{b}_{i,j}$, where
${z}_{i,j}\in {\mathbb{R}}^{1000}$,
${x}_{i}\in {\mathbb{R}}^{1000}$. The first 100 elements of
${x}_{i}$ are drawn from
$\mathcal{N}({u}_{i},1)$ and the remaining elements in
${x}_{i}$ are zeros,
${b}_{i,j}\sim \mathcal{N}({u}_{i},1)$,
${u}_{i}\sim \mathcal{N}(0.1,\alpha )$,
${z}_{i,j}\sim \mathcal{N}({v}_{i},\mathsf{\Sigma})$, where
$\mathsf{\Sigma}$ is a diagonal matrix with the
ith diagonal element equal to
$\frac{1}{{i}^{1.2}}$. Each element in the mean vector
${v}_{i}$ is drawn from
$\mathcal{N}({B}_{i},1)$,
${B}_{i}\sim \mathcal{N}(0,\beta )$. Therefore,
$\alpha $ controls how much the local models differ from each other, and
$\beta $ controls how much the local ondevice data differ between one another; hence, we have simulated NonIID federated data. In simulation I,
$(\alpha ,\beta )\in \{(0.1,0.1),(0.5,0.5),(1,1)\}$. The data generation procedure for simulation II is the same as the procedure of simulation I, except that
${y}_{i,j}^{\prime}=exp({z}_{i,j}^{T}{x}_{i}+{b}_{i,j})/(1+exp({z}_{i,j}^{T}{x}_{i}+{b}_{i,j}))$; then, for the
ith client, we set
${y}_{i,j}=1$ corresponding to the top 100 of
${y}_{i,j}^{\prime}$ for
$j\in \left[1000\right]$; otherwise,
${y}_{i,j}=0$. In simulation II, we also set
$(\alpha ,\beta )\in \{(0.1,0.1),(0.5,0.5),(1,1)\}$.
The results in
Figure 2 show that, with a higher degree of NonIID, both FedHT and FedIterHT tend to converge slower. We also compare the proposed methods with the baseline method—Distributed IHT. In
Figure 3, we observe that in simulation I, FedIterHT only needs 20 (∼
$5\times $ less) communication rounds to reach the same objective value that the DistributedIHT obtains with more than 100 communication rounds; in simulation II, the FedIterHT needs 50 communication rounds (∼
$4\times $ less) to achieve the same objective value that the DistributedIHT obtains with 200 communication rounds.
5.2. Benchmark Datasets
We use the E2006tfidf dataset [
47] to predict the volatility of stock returns based on the SECmandated financial text report, represented by tfidf. It was collected from thousands of publicly traded U.S. companies, for which data from different companies are inherently nonidentical and the privacy consideration for financial data demands federated learning. The RCV1 dataset [
48] is used to predict categories of newswire stories recently collected by Reuters, Ltd. The RCV1 can be naturally partitioned based on the news category and used for federated learning experiments since readers may only be interested in one or two categories of news. Our model training process mimics the personalized privacypreserving news recommender system where we use the Kmeans method to partition the datasets, respectively, into 10 clusters. Each device randomly selects two of the clusters for use in the learning. We run tSNE to visualize the hidden structures found by Kmeans as shown in
Figure 4 and
Figure 5, respectively, for the E2006tfidf dataset (sparse linear regression) and the RCV1 dataset (sparse logistic regression). For the MNIST images, there are 10 digits that automatically serve as the clusters.
For all datasets, the data in each cluster are evenly partitioned into 20 parts, and each client randomly picks two clusters and selects one part of data from each of the clusters. Because the MNIST images are evenly collected for each digit, the partitioned decentralized MNIST data are balanced in terms of categories, whereas the other two datasets are unbalanced.
Figure 6 shows that our proposed FedHT and FedIterHT can significantly reduce the communication rounds required to achieve a given accuracy. In
Figure 6a,c, we further notice that federated learning displays more randomness when approaching the optimal solution. This may be caused by dissimilarity across clients. For instance, the three different algorithms in
Figure 6c reach the neighborhood of different solutions at the end, where the proposed FedIterHT obtains the lowest objective value. These behaviors may be worth exploring further in the future.