Federated Semi-Supervised Learning with Uniform Random and Lattice-Based Client Sampling

Zhang, Mei; Yang, Feng

doi:10.3390/e27080804

Open AccessArticle

Federated Semi-Supervised Learning with Uniform Random and Lattice-Based Client Sampling

by

Mei Zhang

¹

and

Feng Yang

^2,*

¹

School of Mathematics, Southwest Minzu University, Chengdu 610225, China

²

School of Mathematical Sciences, Sichuan Normal University, Chengdu 610066, China

^*

Author to whom correspondence should be addressed.

Entropy 2025, 27(8), 804; https://doi.org/10.3390/e27080804

Submission received: 2 July 2025 / Revised: 25 July 2025 / Accepted: 26 July 2025 / Published: 28 July 2025

(This article belongs to the Special Issue Number Theoretic Methods in Statistics: Theory and Applications)

Download

Browse Figures

Versions Notes

Abstract

Federated semi-supervised learning (Fed-SSL) has emerged as a powerful framework that leverages both labeled and unlabeled data distributed across clients. To reduce communication overhead, real-world deployments often adopt partial client participation, where only a subset of clients is selected in each round. However, under non-i.i.d. data distributions, the choice of client sampling strategy becomes critical, as it significantly affects training stability and final model performance. To address this challenge, we propose a novel federated averaging semi-supervised learning algorithm, called FedAvg-SSL, that considers two sampling approaches, uniform random sampling (standard Monte Carlo) and a structured lattice-based sampling, inspired by quasi-Monte Carlo (QMC) techniques, which ensures more balanced client participation through structured deterministic selection. On the client side, each selected participant alternates between updating the global model and refining the pseudo-label model using local data. We provide a rigorous convergence analysis, showing that FedAvg-SSL achieves a sublinear convergence rate with linear speedup. Extensive experiments not only validate our theoretical findings but also demonstrate the advantages of lattice-based sampling in federated learning, offering insights into the interplay among algorithm performance, client participation rates, local update steps, and sampling strategies.

Keywords:

federated semi-supervised learning; convergence rate; linear speedup; quasi-Monte Carlo techniques; partial client participation

1. Introduction

1.1. Background

With the exponential growth of decentralized data generated by devices such as smartphones, federated learning (FL) [1,2] provides a framework for training high-quality shared global models while preserving data privacy. However, compared with the decentralized optimization [3,4], FL gives rise to significant challenges, particularly in terms of communication efficiency and client availability. To address these challenges, various FL algorithms have been developed [2], with federated averaging (FedAvg) [1] being one of the most widely used. FedAvg works by randomly selecting a subset of clients, performing local stochastic gradient descent (SGD) [5] on the selected clients, and aggregating the resulting models at the server [6]. Numerous studies have analyzed the convergence of FedAvg [7] and have proposed its variations [8,9,10], particularly in scenarios with partial client participation.

Existing federated learning approaches [11,12] commonly assume a supervised learning setup, where local private data is fully labeled. However, data labeling is often expensive and time-consuming, while unlabeled data is readily available in abundance. This drives us to effectively leverage these massive, distributed pools of unlabeled data to enhance federated learning. A straightforward approach might be to apply semi-supervised learning (SSL) techniques [13,14,15] to handle unlabeled data, while using traditional federated learning algorithms to aggregate the learned weights.

1.2. Related Works

Semi-Supervised Learning. Numerous studies have explored centralized semi-supervised learning. One prominent approach is pseudo-labeling [16], which generates one-hot labels from highly confident predictions on unlabeled data and uses these labels as training targets with a standard cross-entropy loss. Another major category of SSL methods involves consistency regularization. These methods, such as the ladder network [17], the

Π

model [18], the mean teacher [19], virtual adversarial training (VAT) [20,21], and MixMatch [22], assume that the class semantics remain invariant under transformations of the input instances. To leverage this assumption, they enforce the model’s predictions to be consistent across different perturbations of the input data. Notably, Du et al. [23] further advanced this direction by proposing label propagation for imbalanced multi-label classification, and Ref. [24] developed specialized augmentation strategies for semi-supervised medical image segmentation.

Federated Semi-Supervised Learning. Federated semi-supervised learning (Fed-SSL) [25] extends traditional federated learning to scenarios characterized by limited labeled data and abundant distributed unlabeled data. Notable methods in this area include FedSem [26], which trains a global model on labeled data and incorporates unlabeled data using pseudo-labeling. Jeong et al. [27] introduced an inter-client consistency loss to align predictions across clients, enhancing the utilization of unlabeled data. Recently, Wang et al. [28] proposed a personalized Fed-SSL framework that leverages adaptive variance reduction and normalized aggregation to address client heterogeneity and improve convergence. Similarly, Liang et al. [29] developed RSCFed, which constructs multiple sub-consensus models by randomly sampling clients and then aggregates these sub-consensus models into the global model for improved reliability and performance. Ref. [30] proposed a two-stage sampling method that uses the predicted distribution changes of samples after different data augmentations.

Partial Client Participation. In federated learning (FL), clients may randomly join or leave the system, leading to a stochastic and time-varying set of active participants across communication rounds. For FL with non-i.i.d. datasets and partial client participation [31], several advancements have been made. Yang et al. [7] established a linear speedup in the convergence of FedAvg under partial client participation. Wang et al. [11] provided a unified convergence analysis for FL with arbitrary client participation, and Wang et al. [32] proposed FedAU, which adaptively weights client updates using online estimates of optimal weights without requiring prior knowledge of participation statistics. Ref. [33] considered a variance reduction algorithm applied at the server that eliminates error due to partial client participation.

Despite these efforts, most convergence analyses focus solely on traditional FL. In the context of federated semi-supervised learning, especially with non-i.i.d. datasets and partial client participation, the convergence behavior remains underexplored. A fundamental question arises: Can an algorithm in Fed-SSL still achieve the same linear speedup for convergence under non-i.i.d. data distributions and varying degrees of client participation? Addressing this challenge is crucial for extending the theoretical foundations of Fed-SSL and ensuring its robustness in practical applications.

Quasi-Monte Carlo Methods in Subsampling. The challenge of partial client participation in FL shares similarities with high-dimensional numerical integration. Quasi-Monte Carlo (QMC) methods [34], rooted in number-theoretic techniques [35], address this by constructing low-discrepancy sequences that achieve superior uniformity in sample distribution—akin to the uniform design in experimental design theory. Zhang et al. [36] and Zhou et al. [37] proposed an efficient model-free subsampling method based on uniform design, which generates representative subdata from the original dataset. This inspires us to similarly employ quasi-Monte Carlo approaches to address the partial client participation.

1.3. Contributions

In this paper, we tackle the federated semi-supervised learning problem with partial client participation, which introduces significant complexity compared to traditional federated optimization problems. Specifically, Fed-SSL involves two sets of optimization variables: the global model and the pseudo-labels, along with the added challenge of constraints. As a result, proving the convergence of algorithms for this problem is notably more difficult.

The main contributions of this paper are as follows:

We propose an efficient federated learning algorithm, FedAvg-SSL, which incorporates partial client participation and alternates between updating the global model and refining pseudo-labels on local clients. The algorithm supports both uniform random sampling and a more structured lattice-based sampling strategy at the server side. While uniform sampling ensures simplicity and unbiased selection, the lattice-based approach offers more balanced client participation, which improves model stability under non-i.i.d. data.
We establish a rigorous convergence analysis for FedAvg-SSL, demonstrating a sublinear convergence rate and linear speedup under partial client participation.
Experimental results validate the performance of the proposed algorithm, showing its consistency with theoretical findings and illustrating the relationship between algorithm performance, the number of participating clients, and the number of local steps.

1.4. Organization

The rest of the paper is organized as follows. Section 2 introduces the federated SSL optimization problem. In Section 3, we present an efficient federated SSL algorithm that incorporates partial client participation via both uniform random sampling and a more structured lattice-based sampling strategy at the server side. In addition, the proposed algorithm alternates between updating the global model and refining pseudo-labels on local clients.

Section 4 provides a convergence rate analysis of the proposed algorithm. In Section 5, we discuss the practical implications of our results. Finally, we conclude in Section 6, with all proofs deferred to the appendix.

2. Problem Formulation

Consider a FL setting with a server and K distributed clients. We assume that both labeled data

L

and unlabeled data

U

are distributed across these K clients. Specifically, for each client k, the local dataset

D_{k}

consists of both labeled and unlabeled data, i.e.,

D_{k} = L_{k} \cup U_{k},

where

L_{k} = \{(x_{k, i}, y_{k, i}), i = 1, \dots, N_{k}\}

is the local labeled dataset and

U_{k} = {u_{k, i}, i = 1, \dots, M_{k}}

is the local unlabeled dataset. Here,

x_{k, i}

is the i-th labeled sample for the k-th client with the corresponding label

y_{k, i} \in R^{C}

, which is a one-hot vector representing the true class label, and C is the number of classes. Similarly,

u_{k, i}

is the i-th unlabeled sample for the k-th client. The datasets

D_{k}

, for

k \in [K] ≜ {1, \dots, K}

, are assumed to be non-overlapping. Additionally,

N_{k}

and

M_{k}

denote the number of labeled and unlabeled samples, respectively, for client k.

Let

\hat{y} = {{\hat{y}}_{1}, \dots, {\hat{y}}_{K}}

denote the collection of pseudo-labels across all clients, where

{\hat{y}}_{k} = {{\hat{y}}_{k, 1}, \dots, {\hat{y}}_{k, M_{k}}}

represents the set of pseudo-labels for the k-th client. Each pseudo-label

{\hat{y}}_{k, i} \in R^{C}

belongs to the feasible set

Y_{k} = {{\hat{y}}_{k} | e^{⊤} {\hat{y}}_{k, i} = 1, {\hat{y}}_{k, i} \geq 0, i \in [M_{k}]}

, ensuring proper probability distributions. Following [38], we formulate the Fed-SSL optimization problem as follows:

\begin{matrix} min_{\begin{matrix} θ, \hat{y} \end{matrix}} & F (θ, \hat{y}) ≜ \frac{1}{K} \sum_{k = 1}^{K} [\underset{≜ F_{k} (θ, {\hat{y}}_{k})}{\underset{︸}{ℓ_{k} (θ, {\hat{y}}_{k}) + α_{1} r_{1} ({\hat{y}}_{k})}}] \end{matrix}

(1a)

\begin{matrix} s . t . & {\hat{y}}_{k} \in Y_{k}, k \in [K] . \end{matrix}

(1b)

The loss function for each client consists of two components: a supervised loss on labeled data and a pseudo-labeled loss on unlabeled data, defined as follows:

ℓ_{k} (θ, {\hat{y}}_{k}) = L_{CE} (θ; L_{k}) + α_{0} L_{CE} (θ; U_{k}, {\hat{y}}_{k})

where the cross-entropy loss is given by the following:

\begin{matrix} L_{CE} (θ; L) = - \frac{1}{N} \sum_{i = 1}^{N} 〈y_{i}, log (f_{θ} (x_{i}))〉 . \end{matrix}

To prevent overconfident pseudo-labels, we introduce a regularization term

r_{1} ({\hat{y}}_{k})

:

\begin{matrix} r_{1} ({\hat{y}}_{k}) = \frac{1}{M_{k}} \sum_{i = 1}^{M_{k}} KL ({\hat{y}}_{k, i}, u), \end{matrix}

(2)

where

u = [\frac{1}{C}, \dots, \frac{1}{C}] \in R^{C}

represents a uniform distribution, and

KL (\cdot, \cdot)

denotes the Kullback–Leibler divergence.

The optimization problem in (1) presents two main challenges: (a) it involves two groups of variables (

θ

and

\hat{y}

), and (b) the pseudo-labels must satisfy probability simplex constraints. These characteristics naturally suggest an alternating optimization approach for updating these variables.

3. FedAvg-SSL Algorithm

In this section, we propose the FedAvg semi-supervised learning (FedAvg-SSL) algorithm as a solution approach to the optimization problem formulated in (1).

The FedAvg-SSL framework operates under the standard federated learning paradigm, where a central server coordinates training across K distributed clients. In each communication round t, the server selects a subset of clients

S^{t} \subseteq K = {1, 2, \dots, K}

of fixed size M to participate in model updates. When

M = K

, all clients are involved in every round. In this section, we explore two distinct strategies for determining the active client subset

S^{t}

at each round t.

3.1. Random Sampling Method

The random sampling method in federated learning selects clients for each round independently and uniformly from the entire client. Let

K = {1, 2, \dots, K}

be the set of all clients. For each iteration

t \in {0, 1, \dots, T - 1}

, we select a subset of clients

S^{t}

via uniform random sampling without replacement from set

K

with size M; then, the probability of any client being selected in a round is

\begin{matrix} P (i \in S^{t}) = \frac{M}{K}, \forall i = 1, \dots, K . \end{matrix}

Although each client has the same probability of being selected in each round, the random sampling method can lead to uneven client participation where some clients might be selected multiple times while others might never be selected, especially when the number of communication rounds is limited.

3.2. Lattice-Based Sampling Method

To improve participation uniformity, we propose a lattice-based sampling method that leverages a uniform design to construct a deterministic selection matrix. This method aims to ensure that all clients are sampled more evenly across communication rounds.

Let T be the number of total rounds and M be the number of clients selected per round. Typically, the selected subset size M is a divisor of the total client number K. Let

q = K / M

, then the set of all clients

K = {1, 2, \dots, K}

could be divided into M groups with q clients in each group:

{1, 2, \dots, q}, {q + 1, q + 2, \dots, 2 q}, \dots, {(M - 1) q + 1, (M - 1) q + 2, \dots, K}

. Our selection strategy requires choosing exactly one client from each of M predefined groups, resulting in M selected clients per round. It is noticed that each group contains q distinct clients, which naturally evokes the concept of q-level uniform designs in experimental design theory. The uniform designs always have a good space-filling property on the experimental domain. We treat the M groups, each containing q distinct clients, as the value ranges for M factors in the experimental design framework. To ensure that all clients are sampled more evenly across T rounds, we perform sampling according to the rows of the uniform design.

Let

U = U_{T} (q^{M})

denote a uniform design comprising T runs for M factors, each having q levels, such that we

(1): Let $U (t, k) \in {1, 2, \dots, q}$ denote the element in the t-th row and k-th column of the matrix U for all $t = 0, \dots, T - 1, k = 1, \dots, K$ ;
(2): Each level appears equally often in every column.

To obtain a uniform design

U = U_{T} (q^{M})

, the first method is finding designs straightforwardly from the library of uniform designs given by [39], which does not require any calculations. If there is no such design in the library, two other methods can be used to construct such a design. Let

k (T) = ϕ (T) / 2 + 1

, where

ϕ (\cdot)

is the Euler function. If

M < k (T + 1)

, the leave-one-out good lattice point method combined with pseudo-level transformations [40] can be used to construct a nearly uniform design. For the other cases, we can use the R package UniDOE (proposed by [41]) on R version 3.4.1 to search for a nearly uniform design. Actually, the UniDOE package can be used to search for a nearly uniform design for arbitrary

T, M

. However, the UniDOE package is a little slower than the leave-one-out good lattice point method in general; then, we recommend to use the former only when

M \geq k (T + 1)

.

Based on the uniform design U, the sampling design D is obtained by a deterministic offset

\begin{matrix} D (t, k) = U (t, k) + (k - 1) q, \forall t = 0, \dots, T - 1, \forall k = 1, \dots, K . \end{matrix}

(3)

Then, all elements of the sampling design D are the actual client IDs. At each communication round t, one row of the matrix D is selected (either sequentially or randomly without replacement), and the resulting client indices form the active set

S^{t}

.

This lattice-based scheme systematically spreads participation over time, significantly reducing the risk of client over-sampling or under-sampling (see Figure 1). As a result, it is particularly advantageous in federated semi-supervised learning scenarios, where balanced exposure to heterogeneous client data is crucial for effective model generalization.

3.3. The Proposed Federated SSL Algorithm

In this subsection, we present the proposed Fed-SHVR algorithm for solving problem (1), and its detailed pseudo code is presented in Algorithms 1 and 2. Since the considered problem (1) involves two blocks of variables

θ

and

\hat{y}

, we propose to train a global model for (1) in a novel way that combines an alternatively updating strategy with the local SGD algorithm.

Specifically, for each communication round, our algorithm has two parts: one is the client update and the other is the server update.

Client Side. After the server broadcasts the current global model

θ^{t}

to all active clients in

S^{t}

, each active client k first updates its pseudo-labels by solving the following optimization problem:

\begin{matrix} {\hat{y}}_{k}^{t + 1} = {arg min}_{\begin{matrix} {\hat{y}}_{k} \end{matrix}} & α_{0} L_{CE} (θ^{t}; u_{k}, {\hat{y}}_{k}) + α_{1} r_{1} ({\hat{y}}_{k}) \end{matrix}

(4a)

\begin{matrix} s . t . & {\hat{y}}_{k} \in Y_{k} . \end{matrix}

(4b)

Following [38], this optimization problem admits a closed-form solution:

\begin{matrix} {[{\hat{y}}_{k, i}^{t + 1}]}_{j} = \frac{{[f_{θ^{t}} (u_{k, i})]}_{j}^{\frac{α_{0}}{α_{1}}}}{\sum_{j = 1}^{C} {[f_{θ^{t}} (u_{k, i})]}_{j}^{\frac{α_{0}}{α_{1}}}}, j = 1, \dots, C, \end{matrix}

(5)

where

{[{\hat{y}}_{k, i}]}_{j}

denotes the j-th entry of

{\hat{y}}_{k, i}

. After updating the pseudo-labels, each client performs local model updates:

Samples mini-batches of labeled data $ξ_{k}$ and unlabeled data $ζ_{k}$ uniformly at random from $L_{k}$ and $U_{k}$ , respectively.
Computes the stochastic gradient of the local loss function

$\begin{matrix} g_{k} (θ_{k}, {\hat{y}}_{k}) : = \nabla_{θ} F_{k} (θ_{k}, {\hat{y}}_{k}; ξ_{k}, ζ_{k}) . \end{matrix}$

(6)
Updates its local model copy $θ_{k}$ using SGD.

Server Side. Each active client transmits its updated local model

θ_{k, Q}

back to the server, which aggregates these updates to compute the new global model

θ^{t + 1}

via (7).

Algorithm 1 FedAvg SSL [Server]

1:: Input: initial model parameters $θ^{0}$ .
2:: for communication round $t = 0$ to $T - 1$ do
3:: Sample clients $S^{t}$ uniformly randomly so that $| S^{t} | = M$ , or according to the sampling design D obtained by (3).
4:: for each client $k \in S^{t}$ in parallel do
5:: Send $θ^{t}$ to client k.
6:: Receive $θ_{k}^{t + 1}$ from client k via Algorithm 2.
7:: end for
8:: Update global model

$\begin{matrix} θ^{t + 1} = (1 - η_{θ}) θ^{t} + \frac{η_{θ}}{M} \sum_{k \in S^{t}} θ_{k}^{t + 1} . \end{matrix}$

(7)
9:: end for

Algorithm 2 FedAvg SSL [Client]

1:: Receive $θ^{t}$ from the server.
2:: Update pseudo-label ${\hat{y}}_{k}^{t + 1}$ from (5).
3:: Initialize $θ_{k, 0}^{t} = θ^{t}$ .
4:: for $q = 0, \dots, Q - 1$ do
5:: Select data $ξ_{k, q}^{t}$ and $ζ_{k, q}^{t}$ uniformly at random from $L_{k}$ and $U_{k}$ .
6:: Update local client model

$\begin{matrix} θ_{k, q + 1}^{t} = θ_{k, q}^{t} - η_{g} g_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) . \end{matrix}$

(8)
7:: end for
8:: Set $θ_{k}^{t + 1} = θ_{k, Q}^{t}$ .
9:: Send $θ_{k}^{t + 1}$ back to the server.

4. Convergence Analysis with Partial Client Participation

In this section, we establish the convergence properties of FedAvg-SSL when only a subset of clients updates their variables per outer loop.

4.1. Assumptions

We first make some standard assumptions.

Assumption 1.

The regularization term

r_{1} (\hat{y})

is a continuously differentiable function. Furthermore,

r_{1} (\hat{y})

is μ-strongly convex, where

μ > 0

. Specifically, for any

{\hat{y}}_{1}, {\hat{y}}_{2} \in Y_{k}

, the following inequality holds:

\begin{matrix} r_{1} ({\hat{y}}_{1}) - r_{1} ({\hat{y}}_{2}) - 〈 \nabla r_{1} ({\hat{y}}_{2}), {\hat{y}}_{1} - {\hat{y}}_{2} 〉 \geq \frac{μ}{2} {∥ {\hat{y}}_{1} - {\hat{y}}_{2} ∥}^{2} . \end{matrix}

(9)

The regularization term

r_{1} ({\hat{y}}_{k})

defined in (2) is strongly convex over the probabilistic simplex

Y_{k}

(see [42], Definition 2 and Example 2). However, it is important to note that the objective function

F_{k} (θ, {\hat{y}}_{k})

in (1a) is not jointly convex with respect to

(θ, \hat{y})

.

Assumption 2.

The local cost

F_{k} (θ, {\hat{y}}_{k})

is L-smooth (possibly non-convex) with respect to

(θ, {\hat{y}}_{k})

for

k \in [K]

, i.e.,

\begin{matrix} ∥ \nabla_{θ} F_{k} (θ, {\hat{y}}_{k}) - \nabla_{θ} F_{k} (θ^{'}, {\hat{y}}_{k}^{'}) ∥ \leq L \sqrt{∥ θ - θ^{'} ∥^{2} + {∥ {\hat{y}}_{k} - {\hat{y}}_{k}^{'} ∥}^{2}}, \end{matrix}

(10)

for all

θ, θ^{'}

and

{\hat{y}}_{k}, {\hat{y}}_{k}^{'} \in Y_{k}

.

Assumption 3.

Given

{\hat{y}}_{k}

, for any

k \in [K]

, the stochastic gradient is unbiased for any θ. Specifically, the following conditions hold:

\begin{matrix} E [g_{k} (θ_{k}, {\hat{y}}_{k})] = \nabla_{θ} F_{k} (θ_{k}, {\hat{y}}_{k}), \end{matrix}

(11)

\begin{matrix} E [∥ g_{k} (θ_{k}, {\hat{y}}_{k}) - \nabla_{θ} F_{k} (θ_{k}, {\hat{y}}_{k}) ∥^{2}] \leq σ_{θ}^{2}, \end{matrix}

(12)

where

σ_{θ}

represents the noise variance, and

E

denotes the expectation with respect to all random variables

{ξ_{k}, ζ_{k}}

.

Assumption 4

(Bounded Gradient Dissimilarity). There exists a constant

σ_{G} > 0

such that for any θ and

\hat{y}

, the following inequality holds:

\begin{matrix} \frac{1}{K} \sum_{k = 1}^{K} {∥ \nabla_{θ} F_{k} (θ, {\hat{y}}_{k}) - \nabla_{θ} F (θ, \hat{y}) ∥}^{2} \leq σ_{G}^{2} . \end{matrix}

(13)

Assumptions 2 and 3 are standard in first-order stochastic algorithms, while Assumption 4 constrains the data heterogeneity by bounding the gradient dissimilarity between different clients, which is commonly used in [7,11,28].

4.2. Convergence Analysis of FedAvg-SSL

From (7) and (8), the update for the global model across consecutive outer loops is given by

\begin{matrix} θ^{t + 1} - θ^{t} & = \frac{η_{θ}}{M} \sum_{k \in S^{t}} (θ_{k}^{t + 1} - θ^{t}) = - \frac{η}{M} \sum_{k, q} I_{k \in S^{t}} g_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}), \end{matrix}

(14)

where

η = η_{θ} γ_{θ}

and

I_{k \in S^{t}}

denote the characteristic function of the set

S^{t}

, representing the subset of clients participating in the current round.

A critical challenge arises because the right-hand side of the above equation does not provide an unbiased estimation of the full gradient. Specifically, we have

\begin{matrix} \frac{1}{M} \sum_{k, q} E [I_{k \in S^{t}} g_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1})] \neq \nabla F (θ, y^{t + 1}), \end{matrix}

(15)

where

\nabla F (θ, y^{t + 1})

is the gradient of the full objective function.

This discrepancy stems from the stochastic nature of partial client participation, as only a subset of clients

S^{t}

is actively involved in each communication round. Consequently, the aggregated gradient

\frac{1}{M} \sum_{k, q} I_{k \in S^{t}} g_{k}

introduces bias, which complicates the convergence analysis. Specifically, controlling this bias becomes essential to establish rigorous convergence guarantees for FedAvg-SSL under partial client participation.

Before we show the convergence, let us first define the following optimality gap:

\begin{matrix} g_{t} ≜ E [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] + \frac{1}{K} \sum_{k = 1}^{K} E [∥ {\hat{y}}_{k}^{t + 1} - {\hat{y}}_{k}^{t} ∥^{2}] . \end{matrix}

(16)

which measures how far an iteration point is away from the stationary point. From the definition, we know

g_{t} \geq 0

. If

g_{t} = 0

, the iterate

{θ^{t}, {\hat{y}}^{t}}

will be a stationary point of problem (1). According to (16), we have

\begin{matrix} \nabla_{θ} F (θ^{t}, {\hat{y}}^{t}) = 0, {\hat{y}}_{k}^{t + 1} - {\hat{y}}_{k}^{t} = 0 . \end{matrix}

(17)

Since

{\hat{y}}_{k}^{t + 1}

is the optimal solution of

F_{k} (θ^{t}, \hat{y})

, from the first order optimality condition, we get

\begin{matrix} 〈 \nabla_{{\hat{y}}_{k}} F_{k} (θ^{t}, {\hat{y}}_{k}^{t + 1}), {\hat{y}}_{k} - {\hat{y}}_{k}^{t + 1} 〉 \geq 0, for \forall {\hat{y}}_{k} \in Y_{k} . \end{matrix}

(18)

Combing (17) and (18), for

\forall θ, {\hat{y}}_{k} \in Y_{k}

, we have

\begin{matrix} 〈 \nabla_{θ} F (θ^{t}, {\hat{y}}^{t}), θ - θ^{t} 〉 + \sum_{k = 1}^{K} 〈 \nabla_{{\hat{y}}_{k}} F_{k} (θ^{t}, {\hat{y}}_{k}^{t}), {\hat{y}}_{k} - {\hat{y}}_{k}^{t} 〉 \geq 0 . \end{matrix}

The above inequality implies that

{θ^{t}, {\hat{y}}^{t}}

is a stationary point to problem (1).

Theorem 1.

If

η_{g}^{2} \leq \frac{1}{64 Q (Q - 1) L^{2}}

and

η : = η_{g} η_{θ} \leq \frac{M (K - 1)}{(3 + 12 Q) Q L (K - M)}

, we have that the sequence

{θ^{t}, {\hat{y}}^{t}}

generated by Algorithm 1 satisfies

\begin{matrix} \frac{1}{T} \sum_{t = 0}^{T - 1} g_{t} & \leq \frac{4 E [F (θ^{0}, {\hat{y}}^{0})]}{η Q T} + η_{g}^{2} B_{1} + 4 η η_{g}^{2} B_{2} + 4 η B_{3}, \end{matrix}

(19)

where

\begin{matrix} B_{1} & = L^{2} (Q (Q - 1)) σ_{G}^{2} + 2 L^{2} (Q - 1) σ_{θ}^{2}, \end{matrix}

(20)

\begin{matrix} B_{2} & = \frac{24 Q^{3} L^{3} (K - M)}{M (K - 1)} σ_{G}^{2} + \frac{6 Q (Q - 1) L^{3} (K - M)}{M (K - 1)} σ_{θ}^{2}, \end{matrix}

(21)

\begin{matrix} B_{3} & = \frac{3 L Q (K - M)}{2 M (K - 1)} σ_{G}^{2} + \frac{L}{2 M} σ_{θ}^{2} . \end{matrix}

(22)

Proof.

See Appendix A. □

Remark 1.

Theorem 1 provides an explicit upper bound on the optimality gap, which offers useful insights into how several key factors influence the convergence behavior of the proposed FedAvg-SSL algorithm:

Number of participating clients (M): When all clients participate in each round, i.e., $M = K$ , the term $B_{2}$ in the bound becomes zero. As a result, the convergence rate improves, demonstrating the benefit of increased client participation.
Number of local steps (Q): As Q increases, the accumulated local update noise can affect convergence. Theorem 1 suggests that in such cases, the global and local learning rates η and $η_{g}$ should be chosen smaller to maintain stability and ensure convergence.

These observations provide practical guidance for selecting different parameters in FedAvg-SSL.

With appropriate step sizes, we can derive the following corollary.

Corollary 1.

Let

η_{g} = \frac{1}{\sqrt{T} Q L}

and

η = \frac{\sqrt{M}}{\sqrt{T Q} L}

; the convergence rate of the FedAvg-SSL algorithm under partial client participation is given by

\begin{matrix} min_{t = 0, \dots, T - 1} g_{t} & \leq O (\frac{1}{\sqrt{M T Q}} + \frac{1}{T}) . \end{matrix}

(23)

Remark 2.

Corollary 1 demonstrates that the required number of outer loop iterations T to achieve an ϵ-solution decreases linearly as M increases. This phenomenon is commonly referred to as linear speedup with respect to M. To the best of our knowledge, this is the first work to establish the linear speedup of a federated semi-supervised learning algorithm.

5. Simulation Results

We conducted extensive experiments on the MNIST dataset to evaluate the effectiveness of our proposed federated semi-supervised learning approach. The experiments were implemented using PyTorch version 1.8 on a CUDA-enabled GPU, utilizing the MNIST dataset which contains 60,000 training images and 10,000 test images of handwritten digits.

To simulate a realistic federated learning scenario, we configured our system with 100 clients

(K = 100)

and implemented a partial participation strategy where only 10% of clients (10 clients) participate in each federation round. The dataset was divided into labeled and unlabeled portions, with 30% of the data being labeled and the remaining 70% unlabeled. We investigated two distinct data distribution scenarios:

IID Setting: The training data is randomly and uniformly distributed across clients, ensuring each client receives approximately equal amounts of both labeled and unlabeled samples. This setting serves as our baseline distribution.
Non-IID Setting: Labeled data is randomly shuffled and split into twice as many shards as there are clients, with each client randomly assigned two unique shards. This ensures that each client receives a distinct, non-overlapping subset of the labeled data, with each shard containing a random mix of classes. The unlabeled data is distributed using a Dirichlet distribution ( $α = 0.01$ ) to further enhance the non-IID nature of the data across clients.

For our model architecture, we employed a simple multi-layer perceptron (MLP) with one hidden layer containing 256 units and ReLU activations. The training process utilized a mini-batch size of 32 for both labeled and unlabeled data, with learning rates set to

η_{g} = η_{θ} = 0.01

. Each participating client performed 10 local epochs during their training phase.

Impact of hyperparameters $α_{0}$ and $α_{1}$ . The hyperparameter

α_{0}

controls the model’s reliance on the unlabeled data during training. To prevent the model from prematurely overfitting to noisy pseudo-labels, we adopt a gradually increasing schedule for

α_{0}

, inspired by the work in [16]. Specifically, we define

α_{0} = \{\begin{matrix} exp (- 5 {(1 - \frac{t}{T_{1}})}^{2}), & if t < T_{1} \\ 1, & if t \geq T_{1} . \end{matrix}

The second hyperparameter,

α_{1}

, serves as a regularization weight in the pseudo-label refinement process. It controls the entropy of the pseudo-labels: smaller values encourage confident (i.e., lower-entropy) predictions, while larger values encourage more conservative updates by penalizing overconfidence. Thus, tuning

α_{1}

balances the trade-off between exploration and exploitation of pseudo-labels.

To investigate the impact of these hyperparameters, we conduct a sensitivity analysis and summarize the results in the table below. The test accuracy is evaluated under different combinations of

T_{1}

and

α_{1}

in the non-IID setting. As shown in Table 1, the highest accuracy of 97.5% occurs when

T_{1} = 10

and

α_{1} = 0.5

, outperforming other configurations. Based on these results, we adopt this parameter configuration for all subsequent experiments. Figure 2 further illustrates the effects of these parameters on training dynamics.

Impact of non-IID datasets. As illustrated in Figure 3a, we present comprehensive experimental results comparing the performance of federated semi-supervised learning under both IID and non-IID data distributions. The test accuracy curves in Figure 3a demonstrates that the non-IID data distribution significantly impacts the model’s convergence and final performance. Under the IID setting with SSL, the model achieves the highest accuracy of approximately 83%, while the same approach under non-IID conditions reaches only 78%, indicating a clear performance degradation due to data heterogeneity.

A key observation from the loss curves in Figure 3b is that the non-IID data distribution notably slows down the convergence rate. The IID+SSL configuration shows the fastest convergence and reaches the lowest loss value of approximately 1.15, while its non-IID counterpart maintains a consistently higher loss of around 1.4. This pattern aligns with theoretical expectations that higher degrees of data heterogeneity lead to slower convergence and higher final loss values.

Furthermore, the impact of non-IID data is evident in both SSL and supervised-only approaches, though with varying degrees. The performance gap between IID and non-IID settings is more pronounced in the SSL approach (5% difference) compared to the supervised-only setting (1% difference). This observation suggests that while SSL techniques can effectively leverage unlabeled data, their benefits are somewhat diminished under non-IID conditions, highlighting the fundamental challenges that data heterogeneity poses in federated learning scenarios.

Impact of label ratios. To investigate how the proportion of labeled data affects model performance, we conducted experiments with varying label ratios. Figure 4 presents the test accuracy and loss curves for different label ratios ranging from 0.1 to 0.9.

Figure 4a demonstrates a clear correlation between label ratio and model performance. With a high label ratio of 0.9, the model achieves the best performance, reaching approximately an 85% accuracy after 100 epochs. The convergence is also notably faster, showing steep improvement in the early stages (20–40 epochs). In contrast, with only 10% labeled data (ratio 0.1), the model struggles significantly, achieving only about 62% accuracy and showing much slower convergence throughout the training process. The loss curves in Figure 4b further reinforce these findings. Models trained with higher label ratios (0.7 and 0.9) demonstrate more efficient optimization, reaching lower loss values (around 0.9–1.0) and showing steeper descent curves.

Impact of Local Training Steps. To validate Theorem 1 and investigate the effect of local training steps on Fed-SSL’s performance under non-IID settings, we conducted experiments with varying numbers of local steps (5, 10, 20, and 30). Figure 5 presents the experimental results.

The experimental results strongly align with our theoretical analysis in Theorem 1 regarding the convergence of Fed-SSL. As shown in Figure 5a, increasing the number of local steps significantly improves both convergence speed and final model performance. With 30 local steps, the model achieves the highest accuracy of approximately 85% and demonstrates the fastest convergence rate, particularly during the early stages (20–40 communication rounds). In contrast, with only five local steps, the model shows slower convergence and reaches a lower final accuracy of about 73%.

The loss curves in Figure 5b further validate our theoretical findings. Models trained with more local steps (20 and 30) exhibit more efficient optimization trajectories, achieving lower final loss values (around 0.9–1.0) and showing steeper descent curves. This observation is consistent with Theorem 1, which suggests that increasing local training steps can enhance convergence stability and final performance, provided the learning rate is properly chosen.

Impact of Client Number. As shown in Figure 6, we compare the test accuracy and loss curves among three different participation rates: full worker participation (

K = 100

) and partial worker participation (

M = 30

and

M = 10

) under the same hyperparameter settings. The results demonstrate that full worker participation consistently outperforms partial participation scenarios, achieving approximately 83% accuracy compared to 82.5% and 82% for 10% and 30% participation rates, respectively. This performance gap can be attributed to the additional randomness introduced by partial worker participation, resulting in slower convergence rates as evidenced by the more gradual slopes of both accuracy and loss curves.

This phenomenon is particularly pronounced in our non-IID setting, where system heterogeneity plays a crucial role. Full worker participation effectively neutralizes this heterogeneity in each communication round by incorporating updates from all clients. In contrast, partial participation struggles to maintain this balance, as the selected subset of workers may not adequately represent the overall data distribution. For instance, with only 10 or 30 active workers, the available training data might cover only a subset of the MNIST dataset’s 10 digit classes, leading to biased model updates. This effect is clearly visible in both the accuracy and loss trajectories, where partial participation schemes exhibit consistently inferior performance throughout the training process.

Impact of Sampling Strategies. As shown in Figure 7, the lattice-based sampling consistently outperforms random sampling in both test accuracy and test loss. The reason may be that the lattice-based sampling method ensures that all clients contribute equally across communication rounds, effectively mitigating the bias introduced by data heterogeneity. In contrast, random sampling may result in some clients being over- or under-represented, leading to less representative model updates (see Figure 1).

6. Conclusions

In this paper, we propose an efficient FedAvg-SSL algorithm that alternately updates model parameters and pseudo-labels in non-IID federated settings for general non-convex optimization. To address the challenge of partial client participation, we introduce and compare two sampling strategies—uniform random sampling and lattice-based sampling—at the server side. The lattice-based method is shown to enhance participation balance and reduce selection variance across communication rounds. We provide a rigorous theoretical analysis demonstrating that FedAvg-SSL achieves a sublinear convergence rate with linear speedup under partial participation. Extensive experiments on the MNIST dataset support our theoretical claims and offer practical insights into how sampling strategy, participation rate, and local update steps jointly influence learning performance. These results provide valuable guidance for the design of efficient and robust federated semi-supervised learning systems. In future work, we plan to explore fairness-aware QMC-based client selection strategies, building on insights from [43], to ensure more equitable participation and performance across clients.

Author Contributions

Conceptualization, M.Z. and F.Y.; methodology, M.Z. and F.Y.; writing—original draft preparation, M.Z.; writing—review and editing, M.Z. and F.Y.; supervision, F.Y.; funding acquisition, M.Z. and F.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 12301321, 12101435), the Fundamental Research Funds for the Central Universities at Southwest Minzu University (No. ZYN2024203), and the Research Foundation of Sichuan Normal University (No. pt12101435, ky20200919).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the first author.

Acknowledgments

The authors would like to thank Yongdao Zhou for his valuable comments.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proof of Theorem 1

For ease of presentation, we first show some important lemmas about the descent of the objective.

Appendix A.1. Some Auxiliary Lemmas

The following lemma bounds the change in function value across consecutive outer loops for FedAvg-SSL with partial client participation.

Lemma A1.

Under Assumptions 1 and 2, it holds that

\begin{matrix} E [F (θ^{t + 1}, {\hat{y}}^{t + 1})] - E [F (θ^{t}, {\hat{y}}^{t})] \leq & E [〈 \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}), θ^{t + 1} - θ^{t} 〉] + \frac{L}{2} E [∥ θ^{t + 1} - θ^{t} ∥^{2}] \\ - \frac{μ α_{1}}{2 K} \sum_{k = 1}^{K} E [∥ {\hat{y}}_{k}^{t} - {\hat{y}}_{k}^{t + 1} ∥^{2}] . \end{matrix}

(A1)

Proof.

Firstly, we can establish the following equality:

\begin{matrix} F_{k} (θ^{t + 1}, {\hat{y}}_{k}^{t + 1}) - F_{k} (θ^{t}, {\hat{y}}_{k}^{t}) = F_{k} (θ^{t + 1}, {\hat{y}}_{k}^{t + 1}) - F_{k} (θ^{t}, {\hat{y}}_{k}^{t + 1}) + F_{k} (θ^{t}, {\hat{y}}_{k}^{t + 1}) - F_{k} (θ^{t}, {\hat{y}}_{k}^{t}) . \end{matrix}

(A2)

Since each client

F_{k}

has Lipschitz continuous gradients with respect to

(θ, {\hat{y}}_{k})

by Assumption 2, we have

\begin{matrix} F_{k} (θ^{t + 1}, {\hat{y}}_{k}^{t + 1}) - F_{k} (θ^{t}, {\hat{y}}_{k}^{t + 1}) \leq 〈 \nabla_{θ} F_{k} (θ^{t}, {\hat{y}}_{k}^{t + 1}), θ^{t + 1} - θ^{t} 〉 + \frac{L}{2} {∥ θ^{t + 1} - θ^{t} ∥}^{2} . \end{matrix}

(A3)

According to (1a), we have

\begin{matrix} F_{k} (θ^{t}, {\hat{y}}_{k}^{t + 1}) - F_{k} (θ^{t}, {\hat{y}}_{k}^{t}) = α_{0} [L_{CE} (θ^{t}; {\hat{y}}_{k}^{t + 1}) - L_{CE} (θ^{t}; {\hat{y}}_{k}^{t})] + α_{1} [r_{1} ({\hat{y}}_{k}^{t + 1}) - r_{1} ({\hat{y}}_{k}^{t})] . \end{matrix}

(A4)

Since

L_{CE} (θ; u_{k}, {\hat{y}}_{k})

is a linear function with respect to

{\hat{y}}_{k}

, and

r_{1} ({\hat{y}}_{k})

is a strongly convex function about

{\hat{y}}_{k}

, using Assumption 1, we have

\begin{matrix} α_{0} L_{CE} (θ^{t}; u_{k}, {\hat{y}}_{k}^{t}) + α_{1} r_{1} ({\hat{y}}_{k}^{t}) - α_{0} L_{CE} (θ^{t}; u_{k}, {\hat{y}}_{k}^{t + 1}) - α_{1} r_{1} ({\hat{y}}_{k}^{t + 1}) \\ \geq 〈 α_{0} \nabla_{y_{k}} L_{CE} (θ^{t}; u_{k}, {\hat{y}}_{k}^{t + 1}) + α_{1} \nabla_{y_{k}} r_{1} ({\hat{y}}_{k}^{t + 1}), {\hat{y}}_{k}^{t} - {\hat{y}}_{k}^{t + 1} 〉 + \frac{μ α_{1}}{2} {∥ {\hat{y}}_{k}^{t} - {\hat{y}}_{k}^{t + 1} ∥}^{2} . \end{matrix}

(A5)

In (1), we see that

{\hat{y}}_{k}^{t + 1}

is the optimal solution. Using the first-order condition implies

\begin{matrix} 〈 α_{0} \nabla_{y_{k}} L_{CE} (θ^{t}; u_{k}, {\hat{y}}_{k}^{t + 1}) + α_{1} \nabla_{y_{k}} r_{1} ({\hat{y}}_{k}^{t + 1}), {\hat{y}}_{k}^{t} - {\hat{y}}_{k}^{t + 1} 〉 \geq 0 . \end{matrix}

(A6)

Substituting (A6) into (A5) and (A4) gives rise to

\begin{matrix} F_{k} (θ^{t}, {\hat{y}}_{k}^{t + 1}) - F_{k} (θ^{t}, {\hat{y}}_{k}^{t})) \leq - \frac{μ α_{1}}{2} {∥ {\hat{y}}_{k}^{t} - {\hat{y}}_{k}^{t + 1} ∥}^{2} . \end{matrix}

(A7)

Finally, we combine (A3) and (A4) to obtain the final result of this lemma by considering the relationship

F (θ, \hat{y}) = \frac{1}{K} \sum_{k = 1}^{K} F_{k} (θ, {\hat{y}}_{k})

and taking the expected value. □

Lemma A2.

Under Assumptions 1 and 2, it holds that

\begin{matrix} E [〈 \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}), θ^{t + 1} - θ^{t} 〉] \leq & - \frac{η Q}{2} E [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] + \frac{η L^{2}}{2 K} \sum_{k, q} E [∥ θ^{t} - θ_{k, q}^{t} ∥^{2}] \\ - \frac{η}{2 K^{2} Q} E [∥ \sum_{k, q} \nabla_{θ} F_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}], \end{matrix}

(A8)

where

η = η_{θ} η_{g}

.

Proof.

In order to bound the above error, we need to introduce the following two

σ

-algebras as follows:

\begin{matrix} F^{t} : & = σ ({ξ_{k, q}^{t}, ζ_{k, q}^{t}, S^{r} : k \in [K], 0 \leq q \leq Q - 1, 0 \leq t \leq t - 1}), \\ F_{k, q}^{t} : & = σ (F^{t} \cup S^{t} \cup {ξ_{k, l}^{t}, ζ_{k, l}^{t} : l \leq q} \cup {ξ_{j, l}^{t}, ζ_{j, l}^{t} : j \leq k - 1, l \leq Q - 1}) . \end{matrix}

Thus, it holds that

\begin{matrix} E [〈 \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}), θ^{t + 1} - θ^{t} 〉] & = E [〈 \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}), - \frac{η_{θ} η_{g}}{M} \sum_{k, q} I_{k \in S^{t}} g_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) 〉] \\ = - \frac{η_{θ} η_{g}}{M} \sum_{k, q} E [E [〈 \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}), I_{k \in S^{t}} g_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) 〉] | F_{k, q - 1}^{t}] \end{matrix}

where the last equality is due to the tower property of conditional expectation. Since

I_{k \in S^{t}}

and

g_{k} (θ_{k, q - 1}^{t}, {\hat{y}}_{k}^{t + 1})

are

F_{k, q - 1}^{t}

-measurable, we have that

\begin{matrix} E [〈 \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}), θ^{t + 1} - θ^{t} 〉] & = E [〈 \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}), - \frac{η_{θ} η_{g}}{M} \sum_{k, q} I_{k \in S^{t}} g_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) 〉] \\ = - \frac{η_{θ} η_{g}}{M} \sum_{k, q} E [〈 \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}), I_{k \in S^{t}} E [g_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1})] 〉] \\ = - \frac{η_{θ} η_{g}}{K} \sum_{k, q} E [〈 \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}), \nabla_{θ} F_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) 〉], \end{matrix}

(A9)

where the last equality is due to (11), and we sample clients

S^{t}

uniformly randomly, which leads to

E [I_{k \in S^{t}}] = \frac{M}{K}

. Based on the common equality

- 〈 a, b 〉 = \frac{1}{2} {[∥ a - b ∥}^{2} - {∥ a ∥}^{2} - {∥ b ∥}^{2}]

, we know

\begin{matrix} E [〈 \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}), θ^{t + 1} - θ^{t} 〉] \\ = - η Q E [〈 \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}), \sum_{k, q} \frac{1}{K Q} \nabla_{θ} F_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) 〉] \\ = - \frac{η Q}{2} [E [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] + E [∥ \sum_{k, q} \frac{1}{K Q} \nabla_{θ} F_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}]] \\ + \frac{η Q}{2} E [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) - \sum_{k, q} \frac{1}{K Q} \nabla_{θ} F_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] . \end{matrix}

(A10)

Note that

\begin{matrix} \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) = \sum_{k, q} \frac{1}{K Q} \nabla_{θ} F_{k} (θ^{t}, {\hat{y}}_{k}^{t + 1}), \end{matrix}

We get

\begin{matrix} E [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) - \sum_{k, q} \frac{1}{K Q} \nabla_{θ} F_{k} (θ_{k, q - 1}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] \leq \frac{L^{2}}{K Q} \sum_{k, q} E [∥ θ^{t} - θ_{k, q}^{t} ∥^{2}], \end{matrix}

(A11)

where the inequality is due to the convexity of

{∥ \cdot ∥}^{2}

and the L-smooth assumption (10). Substituting (A11) into (A10), we obtain the final result. □

Lemma A3.

Let

η = η_{θ} η_{g}

, under Assumptions 2–4, it holds that

\begin{matrix} E [∥ θ^{t + 1} - θ^{t} ∥^{2}] \\ \leq \frac{η^{2} (K - M)}{M K (K - 1)} (3 K Q^{2} E [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] + 3 K Q^{2} σ_{G}^{2} + 3 Q L^{2} \sum_{k, q} E [∥ θ^{t} - θ_{k, q}^{t} ∥^{2}]) \\ + \frac{η^{2} (M - 1)}{M K (K - 1)} E [∥ \sum_{k, q} \nabla_{θ} F_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] + \frac{η^{2} Q σ_{θ}^{2}}{M} . \end{matrix}

(A12)

Proof.

By using Assumption 3 and equation (14), we have

\begin{matrix} E [∥ θ^{t + 1} - θ^{t} ∥^{2}] & = E [∥ \frac{η}{M} \sum_{k, q} I_{k \in S^{t}} g_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] \\ \leq \frac{η^{2}}{M^{2}} E [∥ \sum_{k, q} I_{k \in S^{t}} \nabla_{θ} F_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] + \frac{η^{2} Q σ_{θ}^{2}}{M} . \end{matrix}

(A13)

Let

t_{k} = \sum_{q = 0}^{Q - 1} \nabla_{θ} F_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1})

. Using this definition and expanding the squared norm, we have

\begin{matrix} E [∥ \sum_{k, q} I_{k \in S^{t}} \nabla_{θ} F_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] = E [∥ \sum_{k = 1}^{K} I_{k \in S^{t}} t_{k} ∥^{2}] = \sum_{i, j} E [I_{i \in S^{t}} I_{j \in S^{t}} 〈 t_{i}, t_{j} 〉] . \end{matrix}

(A14)

For the case where

i = j

, noting that

I_{i \in S^{t}}

is an indicator function and

P (i \in S^{t}) = \frac{M}{K}

, we have

\begin{matrix} E [I_{i \in S^{t}} I_{j \in S^{t}} 〈 t_{i}, t_{j} 〉] = E [I_{i \in S^{t}}] ∥ t_{i} ∥^{2} = \frac{M}{K} E [∥ t_{i} ∥^{2}] . \end{matrix}

For the case where

i \neq j

, considering the probability of selecting two distinct clients from K total clients, we obtain

\begin{matrix} E [I_{i \in S^{t}} I_{j \in S^{t}} 〈 t_{i}, t_{j} 〉] = \frac{M (M - 1)}{K (K - 1)} E [〈 t_{i}, t_{j} 〉] . \end{matrix}

Then, we derive that

\begin{matrix} E [∥ \sum_{k = 1}^{K} I_{k \in S^{t}} t_{k} ∥^{2}] & = \frac{M}{K} \sum_{i = 1}^{K} {∥ t_{i} ∥}^{2} + \frac{M (M - 1)}{K (K - 1)} \sum_{i \neq j} 〈 t_{i}, t_{j} 〉 \\ = \frac{M^{2}}{K} \sum_{i = 1}^{K} ∥ t_{i} ∥^{2} - \frac{M (M - 1)}{2 K (K - 1)} \sum_{i \neq j} {∥ t_{i} - t_{j} ∥}^{2}, \end{matrix}

(A15)

where the last equality is due to the fact that

〈 a, b 〉 = \frac{1}{2} {[∥ a ∥}^{2} + {∥ b ∥}^{2} {- ∥ a - b ∥}^{2}]

. Similarity,

\begin{matrix} E [∥ \sum_{k = 1}^{K} t_{k} ∥^{2}] = K \sum_{k = 1}^{K} E [∥ t_{k} ∥^{2}] - \frac{1}{2} \sum_{i \neq j} E [∥ t_{i} - t_{j} ∥^{2}] . \end{matrix}

(A16)

Substituting (A16) into (A15) implies that

\begin{matrix} E [∥ \sum_{k = 1}^{K} I_{k \in S^{t}} t_{k} ∥^{2}] = \frac{K M - M^{2}}{K (K - 1)} \sum_{i = 1}^{K} ∥ t_{i} ∥^{2} + \frac{M (M - 1)}{K (K - 1)} {∥ \sum_{i = 1}^{K} t_{i} ∥}^{2} . \end{matrix}

(A17)

By using Assumptions 2 and 4, we have

\begin{matrix} \sum_{k = 1}^{K} E [∥ t_{k} ∥^{2}] = & \sum_{k = 1}^{K} E [∥ \sum_{q = 0}^{Q - 1} [\nabla_{θ} F_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) - \nabla_{θ} F_{k} (θ^{t}, {\hat{y}}_{k}^{t + 1}) + \nabla_{θ} F_{k} (θ^{t}, {\hat{y}}_{k}^{t + 1}) \\ - \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) + \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) {] ∥}^{2}] \\ \leq & 3 Q L^{2} \sum_{k, q} E [∥ θ^{t} - θ_{k, q}^{t} ∥^{2}] + 3 K Q^{2} σ_{G}^{2} + 3 K Q^{2} E [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] . \end{matrix}

(A18)

Substituting (A17) and (A18) into (A13), we obtain the final result. □

Lemma A4.

Under Assumptions 2–4, if

2 Q η_{g}^{2} L^{2} \leq \frac{1}{Q - 1}

, we have

\begin{matrix} \sum_{k, q} E [∥ θ^{t} - θ_{k, q}^{t} ∥^{2}] \\ \leq 4 (Q - 1) (4 K Q^{2} η_{g}^{2} E [[∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] + 4 K Q^{2} η_{g}^{2} σ_{G}^{2} + K Q η_{g}^{2} σ_{θ}^{2}) . \end{matrix}

(A19)

Proof.

For any worker

k \in [K]

, we have

\begin{matrix} E [∥ θ^{t} - θ_{k, q}^{t} ∥^{2}] = E [∥ θ^{t} - θ_{k, q - 1}^{t} + η_{g} g_{k} (θ_{k, q - 1}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] \\ \leq E [∥ θ^{t} - θ_{k, q - 1}^{t} + η_{g} F_{k} (θ_{k, q - 1}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] + η_{g}^{2} σ_{θ}^{2} \\ \leq (1 + \frac{1}{Q - 1}) E [∥ θ^{t} - θ_{k, q - 1}^{t} ∥^{2}] + Q η_{g}^{2} E [∥ \nabla_{θ} F_{k} (θ_{k, q - 1}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] + η_{g}^{2} σ_{θ}^{2}, \end{matrix}

(A20)

where the first inequality is due to Assumption 3 and the second inequality follows from the inequality

2 〈 a, b 〉 \leq {κ ∥ a ∥}^{2} + \frac{1}{κ} {∥ b ∥}^{2}

and setting

κ = Q - 1

. Furthermore,

\begin{matrix} E [∥ \nabla_{θ} F_{k} (θ_{k, q - 1}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] \\ \leq 2 E [∥ \nabla_{θ} F_{k} (θ_{k, q - 1}^{t}, {\hat{y}}_{k}^{t + 1}) - \nabla_{θ} F_{k} (θ^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] + 2 E [∥ \nabla_{θ} F_{k} (θ^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] \\ \leq 2 L^{2} ∥ θ^{t} - θ_{k, q - 1}^{t} ∥^{2} + 4 E [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] + 4 E [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) - \nabla_{θ} F_{k} (θ^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}], \end{matrix}

(A21)

where the last inequality is due to Assumption 2. Substituting (A21) into (A20), with Assumption 4, we have

\begin{matrix} \sum_{k, q} E [∥ θ^{t} - θ_{k, q}^{t} ∥^{2}] \leq (1 + \frac{1}{Q - 1} + 2 Q η_{g}^{2} L^{2}) [ & \sum_{k, q} E [∥ θ^{t} - θ_{k, q - 1}^{t} ∥^{2}] + 4 K Q^{2} η_{g}^{2} [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] \\ + 4 K Q^{2} η_{g}^{2} σ_{G}^{2} + K Q η_{g}^{2} σ_{θ}^{2}] . \end{matrix}

(A22)

Consider

2 Q η_{g}^{2} L^{2} \leq \frac{1}{Q - 1}

; then, we have

\begin{matrix} \sum_{k, q} E [∥ θ^{t} - θ_{k, q}^{t} ∥^{2}] \leq (1 + \frac{2}{Q - 1}) [ & \sum_{k, q} E [∥ θ^{t} - θ_{k, q - 1}^{t} ∥^{2}] + 4 K Q^{2} η_{g}^{2} [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] \\ + 4 K Q^{2} η_{g}^{2} σ_{G}^{2} + K Q η_{g}^{2} σ_{θ}^{2}] . \end{matrix}

(A23)

Unrolling the recursion, we get

\begin{matrix} \sum_{k, q} E [∥ θ^{t} - θ_{k, q}^{t} ∥^{2}] \leq \sum_{l = 0}^{q} {(1 + \frac{2}{Q - 1})}^{l} (4 K Q^{2} η_{g}^{2} [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] + 4 K Q^{2} η_{g}^{2} σ_{G}^{2} + K Q η_{g}^{2} σ_{θ}^{2}) . \end{matrix}

(A24)

Since

\sum_{l = 0}^{q} {(1 + \frac{2}{Q - 1})}^{l} \leq \frac{Q - 1}{2} {(1 + \frac{2}{Q - 1})}^{Q - 1} \leq 4 (Q - 1)

and

1 + x \leq exp (x)

, we have the final result. □

Appendix A.2. The Proof of Theorem 1

Using Lemmas A1–A3, we have

\begin{matrix} E [F (θ^{t + 1}, {\hat{y}}^{t + 1})] - E [F (θ^{t}, {\hat{y}}^{t})] \\ \leq - \frac{η Q}{2} E [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] + \frac{η L^{2}}{2 K} \sum_{k, q} E [∥ θ^{t} - θ_{k, q}^{t} ∥^{2}] - \frac{η}{2 K^{2} Q} E [∥ \sum_{k, q} \nabla_{θ} F_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] \\ + \frac{L η^{2} (K - M)}{2 M K (K - 1)} (3 K Q^{2} E [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] + 3 K Q^{2} σ_{G}^{2} + 3 Q L^{2} \sum_{k, q} E [∥ θ^{t} - θ_{k, q}^{t} ∥^{2}]) \\ + \frac{η^{2} (M - 1) L}{2 M K (K - 1)} E [∥ \sum_{k, q} \nabla_{θ} F_{k} (θ_{k, q}^{t}, {\hat{y}}_{k}^{t + 1}) ∥^{2}] + \frac{L η^{2} Q σ_{θ}^{2}}{2 M} - \frac{μ α_{1}}{2 K} \sum_{k = 1}^{K} E [∥ {\hat{y}}_{k}^{t} - {\hat{y}}_{k}^{t + 1} ∥^{2}] . \end{matrix}

When

η L K \leq \frac{M (K - 1)}{K (M - 1)}

, we know

\frac{η}{2 K^{2} Q} - \frac{η^{2} L (M - 1)}{2 M K (K - 1)} \geq 0

. It leads to

\begin{matrix} E [F (θ^{t + 1}, {\hat{y}}^{t + 1})] - E [F (θ^{t}, {\hat{y}}^{t})] \leq & - (\frac{η Q}{2} - \frac{L η^{2} (K - M) 3 Q^{2}}{2 M (K - 1)}) E [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] \\ + (\frac{η L^{2}}{2 K} + \frac{3 Q L^{3} η^{2} (K - M)}{2 M K (K - 1)}) \sum_{k, q} E [∥ θ^{t} - θ_{k, q}^{t} ∥^{2}] \\ + \frac{3 L Q^{2} σ_{G}^{2} η^{2} (K - M)}{2 M (K - 1)} + \frac{L η^{2} Q σ_{θ}^{2}}{2 M} - \frac{μ α_{1}}{2 K} \sum_{k = 1}^{K} E [∥ {\hat{y}}_{k}^{t} - {\hat{y}}_{k}^{t + 1} ∥^{2}] . \end{matrix}

Substituting Lemma A4 into the above inequality, we have

\begin{matrix} E [F (θ^{t + 1}, {\hat{y}}^{t + 1})] - E [F (θ^{t}, {\hat{y}}^{t})] \\ \leq - η Q (\frac{1}{2} - 8 L^{2} (Q - 1) Q η_{g}^{2} - \frac{η L (K - M) Q}{2 M (K - 1)} (3 Q + 48 L^{2} (Q - 1) Q^{2} η_{g}^{2})) E [∥ \nabla_{θ} F (θ^{t}, {\hat{y}}^{t + 1}) ∥^{2}] \\ + (η L^{2} (Q^{2} (Q - 1)) η_{g}^{2} + \frac{24 Q^{4} L^{3} η^{2} η_{g}^{2} (K - M)}{M (K - 1)} + \frac{3 L Q^{2} η^{2} (K - M)}{2 M (K - 1)}) σ_{G}^{2} \\ + (2 η L^{2} (Q - 1) Q η_{g}^{2} + \frac{6 Q^{2} (Q - 1) L^{3} η^{2} η_{g}^{2} (K - M)}{M (K - 1)} + \frac{L η^{2} Q}{2 M}) σ_{θ}^{2} \\ - \frac{μ α_{1}}{2 K} \sum_{k = 1}^{K} E [∥ {\hat{y}}_{k}^{t} - {\hat{y}}_{k}^{t + 1} ∥^{2}] . \end{matrix}

When

η_{g}^{2} \leq \frac{1}{64 Q (Q - 1) L^{2}}

and

η \leq \frac{M (K - 1)}{(3 + 12 Q) Q L (K - M)}

, we have

\begin{matrix} 8 L^{2} (Q - 1) Q η_{g}^{2} + \frac{η L (K - M) Q}{2 M (K - 1)} (3 Q + 48 L^{2} (Q - 1) Q^{2} η_{g}^{2}) \leq \frac{1}{4} . \end{matrix}

Let

\frac{η Q}{4} \leq \frac{μ α_{1}}{2 K}

; we know

\begin{matrix} \frac{η Q}{4} g_{t} \leq & E [F (θ^{t}, {\hat{y}}^{t})] - E [F (θ^{t + 1}, {\hat{y}}^{t + 1})] + η η_{g}^{2} (L^{2} (Q^{2} (Q - 1)) σ_{G}^{2} + 2 L^{2} (Q - 1) Q σ_{θ}^{2}) \\ + η^{2} η_{g}^{2} (\frac{24 Q^{4} L^{3} (K - M)}{M (K - 1)} σ_{G}^{2} + \frac{6 Q^{2} (Q - 1) L^{3} (K - M)}{M (K - 1)} σ_{θ}^{2}) \\ + η^{2} (\frac{3 L Q^{2} (K - M)}{2 M (K - 1)} σ_{G}^{2} + \frac{L Q}{2 M} σ_{θ}^{2}) . \end{matrix}

(A25)

Summing the above inequality above from 1 to T, we obtain

\begin{matrix} \frac{1}{T} \sum_{t = 0}^{T - 1} g_{t} & \leq \frac{4 (E [F (θ^{0}, {\hat{y}}^{0})] - E [F (θ^{T}, {\hat{y}}^{T})])}{η Q T} + η_{g}^{2} B_{1} + 4 η η_{g}^{2} B_{2} + 4 η B_{3} . \end{matrix}

Since 0 is the lower bound for cross-entropy loss

F (θ, \hat{y})

, we obtain the final result.

References

McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. 2019, 10, 1–19. [Google Scholar] [CrossRef]
Zeng, Y.; Wang, Z.; Bai, J.; Shen, X. An accelerated stochastic ADMM for nonconvex and nonsmooth finite-sum optimization. Automatica 2024, 163, 111554. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, J.; Chang, T.H.; Li, J.; Luo, Z.Q. Distributed stochastic consensus optimization with momentum for nonconvex nonsmooth problems. IEEE Trans. Signal Process. 2021, 69, 4486–4501. [Google Scholar] [CrossRef]
Stich, S.U. Local SGD converges fast and communicates little. arXiv 2018, arXiv:1805.09767. [Google Scholar]
Yin, Q.; Feng, Z.; Li, X.; Chen, S.; Wu, H.; Han, G. Tackling data-heterogeneity variations in federated learning via adaptive aggregate weights. Knowl.-Based Syst. 2024, 304, 112484. [Google Scholar] [CrossRef]
Yang, H.; Fang, M.; Liu, J. Achieving linear speedup with partial worker participation in non-iid federated learning. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. In Proceedings of the Conference on Machine Learning and Systems, Austin, TX, USA, 2–4 March 2020. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; Suresh, A.T. SCAFFOLD: Stochastic controlled averaging for federated learning. In Proceedings of the International Conference on Machine Learning, Virtual Event, 2020; pp. 5132–5143. [Google Scholar]
Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; Poor, H.V. Tackling the objective inconsistency problem in heterogeneous federated optimization. arXiv 2020, arXiv:2007.07481. [Google Scholar] [CrossRef]
Wang, S.; Ji, M. A unified analysis of federated learning with arbitrary client participation. Adv. Neural Inf. Process. Syst. 2022, 35, 19124–19137. [Google Scholar]
Ribero, M.; Vikalo, H. Reducing communication in federated learning via efficient client sampling. Pattern Recognit. 2024, 148, 110122. [Google Scholar] [CrossRef]
Cohen, I.; Cozman, F.G.; Sebe, N.; Cirelo, M.C.; Huang, T.S. Semisupervised learning of classifiers: Theory, algorithms, and their application to human-computer interaction. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 1553–1566. [Google Scholar] [CrossRef]
Zhu, X.J. Semi-Supervised Learning Literature Survey; Technical report; Department of Computer Sciences, University of Wisconsin-Madison: Madison, WI, USA, 2005. [Google Scholar]
Chapelle, O.; Scholkopf, B.; Zien, A. Semi-supervised learning. IEEE Trans. Neural Netw. 2009, 20, 542. [Google Scholar] [CrossRef]
Lee, D.H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proceedings of the Workshop on Challenges in Representation Learning, ICML, Atlanta, GA, USA, 21 June 2013. [Google Scholar]
Rasmus, A.; Berglund, M.; Honkala, M.; Valpola, H.; Raiko, T. Semi-supervised learning with ladder networks. Adv. Neural Inf. Process. Syst. 2015, 28, 3546–3554. [Google Scholar]
Laine, S.; Aila, T. Temporal ensembling for semi-supervised learning. arXiv 2016, arXiv:1610.02242. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1195–1204. [Google Scholar]
Miyato, T.; Maeda, S.i.; Koyama, M.; Ishii, S. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1979–1993. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Wu, B.; Feng, Y.; Fan, Y.; Jiang, Y.; Li, Z.; Xia, S.T. Semi-supervised robust training with generalized perturbed neighborhood. Pattern Recognit. 2022, 124, 108472. [Google Scholar] [CrossRef]
Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C. Mixmatch: A holistic approach to semi-supervised learning. arXiv 2019, arXiv:1905.02249. [Google Scholar]
Du, G.; Zhang, J.; Zhang, N.; Wu, H.; Wu, P.; Li, S. Semi-supervised imbalanced multi-label classification with label propagation. Pattern Recognit. 2024, 150, 110358. [Google Scholar] [CrossRef]
Wang, J.; Ruan, D.; Li, Y.; Wang, Z.; Wu, Y.; Tan, T.; Yang, G.; Jiang, M. Data augmentation strategies for semi-supervised medical image segmentation. Pattern Recognit. 2025, 159, 111116. [Google Scholar] [CrossRef]
Diao, E.; Ding, J.; Tarokh, V. Semifl: Semi-supervised federated learning for unlabeled clients with alternate training. Adv. Neural Inf. Process. Syst. 2022, 35, 17871–17884. [Google Scholar]
Albaseer, A.; Ciftler, B.S.; Abdallah, M.; Al-Fuqaha, A. Exploiting Unlabeled Data in Smart Cities using Federated Learning. arXiv 2020, arXiv:2001.04030. [Google Scholar] [CrossRef]
Jeong, W.; Yoon, J.; Yang, E.S.; Hwang, J. Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint Learning. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Wang, S.; Xu, Y.; Yuan, Y.; Quek, T.Q. Toward Fast Personalized Semi-Supervised Federated Learning in Edge Networks: Algorithm Design and Theoretical Guarantee. IEEE Trans. Wirel. Commun. 2023, 23, 1170–1183. [Google Scholar] [CrossRef]
Liang, X.; Lin, Y.; Fu, H.; Zhu, L.; Li, X. Rscfed: Random sampling consensus federated semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 10154–10163. [Google Scholar]
Zhu, S.; Ma, X.; Sun, G. Two-stage sampling with predicted distribution changes in federated semi-supervised learning. Knowl.-Based Syst. 2024, 295, 111822. [Google Scholar] [CrossRef]
Cong, Y.; Zeng, Y.; Qiu, J.; Fang, Z.; Zhang, L.; Cheng, D.; Liu, J.; Tian, Z. FedGA: A greedy approach to enhance federated learning with Non-IID data. Knowl.-Based Syst. 2024, 301, 112201. [Google Scholar] [CrossRef]
Wang, S.; Ji, M. A Lightweight Method for Tackling Unknown Participation Statistics in Federated Averaging. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Jhunjhunwala, D.; Sharma, P.; Nagarkatti, A.; Joshi, G. Fedvarp: Tackling the variance due to partial client participation in federated learning. In Proceedings of the Uncertainty in Artificial Intelligence, PMLR, Eindhoven, The Netherlands, 1–5 August 2022; pp. 906–916. [Google Scholar]
Niederreiter, H. Random Number Generation and Quasi-Monte Carlo Methods; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1992. [Google Scholar]
Fang, K.T.; Wang, Y. Number-Theoretic Methods in Statistics; Chapman and Hall: London, UK, 1994. [Google Scholar]
Zhang, M.; Zhou, Y.; Zhou, Z.; Zhang, A. Model-free subsampling method based on uniform designs. IEEE Trans. Knowl. Data Eng. 2024, 36, 1210–1220. [Google Scholar] [CrossRef]
Zhou, Z.; Yang, Z.; Zhang, A.; Zhou, Y. Efficient model-free subsampling method for massive data. Technometrics 2024, 66, 240–252. [Google Scholar] [CrossRef]
Wang, Z.; Wang, X.; Sun, R.; Chang, T.H. Federated semi-supervised learning with class distribution mismatch. arXiv 2021, arXiv:2111.00010. [Google Scholar] [CrossRef]
Fang, K.T.; Li, R.; Sudjianto, A. Design and Modeling for Computer Experiments; Chapman and Hall/CRC: Boca Raton, FL, USA, 2006. [Google Scholar]
Fang, K.; Liu, M.Q.; Qin, H.; Zhou, Y.D. Theory and Application of Uniform Experimental Designs; Springer: Singapore, 2018. [Google Scholar]
Zhang, A.; Li, H.; Quan, S.; Yang, Z. UniDOE: Uniform Design of Experiments. R Package Version 1.0.2. 2018. Available online: https://cran.r-project.org/src/contrib/Archive/UniDOE/ (accessed on 24 July 2025).
Shalev-Shwartz, S.; Singer, Y. Logarithmic Regret Algorithms for Strongly Convex Repeated Games; The Hebrew University: Jerusalem, Israel, 2007. [Google Scholar]
Chen, L.; Zhao, D.; Tao, L.; Wang, K.; Qiao, S.; Zeng, X.; Tan, C.W. A credible and fair federated learning framework based on blockchain. IEEE Trans. Artif. Intell. 2024, 6, 301–316. [Google Scholar] [CrossRef]

Figure 1. A toy example of our client selection procedure where we sequentially select 10 clients from 100 clients in total over 100 communication rounds. The red dashed line indicates the mean selection count (10) in the non-sampling scheme. The left subplot shows the random sampling pattern where some clients might be selected multiple times (blue bars) while others might be rarely selected, while the right subplot demonstrates our lattice-based sampling method that ensures each client is selected exactly 10 times (red bars) with a systematic and balanced distribution.

Figure 2. Performance comparison of FedAvg-SSL approaches under different parameter settings. (a) Test Accuracy Comparison. (b) Test Loss Comparison.

Figure 3. Performance comparison of federated learning approaches under different data distribution settings. The results show the superiority of SSL methods over supervised-only approaches in both IID and non-IID scenarios. (a) Test Accuracy Comparison. (b) Test Loss Comparison.

Figure 4. Performance comparison with different label ratios (0.1, 0.3, 0.7, and 0.9) under non-IID settings. Higher label ratios generally lead to better performance and faster convergence. (a) Test Accuracy with Different Label Ratios. (b) Test Loss with Different Label Ratios.

Figure 5. Impact of local training steps on Fed-SSL performance under non-IID settings. Increasing local steps leads to faster convergence and better final performance, consistent with Theorem 1. (a) Test Accuracy with Different Local Steps. (b) Test Loss with Different Local Steps.

Figure 6. Performance comparison of FedSSL with different participation rates under non-IID setting. (a) Test Accuracy with Different Numbers of Clients. (b) Test Loss with Different Numbers of Clients.

Figure 7. Performance comparison of FedSSL with different sampling strategies under non-IID setting. (a) Test Accuracy with Different Sampling Strategies. (b) Test Loss with Different Sampling Strategies.

Table 1. Test accuracy of the proposed algorithm by using different parameters

α_{1}

and

α_{2}

on the non-IID case.

Table 1. Test accuracy of the proposed algorithm by using different parameters

α_{1}

and

α_{2}

on the non-IID case.

	$T_{1} = 10$	$T_{1} = 50$
$α_{1} = 0$	96.91%	97.21%
$α_{1} = 0.5$	97.5%	96.89%
$α_{1} = 1$	96.71%	96.76%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, M.; Yang, F. Federated Semi-Supervised Learning with Uniform Random and Lattice-Based Client Sampling. Entropy 2025, 27, 804. https://doi.org/10.3390/e27080804

AMA Style

Zhang M, Yang F. Federated Semi-Supervised Learning with Uniform Random and Lattice-Based Client Sampling. Entropy. 2025; 27(8):804. https://doi.org/10.3390/e27080804

Chicago/Turabian Style

Zhang, Mei, and Feng Yang. 2025. "Federated Semi-Supervised Learning with Uniform Random and Lattice-Based Client Sampling" Entropy 27, no. 8: 804. https://doi.org/10.3390/e27080804

APA Style

Zhang, M., & Yang, F. (2025). Federated Semi-Supervised Learning with Uniform Random and Lattice-Based Client Sampling. Entropy, 27(8), 804. https://doi.org/10.3390/e27080804

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Federated Semi-Supervised Learning with Uniform Random and Lattice-Based Client Sampling

Abstract

1. Introduction

1.1. Background

1.2. Related Works

1.3. Contributions

1.4. Organization

2. Problem Formulation

3. FedAvg-SSL Algorithm

3.1. Random Sampling Method

3.2. Lattice-Based Sampling Method

3.3. The Proposed Federated SSL Algorithm

4. Convergence Analysis with Partial Client Participation

4.1. Assumptions

4.2. Convergence Analysis of FedAvg-SSL

5. Simulation Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Proof of Theorem 1

Appendix A.1. Some Auxiliary Lemmas

Appendix A.2. The Proof of Theorem 1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI