SACW: Semi-Asynchronous Federated Learning with Client Selection and Adaptive Weighting

Li, Shuaifeng; Shan, Fangfang; Mao, Shiqi; Lu, Yanlong; Miao, Fengjun; Chen, Zhuo

doi:10.3390/computers14110464

Open AccessArticle

SACW: Semi-Asynchronous Federated Learning with Client Selection and Adaptive Weighting

by

Shuaifeng Li

¹,

Fangfang Shan

^1,2,*,

Shiqi Mao

¹,

Yanlong Lu

¹,

Fengjun Miao

¹ and

Zhuo Chen

¹

College of Computer, Zhongyuan University of Technology, Zhengzhou 450007, China

²

Henan Key Laboratory of Cyberspace Situation Awareness, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(11), 464; https://doi.org/10.3390/computers14110464

Submission received: 7 September 2025 / Revised: 12 October 2025 / Accepted: 23 October 2025 / Published: 27 October 2025

Download

Browse Figures

Versions Notes

Abstract

Federated learning (FL), as a privacy-preserving distributed machine learning paradigm, demonstrates unique advantages in addressing data silo problems. However, the prevalent statistical heterogeneity (data distribution disparities) and system heterogeneity (device capability variations) in practical applications significantly hinder FL performance. Traditional synchronous FL suffers from severe waiting delays due to its mandatory synchronization mechanism, while asynchronous approaches incur model bias issues caused by training pace discrepancies. To tackle these challenges, this paper proposes the SACW framework, which effectively balances training efficiency and model quality through a semi-asynchronous training mechanism. The framework adopts a hybrid strategy of “asynchronous client training–synchronous server aggregation,” combined with an adaptive weighting algorithm based on model staleness and data volume. This approach significantly improves system resource utilization and mitigates system heterogeneity. Simultaneously, the server employs data distribution-aware client clustering and hierarchical selection strategies to construct a training environment characterized by “inter-cluster heterogeneity and intra-cluster homogeneity.” Representative clients from each cluster are selected to participate in model aggregation, thereby addressing data heterogeneity. We conduct comprehensive comparisons with mainstream synchronous and asynchronous FL methods and perform extensive experiments across various model architectures and datasets. The results demonstrate that SACW achieves better performance in both training efficiency and model accuracy under scenarios with system and data heterogeneity.

Keywords:

semi-asynchronous federated learning; client selection; client clustering; adaptive weighting; statistical heterogeneity; system heterogeneity

1. Introduction

With the rapid advancement of Artificial Intelligence (AI) and Internet of Things (IoT) technologies [1], various terminal devices (e.g., smart vehicles, drones, mobile terminals) generate massive amounts of data. Under traditional machine learning and deep learning paradigms, these raw data must be uploaded to a central server for centralized training [2]. This approach not only consumes substantial communication resources but also poses significant risks of user privacy leakage. As data volume and computational requirements grow exponentially, centralized model training methods face severe challenges in terms of time costs and resource consumption. In this context, federated learning (FL) [3] emerges as an innovative distributed machine learning framework [4]. By employing local model training on client devices and transmitting only model parameters, FL effectively addresses the privacy security and resource bottleneck issues associated with raw data transmission. Currently, FL demonstrates significant application value across various domains, including smart finance, IoT applications, and healthcare [5].

Despite its notable successes, federated learning faces critical challenges in practical applications: its fundamental assumption—that all clients share similar local data distributions and homogeneous computational capabilities [6]—often fails to hold in real-world scenarios. Specifically, both statistical heterogeneity and system heterogeneity among clients [7,8] significantly degrade FL performance. Below we analyze these two types of heterogeneity in detail: (1) System heterogeneity stems from variations in client hardware configurations, including computing unit performance, communication bandwidth, storage capacity, and power supply. Such disparities lead to substantial differences in local model training and parameter upload times, thereby prolonging the overall FL training cycle. Taking the classic synchronous FedAvg algorithm [9] as an example, the server must wait for all clients (including the slowest devices) to upload their models before performing global aggregation. This waiting mechanism causes severe computational resource idling, known as the straggler effect [10]. More critically, in FL frameworks employing fixed global deadlines [11], clients with weaker hardware often fail to complete computations on time and are excluded from training, resulting in severe participation imbalance. (2) Statistical heterogeneity primarily manifests as divergent local data distributions across clients. For instance, in healthcare, specialty hospitals (e.g., orthopedic vs. oncology centers) collect patient data with significant variations in disease types and diagnostic metrics. Such distribution disparities are defined as the Non-IID (Non-Independent and Identically Distributed) problem [12]. Under Non-IID conditions, local model updates may exhibit substantial directional conflicts, and clients with larger datasets tend to disproportionately influence the global model [13]. This training inconsistency not only slows model convergence but may also compromise the final model’s generalization capability. Consequently, research on optimizing system and statistical heterogeneity has become pivotal for improving FL training efficiency and model accuracy. Addressing these challenges is essential to advancing FL’s practical deployment in complex real-world environments.

Synchronous federated learning approaches (e.g., FedAvg [9]) exhibit significant efficiency bottlenecks when addressing system heterogeneity challenges. In this framework, clients must synchronously wait for the server to distribute the global model before initiating local training. Simultaneously, the server must wait for all selected clients to complete training and upload their models before performing global aggregation. This bidirectional waiting mechanism generates substantial idle time, severely limiting the overall training efficiency of federated learning. To overcome this limitation, researchers have proposed the asynchronous federated learning (AFL) framework [14,15]. AFL employs a “first-come-first-train” asynchronous mechanism that effectively eliminates resource waste caused by synchronous waiting and significantly improves client participation efficiency. However, AFL still faces two critical issues in practice: First, due to uneven computational resource allocation, high-performance clients may complete multiple training iterations while resource-constrained clients can only finish a single round, leading to participation imbalance. Second, when stale models uploaded by lagging clients are directly aggregated with the continuously evolving global model, gradient deviation occurs, adversely affecting model convergence speed and stability. To address these challenges, semi-asynchronous federated learning [16] has emerged as an innovative compromise solution. This mechanism introduces dynamic aggregation triggers (e.g., reaching preset model quantity thresholds or fixed time windows), maintaining the efficiency advantages of asynchronous training while ensuring fair participation among all clients.

To address statistical heterogeneity challenges, optimizing client selection mechanisms has become a critical breakthrough for improving federated learning performance [17]. In practical application scenarios, a single server needs to coordinate and manage hundreds to thousands of clients for multiple rounds of iterative training. Adopting a full-client participation communication model would not only impose tremendous communication and computational pressure on the server but also lead to the excessive consumption of network bandwidth resources, severely reducing training efficiency. While the random client selection strategy employed by the traditional FedAvg algorithm is simple to implement, it fails to account for differences in client data distributions. This approach often cannot guarantee the representativeness of selected clients and may overlook those with critical data characteristics, ultimately affecting model performance. Therefore, an appropriate client selection method serves as a viable solution to statistical heterogeneity.

Based on the above research methodology, this paper proposes SACW: Semi-Asynchronous Federated Learning with Client Selection and Adaptive Weighting. SACW synergizes the strengths of synchronous and asynchronous federated learning, markedly mitigating biases induced by statistical and system heterogeneity to bring the training process closer to an ideal federated environment. First, we introduce a DBSCAN (Density-Based Spatial Clustering of Applications with Noise)-based clustering algorithm that selects representative clients according to the local data distributions of all participants, ensuring a balanced global data distribution in each round and effectively curbing statistical heterogeneity. Second, to balance communication efficiency and model convergence, we devise a semi-asynchronous communication paradigm: “clients train asynchronously, while the server performs synchronous adaptive-weighted aggregation.” On the one hand, clients are allowed to update their models locally on their own schedules, eliminating idle time and dropouts caused by synchronous waiting; on the other hand, during synchronous aggregation, the server adaptively weights each contribution based on both the client’s local data volume and its model staleness, suppressing the client bias inherent to fully asynchronous updates and thereby significantly alleviating system heterogeneity.

Our main contributions are summarized as follows:

We propose the SACW framework, which adopts a semi-asynchronous strategy of “asynchronous client training–synchronous server aggregation.” This design balances training efficiency with model quality. In addition, an adaptive weighting algorithm based on model staleness and data volume improves system resource utilization and effectively mitigates system heterogeneity.
We present a DBSCAN-based client selection strategy that automatically identifies clients with similar data distributions through density clustering, without requiring predefined cluster numbers. This approach forms a grouping structure with “inter-cluster heterogeneity and intra-cluster homogeneity.” Representative clients are then selected from each cluster for aggregation, which alleviates data heterogeneity. To further reduce the network load introduced by semi-asynchronous communication, we introduce a lattice-based quantization scheme for model compression. This method significantly decreases the transmission overhead between clients and the server.
To evaluate the practicality of our method, we construct a realistic heterogeneous training environment. We simulate fast and slow clients by assigning different computing resources and generate Non-IID data distributions through non-uniform dataset partitioning. Experiments on three benchmark datasets show that SACW achieves high model accuracy while delivering faster convergence and better communication efficiency.

This paper is organized as follows: Section 2 surveys existing synchronous, asynchronous, and semi-asynchronous federated learning paradigms, highlighting their strengths and limitations in coping with both system and statistical heterogeneity. Section 3 briefly reviews the FedAvg framework and details the DBSCAN clustering algorithm. Section 4 presents our proposed algorithm through flowcharts, pseudocode, and an in-depth description of its key modules. Section 5 provides a comprehensive experimental evaluation: we describe the setup and conduct comparative, ablation, and parameter sensitivity analyses to validate the feasibility of SACW under system and statistical heterogeneity. Section 6 discusses the limitations observed during experimentation and outlines promising directions for future work. Finally, Section 7 concludes this paper.

2. Related Work

FedAvg [9], as the earliest proposed federated learning approach, represents the most classic synchronous federated learning framework, with most subsequent research building upon this foundation. In FedAvg, clients can synchronously perform local training and interact with the server for global model training. However, current studies demonstrate that FedAvg tends to diverge when faced with statistical and system heterogeneity [18]. Currently, most synchronous federated learning methods focus on addressing statistical heterogeneity while assuming the absence of system heterogeneity. For example, MOON [19] corrects local model update directions by constraining the differences between representations generated by local models. FedRS [20] mitigates model drift in Label Shift scenarios by introducing weight parameters to limit uncertain updates. FedLC [21] calibrates logits before softmax based on class occurrence probabilities and optimizes local cross-entropy loss through pairwise label margins for fine-grained class calibration. However, constrained by their synchronous nature, these methods still suffer from compromised convergence speed when system heterogeneity is present.

Federated learning based on asynchronous mechanisms serves as an effective solution to system heterogeneity. AFL enables clients to conduct local model training at their individual speeds while maintaining timely interaction with the server. After acquiring the updated global model, clients proceed with subsequent local training. This approach enhances client utilization efficiency while significantly reducing server waiting time. For instance, FedAsync [14] permits clients to upload updates at any time, where the server queues them in order of arrival and employs a staleness function combined with mixing hyperparameters to downweight obsolete models during global model updates. Concurrently, it optimizes the local training process through a proximal term. FedBuff [22] establishes a buffer on the server side, allowing clients to asynchronously upload updates to this fixed buffer, with batch aggregation performed to update the global model once the buffer reaches capacity. MAPA-S and MAPA-C [23] also equip the server with a buffer to enable asynchronous training across clients. Their distinction lies in adding noise to the buffer, thereby shielding the model parameters from potential leakage. However, these fully asynchronous methods may cause the global model to converge toward high-efficiency clients, consequently weakening the model’s generalization capability. Moreover, when local models exhibit higher staleness than the global model, the post-aggregation results may decelerate the overall convergence speed.

Federated learning frameworks based on semi-asynchronous mechanisms can combine the advantages of both synchronous and asynchronous approaches by performing aggregation when reaching specific model quantity thresholds or fixed time intervals. This approach reduces waiting time while ensuring each client can participate fairly in global training. For example, FedSA [24] dynamically adjusts the number of clients M participating in each round of aggregation to balance convergence speed and communication efficiency. SAFA [25] synchronously updates both current and outdated clients to maintain basic model consistency while allowing tolerable lagging clients to continue asynchronous local training. QuAFL [26] permits global model aggregation even when some clients have not completed local training but fails to consider the impact of model staleness and data heterogeneity on training performance.

In federated learning scenarios, focusing solely on data distribution differences would cause weak terminals to fall behind and slow down overall training progress. Conversely, considering only computational capability disparities would lead to slow convergence and accuracy degradation due to Non-IID data effects. Therefore, both system resource heterogeneity and statistical data differences must be incorporated into scheduling decisions to simultaneously achieve optimal model performance and training efficiency.

3. Preliminary

3.1. Federated Learning

To address challenges including data silos, privacy protection regulations, and distributed computing limitations, Google first proposed the concept of Federated Optimization in 2017 [27]. This groundbreaking work established the theoretical foundation for federated learning technology. As the first complete framework in this field, the Federated Averaging (FedAvg) algorithm [9] adopted the typical architecture shown in Figure 1: a two-tier topology consisting of a central server and distributed clients. Its core innovation lies in realizing the privacy-preserving paradigm of “data stays local; models move”—the server only aggregates model parameters trained locally by each client, while raw data always remains on local devices. Table 1 provides explanations of the key notations used throughout this paper. The standard workflow of this framework can be described in detail as follows:

Initialization: The central server broadcasts initialization parameters to all clients. $k_{i} (i \in [1, N])$ performs local initialization after receiving the broadcast information from the server.
Client Selection: The server randomly selects a subset of clients from all available clients, forming a candidate set $S = {s_{1}, s_{2}, \dots, s_{i}, \dots, s_{k}}$ , and sends the global model $ω^{r} (r \in [1, R])$ to the selected clients.
Local Client Training: Each selected $s_{i} \in S$ trains the model using its local dataset based on $ω^{r}$ , obtaining updated parameters $ω_{s_{i}}^{r + 1}$ , and then uploads the trained model parameters to the server. Upon completing training, the client stops local training and waits for subsequent selection by the server. The client state transitions as follows: $I d l e \to T r a i n i n g \to U p l o a d i n g \to I d l e$ .
Server Aggregation: After collecting all updated parameters $ω_{s_{i}}^{r + 1}$ , the server performs a weighted aggregation of local models based on each client’s data volume to obtain a new global model $ω^{r + 1}$ .
Model Evaluation: The system evaluates whether the global model’s accuracy meets the target threshold or whether the total training rounds reach R. If yes, the federated learning task terminates. Otherwise, the next training round begins immediately, repeating steps 2–5.

The global training objective of FedAvg is to obtain a global model that minimizes the loss function, expressed as follows:

m i n (l (ω)) : = \sum_{i = 1}^{N} \frac{| D_{i} |}{| D | = \sum_{i = 1}^{N} | D_{i} |} l_{i} (ω_{i})

(1)

where represents the trained global model,

D_{i}

denotes the dataset of

k_{i}

,

|\cdot|

indicates the size of the dataset, and

l_{i} (ω_{i})

corresponds to the loss of

k_{i}

’s local model.

3.2. DBSCAN

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [28] is a powerful unsupervised clustering algorithm whose core concept lies in identifying “density regions” within datasets and recognizing these regions as clusters while simultaneously detecting and handling noise points located in sparse areas. The algorithm offers several advantages: First, in real-world data, the number of clusters is often unknown. DBSCAN automatically discovers the number of clusters through density expansion, avoiding clustering bias caused by inappropriate cluster number selection. Second, DBSCAN forms clusters through density expansion, enabling the discovery of non-convex and irregularly shaped clusters. Third, DBSCAN explicitly distinguishes between core points, border points, and noise points, marking points in sparse regions as noise to prevent them from interfering with the clustering results. The algorithm procedure is as follows:

Parameter Initialization: Predefine the neighborhood radius $ϵ$ (Eps) and minimum density threshold MinPts to characterize the density conditions for core points.
Point Access and Core Point Determination: Randomly select a point f from the dataset and mark it as visited. If the $ϵ$ -neighborhood of f contains at least MinPts objects, d is identified as a core point, and a new cluster C is created. Otherwise, f is temporarily marked as noise. If f is a core point, all objects within its neighborhood are added to the candidate set F.
Cluster Expansion: While the candidate set F is non-empty, repeat the following operations: Extract any object d from F that has not been assigned to another cluster. If d has not been visited, mark it as visited and examine its $ϵ$ -neighborhood. If d also satisfies the core point condition, append all objects in its neighborhood not yet assigned to the current cluster into F. Then, assign d to cluster C. This process continues until F is empty, at which point cluster C is fully constructed.
Iteration: Randomly select another unvisited point and repeat steps 2 and 3 until all objects in the dataset have been visited.
Output: Finally, all points successfully assigned to a cluster C form the clustering result, while points never incorporated into any cluster are treated as noise.

4. Proposed Method: SACW

4.1. Problem Description

In this paper, we consider a federated learning scenario with both system and statistical heterogeneity, where the system architecture consists of one central server and N local clients. To facilitate theoretical analysis, we make the following basic assumptions:

The client set remains fixed during training, with no dynamic addition of new clients;
Each client’s local dataset remains unchanged throughout the training process;
The central server is honest and coordinates all clients for global training.

Under this framework, the optimization objective of federated learning can be formally expressed as follows: through the efficient scheduling of heterogeneous clients coordinated by the server, we aim to minimize the expected loss function of the global model. Specifically, we need to solve the following distributed optimization problem:

m i n (l (ω)) : = p_{i} \cdot l_{i} (ω_{i}^{τ_{i}^{r}})

(2)

where

p_{i}

represents the adaptive weight of

k_{i}

based on both data volume and model staleness, and

l_{i} (ω_{i}^{τ_{i}^{r}})

denotes the loss function of

k_{i}

’s local model with staleness

τ_{i}^{r}

.

4.2. Algorithm Description

The semi-asynchronous federated learning framework proposed in this study innovatively integrates the advantages of asynchronous training mechanisms while preserving the core principles of FedAvg. In large-scale scenarios such as IoT, industrial IoT, and vehicular networks, our algorithm significantly outperforms synchronous federated learning. The reason is clear: when the number of clients is huge, some devices will inevitably lose power, go offline, or suffer from low training efficiency due to insufficient voltage or computing power. A simple synchronous scheme would introduce excessive waiting time and slow down global model training. SACW adopts a semi-asynchronous mechanism to eliminate this synchronization bottleneck, and quantization during communication further reduces the communication overhead. As illustrated in Figure 2, the framework introduces two key innovative features.

First, regarding client selection strategy, we advanced beyond the traditional random selection mechanism of FedAvg by proposing a DBSCAN density-based clustering intelligent selection algorithm. This algorithm analyzes the data distribution characteristics of clients and clusters those with similar data distributions into several groups, achieving optimized data distribution characterized by “inter-cluster heterogeneity and intra-cluster homogeneity.” This clustering selection mechanism ensures that the clients selected from each cluster are sufficiently representative, thereby effectively mitigating model bias caused by statistical heterogeneity.

Second, in terms of the training mechanism, we improved upon FedAvg’s synchronous waiting mode, specifically via the following: (1) Clients are permitted to continue local training even when not selected by the server. (2) Training is terminated only when clients are selected by the server or when the preset maximum training steps are reached. (3) We innovatively designed a client training scheduling mechanism incorporating three process states. This design enables clients to efficiently switch among three states: “training–waiting–uploading.” Compared with FedAvg, this approach allows for more effective local training iterations within the same time window. As shown in Figure 2, the framework’s operational workflow can be divided into five key steps:

Initialization: All clients receive the initial model ( $ω^{0}$ ) broadcast by the server, including the learning rate and local maximum training steps L. They subsequently transmit their local data distributions to the server.
Client Clustering and Selection: The server performs clustering based on all clients’ local data distributions and periodically (with interval T) randomly selects clients from each cluster to receive the latest global model $ω^{r}, r \in [1, R]$ , where R represents the total number of global training rounds.
Local Client Training: Each local client operates under one of three conditions:
(a)
When selected by the server, the client immediately uploads its compressed local model (full or partial) along with the model version number, then initiates local training based on the latest global model. State transition: $L o c a l t r a i n i n g / i d l e \to U p l o a d i n g \to L o c a l t r a i n i n g$ .
(b)
When not selected by the server but with local training steps below the maximum threshold, the client continues local model training. State transition: $L o c a l t r a i n i n g \to I d l e$ .
(c)
When not selected and having reached the maximum training steps, the client remains idle until scheduled by the server. State transition: Persists in $I d l e$ .
Server Aggregation: Upon receiving all selected clients’ models and version numbers, the server decompresses the models, calculates their staleness, and computes adaptive weighting factors based on both model staleness and training data volume. It then performs weighted aggregation to update the global model $ω^{r + 1}$ .
Model Evaluation: The server evaluates whether the global model’s accuracy meets the target threshold or whether the training duration has expired. If yes, the federated learning task terminates; otherwise, it repeats steps 2–5 after waiting for period T.

Algorithm 1 presents the comprehensive procedure of the SACW algorithm.

4.3. Method Description

4.3.1. Client Selection

In ideal federated learning scenarios where all clients share identical label distributions (IID data), each round of global training involves balanced client data participation, enabling optimal model convergence speed [8]. However, this idealized assumption rarely holds in practice due to inherent disparities in client resources and data distributions. To mitigate the impact of data heterogeneity on global model training, our algorithm first clusters clients based on their local data label distributions, grouping those with similar patterns into the same cluster. This creates an “intra-cluster homogeneity, inter-cluster heterogeneity” structure among clients. The central server then employs a stratified random-sampling strategy to select representative clients from each cluster for participation in training. Moreover, randomly selecting clients with similar data within each cluster effectively prevents the system from persistently favoring high-performance and high-capacity clients. This mechanism ensures the global model learns comprehensive data characteristics while effectively reducing model bias caused by data heterogeneity. The detailed procedure operates as follows:

k_{i}

computes its data distribution

P_{i} (y)

representation through the dataset

D_{i}

as follows:

P_{i} (y) = (\frac{|D_{i, 1}|}{|D_{i}|}, \frac{|D_{i, 2}|}{|D_{i}|}, \dots, \frac{|D_{i, m}|}{|D_{i}|})

(3)

where

|D_{i}|

represents

k_{i}

’s total data volume, and

|D_{i, m}|

denotes the data volume of the

m^{t h}

category’s training samples in

k_{i}

.

Algorithm 1: Pseudocode of SACW

The central server performs clustering based on the local label distributions of all clients, grouping those with similar label distributions into the same cluster. The similarity measurement process for the DBSCAN clustering algorithm follows the methodology described in Section 3.2.

C_{1}, C_{2}, \dots, C_{K} \leftarrow C l u s t e r i n g (P_{i} {(y)}_{i = 1}^{N})

(4)

where

C_{i}, i \in [1, K]

represents all partitioned clusters of clients. The central server randomly selects one representative client from each cluster to participate in global training.

S \leftarrow (s_{1}, s_{2}, \dots, s_{K}) \overset{random}{\leftarrow} (C_{1}, C_{2}, \dots, C_{K})

(5)

where S represents the set of clients participating in the next round of global training.

4.3.2. The Semi-Asynchronous Framework with Adaptive Weighting

Many optimization algorithms based on the FedAvg federated learning framework have been proposed, yet none can completely overcome the negative impacts inherent in synchronous federated learning mechanisms. In FedAvg, clients must wait to be selected by the server before initiating local model training, while the server must wait to collect models from all selected clients before performing model aggregation and starting the next round of global training.

Our algorithm overcomes this waiting constraint: local clients can continue training without needing to be selected by the server. In fact, clients that remain unselected for extended periods can perform more comprehensive model training, resulting in more complete local models. The central server only needs to periodically collect local training progress and distribute the latest global model, replacing the original operation that required waiting for all clients to complete training before aggregation. This “asynchronous client training with periodic server synchronization” approach significantly reduces bidirectional waiting time between clients and the server while greatly improving client utilization efficiency.

In heterogeneous environments, while this high-efficiency communication paradigm improves client utilization, it also increases the number of communication rounds between clients and the server by an order of magnitude compared to FedAvg. This consequently raises two critical problems:

Problem 1: In each training round, the central server selects only a subset of clients to participate in global training (for instance, 200 out of 1000 clients). As communication rounds accumulate, certain clients may remain unselected for an extended period, resulting in local models that are excessively stale relative to the latest global model. Directly averaging these highly stale local models with the global model can severely impede convergence. Therefore, in SACW, a local model is assigned higher utility—and consequently a larger weight in server aggregation—when its host client possesses a larger dataset and exhibits lower staleness. This relationship is strictly linear: weight increases proportionally with data volume and decreases proportionally with staleness. Conversely, clients with smaller datasets and higher staleness receive proportionally smaller weights. This adaptive weighting strategy demonstrably accelerates convergence, as corroborated by the experimental results in Section 4.3.3.
Problem 2: The server must communicate cyclically with all selected clients, transmitting the global model and receiving their local updates in every round. As both the number of communication rounds and participating clients grows, this imposes substantial communication overhead on the server. To mitigate this pressure, SACW incorporates model compression techniques, detailed in Section 4.3.4.

4.3.3. Adaptive Weighting

In our framework, whenever a client is accessed by the server, it immediately uploads its local model

ω_{i}^{τ_{i}^{r}}

(either partial or complete) and the version number

τ_{i}^{r}

of the global model upon which this local model was last trained. The selected client then instantaneously performs local parameter updates and initiates a new round of local model training. The local update and training process proceeds as follows:

τ_{i}^{r} = \{\begin{matrix} r & i \in S \\ τ_{i}^{r - 1} & i \notin S \end{matrix}

(6)

ω_{i}^{τ_{i}^{r}} = \{ω^{r} | i \in S\}

(7)

ω_{i}^{τ_{i}^{r}} = ω_{i}^{τ_{i}^{r}} - η_{i} \nabla g_{i} (ω_{i}^{τ_{i}^{r}}; D_{i})

(8)

where

τ_{i}^{r}

and

ω_{i}^{τ_{i}^{r}}

denote the version number of the global model received by client and the local model of the selected client, respectively, while

η_{i}

and

\nabla g_{i} (\cdot)

represent the learning rate and optimizer employed by

k_{i}

.

Upon receiving

ω_{i}^{τ_{i}^{r - 1}}

and

τ_{i}^{r - 1}

from

s_{i}

, the central server computes a staleness factor

s t a l e n e s s_{i}

. Subsequently, an adaptive weighting factor

q_{i}

is assigned to each client based jointly on its staleness and dataset size. The formulation is as follows:

s t a l e n e s s_{i} = r - τ_{i}^{r - 1}

(9)

c_{i} = \frac{| D_{i} |}{\sum_{j \in S} | D_{j} |} \cdot e^{- λ \cdot s t a l e n e s s_{i}}

(10)

where

λ

is a tunable hyperparameter that modulates the magnitude of staleness’s influence on the weight assignment; its tuning is elaborated in the dedicated ablation study.

The central server performs weighted aggregation on the local models of the selected clients to obtain the new global model, as follows:

ω^{r + 1} = α \cdot ω^{r} + (1 - α) \cdot \sum_{i \in S} (\frac{c_{i}}{\sum_{j \in S} c_{j}}) \cdot ω_{i}^{τ_{i}^{r - 1}}

(11)

where

α

is an adjustable parameter,

α \in [0, \frac{1}{K + 1}]

. We treat the

r^{t h}

global model as an additional client;

α

thus determines the extent to which the

r^{t h}

global model is retained.

4.3.4. Lattice-Based Model Quantization

Lattice-based quantization [29] addresses distributed mean estimation (DME) and variance reduction while minimizing communication cost, making it well-suited to parallel SGD. This aligns closely with federated learning; hence, we adopt this quantization scheme for bidirectional communication between clients and the server to alleviate the central server’s communication burden. The procedure is as follows: Each client transmits its compressed local model, which the central server decompresses and then receives. Specifically, in the

r^{t h}

round, to encode

ω_{i}^{τ_{i}^{r - 1}}

(

e n c o d e (ω_{i}^{τ_{i}^{r - 1}})

), the client randomly maps

ω_{i}^{τ_{i}^{r - 1}}

to a point z selected from a set of nearby lattice points forming a convex hull around

ω_{i}^{τ_{i}^{r - 1}}

. It then sends z mod q under the lattice basis (where q is the quantization precision parameter). To decode

ω_{i}^{τ_{i}^{r - 1}}

(

d e c o d e (B, e n c o d e (ω_{i}^{τ_{i}^{r - 1}}))

), the server maintains the previous global model

ω^{r}

as the decoding key. After receiving all clients’

e n c o d e (ω_{i}^{τ_{i}^{r - 1}})

, it recovers the original model parameter

ω_{i}^{τ_{i}^{r - 1}}

for each dimension by leveraging the proximity between

ω^{r}

and the received

e n c o d e (ω_{i}^{τ_{i}^{r - 1}})

.

5. Experimental Description

5.1. Dataset and Model

To systematically evaluate the algorithm under varying task complexities, we conduct comprehensive experiments on three benchmark datasets: [30], Fashion-MNIST [31], and CIFAR-10 [32]. For MNIST, we employ a fully connected MNISTNet network with a single hidden layer and ReLU activation; the optimizer is SGD with a learning rate of 0.1 and a batch size of 128. For Fashion-MNIST, we adopt the classical CNN architecture and use the Adam optimizer with a learning rate of 0.002 and a batch size of 128. For the more challenging CIFAR-10, we use ResNet-20 and retain SGD with a learning rate of 0.1.

5.2. Experimental Setup

In real federated learning scenarios, clients exhibit both data distribution and system performance heterogeneity. To reproduce these conditions, we introduce statistical and system heterogeneity in the federated setup. The details are described below.

Statistical heterogeneity: We employ Dirichlet-based data partitioning [33] (

p_{k} \sim D i r (β)

), which assigns non-identical class distributions and sample sizes to clients. The strength of statistical heterogeneity is controlled by

β

; as

β

increases, the heterogeneity among clients becomes more pronounced, as illustrated in Figure 3.

System heterogeneity: To capture the uncertainty of local computation time, we model the duration of a single local update as a memoryless random variable X that follows an exponential distribution with parameter

θ

. A larger

θ

indicates faster computation. Accordingly, we classify devices into two types: “fast” nodes with

θ = \frac{1}{2}

(expected time E[X] = 2 units) and “slow” nodes with

θ = \frac{1}{8}

(expected time E[X] = 8 units). By default,

25 %

of the clients are slow nodes.

In our experiments, setting

α

=

\frac{1}{K + 1}

accelerates convergence. We attribute this to the fact that the previous global model is the closest approximation to the current one; retaining a fraction of it further alleviates the staleness effect.

Additionally, SACW advances the global model by periodically collecting a subset of client models. The period is determined by the handling time (h) and the server visiting time (v). The processing time h is governed by the quantization bit-width (b): fewer bits yield shorter processing. The access time v is set by the server according to local task complexity; simple tasks such as MNIST and Fashion-MNIST set v = 1, whereas the more complex CIFAR-10 task sets v = 6. For MNIST and Fashion-MNIST, we set the total number of clients to N = 50; for CIFAR-10, we set N = 40. The experimental setup for this study is outlined in Table 2.

5.3. Comparative Experimental Results and Analysis

Figure 4, Figure 5 and Figure 6 adopt two degrees of data heterogeneity, namely

p_{k} \sim D i r (0.5)

and

p_{k} \sim D i r (0.1)

. The label distributions of the clients are illustrated in Figure 3. In

p_{k} \sim D i r (0.5)

, heterogeneity among clients is mild and better reflects real-world data distributions. In

p_{k} \sim D i r (0.1)

, heterogeneity is severe.

In Figure 4, on the MNIST dataset, the federated learning task is relatively simple, and all four algorithms perform well. Under

β

= 0.5, SACW exhibits rapid convergence and a smooth training trajectory, almost completely mitigating the effects of heterogeneity. The other three algorithms are visibly affected, with FedAvg and FedBUFF being the most sensitive. Under the harsher setting of

β

= 0.1, SACW experiences the smallest degradation; accuracy drops by only

2.01 %

. The remaining algorithms fluctuate sharply, and FedAvg’s accuracy declines by

5.15 %

.

In Figure 5, on the Fashion-MNIST dataset, SACW outperforms the other three algorithms in both convergence speed and final accuracy under

β

= 0.5 and

β

= 0.1. When

β

= 0.5, the training curve of SACW remains smooth despite the increased data and model complexity.

Figure 6 compares the global training loss curves using ResNet-20 on CIFAR-10, a more complex model and dataset. Figure 6a,b both exhibit fluctuations, yet SACW converges faster and achieves higher accuracy than the other methods. Under both

β

= 0.5 and

β

= 0.1, FedBUFF suffers from poor convergence and large oscillations. When

β

= 0.1, QuAFL and FedAvg also display pronounced instability.

In summary, under system and statistical heterogeneity, SACW maintains stable training and rapid convergence across three datasets and neural networks of varying complexity. With

β

= 0.5 and

25 %

slow clients, SACW almost completely overcomes heterogeneity. Under the more challenging case of

β

= 0.1, SACW experiences only minor fluctuations and outperforms the other three algorithms. To evaluate the model performance more comprehensively, we introduce precision, recall, and F1-score as core metrics, with the specific results of supplementary comparative experiments available in Appendix A.

5.4. Hyperparameter Analysis

5.4.1. Analysis of Core Hyperparameters (b, v, $λ$ )

To examine the influence of three key hyperparameters—quantization bits b, weighting factor

λ

, and server access interval v—we conduct ablation studies on MNIST under

p_{k} \sim D i r (0.5)

with

25 %

slow clients, a setting that closely approximates real-world heterogeneity.

The parameter (b) represents the quantization bit-width, which controls the trade-off between communication cost and model accuracy. In our experiments, the server schedules clients for federative learning tasks based on server handling time (h) and waiting time (v). A larger quantization bit-width (b) means more precise model information, which increases processing time (h) but improves accuracy. As shown in Figure 7a, we conduct a hyperparameter analysis with b =

{8, 12, 32}

, where (b = 32) corresponds to the unquantized baseline. According to [29], the adopted lattice-based quantization guarantees bounded distortion, ensuring that quantization does not significantly degrade the global model. When (b = 12), the global accuracy curve closely matches that of the unquantized case, effectively reducing communication cost while maintaining training efficiency. Moreover, in unstable networks (high packet loss or jitter), the bit-width can be increased to compensate.

The parameter (v) denotes the server visiting time, which specifies the minimum local training duration allowed for clients. A larger (v) leads to more complete local training but may cause fast clients to wait for slow ones, making the system behave more synchronously. Conversely, a smaller (v) speeds up global updates but may reduce local model quality. As shown in Figure 7b, we tested (v =

{1, 2, 5}

) on the MNIST dataset and found that the global accuracy curve for (v = 1) is close to that for (v = 3). Thus, for simpler datasets such as MNIST and Fashion-MNIST, (v = 1) is sufficient. However, for the more complex CIFAR-10 dataset, a higher value (v = 5) is needed to balance local model completeness and overall training efficiency.

The parameter (

λ

) is a weighting coefficient that balances client data size and model staleness in global aggregation. A higher

λ

imposes stricter penalties on client staleness, discarding more outdated information. As shown in Figure 7c, we ran extensive tests with

λ

=

{0.1, 0.2, \dots, 1.0}

. Performance stays stable when

λ

is between 0.2 and 0.6. Setting

λ

= 0.3 gives a good trade-off between letting data size dominate and penalizing delay, so this is the value we use in this paper.

5.4.2. Sensitivity analysis to DBSCAN Clustering Parameters ( $ϵ$ , MinPts)

To examine the robustness of the proposed SACW framework with respect to the clustering hyperparameters used in the DBSCAN-based client selection, we conduct a systematic parameter sensitivity analysis. Specifically, we investigate the influence of the two key parameters—the neighborhood radius Eps and the minimum number of core points MinPts—on the clustering structure and global model performance.

Our analysis is conducted under the most heterogeneous and challenging setting: CIFAR-10 with Dirichlet

β

= 0.1. The results on MNIST and Fashion-MNIST show similar trends. Both parameters are tuned within ranges that are empirically common.

$ϵ$ is set to = ${0.02, 0.05, 0.1, 0.2, 0.4, 0.8}$ , covering a wide spectrum from very tight to loose clustering thresholds.
MinPts is set to ${2, 3, 5, 8, 10}$ , representing progressively stricter density constraints.

For each (

ϵ

, MinPts) pair, DBSCAN is applied to generate client clusters, and SACW is trained to converge with identical initialization and learning rate. The final global model accuracy is recorded and visualized as a heatmap (Figure 8).

Figure 8 illustrates the influence of DBSCAN parameters on model performance. The color intensity represents the final global accuracy. As observed, SACW achieves stable and consistent performance across a broad parameter range. The best accuracy of 0.7486 is obtained when

ϵ \in [0.1, 0.2]

and MinPts

\in [3, 5]

. Within this range, accuracy fluctuations remain below

\pm 0.3 %

, indicating strong parameter robustness. When

ϵ

is too small (≤0.05), the clustering becomes overly fine-grained, leading to insufficient data diversity representation. Conversely, when

ϵ \geq 0.4

or MinPts

\geq 8

, the clustering becomes too coarse, merging heterogeneous clients and slightly degrading global performance.

5.5. Ablation Analysis

The SACW framework comprises two primary modules: (1) the DBSCAN-based client selection method (DBCS) and (2) the adaptive local model weighting method (ALMW). We decompose SACW into four variants:

Default: Neither DBCS nor ALMW; employs random client selection and data-volume-based weighting.
Default + DBCS: Replaces the client selection strategy with DBCS.
Default + ALMW: Replaces the server-side weighting strategy with ALMW.
SACW: The complete framework that combines both modules.

We evaluate these four variants on three datasets under two heterogeneity levels. The results are summarized in Table 3. DBCS yields the largest improvement when data heterogeneity is high. On MNIST with

β

= 0.5, accuracy rises from

92.55 %

to

93.91 %

. The gain is more pronounced on simpler datasets than on complex ones. ALMW plays a more critical role in complex datasets. On CIFAR-10 with

β

= 0.1, ALMW alone improves accuracy by

2.17 %

. When both modules are active, SACW achieves the highest score in all 18 experimental settings and often surpasses the sum of individual gains. For example, on CIFAR-10 with

β

= 0.1, SACW attains

74.86 %

, exceeding the second best result by

7.28 %

, indicating a synergistic amplification effect.

In summary, DBCS filters high-quality clients via clustering, and ALMW suppresses low-quality updates through adaptive weighting. They demonstrate complementary strengths across datasets of varying complexity and heterogeneity. Activating either module alone outperforms Default, while their joint activation yields non-linear synergy, ensuring SACW consistently achieves the best performance and validating the necessity of both DBCS and ALMW.

6. Discussion

Although our framework demonstrates robust performance in the presence of system and statistical heterogeneity, it leaves the local training phase of each client unoptimized; theoretically, both efficiency and convergence speed can therefore be further improved. Personalized federated learning [34]—tailoring local optimization to individual clients—will be a primary direction for future work. Moreover, when the number of client distribution vectors is small, the server can easily handle the load. However, once the system scales to millions of clients and thousands of labels, storing every vector becomes prohibitive and cripples training efficiency. Fortunately, an incremental clustering algorithm [35] copes with both streaming data and limited memory: new samples are merged into existing clusters on the fly, so the server only keeps the current cluster summaries instead of the full set of vectors.

A second concern arises during client-to-server uploads: vanilla gradient transmission is vulnerable to gradient inversion attacks [36], which can reconstruct private training data and thus violate the privacy-preserving tenet of federated learning. Differential privacy (DP) [37,38] offers a remedy: the Laplace or Gaussian mechanism can inject calibrated noise into the gradients before sharing, thereby thwarting such attacks. However, most DP-flavored defenses have been designed for synchronous federated protocols. Extending these guarantees to our semi-asynchronous setting—balancing privacy, utility, and convergence—constitutes another key avenue for future research. Additionally, malicious servers do exist in practice, and sending exact client distributions to them would clearly breach privacy. Recent works [39,40] show that deep models tend to predict the majority classes seen during training. Leveraging this, we can feed each client model a large set of random inputs and indirectly estimate its data distribution from the predicted labels. Such estimates are inherently coarse and noisy, mitigating the privacy threat posed by an untrusted server.

7. Conclusions

We propose SACW, a semi-asynchronous federated learning method that combines client selection with adaptive weighting to address training inefficiency under system and statistical heterogeneity. By periodically collecting local model progress while allowing clients to train asynchronously, SACW eliminates the mutual waiting inherent in synchronous schemes, thereby maximizing client utilization and mitigating system heterogeneity. An adaptive weighting mechanism that considers both data volume and model staleness suppresses the impact of low-quality local updates and accelerates convergence. To counteract statistical heterogeneity, a clustering-based client selection strategy enforces intra-cluster homogeneity and inter-cluster diversity, guiding the global model toward consistent optimization. Communication overhead is further reduced through model quantization for both uplink and downlink transmissions. Extensive experiments under heterogeneous conditions demonstrate that SACW outperforms state-of-the-art synchronous, asynchronous, and semi-asynchronous baselines in both convergence speed and final accuracy.

Author Contributions

Conceptualization, S.M. and Z.C.; Methodology, S.M. and Z.C.; software, S.M.; validation, F.M.; formal analysis, F.M. and Z.C.; investigation, F.M.; resources, Y.L.; data curation, Y.L.; Writing—original draft preparation, S.L.; Writing—review and editing, S.L.; visualization, S.M.; supervision, F.S.; project administration, F.S.; funding acquisition, F.S. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is jointly supported by the National Natural Science Foundation of China (Grant No. 62302540, author F.F.S.; https://www.nsfc.gov.cn (accessed on 20 October 2025)), the Open Foundation of the Henan Key Laboratory of Cyberspace Situation Awareness (Grant No. HNTS2022020, author F.F.S.; http://xt.hnkjt.gov.cn/data/pingtai/ (accessed on 20 October 2025)), and the Key Research and Development Program of Henan Province (Grant No. 251111212000, author F.F.S.; http://xt.hnkjt.gov.cn/data/ (accessed on 20 October 2025)).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We only use publicly available datasets. The MNIST dataset can be found at http://yann.lecun.com/exdb/mnist (accessed on 20 October 2025). The Fashion MNIST dataset can be found at https://github.com/zalandoresearch/fashion-mnist (accessed on 20 October 2025). The CIFAR-10 dataset can be found at http://www.cs.toronto.edu/~kriz/cifar.html (accessed on 20 October 2025).

Acknowledgments

The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

Appendix A

Appendix A.1. Evaluation Metrics

To comprehensively evaluate the global model obtained from federated training, we introduce three additional metrics: precision (Pre), recall (Rec), and F1-score (F1).

Pre = \frac{TP}{TP + FP}

(A1)

Rec = \frac{TP}{TP + TN}

(A2)

F 1 = 2 \times \frac{Pre \times Rec}{Pre + Rec}

(A3)

where TP (true positive) denotes the number of correctly predicted positive instances, TN (true negative) the number of correctly predicted negative instances, FP (false positive) the number of instances incorrectly classified as positive, and FN (false negative) the number of instances incorrectly classified as negative.

Appendix A.2. Performance Comparison

We adopt the same experimental settings as in Section 5.3 and evaluate the performance of the global federated learning model using three metrics: precision (Pre), recall (Rec), and F1-score (F1). For each dataset, all four federated learning methods were executed for an identical duration; the final reported result is the mean of the three highest values recorded during this training window. The experimental results are presented in Table A1. The experimental results demonstrate that our algorithm consistently outperforms the baseline methods across various datasets and under different levels of heterogeneity.

Table A1. Performance comparison of federated learning methods in heterogeneous scenarios.

Dataset	Method	$p_{k} \sim Dir (0.5)$			$p_{k} \sim Dir (0.1)$
Dataset	Method	Pre	Rec	F1	Pre	Rec	F1
Mnist	FedAvg	0.8930	0.8912	0.8921	0.8815	0.8715	0.8765
	FedBUFF	0.9179	0.9092	0.9135	0.9109	0.9010	0.9059
	QuAFL	0.9238	0.9227	0.9232	0.8883	0.8805	0.8844
	SACW	0.9396	0.9400	0.9398	0.9272	0.9261	0.9266
Fashion Mnist	FedAvg	0.7063	0.6984	0.7023	0.6742	0.6391	0.6562
	FedBUFF	0.7879	0.7929	0.7904	0.7513	0.7576	0.7544
	QuAFL	0.8767	0.8830	0.8848	0.8655	0.8540	0.8597
	SACW	0.9008	0.9018	0.9013	0.8915	0.8878	0.8896
Cifar-10	FedAvg	0.7452	0.7389	0.7420	0.6378	0.5899	0.6129
	FedBUFF	0.7869	0.7643	0.7754	0.6836	0.6649	0.6741
	QuAFL	0.8021	0.7912	0.7966	0.7337	0.7172	0.7253
	SACW	0.8164	0.8105	0.8134	0.7839	0.7489	0.7660

References

Wang, X.; Han, Y.; Wang, C.; Zhao, Q.; Chen, X.; Chen, M. In-edge AI: Intelligentizing mobile edge computing, caching and communication by federated learning. IEEE Netw. 2019, 33, 156–165. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Found. Trends® Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Zhang, T.; Gao, L.; He, C.; Zhang, M.; Krishnamachari, B.; Avestimehr, A.S. Federated learning for the internet of things: Applications, challenges, and opportunities. IEEE Internet Things Mag. 2022, 5, 24–29. [Google Scholar] [CrossRef]
Li, L.; Fan, Y.; Tse, M.; Lin, K.Y. A review of applications in federated learning. Comput. Ind. Eng. 2020, 149, 106854. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
Fu, L.; Zhang, H.; Gao, G.; Zhang, M.; Liu, X. Client selection in federated learning: Principles, challenges, and opportunities. IEEE Internet Things J. 2023, 10, 21811–21819. [Google Scholar] [CrossRef]
Mayhoub, S.; Shami, T.M. A review of client selection methods in federated learning. Arch. Comput. Methods Eng. 2024, 31, 1129–1152. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics; PMLR: New York, NY, USA, 2017; pp. 1273–1282. [Google Scholar] [CrossRef]
Imteaj, A.; Thakker, U.; Wang, S.; Li, J.; Amini, M.H. A survey on federated learning for resource-constrained IoT devices. IEEE Internet Things J. 2021, 9, 1–24. [Google Scholar] [CrossRef]
Sasindran, Z.; Yelchuri, H.; Prabhakar, T.V. Ed-Fed: A generic federated learning framework with resource-aware client selection for edge devices. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; IEEE: New York City, NY, USA, 2023; pp. 1–8. [Google Scholar] [CrossRef]
Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated learning with non-IID data. arXiv 2018, arXiv:1806.00582. [Google Scholar] [CrossRef]
Cheng, S.L.; Yeh, C.Y.; Chen, T.-A.; Pastor, E.; Chen, M.S. FedGCR: Achieving Performance and Fairness for Federated Learning with Distinct Client Types via Group Customization and Reweighting. Proc. AAAI Conf. Artif. Intell. 2024, 38, 11498–11506. [Google Scholar] [CrossRef]
Moe, C.; Koyejo, S.; Gupta, I. Asynchronous federated optimization. arXiv 2019, arXiv:1903.03934. [Google Scholar]
Liu, J.; Xu, H.; Xu, Y.; Ma, Z.; Wang, Z.; Qian, C.; Huang, H. Communication-efficient asynchronous federated learning in resource-constrained edge computing. Comput. Netw. 2021, 199, 108429. [Google Scholar] [CrossRef]
Xu, C.; Qu, Y.; Xiang, Y.; Gao, L. Asynchronous federated learning on heterogeneous devices: A survey. Comput. Sci. Rev. 2023, 50, 100595. [Google Scholar] [CrossRef]
Deng, Y.; Lyu, F.; Ren, J.; Wu, H.; Zhou, Y.; Zhang, Y.; Shen, X. AUCTION: Automated and quality-aware client selection framework for efficient federated learning. IEEE Trans. Parallel Distrib. Syst. 2021, 33, 1996–2009. [Google Scholar] [CrossRef]
Ye, M.; Fang, X.; Du, B.; Yuen, P.C.; Tao, D. Heterogeneous Federated Learning: State-of-the-art and Research Challenges. ACM Comput. Surv. 2024, 56, 79. [Google Scholar] [CrossRef]
Li, Q.; He, B.; Song, D. Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10713–10722. [Google Scholar] [CrossRef]
Li, X.C.; Zhan, D.C. FedRS: Federated learning with restricted softmax for label distribution non-IID data. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery Data Mining, Singapore, 14–18 August 2021; pp. 995–1005. [Google Scholar] [CrossRef]
Zhang, J.; Li, Z.; Li, B.; Xu, J.; Wu, S.; Ding, S.; Wu, C. Federated learning with label distribution skew via logits calibration. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; PMLR: New York, NY, USA, 2022; pp. 26311–26329. [Google Scholar] [CrossRef]
Nguyen, J.; Malik, K.; Zhan, H.; Yousefpour, A.; Rabbat, M.; Malek, M.; Huba, D. Federated learning with buffered asynchronous aggregation. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 28–30 March 2022; PMLR: New York, NY, USA, 2022; pp. 3581–3607. [Google Scholar]
Li, Y.; Yang, S.; Ren, X.; Shi, L.; Zhao, C. Multi-Stage Asynchronous Federated Learning With Adaptive Differential Privacy. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1243–1256. [Google Scholar] [CrossRef]
Ma, Q.; Xu, Y.; Xu, H.; Jiang, Z.; Huang, L.; Huang, H. FedSA: A semi-asynchronous federated learning mechanism in heterogeneous edge computing. IEEE J. Sel. Areas Commun. 2021, 39, 3654–3672. [Google Scholar] [CrossRef]
Wu, W.; He, L.; Lin, W.; Mao, R.; Maple, C.; Jarvis, S. SAFA: A semi-asynchronous protocol for fast federated learning with low overhead. IEEE Trans. Comput. 2020, 70, 655–668. [Google Scholar] [CrossRef]
Zakerinia, H.; Talaei, S.; Nadiradze, G.; Alistarh, D. Communication-efficient federated learning with data and client heterogeneity. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 2–4 May 2024; PMLR: New York, NY, USA, 2024; pp. 3448–3456. [Google Scholar] [CrossRef]
Konečný, J.; McMahan, B.; Ramage, D. Federated optimization: Distributed optimization beyond the datacenter. arXiv 2015, arXiv:1511.03575. [Google Scholar] [CrossRef]
Ester, M.; Sander, J. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar] [CrossRef]
Davies, P.; Gurunathan, V.; Moshrefi, N.; Ashkboos, S.; Alistarh, D. New bounds for distributed mean estimation and variance reduction. arXiv 2020, arXiv:2002.09268. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 2002, 86, 2278–2324. [Google Scholar] [CrossRef]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar] [CrossRef]
Krizhevsky, A. Dataset: Learning Multiple Layers of Features from Tiny Images. 2024. Available online: https://service.tib.eu/ldmservice/dataset/learning-multiple-layers-of-features-from-tiny-images (accessed on 20 October 2025). [CrossRef]
Guo, S.; Yang, X.; Feng, J.; Ding, Y.; Wang, W.; Feng, Y.; Liao, Q. FedGR: Federated learning with gravitation regulation for double imbalance distribution. In International Conference on Database Systems for Advanced Applications; Springer Nature Switzerland: Cham, Switzerland, 2023; pp. 703–718. [Google Scholar]
Tan, A.Z.; Yu, H.; Cui, L.; Yang, Q. Towards personalized federated learning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 9587–9603. [Google Scholar] [CrossRef] [PubMed]
He, Y.; Tan, H.; Luo, W.; Mao, H.; Ma, D.; Feng, S.; Fan, J. MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce. In Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems, Tainan, Taiwan, 7–9 December 2011; pp. 473–480. [Google Scholar] [CrossRef]
Zhu, L.; Liu, Z.; Han, S. Deep leakage from gradients. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 14774–14784. [Google Scholar]
Dwork, C. Differential privacy. In Proceedings of the 33rd international conference on Automata, Languages and Programming—Volume Part II (ICALP’06), Venice, Italy, 10–14 July 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–12. [Google Scholar]
Shan, F.; Lu, Y.; Li, S.; Mao, S.; Li, Y.; Wang, X. Efficient adaptive defense scheme for differential privacy in federated learning. J. Inf. Secur. Appl. 2025, 89, 103992. [Google Scholar] [CrossRef]
Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
Diao, Y.; Li, Q.; He, B. Exploiting Label Skews in Federated Learning with Model Concatenation. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 38, pp. 11784–11792. [Google Scholar] [CrossRef]

Figure 1. The framework diagram of FedAvg, where both clients and the server operate in a synchronous manner.

Figure 2. The framework diagram of SACW, where clients train asynchronously while the server synchronously aggregates client models.

Figure 3. Under the CIFAR-10 dataset, this figure shows the data distributions across clients under varying degrees of statistical heterogeneity. (a) IID,

p_{k} \sim D i r (+ \infty)

; (b) Non-IID,

p_{k} \sim D i r (0.5)

; (c) Non-IID,

p_{k} \sim D i r (0.1)

.

Figure 3. Under the CIFAR-10 dataset, this figure shows the data distributions across clients under varying degrees of statistical heterogeneity. (a) IID,

p_{k} \sim D i r (+ \infty)

; (b) Non-IID,

p_{k} \sim D i r (0.5)

; (c) Non-IID,

p_{k} \sim D i r (0.1)

.

Figure 4. Under the MNIST dataset, four algorithms are compared under two distinct heterogeneity levels. (a)

p_{k} \sim D i r (0.5)

; (b)

p_{k} \sim D i r (0.1)

.

Figure 4. Under the MNIST dataset, four algorithms are compared under two distinct heterogeneity levels. (a)

p_{k} \sim D i r (0.5)

; (b)

p_{k} \sim D i r (0.1)

.

Figure 5. Under the Fashion MNIST dataset, four algorithms are compared under two distinct heterogeneity levels. (a)

p_{k} \sim D i r (0.5)

; (b)

p_{k} \sim D i r (0.1)

.

Figure 5. Under the Fashion MNIST dataset, four algorithms are compared under two distinct heterogeneity levels. (a)

p_{k} \sim D i r (0.5)

; (b)

p_{k} \sim D i r (0.1)

.

Figure 6. Under the CIFAR-10 dataset, four algorithms are compared under two distinct heterogeneity levels. (a)

p_{k} \sim D i r (0.5)

; (b)

p_{k} \sim D i r (0.1)

.

Figure 6. Under the CIFAR-10 dataset, four algorithms are compared under two distinct heterogeneity levels. (a)

p_{k} \sim D i r (0.5)

; (b)

p_{k} \sim D i r (0.1)

.

Figure 7. A parameter analysis of SACW under the MNlST dataset. Subfigures (a–c) illustrate the effects of b, v, and

λ

, respectively. All experiments are performed under identical conditions (Non-IID,

p_{k} \sim D i r (0.5)

, N = 50).

Figure 7. A parameter analysis of SACW under the MNlST dataset. Subfigures (a–c) illustrate the effects of b, v, and

λ

, respectively. All experiments are performed under identical conditions (Non-IID,

p_{k} \sim D i r (0.5)

, N = 50).

Figure 8. Under the CIFAR-10 dataset (

p_{k} \sim D i r (0.1)

), we analyze the impact of parameters

ϵ

and MinPts on SACW.

Figure 8. Under the CIFAR-10 dataset (

p_{k} \sim D i r (0.1)

), we analyze the impact of parameters

ϵ

and MinPts on SACW.

Table 1. Descriptions of notations in this paper.

Symbol	Descriptions
N	total number of clients
R	number of global model aggregation rounds
r	current round number
T	total training time of the global model
L	maximum number of local training steps
K	number of clusters for clustering all clients
$k_{i}$	$i^{t h}$ client
$c_{i}$	adaptive weight of $i^{t h}$ client
$C_{i}$	$i^{t h}$ cluster
S	set of selected clients
$s_{i}$	selected representative client of $C_{i}$
$D_{i}$	dataset of $c l i e n t_{i}$
$\|\cdot\|$	size of dataset
$l_{i} (\cdot)$	loss function of $k_{i}$
$p_{i}$	weight factor of $k_{i}$
$β$	parameters of the Dirichlet probability distribution
$τ_{i}$	model version number of $k_{i}$
$ω_{i}^{τ_{i}}$	local model of $k_{i}$ after local training
$ω^{r}$	global model in the $r^{t h}$ round
b	quantization bit-width
v	server visiting time
h	server handling time
$m_{i}$	actual local training steps of $k_{i}$
$P_{i} (y)$	local label distribution of $k_{i}$
$G_{i} (\cdot)$	local gradient computation function of $k_{i}$

Table 2. Experiment environment configuration.

Experiment Environment	Configuration
CPU	Intel(R) Core(TM) i9-12900K (Intel Corporation, Santa Clara, CA, USA)
GPU	NVIDIA RTX A5000 (NVIDIA Corporation, Santa Clara, CA, USA)
RAM (random access memory)	64 GB
Programming language	Python 3.8
Pytorch version	1.8.2
CUDA toolkit	11.6

Table 3. Ablation experiment results. (In the table, ✓ indicates the module is used, and × indicates it is not used).

Method		MNIST		Fashion MNIST		CIFAR 10
DBCS	ALMW	$β$ = 0.1	$β$ = 0.5	$β$ = 0.1	$β$ = 0.5	$β$ = 0.1	$β$ = 0.5
×	×	0.8928	0.9255	0.8548	0.8843	0.6541	0.786
✓	×	0.9105	0.9391	0.8689	0.8923	0.6705	0.7968
×	✓	0.9072	0.9324	0.875	0.8913	0.6758	0.8009
✓	✓	0.9276	0.9477	0.8902	0.9016	0.7486	0.8164

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, S.; Shan, F.; Mao, S.; Lu, Y.; Miao, F.; Chen, Z. SACW: Semi-Asynchronous Federated Learning with Client Selection and Adaptive Weighting. Computers 2025, 14, 464. https://doi.org/10.3390/computers14110464

AMA Style

Li S, Shan F, Mao S, Lu Y, Miao F, Chen Z. SACW: Semi-Asynchronous Federated Learning with Client Selection and Adaptive Weighting. Computers. 2025; 14(11):464. https://doi.org/10.3390/computers14110464

Chicago/Turabian Style

Li, Shuaifeng, Fangfang Shan, Shiqi Mao, Yanlong Lu, Fengjun Miao, and Zhuo Chen. 2025. "SACW: Semi-Asynchronous Federated Learning with Client Selection and Adaptive Weighting" Computers 14, no. 11: 464. https://doi.org/10.3390/computers14110464

APA Style

Li, S., Shan, F., Mao, S., Lu, Y., Miao, F., & Chen, Z. (2025). SACW: Semi-Asynchronous Federated Learning with Client Selection and Adaptive Weighting. Computers, 14(11), 464. https://doi.org/10.3390/computers14110464

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SACW: Semi-Asynchronous Federated Learning with Client Selection and Adaptive Weighting

Abstract

1. Introduction

2. Related Work

3. Preliminary

3.1. Federated Learning

3.2. DBSCAN

4. Proposed Method: SACW

4.1. Problem Description

4.2. Algorithm Description

4.3. Method Description

4.3.1. Client Selection

4.3.2. The Semi-Asynchronous Framework with Adaptive Weighting

4.3.3. Adaptive Weighting

4.3.4. Lattice-Based Model Quantization

5. Experimental Description

5.1. Dataset and Model

5.2. Experimental Setup

5.3. Comparative Experimental Results and Analysis

5.4. Hyperparameter Analysis

5.4.1. Analysis of Core Hyperparameters (b, v, λ )

5.4.2. Sensitivity analysis to DBSCAN Clustering Parameters ( ϵ , MinPts)

5.5. Ablation Analysis

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Evaluation Metrics

Appendix A.2. Performance Comparison

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.4.1. Analysis of Core Hyperparameters (b, v, $λ$ )

5.4.2. Sensitivity analysis to DBSCAN Clustering Parameters ( $ϵ$ , MinPts)