Next Article in Journal
A New Characterization of Projective Special Linear Groups L3(p2)
Previous Article in Journal
Development Parallel–Hierarchical Segmentation Method Based on Pyramidal Generalized Contour Preprocessing for Image Processing
Previous Article in Special Issue
Beyond Spurious Cues: Adaptive Multi-Modal Fusion via Mixture-of-Experts for Robust Sarcasm Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FedDCS: Semi-Asynchronous Federated Learning Optimization Based on Dynamic Client Selection

1
Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
2
University of Chinese Academy of Sciences, Beijing 100049, China
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(5), 803; https://doi.org/10.3390/math14050803
Submission received: 15 January 2026 / Revised: 4 February 2026 / Accepted: 22 February 2026 / Published: 27 February 2026
(This article belongs to the Special Issue Advances in Blockchain and Intelligent Computing)

Abstract

Federated Learning (FL) represents a promising paradigm for collaborative model training across numerous devices, preserving data locality and offering potential privacy benefits for industries such as finance, healthcare, and Internet of Things (IoT). Nonetheless, real-world deployments of FL encounter challenges arising from dynamic and diverse environments, which adversely affect training speed and model convergence. To address these issues, this paper introduces FedDCS, an adaptive federated learning framework that effectively manages resources during training through two primary innovations. First, it establishes a reliable method for predicting client training durations, estimating completion times while filtering noise and detecting performance variations. Second, it implements a two-stage adaptive waiting strategy that dynamically determines the optimal timing and selection of client batches for aggregation, thereby balancing collection efficiency with model accuracy. This approach optimizes the trade-off between efficiency and accuracy in heterogeneous settings. Extensive evaluations on datasets such as Fashion-MNIST and CIFAR-10/100, incorporating simulated device and data heterogeneity, demonstrate that FedDCS consistently achieves superior time efficiency and higher global model accuracy compared to state-of-the-art (e.g., synchronous, asynchronous, and semi-asynchronous) baselines. Its robustness and versatility render it effective across various complex and heterogeneous environments.

1. Introduction

Amid increasingly stringent data privacy regulations and the exponential growth of data generated at the network edge, Federated Learning (FL) has emerged as a successful and promising distributed machine learning paradigm [1]. It enables multiple clients to collaboratively train a global machine learning model while keeping all raw data localized. By updating model parameters instead of raw datasets, FL preserves data privacy and effectively leverages the computational resources of each client. It offers viable, practical technical pathways for deploying artificial intelligence in privacy-sensitive domains, such as finance, healthcare, and Internet of Things (IoT).
To satisfy the dual demands of efficiency and system performance in practical deployments, FL has evolved through distinct training paradigms. The synchronous paradigm [1,2,3,4,5] ensures stable client model updates but suffers from the straggler problem and low resource utilization. Conversely, the asynchronous paradigm [6,7,8,9] maximizes hardware utilization but risks model instability due to stale updates. To balance these extremes, semi-asynchronous paradigms [10,11,12,13] have emerged as a promising middle ground, employing mechanisms like server-side buffering to aggregate a subset of timely updates, thereby mitigating the key limitations of both extremes.
However, the system performance of FL is fundamentally limited by two intertwined challenges. First, device heterogeneity [14]—stemming from diverse device capabilities and unstable networks—creates the straggler problem and volatile client availability, crippling time efficiency. Second, data heterogeneity [15] inherent in distributed sources leads to non-IID client data, which biases local client training, impedes global model convergence, and degrades final model accuracy.
In addition, common semi-asynchronous methods typically rely on static aggregation policies, such as fixed buffer sizes [16]. The rigidity cannot adapt to the dynamic environments described above: a high threshold reverts the method to inefficient synchronous waiting, while a low threshold exposes it to the instability of asynchronous updates. Consequently, the absence of an adaptive FL strategy that dynamically responds to real-time system and data states remains a pivotal gap, limiting the robustness and performance of FL in real-world heterogeneous settings.
To this end, this paper proposes FedDCS, a semi-asynchronous FL framework designed to dynamically adapt to the evolving training environment. It develops efficient, predictive, server-side orchestration strategies for real-time aggregation decisions that balance efficiency and robustness across device and data heterogeneity. Our contributions are summarized as follows:
  • We propose an adaptive method for predicting client completion times to optimize aggregation scheduling. This approach uses a refined exponential smoothing prediction model, with key innovations including outlier filtering and change-point detection. This combined design offers robust predictions that are resistant to outlier interference while maintaining continuous adaptability to dynamic environmental changes.
  • We propose a two-stage waiting mechanism grounded in the prediction of client completion times. Building upon adaptive prediction techniques, the server orchestrates the aggregation process dynamically. It intelligently adjusts the waiting time across two distinct stages, thereby optimizing the trade-off between the number of aggregated client updates and the overall aggregation duration. This design allows FedDCS to independently balance time efficiency and model quality within heterogeneous environments.
  • We provide a convergence analysis for FedDCS in the smooth non-convex setting and evaluate FedDCS across a variety of highly diverse and challenging environments. The experimental results show that FedDCS consistently outperforms leading synchronous, asynchronous, and semi-asynchronous baselines in both convergence speed and final model accuracy. It also demonstrates exceptional robustness under extreme conditions with severe client staleness or highly skewed data distributions.

2. Related Work

Heterogeneity has been a fundamental challenge since the inception of FL. Existing research can be broadly categorized into three types, namely synchronous FL, asynchronous FL, and semi-asynchronous FL.
Numerous efforts within the synchronous paradigm have attempted to address it. For instance, TiFL [17] employs a tiering strategy that groups clients with similar performance profiles and samples clients from the same tier per round to mitigate stragglers, adjusting the sampling probability per tier based on its performance. FedAnp [18] adopts a dynamic sampling strategy, starting with the fastest clients and progressively doubling the participant pool as performance thresholds are met. To tackle statistical heterogeneity, FedProx [19] modifies the local objective function by adding a correction term that penalizes deviation from the global model, thereby reducing client drift. From a different perspective, InclusiveFL [20] assigns models of varying sizes to devices based on capability and employs knowledge distillation from stronger to weaker devices. While these synchronous methods introduce valuable innovations for handling specific aspects of heterogeneity, they often address one dimension at the expense of another or introduce new overheads, and their core design remains bound to the inefficient round-based synchronization barrier.
To fundamentally circumvent the straggler problem, asynchronous paradigms were proposed. Similar asynchronous methods have been leveraged in fields such as UAV-assisted computing [21] to handle device heterogeneity by triggering decisions per task arrival rather than on fixed timelines. To address the critical issue of staleness, a temporally weighted aggregation strategy assigns layer-wise weights when aggregating on shallow and deep layers to dampen the impact of stale gradients [22]. FedFa [23] employs a sliding window of historical models; each new update triggers an aggregation of all models within the window to mitigate bias from any single client. Beyond server-side scheduling, client-initiated protocols such as TEA-Fed [24] offer flexibility by enabling idle clients to request training tasks from the server proactively. Similarly, FAVANO [25] has all clients train concurrently, with the server randomly sampling clients whose updates are uploaded and aggregated immediately; the sampled clients then restart local training with the fresh global model. Another method [26] employs a probabilistic upload mechanism in which the server computes an upload probability for each client, and the client decides whether to transmit its update upon completion based on this probability. In summary, while these asynchronous and hybrid methods effectively eliminate idle waiting and offer greater design flexibility, they still fail to fully resolve the convergence instability issues associated with asynchronous updates.
To mitigate the aforementioned issues, semi-asynchronous approaches are proposed. For example, one method employs GANs to generate synthetic data, balances node capacity, and uses a priority function to select clients for aggregation while other clients continue local training [27]. HySync [28] maintains a dynamic aggregation window triggered by an update and aggregates all receipts within a preset timeframe. FedAT [29] groups clients by performance for intra-group synchronous training and inter-group asynchronous aggregation. CE-AFL [30] proposes an SQP-PA algorithm to determine an optimal buffer size for specific scenarios. ASAFL [31] further uses environmental information to dynamically adjust the aggregation size per round. FedSA [12] divides training into two phases: multi-epoch local training under semi-asynchronous aggregation, followed by fewer epochs with asynchronous updates. Other methods introduce novel aggregation rules. Port [13] determines client aggregation weights based on the cosine similarity between the client update and the global model update. ASFL [32] defines a staleness tolerance, requiring clients that exceed this threshold to restart training with the latest global model. CA2FL [11] enhances FedBuff [10] by maintaining a cache of the most recent model from each client and by aggregating only model deltas to mitigate staleness. From a systems perspective, Fed-SASL [33] employs split learning, deploying edge servers to assist stragglers in completing training for later DNN layers. While these methods demonstrate the flexibility of the semi-asynchronous design space, a common and critical limitation persists: they predominantly rely on predefined, static parameters (e.g., fixed group sizes, tolerance thresholds, phase durations). Consequently, they may lack the ability to adapt dynamically and holistically to the continuously evolving training environment characterized by both device and data heterogeneity. To address these limitations, we propose FedDCS, a novel semi-asynchronous FL framework that reduces reliance on static configurations. Unlike the aforementioned methods, FedDCS dynamically determines round-specific training parameters through a predictive, adaptive mechanism that continuously learns from client behavior, enabling robust performance in dynamically heterogeneous environments.

3. Methodology

The proposed FedDCS (Federated learning based on Dynamic Client Selection) framework leverages an adaptive two-stage waiting mechanism to predict client training time, enabling real-time decision-making and dynamically balancing the trade-off between training speed and model quality. As illustrated in the framework diagram, Figure 1. FedDCS operates through four coordinated stages: (1) It estimates each client’s completion time using a robust prediction model resistant to outliers and change points. (2) It determines the optimal buffer size by identifying a cohort of the fastest clients. (3) It selects clients for aggregation in this round by using a two-stage waiting mechanism. (4) It assigns aggregation weights by jointly considering client staleness and data volume to update the global model.
The following sections detail each component. For clarity, frequently used notations are summarized in Table 1.

3.1. Client Training Time Prediction

Accurate prediction of client completion time is the foundation of dynamic client scheduling in this study. To enhance prediction accuracy and ensure robustness against real-world noise and performance shifts, we propose an adaptive exponential smoothing model that integrates outlier filtering and change-point detection. For a given client i , the predicted completion time for round t , denoted as p i t , is computed based on the previous prediction p i t 1 and the newly observed completion time o i t :
p i t = η o i t + ( 1 η ) p i t 1
Here, η is the smoothing coefficient ( 0 η 1 ), which governs the model’s sensitivity to recent observations. A fixed η , however, struggles to adapt to complex, dynamic environments. To address this, we design a dynamic adjustment mechanism that allows η to switch adaptively among three predefined states based on the identified system condition:
η = 0                                                               if   outlier   is   detected η m u t a t i o n                 if   change   point   is   detected η n o r m a l                             under   normal   conditions
To prevent prediction deviation caused by transient fluctuations in client performance (e.g., brief network interruptions or resource contention), this study incorporates an outlier detection mechanism based on the Interquartile Range (IQR) [34]. Upon receiving a new actual completion time o i t for a client i , the server first verifies the sufficiency of that client’s historical time records. If sufficient data are available, the following procedure is executed: the first quartile Q 1 and the third quartile Q 3 of the historical completion times are calculated, yielding the IQR.
I Q R = Q 3 Q 1
The normal value range is defined as [ Q 1 1.5 I Q R , Q 3 + 1.5 I Q R ] . If o i t falls outside this interval, it is flagged as an outlier. In such a case, the smoothing coefficient η is temporarily set to 0, effectively filtering out this temporary disturbance by ignoring the anomalous observation in the prediction update.
To enable the prediction model to quickly adapt to new states when a client’s computational or network environment undergoes a persistent change (e.g., a permanent increase in device load or a network bandwidth upgrade), this study employs a Cumulative Sum (CUSUM) [35] for change point detection. Specifically, the prediction residual of the client i in the current round is first calculated as:
e i t = o i t p i t 1
and the mean μ i e and standard deviation σ i e of the residuals are updated online. Subsequently, the positive and negative cumulative deviations, S t + and S t , are computed as follows:
S t + = max ( 0 , S t 1 + + λ e i t σ i e ) S t = min ( 0 , S t 1 + λ e i t + σ i e )
where λ is a sensitivity parameter.
Formally, this can be framed as a hypothesis test where the null hypothesis H 0 assumes no persistent change in the client’s performance. The test statistics are S t + and S t themselves. The decision rule is to reject H 0 if S t + > 3 σ i e or S t < 3 σ i e . Under the assumption that the prediction residuals are approximately i.i.d. and normally distributed around zero under H 0 , this threshold corresponds to a theoretical false alarm rate with Type I error of about 0.27% per test, given that P N 0,1 > 3 is approximately 0.0027. This provides a statistically grounded balance between sensitivity and robustness against random fluctuations.
A change point is flagged if S t + exceeds a threshold 3 σ i e or if S t falls below 3 σ i e . Upon detecting a change point, the server sets the smoothing coefficient η to a higher value η mutation for several subsequent rounds. This allows the exponential smoothing model to rapidly adjust by assigning greater weight to recent observations. Concurrently, the majority of clients’ historical completion records are purged to facilitate quick “forgetting” of the outdated pattern and adaptation to the new environment.
If neither outlier detection nor change point detection is triggered, the prediction model updates smoothly using the standard coefficient η normal .

3.2. Early-Batch Segmentation for Dynamic Buffer Sizing

With the predicted training times for all clients obtained, the next step is to intelligently determine the dynamic buffer size for each training round. Traditional clustering algorithms (e.g., K-means [36]) can group clients but incur high computational overhead and require a preset number of clusters, making them unsuitable for the real-time demands of FL. To address this, we propose an Early-Batch Segmentation Algorithm designed to efficiently identify a cohort of clients predicted to finish training earliest and to derive the buffer size from this cohort. The algorithm operates as outlined in Algorithm 1. The algorithm takes as input the set of predicted completion times for all clients and a positive hyperparameter ρ . It outputs two key parameters for the first-stage waiting mechanism: the dynamic buffer size K (the number of clients in the fastest batch) and the maximum waiting time T for this batch.
The procedure begins by sorting the predicted times in ascending order. It then calculates the sequence of gaps δ between consecutive sorted times. A dynamic threshold τ is computed as ρ times the average of these gaps. The algorithm then iterates through the sorted list, incrementally building the first batch. A client is included in the batch if the gap between its predicted time and the previous one does not exceed τ . The iteration stops at the first gap that exceeds τ , identifying a natural performance boundary. At this point, K is set to the count of clients included so far, and T is derived from the predicted completion time of the last client in this batch.
Algorithm 1 Early-Batch Segmentation Algorithm
Input: Predicted training completion times of clients end_training_time = { t 1 , t 2 , , t n } , Hyperparameter for segmentation threshold ρ R + ( ρ > 0 )
Output: Number of clients in the fastest batch K N , Maximum waiting time for the batch T
  1:
nNum(end_training_time)
  2:
T s o r t e d   S O R T e n d t r a i n i n g t i m e // Sort in ascending order T s o r t e d 1     T s o r t e d 2 T s o r t e d n
  3:
δ
  4:
for   i 2   to   n  do
  5:
      δ i T s o r t e d i T s o r t e d i 1
  6:
      G G { δ i }
  7:
end for
  8:
μ 1 | δ | δ ~ δ δ ~
  9:
τ ρ μ
10:
K 1 , T 0
11:
for   j 2  to  n  do
12:
      if   δ j τ  then
13:
          K K + 1
14:
     else
15:
          T T s o r t e d j 1 c u r r e n t   t i m e
16:
         Exit the loop
17:
     end if
18:
end for
19:
return   K , T
The computational complexity of the Early-Batch Segmentation Algorithm is dominated by the sorting operation, which takes O ( n log n ) time for n clients. The subsequent steps—calculating gaps δ , computing the dynamic threshold τ , iterating to get K and T —each require only a single linear pass over the data, incurring O ( n ) overhead. Thus, the algorithm has an overall time complexity of O ( n log n ) .

3.3. The Two-Stage Waiting Mechanism

After obtaining the dynamic buffer size K , this study employs an innovative Two-Stage Waiting Mechanism to execute the client model collection process. The mechanism is designed to precisely control the waiting time per round, aiming to optimally balance the trade-off between “avoiding prolonged idle waits for slow clients” and “maximizing the number of client updates collected.” This balance aims to maximize the training efficiency of each communication round. The complete procedure is outlined in Algorithm 2.
The algorithm comprises two consecutive stages with distinct objectives:
  • First-Stage Waiting Mechanism: The first-stage wait is the primary period for client model collection, during which the server actively receives client updates. The value of T 1 is the T output by Algorithm 1, set to the maximum predicted completion time among the clients in the identified fastest batch. As detailed in Algorithm 2, the server initiates a countdown timer for T 1 and begins receiving incoming client updates. This stage is designed to be adaptive through three core rules: (1) Upon receiving an update, the remaining wait time T 1 is reduced by a decay factor φ , dynamically shortening the wait as client update arrive; (2) The wait terminates immediately if the buffer reaches its target capacity K , preventing unnecessary delay; (3) If the timer expires before the buffer is full, the stage ends forcefully to avoid indefinite stalling caused by extremely slow or unresponsive clients. These rules collectively ensure that the system is neither idle nor stalled, proactively advancing the training process.
  • Second-Stage Waiting Mechanism: Empirical observation reveals that a cluster of clients may complete training shortly after the buffer is full. To harness these “following” clients, the second-stage wait is triggered upon completion of the first stage. T 2 is implemented as a short, resettable timer. Its core logic is to dynamically extend the collection window: whenever a new client update arrives within the current T2 window, the timer is reset. This creates a rolling window that persists as long as clients complete in close temporal succession, effectively capturing a naturally occurring cohort of clients. The stage concludes when a full T 2 duration elapses without any new arrival, signaling that the trailing cluster has been fully collected. The settings of T 2 are critical; an optimal value must balance the benefit of collecting additional clients against the cost of added latency.
Algorithm 2 Two-Stage Waiting Algorithm for Client Selection
Input: First-stage waiting time T 1 , Second-stage waiting time T 2 , Buffer size K , Time decay coefficient φ
Output: Collected client set B and the number of clients n
  1:
B , n 0
  2:
T 1 r e m a i n T 1
  3:
t s t a r t c u r r e n t   t i m e
  4:
while   n < K  and  T 1 r e m a i n > 0  do
  5:
     Wait for client arrival event
  6:
      if   Client   i   completes   training   at   time   t i  then
  7:
          t w a i t t i t s t a r t
  8:
          T 1 r e m a i n T 1 r e m a i n   φ t w a i t
  9:
          B B { c l i e n t i }
10:
        n n + 1
11:
        t s t a r t t i //Reset reference time for next wait
12:
   end if
13:
end while
14:
if   n > 0  then
15:
h a s N e w C l i e n t T r u e
16:
    while   h a s N e w C l i e n t  do
17:
        h a s N e w C l i e n t F a l s e
18:
        while   within   T 2 time do
19:
            if   New   client   j completes training then
20:
                B B { c l i e n t j }
21:
                n n + 1
22:
                h a s N e w C l i e n t T r u e
23:
               break
24:
           end if
25:
       end while
26:
     end while
27:
end if
28:
return   B , n
To address this, we introduce a Monte Carlo simulation optimization method [37] to determine the most effective T 2 value for the current training environment. This process consists of the following 3 steps: (1) For each client i , based on its current predicted time p i t and the historical mean μ i e and standard deviation σ i e of the prediction residuals, we model the client’s next-round completion time as a random variable drawn from a normal distribution: N ( p i t + μ i e , σ i e ) . By performing independent sampling for all clients, we generate one possible realization of completion times for the next round. Repeating this process thousands of times yields a diverse set of plausible future scenarios, capturing the inherent uncertainty and variability in client performance. (2) A set of candidate T 2 values is generated within a plausible range. For a given candidate T2 value, we simulate the execution of the complete two-stage waiting mechanism over all generated scenarios. For each scenario, we record two key outcomes: the total number of clients collected n and the total wall-clock waiting time w . We then compute the average number of clients collected n ¯ , and the average total waiting time w ¯ , across all scenarios. (3) We define a reward function R :
R ( T 2 ) = β n ¯ + ( 1 β ) ( w ¯ )
where β ( 0,1 ) is a tunably trade-off parameter that controls the relative importance of collection efficiency versus latency. A higher β prioritizes gathering more client updates, while a lower β emphasizes faster round completion. The optimal T 2 * for the upcoming rounds is selected as the value that maximizes the expected reward:
T 2 * = a r g   m a x T 2 E [ R ( T 2 ) ]
This data-driven approach ensures that T2 is dynamically tailored to maximize the expected utility of each communication round under the current estimated client performance profile.
The time complexity of the Monte Carlo optimization process is as follows. Considering a training round with n clients, for each of the x simulated scenarios: generating a completion time sample for each client takes O ( n ) time; sorting these n samples for the two-stage waiting algorithm requires O ( n log n ) time. For a specific candidate T 2 value, running the two-stage waiting algorithm on the sorted data also takes O ( n ) time. With m candidate values (a small constant, such as m 30 ), the cost per scenario becomes O ( n log n + n m ) . Thus, the total complexity over x simulations is O ( x n ( log n + m ) ) .
In real-world deployments, the client cohort size per round, n , is typically limited to tens to a few hundred. The complexity estimate, O ( x n ( log n + m ) ) , should be considered within this scale. Importantly, in large-scale federated learning, the main bottleneck is network communication and synchronization delays, not server-side processing. As a result, the computational effort for this online optimization is reasonable and worthwhile, given the substantial gains in training efficiency it provides.

3.4. Asynchronous Aggregation

In asynchronous FL paradigm, model updates from different clients are based on different versions of the global model, a difference known as staleness. To mitigate the negative impact of high-staleness client updates on global model convergence, FedDCS employs a dynamic weighting mechanism based on a POLY-rule function, which jointly considers a client’s staleness and its data quantity contribution.
Formally, let the current global model version on the server be v , and let client i ’s local model be trained based on version v i . The staleness of this client’s update is then defined as
τ i = v v i
Intuitively, a model update with higher staleness is based on an older global model and may deviate more from the current optimization direction. Therefore, it should be assigned lower weight during aggregation.
Our method employs the POLY-rule function to quantify the impact of staleness on aggregation weight. For client i , the staleness-based weight decay factor d i is calculated as:
d i =   ( τ i   + 1 ) γ  
where γ ( γ > 0 ) is the decay coefficient that controls how rapidly the weight diminishes as staleness increases and a larger γ imposes a heavier penalty on stale updates. To simultaneously account for the data contribution from different clients, the final aggregation weight θ i is jointly determined by the staleness decay factor d i and the client’s data proportion:
θ i = ( 1 g ) d i n i i = 1 N n i
Here, n i denotes the local data size of client i , and N represents the number of clients participating in aggregation during this round. The term g is a global model weight coefficient governing the influence of the previous global model in the aggregation. Specifically, θ g   is set to g only if all participating clients have zero staleness. Otherwise,
θ g = 1 i = 1 N θ i
to ensure a convex combination of all components. The weight formula inherently respects the underlying data distribution while intelligently regulating the influence based on update timeliness.
After obtaining the normalized weights for all selected clients, the server performs a weighted average to generate the new global model w t + 1 :
w t + 1 = i = 1 N θ i w i local + θ g w t
where w i local denotes the local model parameters uploaded by client i .
Including the historical global model in the aggregation further enhances stability by preventing the new global model from being excessively swayed by the potentially biased updates from a newly sampled subset of clients. The decay coefficient γ is a pivotal hyperparameter in this scheme. Its value embodies a trade-off: a smaller γ is more tolerant of stale updates, facilitating the utilization of information from slower clients, whereas a larger γ prioritizes convergence stability by aggressively discounting outdated information. An appropriate γ is therefore crucial for balancing these competing objectives in heterogeneous environments.

3.5. Convergence Analysis

Building on the seminal work in [10,38], FedDCS can be viewed as an enhanced variant of FedBuff featuring a dynamic buffer size. The modifications are confined to the server-side strategy, rather than altering the client’s local training procedure. Consequently, it extends the established proof methodology of FedBuff by formalizing the server’s variable aggregation size.
We consider the standard federated optimization problem, where the goal is to find a global model parameter w R d that minimizes the following objective:
m i n w   f ( w ) : = 1 n   i = 1 n p i F i ( w )
Here, n is the total number of clients, F i ( w ) denotes the expected loss over the local data distribution of client i , and p i > 0 is the weight assigned to client i . The local functions F i ( ) are only accessible by their respective client i .
Our analysis is based on the following standard assumptions:
Assumption 1.
Unbiased Stochastic Gradients: The client’s stochastic gradient estimator is unbiased.
E ς i [ g i ( w ; ς i ) ] = F i ( w ) , i [ n ]
Assumption 2.
Bounded Local and Global Variance: The variance of the stochastic gradients is bounded.
E ς i [ g i ( w ; ς i ) F i ( w ) 2 ] σ l 2 , i , w
1 n i = 1 n F i ( w ) f ( w ) 2 σ g 2 , w
Assumption 3.
Bounded Gradient: The local gradient norm is uniformly bounded.
F i ( w ) 2 G , i [ n ]
Assumption 4.
Lipschitz gradient: Each local objective function is L-smooth.
F i ( w ) F i ( w ) 2 L w w 2 , i [ n ]
Assumption 5.
Bounded Staleness: In the asynchronous execution, the staleness  τ i ( t )  of an update from client  i  used in the server’s  t t h  aggregation step is bounded.
1 τ i ( t ) τ m a x
The existence of this bound τ m a x is guaranteed as the server performs aggregation at least once every fixed maximum waiting period, and the number of concurrent clients is finite.
Having established the standard assumptions, we proceed to derive the convergence guarantees for FedDCS. Given that our client-side optimization remains unchanged from FedBuff, we can directly leverage its core convergence results. The primary distinction lies in the server-side aggregation, where the fixed buffer size K in FedBuff is replaced by a dynamic buffer size N t in each round.
Lemma 1. 
Under Assumptions, choosing constant local learning rate  η g    and  η l , and if the learning rates satisfy  η g η l ( q ) Q 1 L , the global model iterates of FedBuff are bounded by
1 T t = 0 T 1 E [ f w t 2 ] 2 f w 0 f * η g η l Q T + L 2 η g η l σ l 2 + 3 L 2 Q 2 η l 2 ( η g 2 τ m a x 2 + 1 ) ( σ l 2 + σ g 2 + G )
where  T  is the total number of server communication rounds. Q  is the number of local steps taken by each client.  [ f w 0 f * ]  is the initial objective gap.  η g  and   η l  are the server and client learning rates, respectively.
Building upon Lemma 1, we present the convergence theorem for the proposed FedDCS framework. The key adaptation stems from replacing the fixed buffer size K with the average dynamic buffer size as:
N ¯ = E [ N t ]
where N t is the number of clients aggregated in round t . Under the assumption that client completion times are independent of local data distributions, we can treat the sequence N t as a stationary stochastic process with bounded variance.
Under Assumptions 1–5 and given that the dynamic buffer size N t is a stationary stochastic process with expectation N ¯ = E [ N t ] and bounded variance Var [ N t ] V , the global model iterates of FedDCS satisfy the following inequality for any choice of learning rates η g and η l satisfying η g η l Q 1 / L :
1 T E [ f w t 2 ] 2 f w 0 f * η g η l Q T + L 2 η g η l σ l 2 + 3 L 2 Q 2 η l 2 η g 2 τ m a x 2 + 1 σ l 2 + σ g 2 + G + C η g 2 η l 2 V N ¯ 4
The constant C > 0 and the structure of the final term arise from bounding the impact of the variance of the dynamic buffer size. As shown in the inequality, the final term C η g 2 η l 2 V N ¯ 4 quantifies the additional error due to the randomness of the dynamic buffer size. This term arises because the convergence proof involves bounding the second moment of the aggregated update, where the weighting factor 1 N t 2 appears. By performing a Taylor expansion of E [ 1 N t 2 ] around N ¯ , the deviation from the fixed-weight case is proportional to V a r ( N t ) N ¯ 4 . When combined with the squared learning rates η g 2 η l 2 from the update magnitude, this yields the term C η g 2 η l 2 V N ¯ 4 . This term remains bounded and controlled in practice, as the variance V is finite and is suppressed by both N ¯ 4 and the learning rates.
Corollary 1. 
By selecting the learning rates as:
η l = Θ ( 1 N ¯ T Q )   η g = Θ ( N ¯ )   σ 2 = σ l 2 + σ g 2 + G   η g η l ( q ) Q 1 L
The bound in Formula (22) simplifies to
1 T t = 0 T 1 E [ f w t 2 ] O f w 0 f * T Q + O σ l 2 T Q + O ( Q σ 2 ( 1 + τ m a x 2 ) T K 2 ) + O ( V T N ¯ 4 Q )
The corollary establishes the theoretical foundation for FedDCS. The derived bound confirms that our algorithm converges to a stationary point of the non-convex objective under standard assumptions. Critically, the asymptotic convergence rate, dominated by the  O ( 1 T Q )  term, matches that of the FedBuff baseline. The terms involving  N ¯  demonstrate that a larger average cohort size improves convergence, analogous to increasing the fixed buffer size  K  in FedBuff. The final term  O ( V T N ¯ 4 Q )  quantifies the cost of the dynamic buffer’s variability. This term decays to zero as  T  and is heavily suppressed by  N ¯ 4 , confirming that the inherent randomness in our adaptive cohort formation does not compromise final convergence. In summary, the provided confident evidence validate the convergence properties of this FedDCS approach.

4. Experimental Results and Discussion

4.1. Evaluation

4.1.1. Simulation of Heterogeneity

Our experimental setup simulates a federated system with 100 clients. In each communication round, the server uniformly at random selects 30 clients to participate in the training. To rigorously evaluate the robustness of the FL methods, we explicitly simulate training environments with both device and data heterogeneity, aiming to replicate the challenges posed by real-world variations in user device capabilities and network conditions.
Device Heterogeneity. To emulate the hardware and computational diversity of real-world devices, we stratify the 100 clients into four performance tiers: 50% fast, 20% medium, 20% slow, and 10% extremely slow. This stratification serves as the baseline for systemic variation in performance. To simulate unstable network conditions during training, each client has a 4% probability of incurring an additional random network delay of 5–12 s before its update is received by the server. Furthermore, to reflect dynamic resource contention in real environments (e.g., from background processes), each client in every round has a 1% probability of experiencing a performance shift. When triggered, the client’s local training time is randomly increased or decreased by a duration between 0 and 10 s. As illustrated in Figure 2, which shows the distribution of client training times from a single experimental run, these combined factors result in a pronounced long-tail distribution of training times.
Data Heterogeneity. To simulate data heterogeneity, we partition the dataset using a label distribution skew approach based on the Dirichlet distribution [39]. For a dataset with K classes, we sample a probability vector p k Dir K ( α ) for each client k , which defines the proportion of each class in its local dataset. The concentration parameter α > 0 controls the degree of heterogeneity: a smaller α leads to a more skewed distribution, where clients likely hold data from only a few classes, while a larger α results in a more IID-like distribution. To investigate model performance across various heterogeneity levels, we employ three representative values, namely α = 0.1 (highly heterogeneous), α = 0.5 (moderately heterogeneous), and α = 1.0 (mildly heterogeneous). Examples of the resulting client data distributions for different α values are visualized in Figure 3.

4.1.2. Datasets

To comprehensively evaluate the performance of FedDCS in addressing heterogeneity, we select three widely adopted benchmark datasets with varying complexity and characteristics in FL research, namely Fashion-MNIST [40], CIFAR-10 [41], and CIFAR-100 [41] as shown in Table 2.

4.1.3. Baselines

To comprehensively evaluate the performance of our proposed method, we select five representative FL methods as baselines, encompassing the primary synchronous, asynchronous, and semi-asynchronous training paradigms:
  • FedAvg [1]: The foundational synchronous FL method.
  • FedAnp [18]: An adaptive synchronous method that employs dynamic client sampling to improve efficiency against system heterogeneity.
  • FedAsync [6]: A canonical asynchronous method where the server aggregates the global model immediately upon receiving any client update.
  • FedFa [23]: An enhanced asynchronous method that maintains a buffer, enabling “single-client triggering, multi-client aggregation”.
  • FedBuff [10]: A semi-asynchronous method that aggregates a subset of faster clients per round.

4.1.4. Evaluation Metrics

To comprehensively evaluate the performance of FedDCS, we use the following three metrics: test accuracy, F1-score (F1), and time efficiency. Test accuracy measures the overall classification correctness of the global model on a held-out test set. F1 is the harmonic mean of precision and recall. It provides a balanced assessment of model performance, which is particularly crucial under non-IID or class-imbalanced data distributions. Time efficiency is evaluated from two practical perspectives: (1) the trajectory of test accuracy versus wall-clock training time, illustrating the convergence speed; (2) the total time required to reach a pre-defined target accuracy, reflecting the overall computational cost.

4.2. Experimental Setup

4.2.1. Implementation Setting

The server we used was equipped with an Intel(R) Xeon(R) Platinum 8173 M CPU (16 cores), 32 GB of RAM, and an NVIDIA GeForce RTX 3090 GPU for accelerated training, running Ubuntu 22.04. All models were built, trained, and evaluated using the PyTorch 2.8.0 deep learning framework with CUDA 12.1.1 support. To mitigate the impact of randomness and ensure statistical reliability, each experiment was independently conducted 5 times with different random seeds.

4.2.2. Model and Training Configuration

We design dedicated Convolutional Neural Network (CNN) models for each image classification task. For the Fashion-MNIST dataset, we employ a lightweight CNN comprising two convolutional layers followed by two fully connected layers to process single-channel 28 × 28 grayscale images. For the more complex CIFAR-10 dataset, the model architecture is adapted by increasing the number of input channels to 3 for RGB images and adjusting the feature map dimensions accordingly. To handle the fine-grained 100-class CIFAR-100 dataset, we utilize a deeper and wider network with three convolutional layers and larger fully connected layers to enhance its feature extraction and representation capacity. All models use ReLU activation and incorporate dropout with a rate of 0.5 in the fully connected layers to prevent overfitting.
In federated training, all clients share identical local training hyperparameters. Specifically, we use the Adam optimizer with an initial learning rate of 0.001 and the cross-entropy loss function. In each federated round, every participating client performs 5 local training epochs on its private data with a batch size of 64. Here, according to previous pilot test and empirical analysis, we set the key hyperparameters of FedDCS as CUSUM sensitivity λ = 1.0 , the segmentation threshold coefficient ρ = 1.5 , the first-stage T 1 decay factor φ = 0.7 , the Monte Carlo reward weight β = 0.4 , and the staleness decay coefficient γ = 0.7 .

4.3. Results and Analysis

4.3.1. Convergence Performance

To investigate the performance of FedDCS across different data-heterogeneity settings, we conduct experiments on the CIFAR-10 dataset, simulating non-IID data of varying severity by varying the Dirichlet distribution parameter α . Figure 4 illustrates the convergence timeline for all compared methods, including our proposed FedDCS, across three heterogeneity levels.
As illustrated in the figure, FedDCS outperforms the other methods in robustness and time efficiency across all levels of heterogeneity. Its test accuracy curve climbs the fastest, exhibits stable convergence, and consistently achieves and maintains the highest final accuracy. FedAvg, while capable of reaching a high final accuracy, suffers from severe efficiency degradation due to the straggler effect. Its convergence rate is the slowest among all methods, as it is bottlenecked by the slowest client in each training round. The performance of FedAnp and FedBuff shows greater sensitivity to data heterogeneity. Their time efficiency fluctuates noticeably across different α settings.
To evaluate the generalizability of our method across varying task complexities, we conduct extended experiments on the Fashion-MNIST and CIFAR-100 datasets, using a Dirichlet parameter α = 0.5 to simulate a representative non-IID setting. Figure 5 depicts the convergence timeline for all compared methods and FedDCS.
Overall, FedDCS consistently achieves the best performance on all datasets, characterized by the fastest initial convergence rate and the highest stable accuracy plateau, confirming its dual advantages in efficiency and robustness.
On Fashion-MNIST, all methods eventually converge. Among these, FedDCS outperforms all baselines in both final accuracy and convergence speed. Notably, the semi-asynchronous baseline FedBuff demonstrates acceptable performance here, yet its convergence trajectory remains slower and less accurate than that of FedDCS.
Compared with the previous example, the model performances on the more challenging benchmark dataset of CIFAR-100 vary. FedDCS still maintains a significant lead, achieving the fastest convergence and the highest stable accuracy, which underscores the strong adaptability of its design to high-complexity tasks. In contrast, while FedAvg can reach a respectable final accuracy, its convergence process is prohibitively slow. FedBuff suffers from reduced time efficiency when facing this complex task. The asynchronous methods (FedAsync, FedFa) fail to adapt effectively, exhibit extremely low efficiency, and struggle to converge.
Integrating the experimental results from Figure 4 and Figure 5, our proposed FedDCS strategy demonstrates robust and efficient performance when handling federated learning tasks characterized by dynamic heterogeneity. Notably, FedDCS consistently maintains high time efficiency across a wide spectrum of experimental conditions—from mild (α = 1.0) to high (α = 0.1) data heterogeneity, and from the relatively simple Fashion-MNIST to the complex CIFAR-100 classification task. These observations comprehensively validate that its core design exhibits strong, generalizable adaptability to environmental variations across dimensions and severity levels.

4.3.2. Accuracy and F1 Score

To comprehensively evaluate the final model quality and training stability of each method, Table 3 summarizes the average optimal test accuracy (Acc) and F1 score (F1), along with their standard deviations, obtained from five independent runs with different random seeds across five dataset configurations. The trends for test accuracy and F1 are consistent; therefore, we focus our analysis primarily on accuracy.
Overall, the proposed FedDCS achieves the best comprehensive performance in most settings. On the relatively simple Fashion-MNIST ( α = 0.5 ) task, FedDCS attains a slight lead with 89.0% accuracy, compared to FedAvg and FedBuff (both 88.7%). The advantage of FedDCS becomes more pronounced on the more challenging CIFAR datasets. Under mild heterogeneity on CIFAR-10 ( α = 1.0 ), FedDCS achieves 70.3% accuracy—slightly lower than FedAvg’s 70.9%—but exhibits superior stability with a smaller standard deviation (±0.3% vs. ±0.4%). Crucially, FedAvg requires significantly more training time to reach this accuracy. As data heterogeneity intensifies ( α decreasing from 1.0 to 0.1), the performance degradation of FedDCS is markedly less severe than that of other baselines. In the highly data heterogeneous CIFAR-10 ( α = 0.1 ) setting, FedDCS achieves a significantly higher accuracy of 65.0% compared to FedAvg (64.4%) and others, while also demonstrating the best robustness with a standard deviation of ±1.5%, far lower than that of FedAnp (±7.7%). For the fine-grained CIFAR-100 ( α = 0.5 ) task, FedDCS also delivers the most competitive result with an accuracy of 42.9% ± 0.6%, balancing high performance with stability.
In contrast, all baseline methods exhibit notable limitations. While FedAvg can achieve accuracy comparable to FedDCS in some settings, its synchronous nature fundamentally limits efficiency due to the straggler problem. FedBuff delivers acceptable performance on simpler tasks but shows a significant gap in both convergence speed and final accuracy on complex ones. FedAnp demonstrates high sensitivity to data distribution; its accuracy drops sharply, and its variance increases drastically as heterogeneity intensifies. The asynchronous methods (FedAsync, FedFa) consistently underperform all others in both final accuracy and stability across every configuration. The results in Table 3 collectively demonstrate that FedDCS not only achieves the highest convergence accuracy in most cases but also adapts to varying data distributions and task complexities with minimal performance fluctuation, establishing it as a robust and effective solution for complex, heterogeneous FL scenarios.

4.3.3. Time Efficiency

To precisely quantify the time efficiency of each method, we record the cumulative wall-clock time required to reach a pre-defined target test accuracy. A direct comparison is presented in Figure 6.
Across all experimental settings, FedDCS consistently reaches the training goal in the shortest time, establishing a comprehensive lead in time efficiency. On the Fashion-MNIST task ( α = 0.5 ), FedDCS requires only 320 s. In comparison, FedAvg, FedAnp, and FedBuff take 2.66×, 2.16×, and 1.15× longer, respectively. This advantage is further amplified under the highly heterogeneous CIFAR-10 ( α = 0.1 ) scenario, where FedDCS needs merely 402 s. The completion times for FedAvg, FedAnp, and FedBuff are 3.29×, 4.93×, and 2.00× longer, respectively, indicating a clear efficiency advantage. This trend holds for other CIFAR-10 scenarios and the more complex CIFAR-100 ( α = 0.5 ) task, where FedDCS’s completion time remains significantly lower than all other methods.

4.3.4. Ablation Study

To evaluate the practical overhead of our server-side orchestration approach, we measure the time spent by its main components during a full training cycle. The experiment is performed on three benchmark datasets in a non-IID setting (α = 0.5), with 3000 Monte Carlo simulations. As shown in Table 4, the client training time prediction and early-batch segmentation modules are highly efficient, together representing less than 0.02% of the total round duration. Their overhead is therefore negligible. Although the Monte Carlo simulation module is the most resource-intensive among the three, it still adds only a small overhead—ranging from 2.1% to 3.6% of the total round time across all datasets. These results empirically confirm that FedDCS’s entire adaptive scheduling overhead is minimal. This small additional cost is justified and acceptable, as it enables the significant time efficiency improvements demonstrated in our experiments.
To examine the contribution of each component in the two-stage waiting strategy of FedDCS, we conduct an ablation study by comparing the full method with two variants across three datasets (α = 0.5 ). The compared variants are defined as follows:
  • FedDCS: The full proposed strategy with the complete two-stage adaptive waiting mechanism.
  • FedDCS-T1: A variant retaining only the first-stage waiting mechanism. This effectively reduces the method to a FedBuff variant with a dynamically sized buffer.
  • FedDCS-T2: A variant retaining only the second-stage waiting mechanism, with a fixed buffer size of 20. This is equivalent to augmenting the standard FedBuff with the second-stage waiting mechanism as an optimization patch.
  • FedBuff: The standard semi-asynchronous baseline, which employs a fixed buffer size and contains neither of the proposed waiting mechanisms.
The experimental results are shown in Figure 7 above. The analysis reveals the distinct and complementary roles of the two proposed mechanisms:
  • The first-stage waiting mechanism ensures baseline time efficiency. On simpler tasks such as Fashion-MNIST, FedDCS-T1 achieves higher early-stage efficiency than FedBuff and FedDCS-T2. However, on the complex CIFAR-100 task, it underperforms relative to FedDCS-T2 and the full FedDCS. This indicates that dynamically adjusting the buffer size based on client progress is effective at adapting to common performance variations and avoiding the inefficiencies of a fixed buffer. Nonetheless, its standalone contribution is limited, particularly in handling dynamic environmental changes.
  • The second-stage waiting mechanism enhances efficiency on complex tasks. For the more challenging CIFAR-10 and CIFAR-100 datasets, FedDCS-T2 demonstrates significantly higher early-stage time efficiency than FedBuff and slightly outperforms FedDCS-T1. This validates that the second-stage waiting mechanism, by purposefully incorporating updates from “following” clients, enriches the information content of each communication round, thereby accelerating early-phase convergence. However, a fixed buffer size can result in excessive waiting or unnecessary delays when clients experience sudden slowdowns.
  • Synergistic effect for optimal performance. The complete FedDCS strategy achieves the best overall performance across all three datasets. It leads comprehensively in both time efficiency and final accuracy on Fashion-MNIST and CIFAR-100. On CIFAR-10, it achieves the highest convergence rate while reaching peak accuracy. These results demonstrate a clear synergistic effect: the first-stage waiting mechanism establishes an efficient, adaptive scheduling foundation, while the second-stage waiting mechanism performs refined optimization on this basis. This combination is crucial for robust performance in environments with compounded device and data heterogeneity.
To investigate the impact of the Monte Carlo reward function weight β within the second-stage waiting mechanism, we conduct a sensitivity analysis on three datasets ( α = 0.5 ), testing values of β { 0.2 , 0.4 , 0.6 , 0.8 } . This parameter controls the trade-off in the Monte Carlo simulation between the reward for aggregating more clients and the penalty for incurring additional waiting time. The results are presented in Figure 8.
Across all three datasets, β = 0.4 achieves the best overall performance, striking an optimal balance between time efficiency and final accuracy. On Fashion-MNIST, β = 0.4 maintains a consistent lead throughout the entire FL training process. On CIFAR-10, β = 0.4 and β = 0.2 show comparable early-stage efficiency. However, β = 0.4 demonstrates superior convergence stability and higher final accuracy. While β = 0.6 reaches high final accuracy, its early-stage efficiency is inferior to that of β = 0.4 . On the more complex CIFAR-100, β = 0.4 slightly outperforms β = 0.2 in efficiency and stability, and delivers better overall performance than β = 0.6 and β = 0.8 . Setting β too high places excessive emphasis on waiting for more clients, thereby introducing significant and often unnecessary latency. This leads to markedly lower time efficiency—a drawback that would be further exacerbated in dynamic environments with sudden client performance shifts.
These results indicate that setting β = 0.4 assigns the most appropriate weight to client quantity in the second-stage dynamic decision-making. It enables the effective utilization of more client updates without significantly prolonging the per-round duration. The consistency of this optimal balance point across tasks of varying complexity demonstrates the robustness and generalizability of the chosen parameter value.

5. Conclusions

In this paper, we address the critical challenge of time efficiency and robustness in FL environments characterized by device and data heterogeneity. To overcome this, we proposed FedDCS, a semi-asynchronous FL framework. Its core innovation is an efficient, predictive, and adaptive server-side orchestration mechanism. FedDCS first employs an enhanced exponential smoothing model to predict client completion times. These predictions then drive a two-stage adaptive waiting mechanism: the first stage dynamically sets a buffer size and wait time to efficiently collect a core cohort of clients, while an optimized second-stage wait captures additional clients that complete in close succession. Collectively, these components enable FedDCS to adaptively respond to dynamic heterogeneities in real time.
We conducted extensive experiments across three benchmark datasets of varying complexity under meticulously simulated heterogeneous conditions. The results consistently demonstrate that FedDCS achieves a superior balance between efficiency and model quality. It significantly outperforms synchronous, asynchronous, and semi-asynchronous baselines in terms of time efficiency and exhibits the highest accuracy with lower variance across diverse and challenging scenarios.
Two practical directions for future work include defending against potential client poisoning attacks and handling frequent client dropouts in unstable networks. Enhancing the current adaptive mechanism to address these issues would make the framework more practical for volatile environments.

Author Contributions

Conceptualization, R.L. and L.Z.; methodology, R.L.; validation, R.L.; formal analysis, R.L.; investigation, R.L.; resources, L.Z.; data curation, R.L.; writing—original draft preparation, R.L.; writing—review and editing, L.Z.; supervision, L.Z.; project administration, L.Z.; funding acquisition, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the International Partnership Program of the Chinese Academy of Sciences “Global Open Science Cloud Initiative” (Grant No. 241711KYSB20200023), the National Natural Science Foundation of China “Research on the Measurement of Scientific Data Reusability through An Extended Citation Framework” (Grant No. 72104229), and the CNIC grant.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, 9–11 May 2017; JMLR: Norfolk, MA, USA, 2017; Volume 54. [Google Scholar] [CrossRef]
  2. Asad, M.; Moustafa, A.; Ito, T. FedOpt: Towards Communication Efficiency and Privacy Preservation in Federated Learning. Appl. Sci. 2020, 10, 2864. [Google Scholar] [CrossRef]
  3. Carrillo, J.A.; Trillos, N.G.; Li, S.; Zhu, Y. FedCBO: Reaching Group Consensus in Clustered Federated Learning through Consensus-Based Optimization. J. Mach. Learn. Res. 2024, 25, 1–51. [Google Scholar] [CrossRef]
  4. Yang, Y.; Hui, B.; Yuan, H.; Gong, N.; Cao, Y. PrivateFL: Accurate, Differentially Private Federated Learning via Personalized Data Transformation. In Proceedings of the 32nd USENIX Security Symposium, Anaheim, CA, USA, 9–11 August 2023; pp. 1595–1612. [Google Scholar]
  5. Chen, J.; Tang, H.; Cheng, J.; Yan, M.; Zhang, J.; Xu, M.; Nie, L. Breaking Barriers of System Heterogeneity: Straggler-Tolerant Multimodal Federated Learning via Knowledge Distillation. In Proceedings of the International Joint Conference on Artificial Intelligence, Jeju Island, Republic of Korea, 3–9 August 2024. [Google Scholar] [CrossRef]
  6. Xie, C.; Koyejo, S.; Gupta, I. Asynchronous Federated Optimization. arXiv 2019, arXiv:1903.03934. [Google Scholar] [CrossRef]
  7. Li, Y.; Yang, S.; Ren, X.; Shi, L.; Zhao, C. Multi-Stage Asynchronous Federated Learning with Adaptive Differential Privacy. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 1243–1256. [Google Scholar] [CrossRef] [PubMed]
  8. Huang, P.; Li, D.; Yan, Z. Wireless Federated Learning with Asynchronous and Quantized Updates. IEEE Commun. Lett. 2023, 27, 2393–2397. [Google Scholar] [CrossRef]
  9. Li, W.; Lv, T.; Ni, W.; Zhao, J.; Hossain, E.; Poor, H.V. Route-and-Aggregate Decentralized Federated Learning Under Communication Errors. IEEE TNNLS 2025, 36, 16675–16691. [Google Scholar] [CrossRef]
  10. Nguyen, J.; Malik, K.; Zhan, H.; Yousefpour, A.; Rabbat, M.; Malek, M.; Huba, D. Federated Learning with Buffered Asynchronous Aggregation. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual Conference, 28–30 March 2022; pp. 3581–3607. [Google Scholar] [CrossRef]
  11. Wang, Y.; Cao, Y.; Wu, J.; Chen, R.; Chen, J. Tackling the Data Heterogeneity in Asynchronous Federated Learning with Cached Update Calibration. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  12. Chen, M.; Mao, B.; Ma, T. FedSA: A Staleness-Aware Asynchronous Federated Learning Algorithm with Non-IID Data. Future Gener. Comput. Sy. 2021, 120, 1–12. [Google Scholar] [CrossRef]
  13. Su, N.; Li, B. How Asynchronous Can Federated Learning Be? In Proceedings of the 2022 IEEE/ACM 30th International Symposium on Quality of Service, Oslo, Norway, 10–12 June 2022; pp. 1–11. [Google Scholar] [CrossRef]
  14. Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
  15. Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated Learning with Non-IID Data. arXiv 2018, arXiv:1806.00582. [Google Scholar] [CrossRef]
  16. Liu, B.; Lv, N.; Guo, Y.; Li, Y. Recent Advances on Federated Learning: A Systematic Survey. Neurocomputing 2024, 597, 128019. [Google Scholar] [CrossRef]
  17. Chai, Z.; Ali, A.; Zawad, S.; Truex, S.; Anwar, A.; Baracaldo, N.; Cheng, Y. TIFL: A Tier-Based Federated Learning System. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, Virtual Conference, 23–26 June 2020; pp. 125–136. [Google Scholar] [CrossRef]
  18. Reisizadeh, A.; Tziotis, I.; Hassani, H.; Mokhtari, A.; Pedarsani, R. Straggler-Resilient Federated Learning: Leveraging the Interplay Between Statistical Accuracy and System Heterogeneity. IEEE J. Sel. Areas Inf. Theory 2022, 3, 197–205. [Google Scholar] [CrossRef]
  19. Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar] [CrossRef]
  20. Liu, R.; Wu, F.; Wu, C.; Wang, Y.; Lyu, L.; Chen, H.; Xie, X. No One Left Behind: Inclusive Federated Learning over Heterogeneous Devices. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 3398–3406. [Google Scholar] [CrossRef]
  21. Hao, H.; Xu, C.; Zhang, W.; Chen, X.; Yang, S.; Muntean, G.M. Reliability-Aware Optimization of Task Offloading for UAV-Assisted Edge Computing. IEEE Trans. Comput. 2025, 74, 3832–3848. [Google Scholar] [CrossRef]
  22. Chen, Y.; Sun, X.; Jin, Y. Communication-Efficient Federated Deep Learning with Layerwise Asynchronous Model Update and Temporally Weighted Aggregation. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 4229–4238. [Google Scholar] [CrossRef]
  23. Xu, H.; Zhang, Z.; Di, S.; Liu, B.; Alharthi, K.A.; Cao, J. FedFa: A Fully Asynchronous Training Paradigm for Federated Learning. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea, 3–9 August 2024; pp. 5281–5288. [Google Scholar] [CrossRef]
  24. Zhou, C.; Tian, H.; Zhang, H.; Zhang, J.; Dong, M.; Jia, J. TEA-Fed: Time-Efficient Asynchronous Federated Learning for Edge Computing. In Proceedings of the 18th ACM International Conference on Computing Frontiers, Virtual Event, Catania, Italy, 11–13 May 2021; pp. 30–37. [Google Scholar] [CrossRef]
  25. Leconte, L.; Nguyen, V.M.; Moulines, E. FAVANO: Federated Averaging with Asynchronous Nodes. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14–19 April 2024; pp. 5665–5669. [Google Scholar] [CrossRef]
  26. Yang, J.; Liu, Y.; Chen, F.; Chen, W.; Li, C. Asynchronous Wireless Federated Learning with Probabilistic Client Selection. IEEE Trans. Wirel. Commun 2024, 23, 7144–7158. [Google Scholar] [CrossRef]
  27. Hao, J.; Zhao, Y.; Zhang, J. Time Efficient Federated Learning with Semi-asynchronous Communication. In Proceedings of the 2020 IEEE 26th International Conference on Parallel and Distributed Systems, Hong Kong, China, 2–4 December 2020; pp. 156–163. [Google Scholar] [CrossRef]
  28. Shi, G.; Li, L.; Wang, J.; Chen, W.; Ye, K.; Xu, C. HySync: Hybrid Federated Learning with Effective Synchronization. In Proceedings of the 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems, Yanuca Island, Cuvu, Fiji, 14–16 December 2020; pp. 628–633. [Google Scholar] [CrossRef]
  29. Chai, Z.; Chen, Y.; Anwar, A.; Zhao, L.; Cheng, Y.; Rangwala, H. FedAT: A High-Performance and Communication-Efficient Federated Learning System with Asynchronous Tiers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, MO, USA, 14–19 November 2021; pp. 1–16. [Google Scholar] [CrossRef]
  30. Liu, J.; Xu, H.; Xu, Y.; Ma, Z.; Wang, Z.; Qian, C.; Huang, H. Communication-Efficient Asynchronous Federated Learning in Resource-Constrained Edge Computing. Comput. Netw. 2021, 199, 108429. [Google Scholar] [CrossRef]
  31. Chen, Z.; Yi, W.; Shin, H.; Nallanathan, A. Adaptive Semi-Asynchronous Federated Learning Over Wireless Networks. IEEE Trans. Commun. 2025, 73, 394–409. [Google Scholar] [CrossRef]
  32. Yu, J.; Zhou, R.; Chen, C.; Li, B.; Dong, F. ASFL: Adaptive Semi-Asynchronous Federated Learning for Balancing Model Accuracy and Total Latency in Mobile Edge Networks. In Proceedings of the 52nd International Conference on Parallel Processing, Salt Lake City, UT, USA, 7–10 August 2023; pp. 443–451. [Google Scholar] [CrossRef]
  33. Singh, N.; Adhikari, M. A Hybrid Semi-Asynchronous Federated Learning and Split Learning Strategy in Edge Networks. IEEE Trans. Netw. Sci. Eng. 2025, 12, 1429–1439. [Google Scholar] [CrossRef]
  34. Gupta, M.; Gao, J.; Aggarwal, C.C.; Han, J. Outlier Detection for Temporal Data: A Survey. IEEE Trans. Knowl. Data Eng. 2014, 26, 2250–2267. [Google Scholar] [CrossRef]
  35. Hawkins, D.M.; Olwell, D.H. Cumulative Sum Control Charts and Charting for Quality Improvement; Springer: New York, NY, USA, 1998; pp. 1–247. [Google Scholar] [CrossRef]
  36. MacQueen, J. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1965 and 27 December 1965–7 January 1966. [Google Scholar]
  37. Metropolis, N.; Ulam, S. The Monte Carlo Method. J. Am. Stat. Assoc. 1949, 44, 335–341. [Google Scholar] [CrossRef]
  38. Mania, H.; Pan, X.; Papailiopoulos, D.; Recht, B.; Ramchandran, K.; Jordan, M.I. Perturbed Iterate Analysis for Asynchronous Stochastic Optimization. SIAM J. Optim. 2017, 27, 2202–2229. [Google Scholar] [CrossRef]
  39. Hsu, T.M.H.; Qi, H.; Brown, M. Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification. arXiv 2019, arXiv:1909.06335. [Google Scholar] [CrossRef]
  40. Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar] [CrossRef]
  41. Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, University of Toronto, Toronto, ON, Canada, 2009. [Google Scholar]
Figure 1. The overall workflow of the FedDCS framework.
Figure 1. The overall workflow of the FedDCS framework.
Mathematics 14 00803 g001
Figure 2. (a) Histogram of client initial training time distribution; (b) Histogram of client average training time per round distribution. Both histograms reveal a long-tail distribution of client training times, which is characteristic of device heterogeneity.
Figure 2. (a) Histogram of client initial training time distribution; (b) Histogram of client average training time per round distribution. Both histograms reveal a long-tail distribution of client training times, which is characteristic of device heterogeneity.
Mathematics 14 00803 g002
Figure 3. (ac) represent the client data distributions corresponding to different α values, with darker colors indicating a greater amount of data with the corresponding labels.
Figure 3. (ac) represent the client data distributions corresponding to different α values, with darker colors indicating a greater amount of data with the corresponding labels.
Mathematics 14 00803 g003
Figure 4. Performance comparison of various methods on the CIFAR-10 dataset under different data heterogeneity: (a) Mild data heterogeneity (α = 1.0); (b) Moderate data heterogeneity (α = 0.5); (c) High data heterogeneity (α = 0.1).
Figure 4. Performance comparison of various methods on the CIFAR-10 dataset under different data heterogeneity: (a) Mild data heterogeneity (α = 1.0); (b) Moderate data heterogeneity (α = 0.5); (c) High data heterogeneity (α = 0.1).
Mathematics 14 00803 g004
Figure 5. Performance comparison of various methods across different datasets: (a) Fashion-MNIST dataset ( α = 0.5 ); (b) CIFAR-100 dataset ( α = 0.5 ); (c) CIFAR-10 dataset ( α = 0.5 ).
Figure 5. Performance comparison of various methods across different datasets: (a) Fashion-MNIST dataset ( α = 0.5 ); (b) CIFAR-100 dataset ( α = 0.5 ); (c) CIFAR-10 dataset ( α = 0.5 ).
Mathematics 14 00803 g005
Figure 6. Time required for each method to reach a target accuracy (%): (a) FMNIST ( α = 0.5 , target 85%); (b) CIFAR-10 ( α = 1.0 , target 68%); (c) CIFAR-10 ( α = 0.5 , target 65%); (d) CIFAR-10 ( α = 0.1 , target 44%); (e) CIFAR-100 ( α = 0.5 , target 38%). (Note that FedAsync and FedFa are not able to reach the target accuracy within the specified time limit, thus is omitted.).
Figure 6. Time required for each method to reach a target accuracy (%): (a) FMNIST ( α = 0.5 , target 85%); (b) CIFAR-10 ( α = 1.0 , target 68%); (c) CIFAR-10 ( α = 0.5 , target 65%); (d) CIFAR-10 ( α = 0.1 , target 44%); (e) CIFAR-100 ( α = 0.5 , target 38%). (Note that FedAsync and FedFa are not able to reach the target accuracy within the specified time limit, thus is omitted.).
Mathematics 14 00803 g006
Figure 7. Comparison of the FedDCS Two-Stage Waiting Mechanism: (a) Fashion-MNIST dataset ( α = 0.5 ); (b) CIFAR-10 dataset ( α = 0.5 ); (c) CIFAR-100 dataset ( α = 0.5 ).
Figure 7. Comparison of the FedDCS Two-Stage Waiting Mechanism: (a) Fashion-MNIST dataset ( α = 0.5 ); (b) CIFAR-10 dataset ( α = 0.5 ); (c) CIFAR-100 dataset ( α = 0.5 ).
Mathematics 14 00803 g007
Figure 8. Comparison of the hyperparameter β within the second-stage waiting mechanism: (a) Fashion-MNIST dataset ( α = 0.5 ); (b) CIFAR-10 dataset ( α = 0.5 ); (c) CIFAR-100 dataset ( α = 0.5 ).
Figure 8. Comparison of the hyperparameter β within the second-stage waiting mechanism: (a) Fashion-MNIST dataset ( α = 0.5 ); (b) CIFAR-10 dataset ( α = 0.5 ); (c) CIFAR-100 dataset ( α = 0.5 ).
Mathematics 14 00803 g008
Table 1. Symbols and their descriptions.
Table 1. Symbols and their descriptions.
NotationDescription
N The number of clients participating in aggregation
t Round t of training
v The current version of the global model
v i The local model version of the client i
θ i The aggregation weight of the client i
θ g The global model’s aggregation weight
α Data heterogeneity coefficient
p i t Predicted completion time of client i in round t
o i t Actual completion time of client i in round t
η Exponential smoothing coefficient
e i t The prediction residual of the client i in round t
λ Cumulative Sum sensitivity parameter
μ i e The mean of e i t
σ i e The standard deviation of e i t
K Dynamic buffer size
ρ The dynamic segment threshold coefficient
T 1 Waiting time of the first stage
T 2 Waiting time of the second stage
φ The decay factor of T 1
β The Monte Carlo reward weight of the second-stage waiting mechanism
γ The staleness decay coefficient of the POLY-rule function
Table 2. Benchmark datasets.
Table 2. Benchmark datasets.
Dataset TitleDataset DescriptionData Scale
Fashion-MNIST
[40]
Fashion-MNIST (FMNIST) is a widely used benchmark dataset for image classification tasks, often regarded as an alternative to the traditional MNIST handwritten digit dataset.70,000 28 × 28-pixel images across 10 categories.
CIFAR-10
[41]
CIFAR-10 is a benchmark dataset that encompasses a wide variety of scenes and objects, ranging from natural landscapes to man-made objects, offering high diversity.60,000 32 × 32-pixel color natural images across 10 categories.
CIFAR-100
[41]
CIFAR-100 is an extension of the CIFAR-10 dataset and is frequently used to evaluate algorithm performance on highly heterogeneous data that require strong generalization capabilities. It features a more granular and complex classification structure.60,000 32 × 32-pixel color images across 100 categories.
Table 3. Final performance comparison across multiple dataset configurations.
Table 3. Final performance comparison across multiple dataset configurations.
DatasetMetrics *FedAvgFedAnpFedAsyncFedFaFedBuffFedDCS
FMNIST
( α = 0.5 )
Acc (%)88.7 ± 0.288.0 ± 0.584.0 ± 0.583.1 ± 0.488.7 ± 0.289.0 ± 0.3
F1 (%)88.6 ± 0.287.9 ± 0.583.7 ± 0.582.9 ± 0.588.7 ± 0.288.9 ± 0.3
CIFAR10
( α = 1.0 )
Acc (%)70.9 ± 0.469.3 ± 0.454.5 ± 0.752.8 ± 1.868.9 ± 0.270.3 ± 0.3
F1 (%)70.8 ± 0.569.1 ± 0.654.3 ± 0.952.4 ± 2.068.8 ± 0.370.2 ± 0.3
CIFAR10
( α = 0.5 )
Acc (%)69.5 ± 0.467.7 ± 1.550.3 ± 0.649.7 ± 1.668.2 ± 0.569.7 ± 0.5
F1 (%)69.4 ± 0.467.3 ± 1.549.9 ± 0.748.3 ± 1.668.0 ± 0.469.7 ± 0.5
CIFAR10
( α = 0.1 )
Acc (%)64.4 ± 2.558.5 ± 7.734.5 ± 2.134.0 ± 2.859.6 ± 3.265.0 ± 1.5
F1 (%)63.9 ± 2.457.6 ± 8.029.6 ± 3.528.6 ± 4.159.2 ± 3.864.9 ± 1.5
CIFAR100
( α = 0.5 )
Acc (%)42.8 ± 0.839.1 ± 0.635.9 ± 0.436.2 ± 0.240.5 ± 0.442.9 ± 0.6
F1 (%)43.6 ± 0.740.1 ± 0.635.9 ± 0.536.1 ± 0.341.2 ± 0.443.7 ± 0.6
* The statistics include accuracy and F1, along with the mean ± standard deviation of the approaches. Bolded numbers indicate the best performance among all methods.
Table 4. Time cost breakdown of different modules.
Table 4. Time cost breakdown of different modules.
Dataset
( α = 0.5 )
MetricsTraining Time PredictionEarly-Batch SegmentationMonte Carlo SimulationTotal Round Time
FMNIST Time   ( s )0.1890.03746.7432175.929
Proportion0.009%0.002%2.148%100%
CIFAR10 Time   ( s )0.2640.052134.3383753.548
Proportion0.007%0.001%3.579%100%
CIFAR100 Time   ( s )0.3660.150179.3045998.491
Proportion0.006%0.003%2.989%100%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, R.; Zhang, L. FedDCS: Semi-Asynchronous Federated Learning Optimization Based on Dynamic Client Selection. Mathematics 2026, 14, 803. https://doi.org/10.3390/math14050803

AMA Style

Liu R, Zhang L. FedDCS: Semi-Asynchronous Federated Learning Optimization Based on Dynamic Client Selection. Mathematics. 2026; 14(5):803. https://doi.org/10.3390/math14050803

Chicago/Turabian Style

Liu, Ruilin, and Lili Zhang. 2026. "FedDCS: Semi-Asynchronous Federated Learning Optimization Based on Dynamic Client Selection" Mathematics 14, no. 5: 803. https://doi.org/10.3390/math14050803

APA Style

Liu, R., & Zhang, L. (2026). FedDCS: Semi-Asynchronous Federated Learning Optimization Based on Dynamic Client Selection. Mathematics, 14(5), 803. https://doi.org/10.3390/math14050803

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop