FedDdrl: Federated Double Deep Reinforcement Learning for Heterogeneous IoT with Adaptive Early Client Termination and Local Epoch Adjustment

Federated learning (FL) is a technique that allows multiple clients to collaboratively train a global model without sharing their sensitive and bandwidth-hungry data. This paper presents a joint early client termination and local epoch adjustment for FL. We consider the challenges of heterogeneous Internet of Things (IoT) environments including non-independent and identically distributed (non-IID) data as well as diverse computing and communication capabilities. The goal is to strike the best tradeoff among three conflicting objectives, namely global model accuracy, training latency and communication cost. We first leverage the balanced-MixUp technique to mitigate the influence of non-IID data on the FL convergence rate. A weighted sum optimization problem is then formulated and solved via our proposed FL double deep reinforcement learning (FedDdrl) framework, which outputs a dual action. The former indicates whether a participating FL client is dropped, whereas the latter specifies how long each remaining client needs to complete its local training task. Simulation results show that FedDdrl outperforms the existing FL scheme in terms of overall tradeoff. Specifically, FedDdrl achieves higher model accuracy by about 4% while incurring 30% less latency and communication costs.


Introduction
In the Internet of Things (IoT), each device can collect massive amounts of data (i.e., measurements and location information) [1]. It is estimated that IoT devices will generate over 90 zettabytes of data globally by 2025 [2]. These data can be uploaded to a centralized server, where a new model can be retrained or fine-tuned using the collected dataset. This method is called centralized learning (CL) since the data must be centralized at one location for model training. However, privacy concerns make it inconvenient for devices to share potentially sensitive data with a centralized server (or any other party). For example, medical images may contain sensitive and private information about patients [3], which prohibits the collection of such data from multiple healthcare institutions for CL. Additionally, uploading bandwidth-hungry data requires a high communication cost, which is not feasible for most IoT devices with network resource constraints [1,4]. In fact, the data collected by IoT devices could be larger than the model size [4], especially when dealing with image data.
Federated learning (FL) has emerged as one of the promising candidates to address this challenge. FL is a technique that trains an algorithm across multiple edge devices 1.
We modeled the FL system as an MDP. Then, we proposed to use a DDRL framework for adaptive early client termination and local epoch adjustment, to maximize the global model accuracy while minimizing the training latency and communication costs.

2.
We demonstrated our proposed algorithm in a non-IID setting on MNIST, CIFAR-10 and CrisisIBD datasets. We showed that our solution could outperform existing methods in terms of global model accuracy with shorter training latency and lower communication costs.

3.
We explored the influence of balanced-MixUp in the FL system. In most settings, balanced-MixUp could mitigate weight divergence and improve convergence speed.
The rest of the paper is organized as follows. Section 2 describes the related work. Section 3 discusses the system model and problem formulation. Section 4 presents the proposed solution. Section 5 shows the experimental setup, followed by the simulation results and discussion. Section 6 concludes the paper and outlines future research directions.

Related Work
This section reviews existing works on FL and DRL-based FL to provide insights into the current trend in FL. Then, we discuss the limitation of each algorithm. Lastly, we also elaborate on the weight divergence problem in FL, which is a common problem faced by all FL algorithms.

Federated Learning and Deep Reinforcement Learning
Some of the commonly used FL algorithms include FedAvg [4], FedProx [23] and FedNova [17]. FedAvg was the first practical implementation of FL. In each communication round, the server sends the global model to N randomly selected clients. Each client trains the model using its local dataset. Then, each client uploads its locally trained model to the server, where the server averages the received local models' weights as the new global model. It has since become the de facto approach for FL and is widely used in various applications [3,[5][6][7]. FedProx presents a reparameterization of FedAvg by introducing an additional L 2 regularization term in the local objective function. The regularization term limits the distance between the local and global models, preventing local updates from diverging from global optima. A hyperparameter µ controls the weight of the regularization term. Overall, the modification can be easily performed on the existing FedAvg algorithm while improving model accuracy on non-IID datasets. However, it introduces additional computing overhead, leading to longer training latency. On the other hand, FedNova improves FedAvg in the aggregation stage. It allows each n ∈ N client to conduct a different number of local steps. This allows clients with higher computing resources to conduct more training while waiting for others to complete training. To ensure that the global updates are not biased, each local update is normalized and scaled according to the number of local steps conducted before they are averaged into the new global model. FedNova introduces negligible computation overhead compared to FedAvg while handling computing resources heterogeneity in FL systems. However, all the above algorithms are limited to handling statistical heterogeneity (non-IID dataset) and computing resources heterogeneity. Other heuristic algorithms have been proposed to optimize client selection in FL systems with heterogeneous network and/or energy resources [24,25]. However, heuristic solutions could only deliver sub-optimal performance since they often rely on qualitative analysis without exploring the optimal performance [14].
DRL has been widely applied to solve optimization problems involving complex sequential decision-making, such as playing Atari games [26], multiplayer games [27] and chess [28]. Since an FL procedure can be modelled as an MDP, it can also be optimized using DRL. FAVOR is one of the first research works to optimize FL using DRL [21]. They observed an implicit connection between the distribution of a local client dataset and the model weights trained on those data. Using the model weights collected from each participating client, a DRL agent can learn to select suitable clients for the next round of training. After proving DRL success in FL optimization, multiple studies [9,10,14,19,20,29] have exploited DRL in FL resource allocation problems. For instance, [9,10,14,19] used DRL to jointly optimize computing and network resources in an FL framework while retaining the global model's accuracy. These studies employed a DRL-based client selection policy or early client termination policy. Such a policy is responsible for selecting the best subset of clients for each round of training by optimizing the tradeoff between model accuracy and resource allocation. On the other hand, [29] optimized only the network resources by quantizing the model weights before uploading them to the FL server. However, none of the DRL-based FL frameworks described above allow dynamic local epoch adjustment. These frameworks fixed the same local epochs for all clients, disregarding their computing cost and training latency. With dynamic local epoch adjustment, clients with higher computing resources can conduct more training epochs. On the contrary, clients with limited computing power can train with fewer epochs. Table 1 summarizes the key features of the aforementioned FL and DRL-based FL algorithms. In short, we noticed an ongoing trend of utilizing DRL to optimize the computing and network resources in the FL framework. Most of them relied only on client selection or early client termination techniques. As a result, such a method often rejects clients with limited computing power to prevent these devices from dragging the overall FL training latency. Even when such devices are selected, those with stronger computing power will finish training earlier and remain idle while waiting for the slower ones to complete training. However, these devices may contain crucial training data that is essential for FL convergence. Ideally, these devices should participate in FL training but with a lower local training epoch and vice versa. To the best of our knowledge, no DRL-based FL algorithms adopt DRL for automated local epoch adjustment. On the other hand, existing FL algorithms such as FedNova rely on manual adjustment to set devices with stronger computing power with a higher local epoch. Hence, an exciting potential exists for incorporating DRL-based dynamic local epoch adjustments for automated calibration.

Weight Divergence in Federated Learning
Weight divergence is the difference between FL model weights w FL t and CL model weights w CL t . An ideal level of weight divergence in FL could exploit the rich decentralized data, resulting in a better performance. For instance, FL outperforms its CL counterpart in various applications, including drug discovery [30], disaster classification [7] and autonomous driving object detection [31]. However, in extreme non-IID cases where the local client data distribution p k is far from the global data distribution p, the highly diverged local weight updates could lead to bad aggregated solutions which are far from the global optimum solution. This is especially true in IoT networks, where each IoT device has a unique data distribution and can be considered non-IID [1]. According to [16], the main source of weight divergence is the earth mover's distance (EMD) between p k and p, denoted as ∑ n c i=1 p k (y = i) − p(y = i) , where n c denotes the total number of classes. In general, weight divergence is inevitable since p k and p are almost guaranteed to be different in a real-life setting. Figure 1 shows an example of weight divergence between FedAvg and CL models.
Research work [20] Computing + Communication Random Fixed Research work [29] Communication Random Fixed Proposed FedDdrl Computing + Communication DRL Agent DRL Agent

Weight Divergence in Federated Learning
Weight divergence is the difference between FL model weights and CL model weights . An ideal level of weight divergence in FL could exploit the rich decentralized data, resulting in a better performance. For instance, FL outperforms its CL counterpart in various applications, including drug discovery [30], disaster classification [7] and autonomous driving object detection [31]. However, in extreme non-IID cases where the local client data distribution is far from the global data distribution , the highly diverged local weight updates could lead to bad aggregated solutions which are far from the global optimum solution. This is especially true in IoT networks, where each IoT device has a unique data distribution and can be considered non-IID [1]. According to [16], the main source of weight divergence is the earth mover's distance (EMD) between and , denoted as ∑ ‖ ( = ) − ( = )‖ , where denotes the total number of classes. In general, weight divergence is inevitable since and are almost guaranteed to be different in a real-life setting. Figure 1 shows an example of weight divergence between FedAvg and CL models. Generally, only a fraction of the total clients is selected for training per communication round, , to reduce the total communication cost in FL. Let denote the total number of participating clients per round. When is low, it is difficult to ensure the sampled data resemble the global data distribution. This also leads to high EMD between and , which again contributes to the divergence of from . Let be the accuracy of the global model at communication round ∈ . Figure 2 shows the accuracy curve of FedAvg models trained on a non-IID CIFAR-10 dataset using = 5 and = 10. First, the average accuracy of the global model after = 15 communication round was 62.73% and 72.79% for = 5 and = 10, respectively. Additionally, the fluctuation and standard deviation of the accuracy curve were higher when = 5 as compared to = 10. It is shown that FedAvg (or FL in general) had inferior performance when the number of participating clients per round is low. Generally, only a fraction of the total clients is selected for training per communication round, t, to reduce the total communication cost in FL. Let C denote the total number of participating clients per round. When C is low, it is difficult to ensure the sampled data resemble the global data distribution. This also leads to high EMD between p k and p, which again contributes to the divergence of w FL t from w CL t . Let A t be the accuracy of the global model at communication round t ∈ T. Figure 2 shows the accuracy curve of FedAvg models trained on a non-IID CIFAR-10 dataset using C = 5 and C = 10. First, the average accuracy of the global model after t = 15 communication round was 62.73% and 72.79% for C = 5 and C = 10, respectively. Additionally, the fluctuation and standard deviation of the accuracy curve were higher when C = 5 as compared to C = 10. It is shown that FedAvg (or FL in general) had inferior performance when the number of participating clients per round is low.
Recent studies have contributed various solutions to mitigate weight divergence. For instance, the FedProx [23] mentioned earlier adds a regularization term to the local subproblem to prevent the local updates from diverging away from the global FL model. This method, in turn, hopes to ensure the aggregated global FL model weights w FL t are close to w CL t . Albeit effective, FedProx requires higher computing costs and a longer training time [32]. On the other hand, [16] proposed partial global sharing of local data to reduce EMD between client data distribution and the global populations. However, this induces high communication costs for data sharing and raises privacy concerns. Recent studies have contributed various solutions to mitigate weight divergence. For instance, the FedProx [23] mentioned earlier adds a regularization term to the local subproblem to prevent the local updates from diverging away from the global FL model. This method, in turn, hopes to ensure the aggregated global FL model weights are close to . Albeit effective, FedProx requires higher computing costs and a longer training time [32]. On the other hand, [16] proposed partial global sharing of local data to reduce EMD between client data distribution and the global populations. However, this induces high communication costs for data sharing and raises privacy concerns.
Meanwhile, methods that employ adaptive client selection or early client termination (i.e., FedMarl) aim to tackle weight divergence via careful client selection. For each communication round, FedMarl will only select a subset of the clients for training. Ideally, only the selected clients are useful for training, while the rest are not. Effectively, this means that is not constant for each communication round . However, a lower may lead to less steady convergence based on Figure 2. Thus, FedMarl is expected to handle the careful dropping of clients considering the EMD between global and local populations while taking care of other optimizing objectives, such as the training latency and communication cost of each client. Dropping the wrong clients may lead to large weight divergence, as shown in Figure 3.  Meanwhile, methods that employ adaptive client selection or early client termination (i.e., FedMarl) aim to tackle weight divergence via careful client selection. For each communication round, FedMarl will only select a subset of the C clients for training. Ideally, only the selected clients are useful for training, while the rest are not. Effectively, this means that C is not constant for each communication round t. However, a lower C may lead to less steady convergence based on Figure 2. Thus, FedMarl is expected to handle the careful dropping of clients considering the EMD between global and local populations while taking care of other optimizing objectives, such as the training latency and communication cost of each client. Dropping the wrong clients may lead to large weight divergence, as shown in Figure 3. Recent studies have contributed various solutions to mitigate weight divergence. For instance, the FedProx [23] mentioned earlier adds a regularization term to the local subproblem to prevent the local updates from diverging away from the global FL model. This method, in turn, hopes to ensure the aggregated global FL model weights are close to . Albeit effective, FedProx requires higher computing costs and a longer training time [32]. On the other hand, [16] proposed partial global sharing of local data to reduce EMD between client data distribution and the global populations. However, this induces high communication costs for data sharing and raises privacy concerns.
Meanwhile, methods that employ adaptive client selection or early client termination (i.e., FedMarl) aim to tackle weight divergence via careful client selection. For each communication round, FedMarl will only select a subset of the clients for training. Ideally, only the selected clients are useful for training, while the rest are not. Effectively, this means that is not constant for each communication round . However, a lower may lead to less steady convergence based on Figure 2. Thus, FedMarl is expected to handle the careful dropping of clients considering the EMD between global and local populations while taking care of other optimizing objectives, such as the training latency and communication cost of each client. Dropping the wrong clients may lead to large weight divergence, as shown in Figure 3.

System Model and Problem Formulation
In this section, we present the system model for our FL system and discuss the problem formulation. Our commonly used symbols are listed in Table 2 for ease of reference.

System Model
We considered an FL system with K number of client devices. At communication round t ∈ T, N number of clients were randomly selected from the K number of clients. Each communication round consisted of four phases, which are: (1) model broadcasting, (2) probing training, (3) client dropping and (4) completion of training.

1.
Model broadcasting: If t = 1, the FL server will initialize a global model, whereas at t ≥ 2, the FL server will collect the client models trained at round t − 1 and aggregate them into a new global model. Then, the FL server will broadcast the global model to N randomly selected clients.

2.
Probing training: Each selected client n ∈ N will perform one epoch of local training called probing training. The purpose of probing training is to acquire the metadata of each client. The metadata consist of the client's states, which will be fed to the DRL agents for adaptive early client termination and local epoch adjustment. The details of the client states will be defined later together with the specification of the DRL agents. After probing training, each client will upload its metadata to the server and proceed to the next phase.

3.
Early client termination: Based on the collected client states, the DRL agents at the FL server will drop non-essential clients to reduce total latency H t and total communication cost B t for round t. The decision made by DRL agents will be sent to each client.

4.
Completion of training: Only the remaining C clients that are not dropped by the DRL agent will resume training. Each client n will complete the remaining local training until E t n epochs are reached. Each locally trained model will be uploaded to the FL server for model aggregation.
Let H p t,n denote the probing training latency for client n ∈ N at round t ∈ T. Let H c t,n be the complete local training latency for client n at round t, while H u t,n denotes the time taken for client n to upload its local model to the FL server. Let φ t n ∈ {0, 1} denote if client n is selected by the DRL agent to complete full local training. The total processing latency H t and the total communication cost B t at communication round t can be expressed by Equations (1) and (2): Sensors 2023, 23, 2494 8 of 26 Figure 4 depicts the system model for the proposed FL protocol. Note that the model broadcasting latency H b t,n was not included into H t since it is not part of the FL optimization problem. Additionally, the time latency to upload client metadata to the server, H m t,n , was ignored. This is because the metadata file size was only 278 bytes, while even a lightweight MobileNetV2 file is 24.5 megabytes. Thus, H m t,n is negligible since H m t,n H u t,n . latency and the total communication cost at communication round can be expressed by Equations (1) and (2): = (2) Figure 4 depicts the system model for the proposed FL protocol. Note that the model broadcasting latency , was not included into since it is not part of the FL optimization problem. Additionally, the time latency to upload client metadata to the server, , , was ignored. This is because the metadata file size was only 278 bytes, while even a lightweight MobileNetV2 file is 24.5 megabytes. Thus, , is negligible since , ≪ , .

Problem Formulation
Our objective was to maximize the cumulative improvement while minimizing the and . Let = [ ] and = [ ] be a × matrix for client termination and local epoch adjustment decided by the DRL agents, respectively. We formulated the problem as a weighted sum optimization problem, as formulated below: where , and are the weights to control the importance of each objective. To ensure can improve even if it is small near the end of the FL process, a utility function denoted as (•) was used to reshape the of the global model. In FedMarl, ( ) is defined in Equation (4): One problem with the original ( ) is that it only tells us the transformed value of . The entire [ ( ) − ( )] can be reparametrized into a single (∆ ) expression, which could directly tell us the gain/penalty for ∆ . First, the ( ) equation

Problem Formulation
Our objective was to maximize the cumulative A t improvement while minimizing the H t and B t . Let φ t = φ t n and E t = E t n be a T × N matrix for client termination and local epoch adjustment decided by the DRL agents, respectively. We formulated the problem as a weighted sum optimization problem, as formulated below: where w 1 , w 2 and w 3 are the weights to control the importance of each objective. To ensure A t can improve even if it is small near the end of the FL process, a utility function denoted as U(·) was used to reshape the A t of the global model. In FedMarl, U(A t ) is defined in Equation (4): One problem with the original U(A t ) is that it only tells us the transformed value of A t . The entire w 1 [U(A t ) − U(A t−1 )] can be reparametrized into a single w 1 U(∆A t ) expression, which could directly tell us the gain/penalty for ∆A t . First, the U(A t ) equation is simplified in the given range 0 ≤ A t ≤ 1 since A t is bounded between 0 and 100%. In this range, U(A t ) can be approximated as a straight line, as shown in Figure 5. is simplified in the given range 0 ≤ ≤ 1 since is bounded between 0 and 100%. In this range, ( ) can be approximated as a straight line, as shown in Figure 5. To approximate ( ) as a straight line in the given range, the gradient and y-intercept of the graph are required. The gradient is denoted as ( ), which is the first derivative of the ( ) function. ( ) can be written as in Equation (5): The mean of the gradient, ( ) , within the range can be formulated as in Equation (6): The y-intercept of ( ), denoted as ( = 0), can be written as: Hence, ( ) can be simplified into:  To approximate U(A t ) as a straight line in the given range, the gradient and y-intercept of the graph are required. The gradient is denoted as U (A t ), which is the first derivative of the U(A t ) function. U (A t ) can be written as in Equation (5): The mean of the gradient, U (A t ), within the range can be formulated as in Equation (6): The y-intercept of U(A t ), denoted as U(A t = 0), can be written as: Hence, U(A t ) can be simplified into: Thus, U(∆A t ) can be defined as: since the two expressions have been collapsed into one equation. At this end, we can define our optimization problem as: where (10b-e) are the constraints for our optimization problem. Constraint (10b) is to make sure the sign of U(∆A t ), B t and H t are not inverted. Meanwhile, constraint (10c) is to control the ratio of ∑ T t=1 B t to ∑ T t=1 H t . Furthermore, constraint (10d) is to make sure w 1 U(∆A t ) gain will not be outweighed (w 2 B t + w 3 H t ) penalties when ∆A t is as small as 0.01. Lastly, constraint (10e) makes sure ∑ T t=1 w 1 U(∆A t ) is at least Ω 3 greater than the penalty terms. Note that (10c-e) are additional constraints that are not imposed on the original FedMarl optimization problem.
In FedMarl, the w 1 , w 2 and w 3 are treated as hyperparameters. FL engineers have to manually adjust the weightage of each objective until the desired outcome is achieved. However, the weightage w 1 , w 2 and w 3 does not directly translate to the weightage of each objective ∑ T t=1 U(∆A t ), ∑ T t=1 B t and ∑ T t=1 H t . For instance, the ratio of w 2 B t to w 3 H t does not directly equate to the ratio of w 2 ∑ T t=1 B t to w 3 ∑ T t=1 H t . This is because the values of H t and B t are instantaneous and stochastic, which means that the ratio of w 3 H t to w 2 B t at two different t is most likely different. On the other hand, E ∑ T t=1 H t and E w 2 ∑ T t=1 B t are more consistent. Taking the ratio of these two components is more reliable.
We can find the best w 1 , w 2 and w 3 by setting the desired Ω 1 , Ω 2 and Ω 3 . We set Ω 1 = 0.2, Ω 2 = 0.3 and Ω 3 = 1.0. We needed to run one iteration of FL using FedAvg to get the traces value of ∆A t , B t and H t for t ∈ T since these values are dependant on the target IoT environment setup. Based on the traces value, we could follow Algorithm 1 to acquire the suitable w 1 , w 2 and w 3 . In our experiment setup, we found the desired hyperparameters to be (w 1 = 2.9, w 2 = 0.1 and w 3 = 0.2). Algorithm 1 Search for the best w 1 , w 2 and w 3

Input:
Set The best w 1 , w 2 , w 3 3: Run one complete iteration of FedAvg with T = 15 communication rounds and record the traces value of ∆A t , B t and H t for t ∈ T.

4:
Initialize an empty set cache = {} to store all ((w 1 , if (10c-e) are satisfied: 10: Compute from the traces value 11: From cache, find out which combination of (w 1 , w 2 , w 3 ) results in the smallest R. This can be treated as finding the worse-case max E[R]. 17: returnw 1 , w 2 , w 3

Proposed Method
We propose FedDdrl, which exploits two DRL policy networks for FL optimization. Specifically, we adopted VDNs as the DRL policy networks for FedDdrl. We elaborate in detail on how we formulated the problem as MDP, including the design of state space, action space and reward of the algorithm. In addition, we also exploited the recently proposed balanced-MixUp to mitigate the impact of weight divergence and speed up the FL convergence speed.

Deep Reinforcement Learning for Federated Learning Optimization
The proposed optimization problem in Equation (10a-e) is a 0-1 Multidimensional Knapsack Problem (MKP). The items to be put in knapsacks are the client devices n with complete training latency H c t,n , model uploading latency H u t,n , communication cost B t n and data size D n . The total capacity of the knapsack equals the total communication cost B t = ∑ N n B t n a t n , where a t n is the binary indicator of item (client) n. When a t n is set to 1, item (client) n is selected. Otherwise, a t n is set to 0. The total weight of the knapsack has a lower bound which has to fulfil the minimum requirement of accuracy constraint(10d). Our goal is to select a subset of clients C (1 < C ≤ N) for complete training in each communication round to maximize the total accuracy gain ∑ T t=1 ∆A t while minimizing the total latency ∑ T t=1 H t and total communication cost ∑ T t=1 B t of the entire FL training. Thus, the proposed optimization is NP-hard.
To solve problem (10), our FedDdrl algorithm adopted a double DRL framework for our optimization problem. Specifically, we formulated the DRL policy network for both tasks using a multi-agent reinforcement learning (MARL) approach. In particular, VDN has proven itself in the recent literature [33] to be a promising candidate for MARL problems. A VDN network consists of N agents, in which each agent n ∈ N uses a deep neural network (DNN) parametrized with θ to implement the Q-function Q θ n (s, a) = E[R t s = s t n , a = a t n ] . At each timestep t, each agent n observes its states s t n and selects the optimal action a t n with the maximum Q-value. Let s t = s t n and a t = a t n represent the states and actions collected from all agents n ∈ N at timestep t, respectively. The joint Q-function Q tot (·) for the multi-agent VDN system can be represented by the elementwise summation of all the individual Q-functions, where Q tot (s t , a t ) = ∑ n Q θ n s t n , a t n . In FedDdrl, we set each agent in both VDN as a simple two-layer multi-layer perceptron (MLP), which is cheap to implement. All MLPs in each VDN share their weights to prevent the lazy agent problem [33].
As illustrated in Figure 6, the first VDN network takes the client states s t (which will be detailed later) to obtain the optimal client termination matrix φ t . The second VDN network takes the same client states s t to obtain the optimal local training epoch per client E t . . In FedDdrl, we set each agent in both VDN as a simple two-layer multilayer perceptron (MLP), which is cheap to implement. All MLPs in each VDN share their weights to prevent the lazy agent problem [33]. As illustrated in Figure 6, the first VDN network takes the client states (which will be detailed later) to obtain the optimal client termination matrix . The second VDN network takes the same client states to obtain the optimal local training epoch per client .

Early Client Termination
Inspired by the work of FedMarl [14], the first VDN network was employed to learn the optimal policy for early client termination matrix at round . We reformulated the problem as an MDP with the following state, action and reward to train a VDN network with = 10 agents.
It is noteworthy that since each agent in the VDN only has access to its own local observation instead of the full observed environment, the policy has to incorporate past agent observations from history [33].

Early Client Termination
Inspired by the work of FedMarl [14], the first VDN network was employed to learn the optimal policy for early client termination matrix φ t at round t. We reformulated the problem as an MDP with the following state, action and reward to train a VDN network with N = 10 agents.

1.
State s t State s t = s t n consisted of the client states for each VDN agent. Each agent n consisted of six components: (i) the probing loss L t n , (ii) probing training latencies H p t,n , (iii) model uploading latencies H u t,n , (iv) communication cost from client to server B t n , (v) local training dataset size |D n | and (vi) current communication round index t. The state vector for agent c can be written as Equation (11): It is noteworthy that since each agent in the VDN only has access to its own local observation instead of the full observed environment, the policy has to incorporate past agent observations from history [33]. Thus, the historical values of probing latencies

2.
Action φ t : Action φ t = φ t n comprised the client termination decision for each VDN agent. The action space for client termination was φ t n = {0, 1}, where 0 indicates the termination of the client and 1 indicates the client is selected for complete training.

3.
Reward r 1 t : A vanilla reward for VDN 1, denoted as r 1 t , can be adopted from the FL optimization problem as described in Equation (12): where the system is rewarded with accuracy improvement ∆A t and penalties for B t and H t . However, Equation (12) has one obvious limitation. When ∆A t → 0 , the lim This causes the optimization problem to diverge from improving accuracy with the constraint of B t and H t to merely the reduction of B t and H t . To show the severeness of this problem, we trained the VDN agents using the r 1 t as defined by Equation (12). Let E[w 1 U(∆A t )] and E[w 2 B t + w 3 H t ] denote the expected values of accuracy improvement ∆A t and penalties (w 2 B t + w 3 H t ), respectively. For MNIST dataset, the expected values of both components for the last R = 5 communication rounds can be computed in Equations (13) and (14): It is observed that E[w 1 U(∆A t )] E[w 2 B t + w 3 H t ] for the last five communication rounds. This is because as training approach the end, the accuracy improvement is often smaller compared to the earlier stage. Consequently, the VDN agents start to terminate more clients from complete training, giving way to the reduction of B t and H t . To make sure the agents are motivated to learn even when ∆A t → 0 , we can introduce a bias term b to r 1 t . Let b = 3 10 E[w 2 B t + w 3 H t ]. Hence, the reward function r 1 t can be reformulated as shown in Equation (15): Note that we only added the bias term b to the reward r t when ∆A t > 0 since it is intended to encourage accuracy improvement. We did not subtract the bias term b from the reward r t when ∆A t ≤ 0 since the penalty terms are sufficient to penalize the inferior actions.

Local Epoch Adjustments
The second DRL network was employed to learn the optimal policy for local epoch adjustments E t at round t. A VDN algorithm with N = 10 agents can be formulated by defining the state, action and reward as follows:

1.
State s t : The second VDN shared the same state in Equation (11) since both VDNs required the same local observation for decision making.

2.
Action E t : Action E t = E t n comprises the local epoch counts for each VDN agent. The action space is E t n = {3, 5, 7}. This action aims to exploit client devices with stronger computation power for more training epochs and vice versa.

3.
Reward r 2 t : We adopted Equation (15) as the starting point for the reward function for VDN 2. However, the communication cost B t was not part of the optimizing objectives of VDN 2 since local epoch adjustment is only bounded by the H t constraint. Hence, the reward function r 2 t for this VDN networks can ignore the B t penalty. As such, the r 2 t can be defined in Equation (16): where we used the same bias term b from (15) for the simplicity's sake.
As the training converges, VDN 1 will deliver the optimal client selection, and VDN 2 will impart the optimal local epoch number for each client. The overall algorithm for solving the problem in Equation (10a-e) is summarized in Algorithm 2.

1:
Input: Initialize VDN 1 Q 1 tot and its target network Q 1 tot for client selection φ t policy Initialize VDN 2 Q 2 tot and its target network Q 2 tot for local epoch adjustment E t policy 2: Output: Trained Q 1 tot and Q 2 tot networks 3: Set ε = 1.0 4: for Episode n ep = 1, 2, . . . , N ep do 5: Reset the FL environment 6: Initialize a global model w 0 7: for communication round t = 1, 2, . . . , T do 8: Randomly select N clients from all K clients 9: Broadcast the global model w t to each selected client 10: for each client n ∈ N in parallel do 11: w n t ← w t ; Copy the global model as each client model

12:
Update the client model w n t using the local training dataset D n 13: Upload client states s t n to the FL server 14: end for

15:
Each agent n in VDN 1 selects the optimal action φ * t,n = argmax Q 1 n s t n , φ t n with a (1 − ε) × 100% probability, else randomly output actions 16: Each agent n in VDN 2 selects the optimal action E * t,n = argmax Q 2 n s t n , E t n with a (1 − ε) × 100% probability, else randomly output actions 17: Send action φ * t,n and E * t,n to each client n ∈ N 18: for each client n ∈ N in parallel do 19: if φ * t,n = 1:

20:
Continue updating w n t using D n until E * t,n is reached 21: Return updated w n t 22: end if 23: end for 24: Reward r 1 t and r 2 t are given to Q 1 tot and Q 2 tot based on ∆A t , B t , H t 26:

27:
Store transitions 1 s t , φ t , r 1 t for Q 1 tot into memory buffer 1 28: Store transitions 2 s t , E t , r 2 t for Q 2 tot into memory buffer 2 29: Sample mini-batches with size n b from memory buffer to train Q 1 tot , Q 2 tot and Q 1 tot , Q 2 tot 30: Decay ε gradually from 1.0 to 0.1 31: end for 32: end for

Balanced-MixUp to Mitigate Weight Divergence
In this study, we focused on the non-IID label shift, where the client dataset is heavily skewed to one of the label classes. The huge EMD between the client data distribution p k and the global distribution p will contribute to weight divergence, deteriorating the training efficiency of FL. Thus, the highly imbalanced client dataset has to be handled wisely. MixUp [34] is a simple yet effective data augmentation technique that could shed some light on this problem.
MixUp extends the training data distribution by linearly interpolating between existing data points, filling the underpopulated areas in the data space. It generates synthetic training data x ,ŷ by simply taking the weighted combination of two random data pairs, (x i , y i ) and x j , y j , as shown in Equations (17) and (18): where λ~Beta(α, α), with α > 0. Despite its simplicity, MixUp has been proven to improve model calibration and better generalization [35]. Thus, its application has expanded from image and speech classification tasks [34] to other domains, including image segmentation [36] and natural language processing [37,38]. However, a vanilla MixUp works poorly in highly imbalanced datasets [39]. In highly imbalanced datasets, MixUp would end up sampling the data pairs (x i , y i ) and x j , y j from the same class for most of the time since the sampling is done randomly. Some recent studies focused on solving the data imbalance problem for MixUp. In Remix by [39], x i and x j are mixed in the same fashion as MixUp, but y i and y j are mixed such that the minority class is assigned a higher weight. This method pushes the decision boundaries away from the minority class, balancing the generalization error between the majority and minority classes. Balanced-MixUp is another variation of MixUp, where MixUp is combined with a data-resampling technique to achieve a balanced class distribution [22]. Specifically, balanced-MixUp combines instance-based sampling and class-based sampling for the majority and minority classes, respectively. This ensures that each data pair (x i , y i ) and x j , y j always consists of instances from both the majority and minority classes.
We adopted balanced-MixUp as the augmentation into the formulation of our solution to address the class imbalanced problem. To the best of our knowledge, we are the first to integrate balanced-MixUp into FL for weight divergence mitigation. Let (x M , y M ) and (x m , y m ) denote the instance pair sampled from the majority and minority classes, respectively. Balanced-MixUp can be expressed as shown in Equations (19) and (20): Balanced-MixUp guarantees that each data pair mixing consists of instances from both the majority and minority class. Unlike the original balanced-MixUp where λ~Beta(1, α), we adopted λ~Beta(α, α) and found it to work better in our study. The best α value may be different depending on the datasets, which will be detailed in the results section.

Simulation Results
This study adopted TensorFlow as the deep learning platform. We adopted three datasets for FL benchmarking: MNIST, CIFAR-10 and CrisisIBD [40]. First, MNIST is a relatively simple task under most non-IID settings. It is mainly used to prove that a novel FL algorithm is at least working. In contrast, CIFAR-10 is a challenging dataset in non-IID settings, which is strongly encouraged to be included in FL benchmarking experiments [32]. On the other hand, the CrisisIBD dataset is the benchmark dataset for various real-world disaster-related image classification tasks. In this study, we adopted the disaster classification dataset from the dataset (hereinafter referred to as CrisisIBD) as one of our benchmark datasets. We adopted balanced-MixUp to augment all three datasets. We found the best α value used by balanced-MixUp was 0.05, 0.4 and 0.2 for MNIST, CIFAR-10 and CrisisIBD, respectively. We used all three datasets to train a lightweight MobileNetV2 [41], which aligned with our goal of developing the FL framework for lowpowered IoT devices. All clients adopted B t n = 1∀n, t, similar to the setting in [14]. In this study, we focused on the non-IID label shift as demonstrated in [14,21]. Similarly, we divided each dataset into K clients. For each client, a fraction σ = 0.8 of the local training dataset was sampled from one random label (which is the majority label), while the rest of the training data were sampled uniformly from the remaining labels (which are the minority labels). We compared our proposed method with FedAvg, FedProx and FedMarl. The first two algorithms were shown to be robust baselines in non-IID label shift [32] and are prebuilt in many existing FL frameworks, including Tensorflow Federated [42] and Intel OpenFL [8]. On the other hand, FedMarl is one of the state-of-the-art FL algorithms [14]. Our FedDdrl aims to outperform all three of the algorithms. The hyperparameters are listed in Table 3. Due to limited resources, we only had two hardware devices: (i) an Intel NUC with an Intel core i7-10710U processor with 4.70 GHz and (ii) a workstation equipped with an Intel core i7-10875H processor with 2.30 GHz and NVIDIA RTX 2070 SUPER. In total, this yielded two TensorFlow CPU operators and one TensorFlow GPU operator. However, this was far from enough to simulate a heterogeneous FL environment with K = 100 clients if each operator only represents one client. Thus, we carefully devised our experimental setup, as shown in Figure 7.
To simulate an FL environment with K = 100 clients (IoT devices) and C = 10 selected clients (before early termination), we set up the experiment as shown below:

1.
We created K = 100 client configurations, each consisting of the (i) client's computing latency per data, (ii) model upload latency and (iii) local dataset identity (ID) number.
To closely simulate the heterogeneity of resources in an IoT network as in [14], the computing latency per data in each client configuration can be any of {0.25, 0.50, 0.75} seconds, while the model upload latency can be any of {1.00, 1.25, 1.75, 2.00} seconds. 2.
CPU 1, CPU 2 and GPU simulated three, three and four clients, respectively. The simulated clients represent the C = 10 randomly selected clients from the total K = 100 clients in each communication round t.

3.
In each communication round t, 10 client configurations were randomly sampled out from the configuration pools. The 10 simulated clients (in CPU 1, CPU 2 and GPU) were configured according to the selected client configuration. This entire process (3) is equivalent to the FL process of randomly selected 10 clients with unique local datasets and resources. 4.
After step (3), each simulated client proceeded with its training. If the FL algorithm was FedAvg or FedProx, all 10 simulated clients underwent complete training of E t = 5 local epochs. On the contrary, if the FL algorithm was FedMarl or FedDdrl, only the simulated clients that were not terminated by the FedMarl/FedDdrl completed their local training based on E * t,n = argmax Q 2 n s t n , E t n by VDN 2. To simulate an FL environment with = 100 clients (IoT devices) and = 10 selected clients (before early termination), we set up the experiment as shown below: 1. We created = 100 client configurations, each consisting of the (i) client's computing latency per data, (ii) model upload latency and (iii) local dataset identity (ID) number. To closely simulate the heterogeneity of resources in an IoT network as in [14], the computing latency per data in each client configuration can be any of {0.25, 0.50, 0.75} seconds, while the model upload latency can be any of {1.00, 1.25, 1.75, 2.00} seconds.
2. CPU 1, CPU 2 and GPU simulated three, three and four clients, respectively. The simulated clients represent the = 10 randomly selected clients from the total = 100 clients in each communication round .
3. In each communication round , 10 client configurations were randomly sampled out from the configuration pools. The 10 simulated clients (in CPU 1, CPU 2 and GPU) were configured according to the selected client configuration. This entire process (3) is equivalent to the FL process of randomly selected 10 clients with unique local datasets and resources. 4. After step (3), each simulated client proceeded with its training. If the FL algorithm was FedAvg or FedProx, all 10 simulated clients underwent complete training of = 5 local epochs. On the contrary, if the FL algorithm was FedMarl or FedDdrl, only the simulated clients that were not terminated by the FedMarl/FedDdrl completed their local training based on , * = argmax ( , ) by VDN 2.

Results and Ablation Study
We compared the performance of FedDdrl with other baselines in all three objectives of the optimization problem, which are the (i) model accuracy, (ii) training latency and (iii) communication efficiency. Table 4 shows the model accuracy trained using each FL setting after = 15 communication rounds. We also conducted an ablation study showing how FedDdrl improves beyond FedMarl.

Results and Ablation Study
We compared the performance of FedDdrl with other baselines in all three objectives of the optimization problem, which are the (i) model accuracy, (ii) training latency and (iii) communication efficiency. Table 4 shows the model accuracy trained using each FL setting after T = 15 communication rounds. We also conducted an ablation study showing how FedDdrl improves beyond FedMarl. Setting A in Table 4 was our implementation of FedMarl with the original hyperparameters (w 1 = 1.0, w 2 = 0.1, w 3 = 0.2). In Setting B, we proved that FedMarl performance could be improved using our hyperparameters (w 1 = 2.9, w 2 = 0.1, w 3 = 0.2). However, we found that its accuracy was still far behind both FedAvg and FedProx. This is because MobileNetV2 is a lightweight model which is easy to overfit [41]. In FedAvg and FedProx, the total number of clients selected for training is always C = 10 in each communication round t. During aggregation, there is a sufficient amount of client models overfitted for different classes, which, when aggregated, can generate a regularization effect, thus mitigating the weight divergence caused by overfitting. This is not the case for FedMarl, which does not have a fixed C for each round. We argue that since the original FedMarl was not tested on MobileNetV2, it was able to perform better than FedAvg and FedProx. Applying balanced-MixUp could significantly mitigate this problem, as shown in Setting C. The reasoning on how balanced-MixUp helps weight divergence mitigation is detailed in Section 5.5.

Model Accuracy
Setting D was our FedDdrl, where we added another VDN for local epoch adjustment. FedDdrl allows client devices to train for more epochs when required and vice versa. This allows FedDdrl to converge faster than FedAvg and FedProx for most cases, even when it is not utilizing all clients at each communication round. FedDdrl outperformed other FL algorithms in both the challenging CIFAR-10 and CrisisIBD datasets, and it was the second-best for MNIST. We suspect FedDdrl is slightly overengineered for an easy task like MNIST. Nevertheless, it was still very robust considering that real-world data is often not as simple as MNIST and is instead more challenging like the CIFAR-10 and CrisisIBD datasets. Figure 8 shows the normalized training latency for each FL algorithm (with balanced-MixUp) on all three datasets. Our FedDdrl outperformed all three other algorithms in all datasets. This is promising since FedDdrl allows dynamic local adjustment. The FedDdrl will sometimes increase the local epoch from five to seven. However, the extra training latency is balanced when FedDdrl decreases the local epoch from five to three, especially in the early communication round when the MobileNetV2 is still learning lower-level features. FedMarl followed closely behind FedDdrl. This is mainly because both FedDdrl and FedMarl can terminate clients with longer probing latencies. On the other hand, FedAvg had a moderate performance in terms of training latency. It was not as fast as FedMarl and FedDdrl, but it was still significantly faster than FedProx. As expected, FedProx had the longest training latency compared to other FL algorithms, which is aligned with the observation by [32]. This is mainly due to the extra computing cost required to compute the L2 distance between the client and the global model. Hence, applying FedProx in low-powered devices (i.e., IoT devices) with limited computation power is not feasible.   Figure 9 shows the normalized communication costs of all FL algorithms in the three datasets. FedDdrl and FedMarl were significantly more efficient than FedAvg and FedProx in total communication costs. There was no clear winner between FedDdrl and FedMarl regarding communication efficiency. However, FedDdrl outperformed FedMarl in CIFAR-10 and CrisisIBD datasets, which are significantly harder tasks compared to MNIST. Thus, we argue that FedDdrl is the best algorithm. On the other hand, FedAvg and FedProx have a fixed number of clients selected in each round. Since we assume B t n = 1∀n, t, both algorithms have the same total communication costs.  Figure 9 shows the normalized communication costs of all FL algorithms in the three datasets. FedDdrl and FedMarl were significantly more efficient than FedAvg and FedProx in total communication costs. There was no clear winner between FedDdrl and FedMarl regarding communication efficiency. However, FedDdrl outperformed FedMarl in CIFAR-10 and CrisisIBD datasets, which are significantly harder tasks compared to MNIST. Thus, we argue that FedDdrl is the best algorithm. On the other hand, FedAvg and FedProx have a fixed number of clients selected in each round. Since we assume = 1∀ , , both algorithms have the same total communication costs.

Strategy Learned by FedDdrl
In this section, we analyze the strategies learned by the FedDdrl algorithm to fully utilized its computing resources while reducing communication costs. Figures 10 and 11 show the early client termination strategy learned by FedDdrl. Blue dots indicate that the client was chosen for complete training, while red dots indicate early client termination.

Strategy Learned by FedDdrl
In this section, we analyze the strategies learned by the FedDdrl algorithm to fully utilized its computing resources while reducing communication costs. Figures 10 and 11 show the early client termination strategy learned by FedDdrl. Blue dots indicate that the client was chosen for complete training, while red dots indicate early client termination.
First, FedDdrl generally picked lesser clients for complete training in the early phase of FL training. From Figures 10 and 11, it is noticed that only three clients were selected for complete training in the first two communication rounds t = {1, 2}. This is because DNNs usually learn the low-complexity features before learning the higher-complexity features. The former is more robust to noises [43] and can be learned with fewer data [14]. This allows FedDdrl to reduce communication costs by terminating most clients from training in the early phase, where the MobileNetV2 is still learning low-level features. Starting from round t = 3 to t = 6, the number of clients that underwent complete training increased from 6 to 10. This indicates that MobileNetV2 was beginning to learn higher-level features that require more training data. For the remaining rounds, the number of selected clients was roughly five. Second, FedDdrl preferred clients with a lower probing loss for complete training, which is aligned with the findings in FedMarl [14]. Third, FedDdrl tended to pick clients with shorter probing latency for complete training to reduce the total latency of FL training.
As mentioned earlier, conventional FL training sets the same local epoch E n t for all clients, disregarding their computing resources. Hence, one of the contributions of FedDdrl is to learn the optimal strategy to adjust the local epoch count for each client dynamically. In Figure 12, we plotted the local epoch count corresponding to each selected client from the scenario in Figure 11. Bigger dots indicate that a higher local epoch count was assigned for the corresponding clients. It was found that FedDdrl tended to set a lower epoch count (smaller dots) for clients with higher probing latency. This method could reduce the total training latency since clients with limited computing power did not have to participate in long training epochs. On the other hand, clients with lower probing latency tended to have a higher epoch count. This strategy can fully utilize the computing resource of clients with stronger computing power since they can continue training while waiting for other clients to finish. However, this was not always the case, as shown in Figure 13. On certain occasions, FedDdrl set a high epoch count for clients with long probing latency if the data in these clients were crucial for FL convergence. In any case, FedDdrl was superior to FedMarl, where the FedMarl could either select or terminate a client without the third option of selecting the clients and dynamically tuning the local epoch.  First, FedDdrl generally picked lesser clients for complete training in the early phase of FL training. From Figures 10 and 11, it is noticed that only three clients were selected for complete training in the first two communication rounds = {1,2}. This is because DNNs usually learn the low-complexity features before learning the higher-complexity features. The former is more robust to noises [43] and can be learned with fewer data [14].  First, FedDdrl generally picked lesser clients for complete training in the early phase of FL training. From Figures 10 and 11, it is noticed that only three clients were selected for complete training in the first two communication rounds = {1,2}. This is because DNNs usually learn the low-complexity features before learning the higher-complexity features. The former is more robust to noises [43] and can be learned with fewer data [14]. waiting for other clients to finish. However, this was not always the case, as shown in Figure 13. On certain occasions, FedDdrl set a high epoch count for clients with long probing latency if the data in these clients were crucial for FL convergence. In any case, FedDdrl was superior to FedMarl, where the FedMarl could either select or terminate a client without the third option of selecting the clients and dynamically tuning the local epoch.  Figure 10. Clients with high probing latency were assigned smaller epoch counts so that the clients could finish local training earlier.

How FedDdrl Optimizes the Three Objectives Simultaneously
The objectives of FedDdrl are to (i) maximize the global model's accuracy while minimizing the (ii) FL system's training latency and (iii) communication cost. First, VDN 1 will perform early client termination to terminate clients who are not essential for training. By doing so, we can reduce the total communication cost. Additionally, VDN 1 prefers clients with lower probing latency (which also translates to lower training latency). Thus, VDN 1 plays a huge role in reducing both communication costs and training latency. Second, VDN 2 will dynamically adjust the local epoch count. Clients with limited computing power only have to train with a lesser epoch, so they can finish training earlier. Meanwhile, VDN 2 assigns a higher epoch count to clients with stronger computing power, so

How FedDdrl Optimizes the Three Objectives Simultaneously
The objectives of FedDdrl are to (i) maximize the global model's accuracy while minimizing the (ii) FL system's training latency and (iii) communication cost. First, VDN 1 will perform early client termination to terminate clients who are not essential for training. By doing so, we can reduce the total communication cost. Additionally, VDN 1 prefers clients with lower probing latency (which also translates to lower training latency). Lastly, the global model's accuracy must be retained. When VDN 1 performs client termination, it is essentially reducing C. Intuitively, reducing C seems counter-productive since it reduces the total number of clients participating in training for each communication round t. Fewer clients translate to fewer training data. However, [44] showed that in a non-IID setup, the convergence rate of FedAvg had a weak dependence on C. This makes sense, as some clients may have local datasets with a huge EMD distance from the global distribution. Training the FL model using these clients may hinder the convergence rate. Additionally, [14] showed that using DRL in client selection (or early client termination) can positively affect the convergence rate. This is because selecting useful clients (with useful data) can improve the quality of the overall FL data, which is more crucial than increasing the quantity of data.
In short, FedDdrl can reduce communication cost and training latency without sacrificing model accuracy via early client termination due to the weak correlation between convergence rate and C.

Computational Complexity Analysis
FedDdrl is composed of a finite number of MLPs. In MLP, let L, n 0 and n i denote the layer numbers, the size of the input layer (which corresponds to the client state's size) and the number of neurons in i-th layer, respectively. During training mode, the computational complexity for an MLP to update its weight in each step can be expressed as O(N b (n 0 n 1 + ∑ L−1 i=1 n i n i+1 )) [45]. In total, it takes N ep × T steps for the FedDdrl algorithm to finish training. Hence, the total training computational complexity of Fed-Ddrl is O(N ep TN b (n 0 n 1 + ∑ L−1 i=1 n i n i+1 )). The high computation complexity of the MLP can be performed offline using a powerful device (i.e., the FL server). In the online deployment mode, the computational complexity in each step is dramatically reduced to O(n 0 n 1 + ∑ L−1 i=1 n i n i+1 ). This is done by cutting off the training procedure, which requires feedforward and backpropagation of N b data points. Thus, the computational complexity is retained at a favorable level.

Why Balanced-MixUp Helps in Federated Learning
Without loss of generality, we explored how balanced-MixUp mitigates weight divergence in FL assuming the amount of training data for each class is uniform in the global population. Under this setting, we can express the global distribution p(y = i) for all labels i = {1, 2, 3, . . . , n c } as shown in Equation (21): In this study, a fraction σ of the local training dataset is sampled from one random label, while the remaining 1 − σ fraction is sampled uniformly from the remaining labels. Following this assumption, let i = 1 be the majority class in each client and i = {2, 3, . . . , n c } be the minority classes (whichever i can be the majority class since the ordering does not affect the approximation of client distribution). Without balanced-MixUp, we can express the client dataset distribution p k (y = i) as Equation (22) Based on Equations (21) and (22), the EMD between local and global distribution for FedAvg without balanced-MixUp, denoted as EMD ori , can be written as Equation (23): On the other hand, the client dataset distribution with balanced-MixUp p k MixU p (y = i) can be expressed as shown in Equation (24): where E(λ) is the expected value of λ~Beta(α, β). E(λ) can be written as Equation (25): Based on Equations (21) and (24), the EMD between FedAvg with balanced-MixUp denoted as EMD MixUp can be written as Equation (26): Take our experiments using CIFAR-10 as example, where σ = 0.8, n c = 10, λ~Beta(0.4, 0.4) and E(λ) = 0.5. Based on Equations (23) and (26), the EMD ori and EMD MixUp can be computed as 0.777 and 0.444, respectively. This shows that balanced-MixUp could significantly reduce the weight divergence caused by the EMD between p k and p.
The above proposition is aligned with our experiments. Table 5 shows the performance of FedAvg with and without balanced-MixUp in different C. Balanced-MixUp provided a drastic accuracy boost to FedAvg in all datasets. The accuracy improvement was as high as 17.0% for the CrisisIBD dataset when C = 10. This shows that balanced-MixUp can effectively mitigate the weight divergence caused by the EMD, especially when a low number of clients participate in a communication round t. Note that for the MNIST dataset (C = 10), the accuracy of FedAvg with balanced-MixUp was slightly poorer than its counterpart, lagging behind by merely 1.4%. This is reasonable, as MNIST is considered a simple task [32] in which FedAvg could perform similarly in certain non-IID settings. Another notable observation is that FedAvg with balanced-MixUp significantly outperformed its counterpart in both the challenging CIFAR-10 and CrisisIBD datasets. The observation is consistent in both C = 5 and C = 10 experiments. This is encouraging because it proves that balanced-MixUp is useful in mitigating non-IID label shifts, especially for algorithms like FedMarl and FedDdrl which do not have a fixed value of C.

Conclusions
In this paper, we proposed a DDRL-based FL framework (FedDdrl) for adaptive early client termination and local epoch adjustment. FedDdrl can terminate clients with high probing latency to reduce total training latency and communication costs, and it can automatically adjust the local epoch to fully utilize clients' computing resources. We also showed that balanced-MixUp is a useful augmentation technique to mitigate the impact of weight divergence arising from non-IID label shifts in FL. The simulation results on MNIST, CIFAR-10 and CrisisIBD confirmed that FedDdrl outperformed the comparison schemes in terms of the model's accuracy, training latency and communication costs of FL under extreme non-IID settings. As a future work, we would explore the performance of FedDdrl on other types of non-IID settings, such as feature distribution skew and quantity skew.