Personalized Federated Multi-Task Learning over Wireless Fading Channels

: Multi-task learning (MTL) is a paradigm to learn multiple tasks simultaneously by utilizing a shared network, in which a distinct header network is further tailored for ﬁne-tuning for each distinct task. Personalized federated learning (PFL) can be achieved through MTL in the context of federated learning (FL) where tasks are distributed across clients, referred to as personalized federated MTL (PF-MTL). Statistical heterogeneity caused by differences in the task complexities across clients and the non-identically independently distributed (non-i.i.d.) characteristics of local datasets degrades the system performance. To overcome this degradation, we propose FedGradNorm , a distributed dynamic weighting algorithm that balances learning speeds across tasks by normalizing the corresponding gradient norms in PF-MTL. We prove an exponential convergence rate for Fed-GradNorm . Further, we propose HOTA-FedGradNorm by utilizing over-the-air aggregation (OTA) with FedGradNorm in a hierarchical FL (HFL) setting. HOTA-FedGradNorm is designed to have efﬁcient communication between the parameter server (PS) and clients in the power-and bandwidth-limited regime. We conduct experiments with both FedGradNorm and HOTA-FedGradNorm using MT facial landmark (MTFL) and wireless communication system (RadComDynamic) datasets. The results indicate that both frameworks are capable of achieving a faster training performance compared to equal-weighting strategies. In addition, FedGradNorm and HOTA-FedGradNorm compensate for imbalanced datasets across clients and adverse channel effects.


Introduction
Multi-tasking (MTL) is a powerful technique for learning several related tasks simultaneously [1,2]. MTL improves the overall system performance, the training speed, and data efficiency, by leveraging the synergy among multiple related tasks. Each task in MTL setting has a single common encoder network to map the raw data into a lower dimensional shared representation space, in addition to a unique client-specific header network to infer task related prediction values from the shared representation. MTL is particularly suitable for distributed learning settings where no single entity has all the data and labels for all of the multiple different tasks.
Federated learning (FL) [3] is a distributed learning paradigm in which many clients train a shared model with the assistance of a central unit called parameter server (PS) by keeping their data private and decentralized. Statistical heterogeneity becomes a major challenge for FL as the size of the distributed setting, namely the number of clients or the number of tasks in a FL setting, increases. This is caused by different task complexities and non-i.i.d. data distribution among clients. Statistical heterogeneity due to non-i.i.d. data distribution degrades the system performance [4,5]. Further, synergy between the tasks may not always be positive, referred to as a negative transference, again degrading the performance [6].
Personalized federated learning (PFL) is introduced to deal with statistical heterogeneity, where the centralized server and clients learn a shared representation together, while each client trains its own client-specific header network further, referred to as personalization. PFL has the capability of learning user-specific models better while also capturing the distilled common knowledge from other clients [7][8][9]. As a result, PFL reduces the statistical heterogeneity among clients. Several PFL approaches have been proposed, where different local models are used to fit user-specific data, but also capture the common knowledge distilled from data of other devices for this purpose [7][8][9][10][11][12][13][14][15]. Hanzely et al. [16] provides a unified model framework to prove multiple kinds of PFL methods. In this paper, we consider MTL in a FL setting, enhanced with personalization, namely, personalized federated multi-task learning (PF-MTL).
PFL works that are closely related to our work are federated representation learning FedRep [7] and federated learning with personalization layers FedPer [8]. These works use a shared encoder with a unique task-specific header to enhance personalization in a FL setup. FedRep and FedPer aggregate the common encoder parameters by equal weighting. Our goal in this paper is to perform weighted gradient aggregation at the PS dynamically according to gradient updates coming from clients to overcome statistical heterogeneity.
Several approaches involve setting the weights of tasks manually at the beginning of training [17,18]. Task weights can also be set in an adaptive manner for each iteration. GradNorm is a dynamic weighting approach in MTL that normalizes gradient norms by scaling the task loss functions to regulate learning speeds and fairness across tasks [19]. GradNorm is proposed for a centralized learning setting. In our work, we introduce Fed-GradNorm framework [20], where dynamic weighting is incorporated in PFL setting by considering some aspects of [7,8,19] together. We provide a theoretical convergence proof for FedGradNorm, while FedPer [8] and GradNorm [19] do not provide a convergence proof, and FedRep [7] provides a convergence proof only for the linear learning model setting.
We investigate how the FedGradNorm framework is affected by the characteristics of the wireless fading communication channel between clients and the PS. There are different channel conditions since clients are geographically spread out. In addition to the wireless channel effects, the communication is performed over bandwidth-and power-limited regime, which brings concerns about communication costs. We utilize over-the-air (OTA) aggregation to perform efficient aggregation over a shared wireless channel to support the clients on the same bandwidth by utilizing the additive nature of wireless multiple access channel (MAC) [21,22]. In practice, when the OTA mechanism is utilized, the gradients are superposed, and it is not possible for the PS to receive individual gradients of the clients. However, the PS needs individual gradients from the clients to perform dynamic weighting. To address this issue, we modify FedGradNorm with OTA in a hierarchical structure, which is called hierarchical over-the-air FedGradNorm, HOTA-FedGradNorm [23]. Hierarchical federated learning (HFL) establishes clusters of clients around intermediate servers (IS). ISs communicate with the PS instead of clients directly communicating with the PS. Some aspects of HFL have been studied in the literature, such as, latency and power analysis [24,25], and resource allocation [26,27]. These works demonstrate the advantage of the proximity of ISs to the clients in terms of resource consumption of clients. In addition to these advantages, we utilize hierarchical structure since it provides an efficient way of combining FedGradNorm with OTA over the wireless fading channel.
The main contributions of our paper can be summarized as: • We propose the FedGradNorm algorithm. The proposed algorithm takes advantage of the GradNorm [19] dynamic weighting strategy in a PFL setup for achieving a more effective and fair learning performance when the clients have a diverse set of tasks to perform. • We propose HOTA-FedGradNorm. The proposed algorithm takes into account the characteristics of the communication channel by defining a hierarchical structure for the PFL setting.
• We provide the convergence analysis for adaptive weighting strategy for MTL in PFL setting. Existing works either do not provide convergence analysis or do it in special cases. We demonstrate that FedGradNorm has an exponential convergence rate. • We conduct several experiments on our framework using Multi-Task Facial Landmark (MTFL) dataset [28], and RadComDynamic dataset on the wireless communication domain [29]. We investigate the changes in task loss during training to compare the learning speed and fairness of FedGradNorm with a similar PFL setting which uses equal weighting technique, namely FedRep. Experimental results exhibit a better and faster learning performance for FedGradNorm than FedRep. In addition, we demonstrate that HOTA-FedGradNorm results in faster training over the wireless fading channel compared to algorithms with naive static equal weighting strategies since dynamic weight selection process takes the channel conditions into account.

Federated Learning (FL)
FL [3] is a distributed machine learning approach that enables training on decentralized data in devices such as smart phones, IoT devices, and so on. FL can be described as an approach that brings the training model to the data, instead of bringing data to the training model [30]. In FL, edge devices collaboratively learn a shared model under the orchestration of a PS without sharing their training data.
The generic form of FL with N clients is where p (i) is the loss weight for client i such that ∑ N i=1 p (i) = N, and F (i) is the local loss function for client i.

Personalized Federated Multi-Task Learning (PF-MTL)
We consider a PFL setting with N clients, in which client i has its own local dataset where n i is the size of the local dataset, and T i is the task of client i, i ∈ [N]. The system model consists of a global representation network q ω : R d → R d which is a function parameterized by ω ∈ W and maps data points to a lower space of size d . All clients share the same global representation network which is synchronized across clients with global aggregation. Client-specific heads q h (i) : R d → Y are functions parameterized by h (i) ∈ H for all clients i ∈ [N] and map from the low dimensional representation space to the label space Y. The system model is shown in Figure 1. Then, the local model for the ith client is the composition of the ith client's global representation model q ω and the personalized model q h (i) , The local loss for the ith client is represented as Through alternating minimization, the clients and the centralized server aim to learn a set of global parameters ω together, while each client i learns its own set of client-specific parameters h (i) locally. Specifically, client i performs τ h local gradient based updates to optimize h (i) , i ∈ [N], while the global network parameters at client i, i.e., ω (i) , are frozen. Thereafter, client i performs τ ω local updates to optimize the global shared network parameters, while the parameters corresponding to the client-specific head are frozen. Then, the global shared network parameters {ω (i) } N i=1 are aggregated at the PS to obtain a common ω. Thus, the problem is FedRep [7] investigates this framework with p (i) = 1, i ∈ [N].

PF-MTL as Bilevel Optimization Problem
The optimization problem in (4) can be rewritten as (5) because F (i) (h (i) , ω) relies on only h (i) and ω for all i ∈ [N], and h (i) are independent of each other Equivalently, we have Note that p (i) values in (6) are obtained from our proposed FedGradNorm algorithm which will be derived later as a consequence of another optimization problem. As a result, the problem can be expressed as a bilevel optimization problem, which is an optimization problem containing another optimization problem as a constraint min x l = arg min where F(x u , x l ) represents the upper-level objective function and g(x u , x l ) represents the lower-level objective function. {c j (x u , x l ) ≤ 0, j = 1, . . . , J} are the constraints for the lower-level optimization problem while {C m (x u , x l ) ≤ 0, m = 1, . . . , M} and the lower-level optimization problem itself are the constraints for the upper-level optimization problem.
Multiple algorithms have been developed to solve bilevel optimization problems in the literature [31]. Different reformulations of the bilevel optimization problem have been made by utilizing the optimality conditions of the lower-level optimization problem to formulate the bilevel optimization problem as a single-level constraint problem [32][33][34]. In addition, there are recently developed gradient-based bilevel optimization algorithms [35][36][37][38][39][40]. Our algorithm is based on iterative differentiation (ITD), as explained in Algorithm 1.
Updates to the upper-level optimization take place in the outer loop, while updates to the lower-level optimization are performed in the inner loop.
For our problem, x u and x l correspond to to represent the outer loop iteration index in Algorithm 1 for the rest of the paper. Furthermore, i in p i k represents the inner loop iteration index, while i in p (i) k represents the client index. Then, the bilevel optimization problem in our case can be written as The objective function is a weighted sum of local loss functions, , ω), and F grad is the auxiliary loss function defined by the FedGradNorm algorithm in the next section.

Hierarchical Federated Learning (HFL) for Wireless Fading Channels
The characteristics of the communication channel should be considered in PF-MTL, since clients can be distributed in a large geographic area in an FL framework [41,42]. The PS can be far away from the clients, and the communication between the PS and the clients can be noisy and subject to channel effects. Then, PF-MTL can be constructed in hierarchical setting by creating clusters of clients around IS to communicate with the PS instead of the direct communication of clients with the PS. Further, the communication can be performed over a shared wireless channel, where the transmission power and the bandwidth are constrained. Thus, we employ over-the-air aggregation (OTA) to address these issues [21,22].
The generic HFL problem shown in Figure 2, with C clusters each containing an IS and N clients, can be formulated as where p (l,i) is the loss weight for client i in cluster l such that ∑ N i=1 p (l,i) = N, l ∈ [C], and F (l,i) (·) is the local loss function for client i in cluster l.  ... We consider a PFL setting of N clients within each cluster, in which client i of cluster l has its own local dataset where n l,i is the size of the local dataset. Within cluster l, T l,i denotes the task of client i, i ∈ [N], and l ∈ [C]. T l,i is assigned from the task set T = {T 1 , T 2 , . . . , T N } such that T l,i = T l,i , for i = i and for any l ∈ [C]. Real-life scenarios might involve the same or very similar tasks for clients in a cluster. We assume that tasks are different due to the lack of prior information about it.
We assume that clients in a cluster have error-free and high-speed connections to corresponding IS over local area networks (LANs). In addition, the PS and ISs share a bandwidth-limited wireless fading MAC. Using the wireless fading MAC, each IS sends the corresponding local gradient aggregations within its cluster to the PS. The broadcast from the PS to the ISs is considered error-free.
As in the case of the simple PF-MTL setting illustrated in Figure 1, the system model in Figure 2 is composed of a global representation network q ω : R d → R d , which is a function parameterized by ω ∈ W, that maps data points into a lower space of size d . The same global representation network is shared by all clients in each cluster, which is synchronized through global aggregation. A client-specific head , mapping a low-dimensional representation space to a label space Y. The local model for client i of cluster l is the composition of the client's global representation model q ω and personalized model q h (l,i) , shown as q l,i (·) = (q h (l,i) • q ω )(·). In addition, the local loss for the ith client of cluster l is shown as Using alternating minimization, the PS and the clients learn the global representation ω together, while only client i learns the the client-specific head h (l,i) in cluster l, i ∈ [N] and l ∈ [C]. Specifically, τ h local updates are performed by client i in cluster l to optimize h (l,i) when global network parameters at client i of cluster l, i.e., ω (l,i) , are frozen. Then, τ ω local updates are performed to optimize ω (l,i) while the parameters corresponding to the client-specific head are frozen. Thereafter, the lth IS aggregates {ω (l,i) } N i=1 which are sent via LAN, for any l ∈ [C]. The ISs send cluster aggregations to the PS to perform the global aggregation over the wireless fading MAC. The additive nature of the wireless MAC enables global aggregation to occur over-the-air. The optimization problem is

Algorithm Description
In this section, we present the FedGradNorm algorithm after introducing the definitions and preliminaries. Then, we present the extension of FedGradNorm algorithm for hierarchical structure with OTA.

Definitions and Preliminaries
In FedGradNorm, we aim to learn the dynamic loss weights {p (i) } N i=1 given in the lowerlevel optimization problem of (8). The main objective of the algorithm is to dynamically adjust the gradient norms so that the different tasks across clients can be trained at similar learning speeds. In the rest of the paper, clients and tasks will be used interchangeably as we assume that each client has its own different task. Before describing the algorithm in detail, we first introduce the notation: •ω: A subset of the global shared network parametersω ⊂ ω. FedGradNorm is applied k , which is a subset of the global shared network parameters at client i at iteration k.ω (i) k is generally chosen as the last layer of the global shared network at client i at iteration k.
: The 2 norm of the gradient of the weighted task loss at client i at iteration k with respect to the chosen weightsω The average gradient norm across all clients (tasks) at iteration k.
: Relative inverse training rate of task i at iteration k.
Additional notation that is useful in algorithm description: k,j is the average of gradient updates at client i at iteration k, where g k,j is the jth local update of the global shared representation at client i at iteration k.
k,j is the client-specific head parameters h (i) after the jth local update on the clientspecific network of client i at iteration k, k,j is the global shared network parameters of client i after the jth local update at iteration k, j = 1, . . . , τ ω . Additionally, ω k,τ ω for brevity.

FedGradNorm Description
FedGradNorm adjusts gradient magnitudes to balance training rates between different tasks across clients. FedGradNorm is distributed across the clients and the parameter server.
Gω is used to have a common scale for the gradient sizes while the gradient norms are adjusted according to the relative inverse training rates r (i) k . A higher value of r (i) k leads to a larger gradient magnitude for task i, which encourages the task to train faster. Each client i sends its inverse training rateF (i) k at time k to the PS, so that the PS can construct r . Therefore, given the common scale of gradient magnitudes, and the relative inverse training rate, the desired gradient norm of task i at iteration k is calculated as where γ represents the strength of the restoring force which pulls tasks back to a common training rate, which can also be considered as a metric of task asymmetry across different tasks. In cases where tasks have different learning dynamics, a larger γ would be a better choice for a stronger balancing.
In order to shift the gradient norms towards the desired norm, the loss weights p (i) k are updated by minimizing an auxiliary loss function F grad k; {p which is the sum of 2 distances between the actual gradient norm and the desired gradient norm across all tasks for each iteration k, i.e., The auxiliary loss function F grad k; {p is constructed by the parameter server at each global iteration k by using ∇ω (i) k , which is a subset of the whole gradient of the global shared network sent by client i at iteration k for the global aggregation.
Next, the parameter server performs the differentiation of F grad k; {p with respect to each loss weight p (i) k . The weights are updated as, The Finally, the parameter server obtains the global aggregated gradient k to update the global shared network parameters ω via ω k+1 = ω k − βg k and broadcasts the updated parameters to the clients for the next iteration. The overall FedGradNorm algorithm is summarized in Algorithm 2. In FedGradNorm, Update( f , h) represents the generic notation for the update of the variable h by using the gradient of f function with respect to the variable h.

Hierarchical Over-the-Air (HOTA) FedGradNorm
The HOTA-FedGradNorm algorithm is a two-stage version of the FedGradNorm algorithm in HFL setting, which is shown in Figure 2. During the first stage of the algorithm, the learning speeds of the clients are balanced using a dynamic weighting approach for each cluster. FedGradNorm as a dynamic weighting strategy is used jointly with a power allocation scheme to satisfy the total average transmit power constraint and to ensure that the wireless fading MAC between the ISs and the PS is robust against negative channel conditions.

Algorithm 2 Training with FedGradNorm
Initialize ω 0 , {p The parameter server sends the current global shared network parameters ω k to the clients.
for Each client i ∈ [N] do Initialize global shared network parameters for local updates by ω to the parameter server After collecting g , the parameter server performs the following operations in the order: • Aggregates the gradient for the global shared network by k . • Updates the global shared network parameters with the aggregated gradient by ω k+1 = ω k − βg k .
• Broadcasts ω k+1 to clients for the next global iteration.
Each client within a cluster sends its gradient for the global model q ω to its corresponding IS via the LAN, where the channels between each client and the IS are assumed to be error-free inside a cluster. The corresponding IS performs a modified version of FedGradNorm based on the client's gradients by taking the power allocation scheme into account. Specifically, the IS of cluster l computes the loss weight p (l,i) k for each client i ∈ [N] in cluster l via FedGradNorm algorithm to eventually obtain the local weighted aggregation is the local gradient update of client i in cluster l for iteration k.
The power allocation vector β (l,i) k constructed by the IS of cluster l for each client i in the cluster is designed as where β   N (0, σ 2 l ). The threshold H th k is set to satisfy the total average transmit power constraint given as follows, where x (l) • represents the elementwise multiplication. The expectation is taken over the randomness of the channel gains.
From the power allocation scheme in (14), each cluster transmits only the scaled entries of its weighted gradient for which the channel conditions are sufficiently good. Consequently, F grad is modified according to the power allocation scheme to have power efficient system design as follows, where M (l) ∈ {0, 1} |ω| is a mask matrix designed for the sparsification of cluster l as follows: Here,ω is the relative inverse training rate of task i in cluster l at iteration k, and γ represents the strength of the restoring force, as defined in FedGradNorm previously.
Gradient sparsification used during the calculation of F grad acts as an implicit constraint on F grad minimization problem by considering the channel conditions. Consequently, it ensures that the learning speed of tasks is invariant to the dynamic channel conditions with an appropriate selection process of loss weights. In other words, the implicit constraint of the channel condition preserves the fairness of the learning speed among the clients, as shown in the experimental results.
The second stage of the algorithm involves the process of global aggregation over the wireless fading MAC. The PS obtains a noisy estimate of the aggregated gradient over the wireless fading channel while updating the model parameters. Due to the additive nature of the wireless MAC, the summation of the signals transmitted by clusters arrives at the PS. The jth entry of the received signal at iteration k, y k ∈ R |ω| is where z k (j) is the jth entry of the Gaussian noise vector z k and is i.i.d. according to N (0, 1).
k (j)| 2 > H th k } represents the set of clusters contributing to the jth entry of the received signal at the kth iteration. M k (j) is known by the PS, for j ∈ [|ω|] since the PS has the perfect channel state information (CSI).
By considering (14) and the definition of x (l) k in terms of the power allocation vector, we have where g (l,i) k (j) is the jth entry of g (l,i) k . The noisy aggregated gradient estimate iŝ Then, the estimated gradient vector is used to update the model parameters as ω k+1 = ω k − βĝ k . The overall algorithm is shown in Algorithm 3.

Algorithm 3 HOTA-FedGradNorm
2: for k=0 to K do 3: The PS broadcasts the current global shared network parameters ω k to the ISs. 4: for Each cluster l ∈ [C] do 5: The IS l broadcasts ω (l) k to clients within cluster. 7: for Each client i ∈ [N] do 8: Initialize global shared network parameters for local updates by ω for j = 1, . . . , τ ω do 13: to IS l for dynamic weighting. 16: The IS l performs the followings: The IS l constructs the power allocation vector β (l,i) k for each clients in cluster l as given in Equation (14) • aggregates the gradients of clients in cluster l for the global shared network by combining with power allocation scheme as x k .

17:
The gradients are aggregated over the wireless fading channel as given in Equation (19). 18: The estimated gradient aggregationĝ k is obtained by the PS as given in Equation (21). 19: The PS updates the global shared network by ω k+1 ← ω k − βĝ k . Update( f , h) in Algorithm 3 represents the generic notation for the update of the variable h by using the gradient of f function with respect to the variable h. ω k is the global shared network parameters on the IS l at the beginning of the global iteration k, and β is the learning rate for both the client local updates and the PS global updates. FGN_Server(·) given in Algorithm 4 performs the auxiliary loss F grad construction and minimization via gradient descent.

Convergence Analysis
In this section, we provide the convergence analysis for FedGradNorm along with necessary assumptions and lemmas. Assumption 1. The following strong convexity assumptions hold for upper-level optimization function F(·) and lower-level optimization function g(·) given in (7), , ω , and p * (x) = arg min p∈R N g(x).
Assumption 2. ∇F(z) and ∇g(z) are L-Lipschitz, i.e., for any z, z , Assumption 3. The derivatives ∇ x ∇ p g(z) and ∇ 2 p g(z) are τ-and ρ-Lipschitz, i.e., for any z, z , Assumption 4. The expected value of the squared 2 norm of stochastic gradient of F(·) with respect to p is bounded, i.e., The expectation is taken over the randomness of stochasticity of gradient descent, where ξ represents the stochastic data samples.

Assumption 5.
The stochastic gradient of F(·) with respect to x is an unbiased estimator of the gradient, i.e., where ξ represents the stochastic data samples.
The following lemma characterizes the Lipschitz properties of the upper-level objective function F(·). It is adapted from [36] where the constant L F is given by

Lemma 2. Suppose Assumptions 1 to 5 hold and let
The proof of Lemma 2 is provided in Appendix A.
Lemma 3. Suppose Assumptions 1 to 5 hold. Then, the following holds, The proof of Lemma 3 is provided in Appendix A.
Theorem 1. Let Assumptions 1 to 5. Then, the algorithm satisfies Theorem 1 shows that FedGradNorm algorithm converges exponentially over the iterations. The proof of Theorem 1 is provided in Appendix A.

Dataset Specifications
The following two datasets are used for experiments:  Table 1. We perform 3 different tasks over RadComDynamic dataset, (1) modulation classification, (2) signal type classification, and (3) anomaly detection.
-Task 2. Signal type classification: The signal classes are AM radio, short-range, Radar-Altimeter, Air-Ground-MTI, Airborne-detection, Airborne-range, Groundmapping. -Task 3. Anomaly behavior: Signal to noise ratio (SNR) can be considered as a proxy for geo-location information. We define anomaly behavior as having SNR lower than −4 dB. Each data point in this dataset is a normalized signal vector of size 256 which is obtained by vectorizing the real and complex parts of the signal, x = x I + jx Q where x I , x Q ∈ R 128 , as follows,x = x I x Q ∈ R 256 (34)

Hyperparameters and Model Specifications
A detailed description of the hyperparameters of the system model for both Fed-GradNorm and HOTA-FedGradNorm algorithms are given in Table 2. Note that γ is a hyperparameter that should be determined with respect to the task asymmetry in the system. We use Adam optimizer for both network training and F grad optimization. β is a learning rate that optimizer uses to update the global shared network as well as the personalized network on the client side, and α is a learning rate used for F grad optimization. The shared network model is explained in Table 3. Each client also has a simple linear layer that maps the shared network's output to the corresponding prediction value for a personalized network. Cross-entropy and mean squared error (MSE) are used as the loss functions for classification and regression tasks, respectively. Table 2. System model hyperparemeters.

HOTA-FedGradNorm
Number of clusters C 10 Number of clients in each cluster N 3 Learning rate (β) 0.0003 Learning rate (α) 0.008

Results and Analysis
In the experiments with MTFL dataset, we observe that task 1 (facial landmark regression task) has a higher gradient norm compared to all other tasks, which are classification tasks. Figure 3 illustrates how FedGradNorm gradually decreases the loss weight of the first task to balance the learning speed among tasks. At epoch 70, when tasks 2 and 3 finally can reduce their loss with a higher rate, the weight of their corresponding tasks decreases to help improve the two remaining tasks. Tasks 2 and 3 could not be improved without dynamic-weighting since task 1 would mask the gradient updates for the remaining tasks. As a result of reducing the weight of tasks 2 and 3, the weight of tasks 4 and 5 would then be increased with a similar slope (the weight change of both is the same, since they are stacked on top of each other in Figure 3f) in order to improve the training performance if possible. Unlike other tasks, task 4 (detecting glasses on human faces) and task 5 (pose estimation) reach the minimum very quickly on the first epochs as they are easy tasks. Thus, as indicated by Figure 3, the performance does not improve much for tasks 4 and 5. Although the performance of tasks 1 and 5 are also quite the same in the long-run, FedGradNorm helps to learn faster at the early stages. For Figure 3, the data allocation is balanced.  We also perform experiments with the imbalanced data distribution. Table 4 exhibits the loss comparisons between FedGradNorm and FedRep when task 2 and task 4 have access to 500 data points, whereas other tasks have 3000 data points. The FedGradNorm performs better than FedRep. Furthermore, we conduct experiments with the RadComDynamic dataset using Network 2 given in Table 3. FedGradNorm outperforms FedRep on modulation detection and signal detection tasks, as illustrated in Figure 4. The modulation detection task and the signal detection task have slower training than the anomaly detection task with respect to the change of the loss. By employing FedGradNorm, we demonstrate that the learning speed of signal and modulation detection tasks are balanced against the anomaly detection task. Moreover, we observe that the loss weight for task 1 is increased to speed up its training at the beginning of the training since the loss for task 2 and task 3 decreases faster compared to the loss of task 1 initially. In epoch 55, the loss of task 1 decreases significantly. Therefore, the loss weight of task 1 is decreased to prevent task 1 from dominating the training.  Next, we conduct experiments for HOTA-FedGradNorm setting to investigate the effects of the wireless fading channel. Figure 5 depicts the task losses in the first cluster. We observe that the change in the loss for the first task (modulation classification) is less than the change of the loss for the second and third tasks at the beginning of the training. Then, the loss weight of the first task is increased. After epoch 65, it is decreased since the loss decreases significantly. Comparing the result with the result in Figure 4, we observe that considering the wireless MAC channel between the IS servers and the PS leads to slower training. However, as shown in Figure 5, HOTA-FedGradNorm yields a higher training speed compared to naive equal weighting update strategy.
To demonstrate the effectiveness of F grad in reducing negative channel effects, the first cluster channel gain is changed from σ 2 1 = 1 to σ 2 1 = 0.5 while channel gains for the remaining clusters are left unchanged. A decreased σ l 1 value is equivalent to intensifying the sparsification of the corresponding gradient according to the defined H th . Figure 6 shows how even having a single bad channel can negatively impact the entire learning process if we do not utilize FedGradNorm into our system model. With HOTA-FedGradNorm, clients' weights can be adapted based on the channel conditions, thereby reducing the channel effects. Figure 6 illustrates that both the first and second tasks have improved after epoch 85. Additionally, we compare the effects of channels for more diverse σ values in Figure 7. From these result, we observe that HOTA-FedGradNorm is both robust and faster to train, even under more challenging channel conditions. Weight weight of task 1 in cluster 2 weight of task 2 in cluster 2 weight of task 3 in cluster 2 (d) Figure 6. Comparison between task loss achieved via HOTA-FedGradNorm and naive equal weighting case in RadComDynamic dataset for the second cluster where σ 2 1 = 0.5 and σ 2 l = 1 ∀ ≥ 2 (a) task 1 (modulation classification), (b) task 2 (signal classification), (c) task 3 (anomaly behavior), (d) task weights.

Conclusions and Discussion
We proposed FedGradNorm, a distributed version of the GradNorm dynamic weighting algorithm for the personalized FL setting. We provided the convergence analysis for Fed-GradNorm and showed that it has an exponential convergence rate. Moreover, we proposed HOTAFedGradNorm, which is the modified version of FedGradNorm designed with the utilization of over-the-air aggregation in a hierarchical FL setting. The characteristics of the wireless communication channel were considered for the design of HOTA-FedGradNorm.
In the experiments with FedGradNorm, the learning speed and task performance of Fed-GradNorm were compared with the naive equal weighting strategy. In contrast to naively assigning equal weights to each task, we observed that FedGradNorm could ensure faster training and more consistent performance. Additionally, FedGradNorm could compensate for the effects of imbalanced allocation of data among the clients. Tasks with insufficient data are also eligible for fair training since the weights of task losses are adjusted with respect to training speeds to encourage the slow learning tasks. Furthermore, the experimental results with HOTA-FedGradNorm indicated that HOTA-FedGradNorm provides robustness under negative channel effects while having faster training compared to naive equal weighting strategy.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Proof of Lemma 2. Since p k depends on x k , the following holds by the chain rule, Then, where the last inequality follows from the triangle inequality. The first term of (A2) is bounded by Assumption 2 as follows, Similarly, the third term of (A2) is bounded by Assumption 1 Furthermore, the second term of (A2) is bounded as since we have ∇ p F(x, p) ≤ M by Assumptions 4 and 5. Then, (A2) is upper bounded as To upper bound (A6) further, we bound By the gradient descent update of p, i.e., p t k = p t−1 k − α∇ p g(x k , p t−1 k ) for t = 1, . . . , D, and by the chain rule, where the last inequality comes from By the gradient descent update of x By inserting (A18) into (A17), the following holds, F(x k+1 , p k+1 ) ≤F(x k , p k ) + (−β∇F(x k , p k ) + β∇F(x k , p k ) − β∇F(x k , p k )) T ∇F(x k , p k ) =F(x k , p k ) − β ∇ F(x k , p k ) − ∇F(x k , p k ), ∇F(x k , p k ) − β ∇F(x k , p k ) 2 + β 2 L F ∇ F(x k , p k ) − ∇F(x k , p k ) 2 + β 2 L F ∇F(x k , p k ) 2 (A20) ≤F(x k , p k ) − β 2 − β 2 L F ∇F(x k , p k ) 2 where the last inequality comes from ||x − y|| 2 = ||x|| 2 + ||y|| 2 − 2 x, y ≥ −||x|| 2 − 2 x, y − x (A22) by substituting x = ∇F(x k , p k ) and y =∇F(x k , p k ).
Proof of Theorem 1. By Lemma 3, we have To upper bound A 2 , we use Lemma 2 and the fact that (a + b + c) 2 ≤ 3a 2 + 3b 2 + 3c 2 , ∀a, b, c ∈ R, while also assuming p 0 k − p * k (x k ) 2 ≤ ∆. By choosing a = L(L+µ)(1−αµ) where constant B is defined as To upper bound the A 1 , we use the µ-strong convexity of F(x, p) with respect to x by Assumption 1. By µ-strong convexity of F(x, p), for any fixed p ∈ P N , we have, Then, By substituting (A24) and (A27) in (A23), we have By subtracting F(x * , p * (x * )) from both sides, we have F(x k+1 , p k+1 ) − F(x * , p * (x * )) ≤F(x k , p k ) − F(x * , p * (x * )) − µ 2 β 2 − β 2 L F x k − x * 2 By L F -smoothness of F(x, p) from Lemma 1, and the fact that ∇ x F(x, p)| {x=x * ,p=p * } = 0 for any x k . By substituting (A30) in (A29), we have Additionally, by the µ-strong convexity of F(x, p) with respect to x, we have Then, completing the proof.