Distributed Interference-Aware Power Optimization for Multi-Task Over-the-Air Federated Learning

Tang, Chao; He, Dashun; Yao, Jianping

doi:10.3390/telecom6030051

Open AccessArticle

Distributed Interference-Aware Power Optimization for Multi-Task Over-the-Air Federated Learning

by

Chao Tang

,

Dashun He

and

Jianping Yao

^*

School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Telecom 2025, 6(3), 51; https://doi.org/10.3390/telecom6030051

Submission received: 6 June 2025 / Revised: 27 June 2025 / Accepted: 4 July 2025 / Published: 14 July 2025

(This article belongs to the Topic Machine Learning in Communication Systems and Networks, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Over-the-air federated learning (Air-FL) has emerged as a promising paradigm that integrates communication and learning, which offers significant potential to enhance model training efficiency and optimize communication resource utilization. This paper addresses the challenge of interference management in multi-cell Air-FL systems, focusing on parallel multi-task scenarios where each cell independently executes distinct training tasks. We begin by analyzing the impact of aggregation errors on local model performance within each cell, aiming to minimize the cumulative optimality gap across all cells. To this end, we formulate an optimization framework that jointly optimizes device transmit power and denoising factors. Leveraging the Pareto boundary theory, we design a centralized optimization scheme that characterizes the trade-offs in system performance. Building upon this, we propose a distributed power control optimization scheme based on interference temperature (IT). This approach decomposes the globally coupled problem into locally solvable subproblems, thereby enabling each cell to adjust its transmit power independently using only local channel state information (CSI). To tackle the non-convexity inherent in these subproblems, we first transform them into convex problems and then develop an analytical solution framework grounded in Lagrangian duality theory. Coupled with a dynamic IT update mechanism, our method iteratively approximates the Pareto optimal boundary. The simulation results demonstrate that the proposed scheme outperforms baseline methods in terms of training convergence speed, cross-cell performance balance, and test accuracy. Moreover, it achieves stable convergence within a limited number of iterations, which validates its practicality and effectiveness in multi-task edge intelligence systems.

Keywords:

over-the-air federated learning (Air-FL); multi-task; interference management; distributed algorithm

1. Introduction

In the era of ubiquitous edge intelligence, federated learning (FL) has emerged as a pivotal framework for privacy-preserving distributed model training [1,2]. By enabling edge devices to collaboratively train models without sharing raw data, FL addresses concerns related to data privacy and communication overhead [3,4]. However, traditional FL frameworks often rely on digital communication methods, which can be inefficient in resource-constrained wireless environments due to the substantial communication overhead associated with transmitting high-dimensional model updates [5,6,7].

To mitigate these challenges, over-the-air computation (AirComp) has been integrated with FL, giving rise to over-the-air federated learning (Air-FL) [8,9]. AirComp leverages the superposition property of wireless multiple-access channels to aggregate data directly in the air, significantly reducing the communication latency and enhancing bandwidth efficiency [10,11]. This synergy between communication and computation positions Air-FL as a promising paradigm for efficient model aggregation in wireless networks [12,13].

1.1. Related Work

While Air-FL offers notable substantial benefits, its deployment in multi-cell wireless networks introduces significant challenges [14]. In particular, concurrent FL tasks among multiple cells can cause severe inter-cell interference, which undermines the quality of aggregated updates and slows down model convergence [15,16]. This issue is especially acute in parallel multi-task scenarios, where each cell tackles different learning tasks with heterogeneous data and distinct model objectives [17,18].

Existing strategies to tackle interference in multi-cell Air-FL primarily involve centralized optimization [19,20]. For example, the authors in [19,20] characterized the Pareto boundary of the error-induced gap region to quantify learning performance trade-offs among different FL tasks, which formulates optimization problems to minimize the sum of error-induced gaps across all cells. Although such centralized solutions achieve near-optimal performance, they suffer from scalability and latency issues in large or dynamic networks, due to the dependence on global channel state information (CSI) and centralized coordination.

To address these concerns, other works have introduced innovations such as hierarchical personalized FL frameworks with optimized beamforming [21], and dynamic clustering combined with power control for two-tier FL systems [22]. Additionally, methods based on compressed sensing [23] have been proposed to enable cross-cell model aggregation with privacy preservation and a reduced need for strict synchronization. The authors in [24] investigated a theoretically guaranteed two-stage search algorithm for joint edge association and aggregation optimization, extended with flexible bandwidth allocation. However, these methods still require some inter-cell coordination.

Despite these advances, centralized methods still face limitations in real-world deployments due to the signaling overhead and latency. This paper addresses that gap by proposing a distributed power control framework guided by interference temperature (IT) parameters, offering near-Pareto optimal performance without requiring centralized coordination. Compared with existing centralized interference management strategies, such as joint power control and beamforming [19,20], our approach introduces several key advancements. First, prior works typically rely on centralized optimization frameworks that require global CSI and full inter-cell coordination, which pose challenges in terms of scalability, latency, and signaling burden. In contrast, our proposed scheme adopts a distributed optimization perspective based on IT, which enables each cell to manage its transmit power using only local CSI and lightweight inter-cell communication. Second, while [19,20] focus on minimizing global error-induced gaps via Pareto boundary characterization under centralized control, our framework introduces a novel dynamic IT update mechanism that allows the system to iteratively approximate Pareto optimality in a decentralized manner. These features make our approach more suitable for large-scale, heterogeneous, and dynamic wireless learning scenarios. A comparison of the focuses, contributions, and potential limitations among highly relevant studies is provided in Table 1.

1.2. Contributions

This paper proposes an interference-aware distributed power control scheme tailored for parallel multi-task Air-FL scenarios. By leveraging the local CSI and IT constraints to enable each cell to autonomously manage its transmit power, our approach mitigates inter-cell interference and enhances overall system performance. Through rigorous analysis and extensive simulations, we demonstrate that our proposed scheme achieves near-optimal performance, which offers a practical and scalable solution for interference management in multi-cell Air-FL deployments.

The primary contributions of this work are as follows:

System modeling and centralized optimization: We construct a multi-cell parallel Air-FL system model, detail the AirComp architecture between devices and APs, and establish the interference mechanisms under shared spectrum conditions. We analyze the impact of aggregation errors on the convergence of the optimality gap and formulate a power control optimization problem aimed at minimizing the total optimality gap across all cells. To characterize performance trade-offs, we employ the Pareto boundary theory and design a centralized power control algorithm to delineate the Pareto boundary.
Distributed optimization via IT: We propose a distributed optimization scheme based on IT, which decouples the globally coupled problem into locally solvable subproblems. Each cell independently adjusts its transmit power using local CSI. To address the non-convexity of the subproblems, we first transform them into convex problems and then develop an analytical solution framework grounded in Lagrangian duality theory and implement a dynamic IT update mechanism to iteratively approach the Pareto boundary.
Simulation validation: Through numerical simulations, we validate the efficacy of our proposed scheme. The results demonstrate that our approach surpasses baseline methods in terms of training convergence speed, cross-cell performance balance, and test accuracy. Moreover, it achieves stable convergence within a limited number of iterations, underscoring its practicality and effectiveness in complex multi-task edge intelligence scenarios.

The remainder of this paper is structured as follows. Section 2 introduces the parallel multi-task Air-FL system model. Section 3 examines the influence of aggregation errors on cell-level local optimality gaps, formulates a power control optimization framework, and presents the concept of the Pareto boundary. Section 4 proposes both a centralized power control optimization scheme and a distributed power control optimization scheme based on IT constraints. Section 5 provides numerical results to validate the performance of our proposed scheme. Section 6 concludes the paper.

2. System Model

As illustrated in Figure 1, we consider a parallel multi-task Air-FL system composed of M cells. Each cell includes a dedicated AP, indexed by

m \in M ≜ {1, \dots, M}

, and serves K single-antenna edge devices, indexed by

k \in K_{m} ≜ {1, \dots, K}

. Define the global device set as

K ≜ {K_{1}, \dots, K_{M}}

. Each cell independently collaborates to train a distinct machine learning model, which highlights the heterogeneity across tasks.

2.1. Federated Learning Model

Each AP–device cluster forms a standalone FL system, with the m-th cell learning a cell-specific model parameterized by the vector

e_{m} = {[e_{m, 1}, \dots, e_{m, C}]}^{T}

, where C is the model dimension. Each device

k \in K_{m}

holds a private local dataset

D_{k}

. The sample loss function evaluates the prediction error of model parameter

e_{m}

on an input-label pair

(x_{i}, ξ_{i}) \in D_{k}

. Accordingly, the local loss function on device k is defined as

\begin{matrix} H_{k} (e_{m}) = \frac{1}{| D_{k} |} \sum_{(x_{i}, ξ_{i}) \in D_{k}} H_{k} (e_{m}, x_{i}, ξ_{i}) . \end{matrix}

(1)

Assuming uniform dataset sizes within each cell, i.e.,

| D_{j} | = | D_{k} |, \forall j, k \in K_{m}

, the union of all local datasets in cell

m \in M

is denoted as

D_{m} = \cup_{k \in K_{m}} D_{k}

. Then, the corresponding global loss function for cell

m \in M

is expressed as

\begin{matrix} H_{m} (e_{m}) & = \frac{| D_{k} |}{| D_{m} |} \sum_{k \in K_{m}} H_{k} (e_{m}) = \frac{1}{K} \sum_{k \in K_{m}} H_{k} (e_{m}) . \end{matrix}

(2)

The training goal of cell

m \in M

is to find the optimal model parameter

e_{m}^{★}

that minimizes the global loss function

H_{m} (e_{m})

, formulated as

\begin{matrix} e_{m}^{★} = \arg min_{e_{m} \in R^{C}} H_{m} (e_{m}) . \end{matrix}

(3)

2.2. Communication Model

Each cell adopts the federated stochastic gradient descent (FedSGD) for model training, where devices compute and transmit gradients to the AP via AirComp-enabled uplink aggregation. The training is conducted over N global communication rounds, indexed by

n \in N ≜ {1, \dots, N}

. At the beginning of round

n = 1

, the aggregation center broadcasts a common initialization

e^{(1)}

to all APs, which then distribute it to their associated devices, setting their local model parameters

e_{k}^{(1)} = e^{(1)}, \forall k \in K_{m}

. We assume that the downlink is error-free.

During each round n, device

k \in K_{m}

computes its local gradient estimate

a_{k}^{(n)}

over a randomly sampled mini-batch

{\tilde{D}}_{k}^{(n)} \subset D_{k}

, expressed as

\begin{matrix} a_{k}^{(n)} = \frac{1}{| {\tilde{D}}_{k}^{(n)} |} \sum_{x_{i}, ξ_{i} \in {\tilde{D}}_{k}^{(n)}} \nabla H_{k} (e_{k}^{(n)}, x_{i}, ξ_{i}) . \end{matrix}

(4)

To align power levels and mitigate signal distortion, each device applies gradient normalization, defined as

\begin{matrix} {\bar{a}}_{k}^{(n)} & = \frac{1}{C} \sum_{c = 1}^{C} a_{k, c}^{(n)}, \end{matrix}

(5)

\begin{matrix} {(υ_{k}^{(n)})}^{2} & = \frac{1}{C} \sum_{c = 1}^{C} {(a_{k, c}^{(n)} - {\bar{a}}_{k}^{(n)})}^{2}, \end{matrix}

(6)

\begin{matrix} s_{k, c}^{(n)} & = \frac{a_{k, c}^{(n)} - {\bar{a}}_{k}^{(n)}}{υ_{k}^{(n)}}, \end{matrix}

(7)

where

{\bar{a}}_{k}^{(n)} \in R

and

υ_{k}^{(n)} \in R_{+}

denote the mean and standard deviation of the C dimensional local gradient

a_{k}^{(n)}

, respectively.

s_{k, c}^{(n)}

is the normalized signal to be transmitted, satisfying

E [s_{k, c}^{(n)}] = 0

and

E [{(s_{k, c}^{(n)})}^{2}] = 1

. Gradients from different devices are assumed to be statistically independent.

Each device transmits its signal over a quasi-static channel. Let

h_{m, k}^{(n)}

denote the channel from device

k \in K_{m}

to its AP

m \in M

, and

h_{m, i}^{(n)}

denote the channel from interfering device

i \in K_{j}

,

j \neq m

to AP m. The transmission coefficient is set as

b_{k}^{(n)} = \sqrt{p_{k}^{(n)}} \frac{{(h_{m, k}^{(n)})}^{H}}{| h_{m, k}^{(n)} |}

with power

p_{k}^{(n)} \leq P_{k}^{m a x}

, where

P_{k}^{m a x}

is the instantaneous power budget on each device. The signal received at AP

m \in M

for the c-th model parameter is expressed as

\begin{matrix} r_{m, c}^{(n)} & = \sum_{k \in K_{m}} \sqrt{p_{k}^{(n)}} | h_{m, k}^{(n)} | s_{k, c}^{(n)} + Z_{m, c}^{(n)} + \sum_{j \in M ∖ {m}} \sum_{i \in K_{j}} \frac{\sqrt{p_{i}^{(n)}} h_{m, i}^{(n)} {(h_{j, i}^{(n)})}^{H}}{| h_{j, i}^{(n)} |} s_{i, c}^{(n)}, \end{matrix}

(8)

where

Z_{m, c}^{(n)} \sim CN (0, σ_{m}^{2})

denotes the additive white Gaussian noise (AWGN) with power

σ_{m}^{2}

at the receiver of AP

m \in M

.

The denoising factor

θ_{m}^{(n)}

is used to equalize signal power and suppress noise/interference. To reconstruct the aggregated gradient, the AP performs gradient-based linear estimation post-processing on the received signal

r_{m, c}^{(n)}

, given as

\begin{matrix} y_{m, c}^{(n)} & = \frac{1}{K} (\frac{r_{m, c}^{(n)}}{\sqrt{θ_{m}^{(n)}}} + \sum_{k \in K_{m}} {\bar{a}}_{k}^{(n)}) \\ = \frac{1}{K} (\sum_{k \in K_{m}} a_{k, c}^{(n)} + \frac{r_{m, c}^{(n)}}{\sqrt{θ_{m}^{(n)}}} - \sum_{k \in K_{m}} (a_{k, c}^{(n)} - a_{k}^{(n)})) \\ = a_{m, c}^{(n)} + ε_{m, c}^{(n)}, \end{matrix}

(9)

where

a_{m, c}^{(n)} = \frac{1}{K} \sum_{k \in K_{m}} a_{k, c}^{(n)}

is the desired average gradient;

ε_{m, c}^{(n)}

represents the aggregation error due to noise, power mismatch, and inter-cell interference, given as

\begin{matrix} ε_{m, c}^{(n)} = \frac{1}{K} (\sum_{k \in K_{m}} (\frac{\sqrt{p_{k}^{(n)}} | h_{m, k}^{(n)} |}{\sqrt{θ_{m}^{(n)}}} - υ_{k}^{(n)}) s_{k, c}^{(n)} + \sum_{j \in M ∖ {m}} \sum_{i \in K_{j}} \frac{\sqrt{p_{i}^{(n)}} h_{m, i}^{(n)} {(h_{j, i}^{(n)})}^{H}}{\sqrt{θ_{m}^{(n)}} | h_{j, i}^{(n)} |} s_{i, c}^{(n)} + \frac{Z_{m, c}^{(n)}}{\sqrt{θ_{m}^{(n)}}}) . \end{matrix}

(10)

By collecting all C dimension model parameters, the average global gradient received by AP

m \in M

is given by

\begin{matrix} {\hat{a}}_{m}^{(n)} = ℜ {y_{m}^{(n)}} = (a_{m}^{(n)} + ℜ {ε_{m}^{(n)}}), \end{matrix}

(11)

where

a_{m}^{(n)} = {[a_{m, 1}^{(n)}, \dots, a_{m, c}^{(n)}]}^{T}, y_{m}^{(n)} = {[y_{m, 1}^{(n)}, \dots, y_{m, c}^{(n)}]}^{T}, ε_{m}^{(n)} = {[ε_{m, 1}^{(n)}, \dots, ε_{m, c}^{(n)}]}^{T}

.

Finally, the global model of cell

m \in M

is updated as

\begin{matrix} e_{m}^{(n + 1)} = e_{m}^{(n)} - η_{m}^{(n)} {\hat{a}}_{m}^{(n)}, \end{matrix}

(12)

where

η_{m}^{(n)}

is the learning rate. This process continues until convergence or until the maximum number of communication rounds is reached.

3. Convergence Analysis and Problem Formulation

To guide the design of effective power control strategies in the proposed multi-task Air-FL system, we develop a convergence analysis framework that explicitly accounts for the impact of aggregation errors. In this framework, we adopt the optimality gap as a key performance indicator, which quantifies the deviation between the current global loss function value and the corresponding optimal value for each federated task during the n-th communication round.

3.1. Assumptions

As a preliminary step for our convergence analysis, we introduce several foundational assumptions, which are in line with those commonly utilized in prior studies [19,25,26,27,28,29]. These include the smoothness of local loss functions, bounded variance of stochastic gradients, and Polyak–Łojasiewicz condition to ensure the stability of gradient-based updates and allow the analytical derivation of learning dynamics under communication constraints.

Assumption 1

(Lipschitz Smoothness). For any cell

m \in M

, the global loss function

H_{m} (e)

is assumed to be differentiable and

L_{m}

-smooth. That is, there exists a non-negative constant

L_{m} > 0

such that for all

e, v \in R^{C}

, the following condition holds [19,25]:

\begin{matrix} ∥\nabla H_{m} (e) - \nabla H_{m} (v)∥ \leq L_{m} ∥e - v∥, \end{matrix}

(13)

where

\nabla H_{m} (e)

denotes the gradient of the loss function. Equivalently, the following upper bound holds for the loss function:

\begin{matrix} H_{m} (e) \leq H_{m} (v) + \nabla H_{m} {(v)}^{T} (e - v) + \frac{L_{m}}{2} {∥e - v∥}^{2} . \end{matrix}

(14)

Assumption 2

(Bounded Variance). The local gradient estimates

a_{k}^{(n)}

are assumed to be independent and unbiased estimates of the true gradient

\nabla H_{k} (e_{k})

and possess bounded variance, i.e., [26,27]

\begin{matrix} E [a_{k}^{(n)}] = \nabla H_{k} (e_{k}^{(n)}), \forall k \in K, \forall n \in N, \end{matrix}

(15)

\begin{matrix} E ({∥a_{k}^{(n)} - \nabla H_{k} (e_{k}^{(n)})∥}^{2}) \leq \frac{ϕ_{k}^{2}}{n_{b}}, \forall k \in K, \forall n \in N, \end{matrix}

(16)

where

ϕ_{k} \geq 0

denotes the variance of stochastic gradients at each device, and

n_{b}

denotes the mini-batch size used for gradient computation. (While we assume uniform mini-batch sizes and dataset sizes across devices for analytical tractability, the stochastic gradient variance term

ϕ_{k}

captures statistical heterogeneity arising from non-identically distributed (non-IID) local data distributions. Extensions to support heterogeneous dataset sizes and computation budgets can be incorporated by adjusting the weighting factors in the loss function and optimization problem.)

Assumption A3

(Polyak–Łojasiewicz Condition). Let

H_{m}^{★}

denote the global optimal loss function value for cell m. The Polyak–Łojasiewicz condition asserts that there exists a constant

μ_{m} \geq 0

such that

\nabla H_{m} (e)

satisfies [28,29]

\begin{matrix} {∥\nabla H_{m} (e)∥}^{2} \geq 2 μ_{m} (H_{m} (e) - H_{m}^{★}) . \end{matrix}

(17)

3.2. Optimality Gap vs. Aggregation Error

Building upon the preceding assumptions, this subsection investigates the convergence behavior of parallel multi-task Air-FL, with a particular focus on how aggregation errors influence the optimality gap.

Let

e_{m}^{(n + 1)}

denote the updated global model parameter for cell

m \in M

after global communication round n. The corresponding optimality gap is defined as

E [H_{m} (e_{m}^{(n + 1)})] - H_{m}^{★}

. Let

E [ℜ {ε_{m, c}^{(n)}}^{2}]

represent the MSE of the global gradient estimate. With an appropriately designed diminishing learning rate, we establish the following result to characterize the convergence bound for each task.

Theorem 1.

Consider a parallel multi-task Air-FL setup where each cell

m \in M

adopts a diminishing learning rate of the form

0 \leq η_{m}^{(n)} = \frac{u_{m}}{n + v_{m}} \leq \frac{1}{L_{m}} \leq \frac{1}{μ_{m}}, \forall n \in N

with constants

v_{m} > 0, u_{m} > \frac{1}{μ_{m}}

. Suppose that each device employs a fixed mini-batch size

n_{b} = N

. Under these settings, the expected optimality gap for each cell

m \in M

after N communication rounds is bounded by

\begin{matrix} E [H_{m} (e_{m}^{(N + 1)})] - H_{m}^{★} \leq \prod_{n \in N} D_{m}^{(n)} (E [H_{m} (e_{m}^{1})] - H_{m}^{★}) \\ + \sum_{n \in N} J_{m}^{(n)} (η_{m}^{(n)} B_{m} + L_{m} {(η_{m}^{(n)})}^{2} \sum_{c = 1}^{C} E [ℜ {ε_{m, c}^{(n)}}^{2}]) & , \end{matrix}

(18)

where

D_{m}^{(n)} ≜ 1 - μ_{m} η_{m}^{(n)}

satisfies

0 < D_{m}^{(n)} < 1, \forall n \in N

,

J_{m}^{(n)} ≜ \frac{\prod_{i = n}^{N} D_{m}^{(i)}}{2 D_{m}^{(n)}}

, and the gradient variance is defined as

B_{m} ≜ \frac{1}{K} \sum_{k \in K_{m}} \frac{ϕ_{k}^{2}}{N}

.

Remark 1.

The proof of Theorem 1 is similar to reference [19]. Please refer to reference [19] for more details. The convergence upper bound of the optimality gap in Theorem 1 consists of three distinct components. The first component is the initial optimality gap

D_{m}^{(n)} E [H_{m} (e_{m}^{1})] - H_{m}^{★}

, determined by the discrepancy between the initial point of the global loss function and its optimal value in cell

m \in M

. The second component is the gradient variance term

J_{m}^{(n)} η_{m}^{(n)} B_{m}

, capturing the influence of stochastic gradient noise arising from statistical heterogeneity within the local datasets of cell m. The third component is the aggregation error term

J_{m}^{(n)} L_{m} {(η_{m}^{(n)})}^{2} \sum_{c = 1}^{C} E [ℜ {ε_{m, c}^{(n)}}^{2}]

, which quantifies the cumulative effect of model aggregation errors induced by channel noise, channel fading, and inter-cell interference. In real-world deployments, the number of communication rounds is finite, and the learning rate cannot be reduced indefinitely. As a result, each cell’s optimization process converges to a neighborhood of the global minimum, characterized by a small but non-zero steady-state optimality gap.

3.3. Problem Formulation

In this subsection, building on the convergence analysis from the previous subsection, we derive the optimality gap expression and further formulate a power control optimization problem aimed at minimizing the optimality gap in each cell. From the derived upper bound, it is evident that the initial gap and gradient variance terms remain constant once the system configuration is fixed. Therefore, our attention centers on the aggregation error, which varies with the power control variables and dominates the effective optimality gap to be minimized, formulated as

\begin{matrix} \sum_{n \in N} \sum_{c = 1}^{C} J_{m}^{(n)} L_{m} {(η_{m}^{(n)})}^{2} E [ℜ {ε_{m, c}^{(n)}}^{2}] . \end{matrix}

(19)

To enable tractable optimization in parallel multi-task scenarios, we decouple the problem into N subproblems, each corresponding to a single communication round. Furthermore, for clarity, we focus on the aggregation error in a single dimension and omit the superscript n without loss of generality. Based on Equation (9), the MSE for the c-th model parameter in cell m is given by

\begin{matrix} E [ℜ {ε_{m, c}}^{2}] = \frac{1}{K^{2}} (\sum_{k \in K_{m}} {(\frac{\sqrt{p_{k}} | h_{m, k} |}{\sqrt{θ_{m}}} - υ_{k})}^{2} + \sum_{j \in M ∖ {m}} \sum_{i \in K_{j}} \frac{p_{i} ℜ {h_{m, i} {(h_{j, i})}^{H}}^{2}}{θ_{m} | h_{j, i} |^{2}} + \frac{σ_{m}^{2}}{2 θ_{m}}) . \end{matrix}

(20)

Substituting this expression into (19), we obtain the following formulation for the effective optimality gap as a function of the power control variables, written as

\begin{matrix} Φ ({p_{k}}, {θ_{m}}) ≜ & E_{m} (\sum_{k \in K_{m}} {(\frac{\sqrt{p_{k}} | h_{m, k} |}{\sqrt{θ_{m}}} - υ_{k})}^{2} + \sum_{j \in M ∖ {m}} \sum_{i \in K_{j}} \frac{p_{i} ℜ {h_{m, i} h_{j, i}^{H}}^{2}}{θ_{m} {| h_{j, i} |}^{2}} + \frac{σ_{m}^{2}}{2 θ_{m}}), \end{matrix}

(21)

where

E_{m} = \frac{J_{m} L_{m} η_{m}^{2}}{K^{2}}

. Based on this, the optimality gap minimization problem is formulated as

\begin{matrix} (P 0) : & min_{{p_{k}, θ_{m}}} \{Φ ({p_{k}}, {θ_{m}})\} \end{matrix}

(22)

\begin{matrix} s . t . & 0 \leq θ_{m}, \end{matrix}

(23)

\begin{matrix} 0 \leq p_{k} \leq P_{k}^{m a x}, \forall k \in K_{m}, m \in M . \end{matrix}

(24)

Minimizing the objective function

Φ ({p_{k}}, {θ_{m}})

effectively reduces the optimality gap for each cell. However, in multi-cell wireless systems, independently optimizing each cell may inadvertently amplify inter-cell interference, thus degrading the learning performance of neighboring cells.

To address this challenge and ensure fairness across distributed tasks, it is essential to introduce a theoretical framework that captures the global trade-offs among all cells’ learning performances. To this end, the next subsection introduces the concept of the Pareto boundary to characterize the optimal trade-off set of optimality gaps among cells under shared resource constraints, thereby laying a theoretical foundation for subsequent coordinated power control strategies across cells.

3.4. Pareto Boundary Definition and Characterization

To achieve balanced learning performance across different cells, we introduce the concept of the optimality gap region, denoted by

G

. This region represents the set of all achievable tuples

(Δ_{1}, Δ_{2}, \dots, Δ_{M})

, where each component

Δ_{m}

corresponds to the optimality gap of cell

m \in M

under the given system constraints such as per-device power budgets. Formally, the optimality gap region is defined as

\begin{matrix} G = ⋃ \{(Δ_{1}, Δ_{2}, \dots, Δ_{M}) ∣ Δ_{m} \geq {Gap}_{m}, \forall m \in M\}, \end{matrix}

(25)

where

{Gap}_{m} ≜ J_{m} L_{m} η_{m}^{2} E [ℜ {ε_{m, c}}^{2}]

represents the minimal achievable optimality gap for cell m.

The Pareto boundary of region

G

is defined as the set of all Pareto optimal points, for which it is impossible to reduce any component

Δ_{m}

without increasing at least one of the others. These points characterize the optimal trade-offs between the learning performance results of different cells under shared resource constraints and provide a theoretical limit to how well system-wide performance can be balanced.

To mathematically characterize the Pareto boundary, we adopt the rate profile method proposed in [30]. This method enables joint optimization across all APs by formulating a scalarized minimization problem, where the individual performance weights are specified by a profile vector

κ = [κ_{1}, κ_{2}, \dots, κ_{M}]

. Each weight

κ_{m}

reflects the relative importance or performance share assigned to cell m, subject to the normalization constraint

\sum_{m \in M} κ_{m} = 1

.

The optimization problem for characterizing the Pareto boundary is then given by

\begin{matrix} (P 1) : & min_{{p_{k}, 0 < θ_{m}, 0 < ς}} ς \end{matrix}

(26)

\begin{matrix} s . t . & {Gap}_{m} \leq κ_{m} ς, \forall m \in M, \end{matrix}

(27)

\begin{matrix} 0 \leq p_{k} \leq P_{k}^{m a x}, \forall k \in K_{m}, m \in M, \end{matrix}

(28)

where

ς

serves as an upper bound that balances the individual optimality gaps according to the profile vector

κ

.

For a given profile vector

κ

, the solution

ς^{*}

to problem

(P 1)

determines a Pareto optimal tuple as

κ ς^{*}

. Geometrically, this point lies on the intersection between the ray defined by direction

κ

and the Pareto boundary of region

G

. By varying

κ

over the unit simplex, the entire Pareto boundary can be traced as illustrated in Figure 2, thereby fully revealing the trade-off structure among competing tasks.

4. Proposed Method

Building upon the previous section, this section develops efficient power control algorithms for parallel multi-task Air-FL from both centralized and distributed perspectives. We begin with a centralized approach that leverages full global CSI to iteratively transform and solve the non-convex optimization problem through convex reformulations. This approach enables accurate characterization of the Pareto boundary via joint optimization across all cells. However, due to the high overhead and poor scalability of centralized solutions in large-scale systems, we then propose a distributed power control scheme based on IT. The distributed approach uses only local CSI and IT constraints, and achieves near-Pareto optimal performance through inter-cell coordination without requiring a global controller.

4.1. Centralized Scheme

The main challenge in solving problem

(P 1)

lies in the coupling between the transmit power

p_{k}

of devices and the denoising factor

θ_{m}

within the expression of the optimality gap

{Gap}_{m}

, which renders the original problem non-convex. To address this, we propose an iterative centralized algorithm. Specifically, we first fix the transmit powers

p_{k}

and optimize the denoising factor

θ_{m}

for each cell, then substitute the resulting

θ_{m}

back into the original problem to solve for the optimal transmit powers. This alternating optimization (AO) continues until convergence.

4.1.1. Denoising Factor Optimization

Under fixed device transmit power, problem

(P 1)

is decomposed into M independent subproblems, each corresponding to one cell, expressed as

\begin{matrix} (P 2) : & min_{{0 < θ_{m}}} \{\sum_{k \in K_{m}} {(\frac{\sqrt{p_{k}} | h_{m, k} |}{\sqrt{θ_{m}}} - υ_{k})}^{2} + \sum_{j \in M ∖ {m}} \sum_{i \in K_{j}} \frac{p_{i} ℜ {h_{m, i} {(h_{j, i})}^{H}}^{2}}{θ_{m} | h_{j, i} |^{2}} + \frac{σ_{m}^{2}}{2 θ_{m}}\} . \end{matrix}

(29)

By introducing auxiliary variables

γ_{m} = 1 / \sqrt{θ_{m}}

, the aforementioned problem can be reformulated into a quadratic equation, thereby enabling the straightforward derivation of the closed-form solution for the denoising factor, given as

\begin{matrix} θ_{m} = {(\frac{\sum_{k \in K_{m}} p_{k} | h_{m, k} |^{2} + \sum_{j \in M ∖ {m}} \sum_{i \in K_{j}} \frac{p_{i} ℜ {h_{m, i} {(h_{j, i})}^{H}}^{2}}{| h_{j, i} |^{2}} + \frac{σ_{m}^{2}}{2}}{\sum_{k \in K_{m}} \sqrt{p_{k}} | h_{m, k} | υ_{k}})}^{2} . \end{matrix}

(30)

4.1.2. Device Transmit Power Optimization

Substituting the solution for

θ_{m}

in Equation (30) back into problem

(P 1)

, problem

(P 1)

is rewritten as

\begin{matrix} (P 3) : & min_{{p_{k}, 0 < ς}} ς \\ s . t . & (\sum_{k \in K_{m}} υ_{k}^{2} - \frac{κ_{m} ς}{E_{m}}) (\sum_{k \in K_{m}} p_{k} | h_{m, k} |^{2} + \sum_{j \in M ∖ {m}} \sum_{i \in K_{j}} \frac{p_{i} ℜ {h_{m, i} {(h_{j, i})}^{H}}^{2}}{| h_{j, i} |^{2}} + \frac{σ_{m}^{2}}{2}) \end{matrix}

(31)

\begin{matrix} \leq {(\sum_{k \in K_{m}} \sqrt{p_{k}} | h_{m, k} | υ_{k})}^{2}, \forall m \in M, \end{matrix}

(32)

\begin{matrix} 0 \leq p_{k} \leq P_{k}^{m a x}, \forall k \in K_{m}, m \in M . \end{matrix}

(33)

For any given

ς

, problem

(P 3)

can be reformulated as

\begin{matrix} (P 4) : & Find {p_{k}} \\ s . t . & ψ_{m} (\sum_{k \in K_{m}} p_{k} | h_{m, k} |^{2} + \sum_{j \in M ∖ {m}} \sum_{i \in K_{j}} \frac{p_{i} ℜ {h_{m, i} {(h_{j, i})}^{H}}^{2}}{| h_{j, i} |^{2}} + \frac{σ_{m}^{2}}{2}) \end{matrix}

(34)

\begin{matrix} \leq {(\sum_{k \in K_{m}} \sqrt{p_{k}} | h_{m, k} | υ_{k})}^{2}, \forall m \in M, \end{matrix}

(35)

\begin{matrix} 0 \leq p_{k} \leq P_{k}^{m a x}, \forall k \in K_{m}, m \in M, \end{matrix}

(36)

where

ψ_{m} = \sum_{k \in K_{m}} υ_{k}^{2} - \frac{κ_{m} ς}{E_{m}}

. Introduce the interference matrix

Λ_{m} ≜ [Λ_{m, 1}, Λ_{m, 2}, \dots, Λ_{m, K_{tot}}]

with

K_{tot} = K M

, defined as

\begin{matrix} Λ_{m, k} = \{\begin{matrix} | h_{m, k} |, & k \in K_{m}, \\ \frac{ℜ {h_{m, j} {(h_{j, i})}^{H}}}{| h_{j, i} |}, & k \in K_{j}, j \in M ∖ {m} . \end{matrix} \end{matrix}

(37)

Constraint (35) can be equivalently expressed as

\begin{matrix} \sqrt{ψ_{m} (\sum_{m \in M} \sum_{k \in K_{m}} Λ_{m, k}^{2} p_{k} + \frac{σ_{m}^{2}}{2})} \leq \sum_{k \in K_{m}} \sqrt{p_{k}} | h_{m, k} | υ_{k} . \end{matrix}

(38)

Define

q = {[\sqrt{p_{1}}, \sqrt{p_{2}}, \dots, \sqrt{p_{K_{tot}}}]}^{T}

,

α_{m} = diag ([σ_{m}^{2} / 2, Λ_{m}])

,

β_{m} = {[β_{m, 1}, β_{m, 2}, \dots, β_{m, K_{tot}}]}^{T}

with

β_{m, k}

satisfying:

\begin{matrix} β_{m, k} = \{\begin{matrix} | h_{m, k} | υ_{k}, & k \in K_{m}, \\ 0, & k \in K_{j}, j \in M ∖ {m} . \end{matrix} \end{matrix}

(39)

Therefore, problem

(P 4)

can be transformed into a standard second-order cone program (SOCP) problem, given as

\begin{matrix} (P 5) : Find & {q} \end{matrix}

(40)

\begin{matrix} s . t . & \sqrt{ψ_{m}} ∥{[1; q]}^{T} α_{m}∥ \leq q^{T} β_{m}, \forall m \in M, \end{matrix}

(41)

\begin{matrix} 0 \leq q_{k} \leq \sqrt{P_{k}^{m a x}}, \forall k \in K_{m}, m \in M . \end{matrix}

(42)

Problem

(P 5)

can be efficiently solved using standard convex optimization solvers (e.g., CVX [31]). Once the optimal solution

q^{*}

is obtained, the transmit powers follow as

p_{k}^{*} = {(q_{k}^{*})}^{2}

. To determine the optimal

ς^{*}

, we employ a bisection search over feasible

ς

values and solve the corresponding SOCP at each step. The complete process is summarized in Algorithm 1.

Algorithm 1: Centralized scheme for solving problem

(P 1)

.

1:: Input: Local gradient variance $υ_{k}$ , profile vector $κ$ , convergence threshold $ι$ , and maximum power budget $P_{k}^{m a x}$ .
2:: Set $ς^{low} \leftarrow 0$ , $ς^{up} \leftarrow {min}_{m \in M} (E_{m} \sum_{k \in K_{m}} υ_{k}^{2} / κ_{m})$ .
3:: while $|ς^{up} - ς^{low}| < ι$ do
4:: $ς \leftarrow \frac{ς^{low} + ς^{up}}{2}$ .
5:: Solve problem $(P 5)$ to obtain $p_{k}^{*}$ .
6:: if problem $(P 5)$ is feasible then
7:: $ς^{up} \leftarrow ς$ .
8:: else
9:: $ς^{low} \leftarrow ς$ .
10:: end if
11:: end while
12:: Obtain $θ_{m}^{*}$ based on (30).
13:: Output: Optimal solutions ${p_{k}^{*}, θ_{m}^{*}, ς^{*}}$ .

While the centralized scheme achieves globally optimal solutions under full CSI, it requires an aggregation center to collect information from all APs, including intra- and inter-cell channels. This introduces significant signaling overhead and limits the system’s scalability in real-world implementations. To address this, the next subsection proposes a distributed power control algorithm based on IT. In this framework, each AP requires only local CSI and interference thresholds, thereby enabling decentralized optimization while still approximating the global Pareto boundary through iterative coordination.

4.2. Distributed Scheme

This subsection investigates the decentralized power control strategy for parallel multi-task systems under IT constraints to characterize the optimality gap and the Pareto boundary.

We define

Γ_{j, m}

as the IT value representing the maximum allowable interference power from devices in cell m to AP j. This constraint ensures that the interference from cell m does not exceed a tolerable threshold at AP j, thereby enabling independent local optimization while maintaining global coordination.

Let

Γ

be a vector of size

M (M - 1) \times 1

containing all

Γ_{j, m}

, and

Γ_{m}

be a vector of size

2 (M - 1) \times 1

containing both

Γ_{j, m}

and

Γ_{m, j}

. For AP

m \in M

, replace the interference term

\sum_{i \in K_{j}} \frac{p_{i} ℜ {h_{m, i} {(h_{j, i})}^{H}}^{2}}{| h_{j, i} |^{2}}

with

Γ_{m, j}

. Define the interference channel from device

k, \forall k \in K_{m}

to neighboring cells as

{|g_{j, k}|}^{2} = \frac{ℜ {h_{j, k} {(h_{m, k})}^{H}}^{2}}{| h_{m, k} |^{2}}

. Consequently, the optimality gap minimization problem can be independently addressed at AP

m \in M

, formulated as

\begin{matrix} (P 6 . m) : & min_{{p_{k}, θ_{m} \geq 0}} E_{m} (\sum_{k \in K_{m}} {(\frac{\sqrt{p_{k}} | h_{m, k} |}{\sqrt{θ_{m}}} - υ_{k})}^{2} + \sum_{j \in M ∖ {m}} \frac{Γ_{m, j}}{θ_{m}} + \frac{σ_{m}^{2}}{2 θ_{m}}) \end{matrix}

(43)

\begin{matrix} s . t . & \sum_{k \in K_{m}} p_{k} | g_{j, k} |^{2} \leq Γ_{j, m}, j \in M ∖ {m}, \end{matrix}

(44)

\begin{matrix} 0 \leq p_{k} \leq P_{k}^{m a x}, \forall k \in K_{m}, m \in M . \end{matrix}

(45)

Due to the complex coupling relationship between the transmit power variables

p_{k}

and the denoising factor

θ_{m}

in the objective function of problem

(P 6 . m)

, the problem exhibits a non-convex structure and is challenging to solve. To address this, we introduce the inverse denoising factor

γ_{m} = 1 / θ_{m}

and an auxiliary variable

Q_{k} = \sqrt{p_{k} γ_{m}}

, thereby transforming problem

(P 6 . m)

as

\begin{matrix} (P 7) : & min_{{Q_{k}, γ_{m} \geq 0}} E_{m} (\sum_{k \in K_{m}} {(Q_{k} | h_{m, k} | - υ_{k})}^{2} + γ_{m} (\sum_{j \in M ∖ {m}} Γ_{m, j} + \frac{σ_{m}^{2}}{2})) \end{matrix}

(46)

\begin{matrix} s . t . & \sum_{k \in K_{m}} | Q_{k} g_{j, k} |^{2} \leq Γ_{j, m} γ_{m}, j \in M ∖ {m}, \end{matrix}

(47)

\begin{matrix} Q_{k}^{2} \leq P_{k}^{m a x} γ_{m}, \forall k \in K_{m}, m \in M . \end{matrix}

(48)

Since problem

(P 7)

is a standard convex optimization problem, it can be directly solved using common convex optimization tools (e.g., CVX [31]). To further derive a closed-form expression and enhance theoretical understanding of the problem structure, we employ the Lagrangian dual method for analysis.

Let

{\{λ_{j, m}\}}_{j \in M ∖ {m}} \geq 0

represent the Lagrange multipliers corresponding to the j-th IT constraint, and

φ_{k}

denote the Lagrange multiplier associated with the k-th edge device’s power constraint. Thus, the partial Lagrangian function for AP

m \in M

is expressed as

\begin{matrix} L_{m} (Q_{k}, γ_{m}, {\{λ_{j, m}\}}_{j \in M ∖ {m}}, φ_{k}) & = E_{m} ({\sum_{k \in K_{m}} (Q_{k} | h_{m, k} | - υ_{k})}^{2} \\ + γ_{m} (\sum_{j \in M ∖ {m}} Γ_{m, j} + \frac{σ_{m}^{2}}{2})) + \sum_{k \in K_{m}} φ_{k} (Q_{k}^{2} - P_{k}^{max} γ_{m}) \\ + \sum_{j \in M ∖ {m}} λ_{j, m} (\sum_{k \in K_{m}} {|Q_{k} g_{j, k}|}^{2} - Γ_{j, m} γ_{m}) . \end{matrix}

(49)

This leads to the dual function

\begin{matrix} R_{m} ({\{λ_{j, m}\}}_{j \in M ∖ {m}}) = & \min_{\{Q_{k} \geq 0, γ_{m} \geq 0\}} L_{m} (Q_{k}, γ_{m}, {\{λ_{j, m}\}}_{j \in M ∖ {m}}, φ_{k}) \end{matrix}

(50)

\begin{matrix} s . t . & Q_{k}^{2} \leq P_{k}^{m a x} γ_{m}, \forall k \in K_{m} . \end{matrix}

(51)

Therefore, the dual problem is formulated as

\begin{matrix} (P 8) : max_{\{λ_{j, m} \geq 0, φ_{k} > 0\}} R_{m} ({\{λ_{j, m}\}}_{j \in M ∖ {m}}, φ_{k}) . \end{matrix}

(52)

Since problem

(P 7)

satisfies Slater’s condition, strong duality holds between its dual problem and the primal problem, which allows us to apply the Karush–Kuhn–Tucker (KKT) optimality conditions [32] to solve the problem. Define the optimal solutions of the dual problem as

{\{λ_{j, m}^{*}\}}_{j \in M ∖ {m}}

and

φ_{k}^{*}

. According to the KKT optimality conditions, the variables

\{Q_{k}^{*}, γ_{m}^{*}, {\{λ_{j, m}^{*}\}}_{j \in M ∖ {m}}, φ_{k}^{*}\}

satisfy the following complementary slackness conditions, given as

\begin{matrix} λ_{j, m}^{*} \geq 0, \forall j \in M ∖ {m}, φ_{k}^{*} \geq 0, \forall k \in K_{m}, \end{matrix}

(53)

\begin{matrix} λ_{j, m} (\sum_{k \in K_{m}} | Q_{k}^{*} g_{j, k} |^{2} - Γ_{j, m} γ_{m}^{*}) = 0, \forall j \in M ∖ {m}, \end{matrix}

(54)

\begin{matrix} φ_{k}^{*} ({(Q_{k}^{*})}^{2} - P_{k}^{max} γ_{m}^{*}) = 0, \forall k \in K_{m}, \end{matrix}

(55)

\begin{matrix} \nabla_{Q_{k}} L_{m} (Q_{k}, γ_{m}^{*}, {\{λ_{j, m}^{*}\}}_{j \in M ∖ {m}}, φ_{k}^{*}) |_{Q_{k} = Q_{k}^{*}} = 0, \end{matrix}

(56)

\begin{matrix} \nabla_{γ_{m}} L_{m} (Q_{k}^{*}, γ_{m}, {\{λ_{j, m}^{*}\}}_{j \in M ∖ {m}}, φ_{k}^{*}) |_{γ_{m} = γ_{m}^{*}} = 0 . \end{matrix}

(57)

To determine the optimal auxiliary variable

Q_{k}

of each device, we begin by decomposing the Lagrangian dual function into

K_{m}

independent subproblems, each corresponding to a device’s power optimization task. This decomposition allows for parallel optimization across devices.

The complementary slackness conditions indicate that when

φ_{k}^{*} > 0

, the power constraint is strictly tight, and the device transmits at its maximum power

P_{k}^{m a x}

. Conversely, if

φ_{k}^{*} = 0

, the device transmit power satisfies

Q_{k}^{2} \leq P_{k}^{m a x} γ_{m}

. Accordingly, each subproblem is equivalently formulated as

\begin{matrix} (P 9) : & min_{{0 \leq Q_{k}}} E_{m} {(Q_{k} | h_{m, k} | - υ_{k})}^{2} + \sum_{j \in M ∖ {m}} λ_{j, m} | Q_{k} g_{j, k} |^{2} \end{matrix}

(58)

\begin{matrix} s . t . & Q_{k}^{2} \leq P_{k}^{m a x} γ_{m} . \end{matrix}

(59)

Furthermore, according to the stationarity condition () for

Q_{k}

, we obtain

\begin{matrix} \nabla_{Q_{k}} L_{m} (Q_{k}, γ_{m}^{*}, {\{λ_{j, m}^{*}\}}_{j \in M ∖ {m}}, φ_{k}^{*}) |_{Q_{k} = Q_{k}^{*}} \\ = 2 E_{m} (| h_{m, k} |^{2} Q_{k} - | h_{m, k} | υ_{k}) + 2 Q_{k} \sum_{j \in M ∖ {m}} λ_{j, m}^{*} | g_{j, k} |^{2} = 0 . \end{matrix}

(60)

By rearranging the equations and considering the maximum device power constraint, the optimal solution for

Q_{k}

is derived as

\begin{matrix} Q_{k}^{*} = min (\frac{E_{m} υ_{k} |h_{m, k}|}{E_{m} {|h_{m, k}|}^{2} + \sum_{j \in M ∖ {m}} λ_{j, m} {|g_{j, k}|}^{2}}, \sqrt{P_{k}^{m a x} γ_{m}}) . \end{matrix}

(61)

Similarly, to determine the optimal inverse denoising factor

γ_{m}

, we consider the following optimization problem:

\begin{matrix} (P 10) : & min_{{0 \leq γ_{m}}} E_{m} γ_{m} (\sum_{j \in M ∖ {m}} Γ_{m, j} + \frac{σ_{m}^{2}}{2}) - \sum_{j \in M ∖ {m}} λ_{j, m} Γ_{j, m} γ_{m} - \sum_{k \in K_{m}} φ_{k} P_{k}^{max} γ_{m} . \end{matrix}

(62)

Applying the stationarity condition () with respect to

γ_{m}

, we can obtain

\begin{matrix} \nabla_{γ_{m}} L_{m} (Q_{k}^{*}, γ_{m}, {\{λ_{j, m}^{*}\}}_{j \in M ∖ {m}}, φ_{k}^{*}) |_{γ_{m} = γ_{m}^{*}} \\ = E_{m} (\sum_{j \in M ∖ {m}} Γ_{m, j} + \frac{σ_{m}^{2}}{2}) - \sum_{j \in M ∖ {m}} λ_{j, m}^{*} Γ_{j, m} - \sum_{k \in K_{m}} φ_{k}^{*} P_{k}^{max} = 0 . \end{matrix}

(63)

Therefore, the optimal solution for

γ_{m}

is derived as

\begin{matrix} γ_{m}^{*} = \frac{E_{m} (\sum_{j \in M ∖ {m}} Γ_{m, j} + \frac{σ_{m}^{2}}{2})}{\sum_{j \in M ∖ {m}} λ_{j, m}^{*} Γ_{j, m} + \sum_{k \in K_{m}} φ_{k}^{*} P_{k}^{max}} . \end{matrix}

(64)

To find the optimal dual variables

{\{λ_{j, m}^{*}\}}_{j \in M ∖ {m}}

and

φ_{k}^{*}

, we can employ the ellipsoid method [33], which utilizes subgradients to iteratively refine the estimates of the dual variables. The subgradient for

{\{λ_{j, m}\}}_{j \in M ∖ {m}}

is

\sum_{k \in K_{m}} {|Q_{k}^{*} g_{j, k}|}^{2} - Γ_{j, m} γ_{m}^{*}

, and the subgradient for

φ_{k}^{*}

is

{(Q_{k}^{*})}^{2} - P_{k}^{max} γ_{m}^{*}

. By iteratively updating

{\{λ_{j, m}\}}_{j \in M ∖ {m}}

and

φ_{k}^{*}

using these subgradients, the ellipsoid method converges to the optimal dual variables. Once the optimal dual variables are obtained, these variables can be substituted back into the expressions (61) and (64) to determine the optimal primal variables.

Finally, the optimal transmit power

p_{k}^{*}

and the denoising factor

θ_{m}^{*}

are computed as

\begin{matrix} p_{k}^{*} = min ({(\frac{E_{m} υ_{k} |h_{m, k}| θ_{m}^{*}}{E_{m} {|h_{m, k}|}^{2} + \sum_{j \in M ∖ {m}} λ_{j, m}^{*} {|g_{j, k}|}^{2}})}^{2}, P_{k}^{m a x}), \end{matrix}

(65)

\begin{matrix} θ_{m}^{*} = \frac{\sum_{j \in M ∖ {m}} λ_{j, m}^{*} Γ_{j, m} + \sum_{k \in K_{m}} φ_{k}^{*} P_{k}^{max}}{E_{m} (\sum_{j \in M ∖ {m}} Γ_{m, j} + \frac{σ_{m}^{2}}{2})} . \end{matrix}

(66)

These results indicate that when

φ_{k}^{*} > 0

, the device transmits at its maximum power

P_{k}^{max}

. Otherwise, the transmit power is adjusted based on the regularized inverse power transmission strategy, where the regularization term

\sum_{j \in M ∖ {m}} λ_{j, m}^{*} {|g_{j, k}|}^{2}

accounts for inter-cell interference. The denoising factor

θ_{m}^{*}

is influenced by both inter-cell interference and the power constraints of all devices. While these solutions satisfy the inter-cell interference constraints, they are limited to scenarios with fixed interference thresholds and may not achieve Pareto optimality. Therefore, an iterative algorithm for optimizing inter-cell interference thresholds is proposed to enhance overall system performance.

To facilitate efficient distributed power control in parallel multi-task Air-FL systems, this paper introduces the concept of IT as a pivotal coordination variable to establish a local information-driven distributed optimization framework. (The proposed analysis and algorithms assume quasi-static channel conditions, where wireless links remain constant over each communication round. This assumption enables tractable optimization and reliable gradient aggregation using AirComp. In practical scenarios with mobility or fading, channel variations may affect aggregation accuracy. Nonetheless, our distributed scheme relies only on local CSI and permits iterative IT updates, offering inherent adaptability to slow channel variations. Future work may consider extending the framework to incorporate robust optimization techniques for time-varying or uncertain CSI, thereby improving resilience under mobility and fading.) Building upon this foundation, an iterative collaborative algorithm is proposed, wherein APs dynamically adjust mutual IT values through peer-to-peer signaling interactions.

In each iteration, the system selects a pair of APs for local collaborative updates. These updates are designed to ensure that the optimality gap of the local cell does not increase, nor does the performance of other cells degrade, thereby guaranteeing non-decreasing overall system performance. To implement this mechanism, it is assumed that APs possess basic backhaul link support for essential information sharing and synchronization of interference parameters.

However, determining whether the system achieves Pareto optimality under arbitrary IT configurations remains challenging. To address this, a lemma is proposed to characterize the necessary constraints between IT and the cell optimality gap under Pareto optimality conditions. This lemma provides a theoretical foundation for subsequent algorithm design.

Lemma 1

(Necessary Condition for Pareto Optimality). Under any given IT configuration Γ, if the optimality gap

\bar{Δ_{m}} (Γ_{m})

has reached a Pareto optimal state, then for any pair of APs m and j, the determinant of the following

2 \times 2

matrix

T_{j, m}

must be zero, given as [34]

\begin{matrix} T_{j, m} & = [\begin{matrix} \frac{\partial \bar{Δ_{m}} (Γ_{m})}{\partial Γ_{j, m}} & \frac{\partial \bar{Δ_{m}} (Γ_{m})}{\partial Γ_{m, j}} \\ \frac{\partial \bar{Δ_{j}} (Γ_{j})}{\partial Γ_{j, m}} & \frac{\partial \bar{Δ_{j}} (Γ_{j})}{\partial Γ_{m, j}} \end{matrix}] . \end{matrix}

(67)

Building upon the previously established analytical framework, we now delve into the derivation of the matrix elements within the IT optimization context. By solving both the primal and dual formulations of the problem, we obtain explicit expressions for each component of the matrix

T_{j, m}

, which encapsulates the sensitivity of the optimality gaps to variations in IT parameters.

Specifically, the partial derivatives of the optimality gap

\bar{Δ_{m}} (Γ_{m})

with respect to the IT parameters

Γ_{j, m}

and

Γ_{m, j}

are given by

\begin{matrix} \frac{\partial \bar{Δ_{m}} (Γ_{m})}{\partial Γ_{j, m}} = - λ_{j, m}^{*} γ_{m}^{*}, \frac{\partial \bar{Δ_{j}} (Γ_{j})}{\partial Γ_{m, j}} = - λ_{m, j}^{*} γ_{j}^{*}, \end{matrix}

(68)

where

\{λ_{j, m}^{*}, λ_{m, j}^{*}\}

respectively denote the optimal dual multipliers associated with the IT constraints in the primal problems of cell m and cell j, and

\{γ_{m}^{*}, γ_{j}^{*}\}

are the corresponding optimal inverse denoising factors.

Additionally, the derivatives with respect to the reciprocal IT parameters are written as

\begin{matrix} \frac{\partial \bar{Δ_{m}} (Γ_{m})}{\partial Γ_{m, j}} = E_{m} γ_{m}^{*}, \frac{\partial \bar{Δ_{j}} (Γ_{j})}{\partial Γ_{j, m}} = E_{j} γ_{j}^{*} . \end{matrix}

(69)

To iteratively refine the IT vector

Γ

, we focus on updating only the mutual IT parameters between a selected pair of APs m and j, while keeping the rest unchanged, which is defined as

\begin{matrix} {[{Γ^{'}}_{j, m}, {Γ^{'}}_{m, j}]}^{T} = {[Γ_{j, m}, Γ_{m, j}]}^{T} + δ_{j, m} \cdot t_{j, m}, \end{matrix}

(70)

where

Γ^{'}

represents the updated IT vector;

δ_{j, m}

denotes a suitably small step size; and

t_{j, m}

is a vector ensuring that the product

T_{j, m} t_{j, m} < 0

, thereby guaranteeing a reduction in the optimality gaps.

Assuming

T_{j, m} = [\begin{matrix} a & b \\ c & d \end{matrix}]

, a feasible solution for

t_{j, m}

is computed as

\begin{matrix} t_{j, m} = sign (b c - a d) \cdot {[ℓ_{j, m} d - b, a - ℓ_{j, m} c]}^{T}, \end{matrix}

(71)

where

ℓ_{j, m}

is a ratio parameter that modulates the relative reduction rate of the optimality gap between APs m and j;

sign (\cdot)

denotes the sign function.

By adjusting

ℓ_{j, m}

, we can control the balance of improvement between the two APs. Specifically, setting

ℓ_{j, m} \geq 1

prioritizes AP m, while

ℓ_{j, m} \leq 1

favors AP j. Through the careful selection of

δ_{j, m}

and

ℓ_{j, m}

, we can navigate the Pareto boundary to identify configurations where no AP performance can be improved without degrading the others.

This iterative approach culminates in a distributed power control algorithm that leverages IT optimization to enhance overall system efficiency. The complete algorithm is encapsulated in Algorithm 2, which systematically updates the IT parameters to converge towards Pareto-optimal configurations. (The proposed distributed scheme currently assumes uniform data and computational resources across devices. In practice, edge devices may exhibit heterogeneity in dataset sizes, local data distributions, and processing capabilities. Our framework can accommodate such variations by incorporating per-device weighting or fairness constraints in the optimization objectives. Extending the algorithm to explicitly model non-IID data and device-level heterogeneity is a promising direction for future research).

The distributed algorithm outlined above incrementally minimizes the global optimality gap across all APs through iterative, pairwise updates of IT parameters. (Although our distributed power optimization framework shares a conceptual similarity with some interference-constrained optimization methods (e.g., [35]), there are key distinctions. In [35], the objective is to enhance location privacy and spectrum efficiency in cognitive radio networks through spectrum sharing policies, typically under single-task settings. In contrast, our work addresses a fundamentally different problem of minimizing the learning optimality gap in multi-task Air-FL systems. Furthermore, our framework integrates gradient aggregation distortion, Pareto boundary trade-offs, and an iterative IT update mechanism that collectively enable decentralized learning coordination, which is different from [35].) In each iteration, a selected pair of APs adjusts their mutual IT levels to reduce their respective optimality gaps. Crucially, these adjustments are designed to ensure that the optimality gaps of other APs in the network remain unaffected, thereby preserving overall system stability.

Algorithm 2: IT-based decentralized scheme for solving problem

(P 1)

.

1:: Input: IT vector $Γ$ , convergence threshold $ι$ .
2:: for any pair of APs m and j, $j \in M ∖ {m}$ do
3:: if $|T_{j, m}| > ι$ then
4:: APs m and j synchronously update their current IT values via a reliable backhaul link.
5:: APs m and j solve problem $(P 7)$ to obtain optimal solution ${\{p_{k}^{*}\}}_{k \in K_{m}}$ , ${\{p_{i}^{*}\}}_{i \in K_{j}}$ , $\{θ_{m}^{*}, θ_{j}^{*}\}$ , $\{λ_{j, m}^{*}, λ_{m, j}^{*}\}$ .
6:: APs m and j update the elements in $T_{j, m}$ based on (68) and (69)
7:: APs m and j exchange elements in $T_{j, m}$ via the backhaul link, reconstruct $T_{j, m}$ via (67), and compute $t_{j, m}$ via (71).
8:: APs m and j update their IT values $Γ_{j, m}^{'}$ according to (70).
9:: end if
10:: end for
11:: Output: Optimal device transmit power $p_{k}^{*}$ and optimal denoising factor $θ_{m}^{*}$ .

This iterative update mechanism guides the system’s performance toward the Pareto boundary of the optimality gap region. By carefully selecting step sizes and direction vectors for IT adjustments, the algorithm ensures that each update leads to a non-increasing optimality gap for the involved APs. Over successive iterations, this process converges to a state where no further reductions in optimality gaps are possible without adversely affecting other APs, which signifies Pareto optimality.

Compared to the initial configurations or schemes where each AP optimizes independently, this cooperative approach achieves a more efficient global performance. It encourages APs to participate in collaborative adjustments, even when such changes may not immediately benefit their individual performance metrics. This collaborative behavior is facilitated by the algorithm’s design, which ensures that any trade-offs made by individual APs contribute to the overall reduction in the system’s optimality gap. Such approaches highlight the potential of decentralized coordination mechanisms in enhancing the efficiency and fairness of distributed wireless systems.

5. Numerical Results

In this section, we present the simulation results to evaluate the performance of the proposed distributed scheme in multi-task Air-FL scenarios. Specifically, we consider two representative tasks: ridge regression on synthetic datasets and handwritten digit classification using the Modified National Institute of Standards and Technology (MNIST) dataset. These experiments aim to assess the effectiveness of the distributed scheme in managing interference and optimizing learning performance across multiple tasks. (While MNIST and synthetic regression tasks are used in our simulations for clarity and analytical convenience, the proposed distributed power control framework is compatible with more complex datasets and deep neural models. Our goal in this paper is to highlight the impact of interference-aware power optimization on FL performance under multi-task and multi-cell settings.)

5.1. Simulation Setup and Benchmark Schemes

We simulate a two-cell network, each hosting a distinct FL task. The first cell is centered at coordinates

(0, 0)

, and the second at

(40, 0)

. Devices within each cell are randomly distributed within a 20-meter radius around their respective APs. The wireless channels between devices and APs follow a distance-dependent Rayleigh fading model. (In this study, we adopt Rayleigh fading to model the wireless channels between APs and devices. This choice captures a rich-scattering, non-line-of-sight (NLoS) environment and allows for tractable performance evaluation. However, we acknowledge that practical deployments may involve line-of-sight (LoS) components, in which case Rician or Nakagami-m fading models may provide more accurate characterizations. Incorporating such models to study the impact of LoS-dominant fading on Air-FL performance is an important direction for future work.) The channel gain between device k and AP m is modeled as

h_{m, k} = \sqrt{Ω_{0} d_{m, k}^{- ζ}} h_{0}

, where

Ω_{0}

denotes the path loss at a reference distance of 1 m;

ζ

is the path loss exponent;

d_{m, k}

is the distance between device k and AP m; and

h_{0} \sim C N (0, I)

is a complex Gaussian random variable with zero mean and unit variance. Interference channels between devices and non-associated APs are modeled similarly. Simulation parameters are detailed in Table 2.

In terms of performance metrics, the optimality gap and prediction error are used to evaluate the ridge regression task on the synthetic dataset. To benchmark the proposed distributed scheme, we compare it against the following baseline methods:

Benchmark with maximum power: In this scheme, all devices transmit at their maximum power levels, i.e., $p_{k} = p_{k}^{max}$ . This scheme requires no CSI collection and represents the simplest power control strategy.
Benchmark without AirComp: In this scheme, all devices transmit their local model updates to their respective APs, which perform aggregation without any interference. This scenario assumes an ideal communication environment, serving as an upper performance bound.
Benchmark without interference: In this scheme, each AP optimizes the device transmit power and denoising factor based solely on intra-cell CSI, without coordinating with other APs. The optimization problem for AP m is formulated as

$\begin{matrix} min_{{0 \leq p_{k} \leq p_{k}^{max}, 0 \leq θ_{m}}} & E_{m} (\sum_{k \in K_{m}} {(\frac{\sqrt{p_{k}} | h_{m, k} |}{\sqrt{θ_{m}}} - υ_{k})}^{2} + \frac{σ_{m}^{2}}{2 θ_{m}}) . \end{matrix}$

(72)

5.2. Multi-Task Ridge Regression Performance

We consider that each cell independently trains a ridge regression model on distinct datasets. Each cell’s dataset follows a unique distribution, which reflects heterogeneous data environments. The sample loss function for cell

m \in M

is defined as

\begin{matrix} H_{m} (e_{m}, x_{m}, τ_{m}) = \frac{1}{2} {∥x_{m}^{T} e_{m} - τ_{m}∥}^{2} + ρ R (e_{m}), \end{matrix}

(73)

where

ρ = 5 \times 10^{- 5}

is the regularization hyperparameter. The input sample vector

x_{m} \in R^{C}

for each cell is drawn from a standard normal distribution, i.e.,

x_{m} \sim N (0, I)

, with model dimension

C = 20

. The target labels for each cell are generated as

τ_{1} = x_{1} (2) + 3 x_{1} (5) + 0.2 z

and

τ_{2} = x_{2} (3) + 2 x_{2} (8) + 4 x_{2} (10) + 0.3 z

, where z represents standard Gaussian noise. Each device within a cell holds

D_{k} = 500, \forall k \in K

data samples, resulting in a total of

D_{m} = \sum_{k \in K} D_{k} = 5000

samples per cell.

To evaluate the convergence behavior, we compute the smoothness parameter

L_{m}

and the Polyak–Łojasiewicz parameter

μ_{m}

for each cell as in Assumptions A1 and A3, based on the eigenvalues of the sample covariance matrix

X_{m}^{T} X_{m} / D_{m} + 10^{- 4} I

[36].

The optimal model parameters

e_{m}^{★}

are obtained as

e_{m}^{★} = {(X_{m}^{T} X_{m} + ρ I)}^{- 1} X_{m}^{T} τ_{m}

with

τ_{m} = {[τ_{1}, \dots, τ_{D_{m}}]}^{T}

. The initial model for each cell is initialized as a zero vector.

Figure 3 presents the evolution of the optimality gap and prediction error across communication rounds N for various power control strategies in both cells. The proposed distributed scheme demonstrates a consistent decline in both metrics as N increases, with the prediction error stabilizing in later rounds. This trend indicates that the local models of both cells progressively converge towards their respective optimal solutions on the training set and ultimately achieve strong generalization performance on the test set. Analyzing the performance disparity between the two cells, Cell 1 consistently exhibits superior outcomes in both optimality gap and prediction error. This advantage is primarily due to the simpler label structure and lower data noise in Cell 1’s task, thereby enabling more efficient learning and faster convergence under identical training conditions. Comparative assessments reveal that the proposed distributed scheme significantly outperforms both the benchmark with maximum power and benchmark without interference in both cells. Specifically, the proposed scheme not only accelerates the reduction in training errors to bring the model closer to the optimal training solution but also enhances generalization capability as evidenced by the lower prediction errors on the test set. This performance underscores its effectiveness in balancing rapid convergence with robust generalization. Furthermore, Figure 3 also includes the benchmark without AirComp as an ideal performance upper bound. In later communication rounds, the prediction error achieved by the proposed distributed scheme closely approaches this ideal benchmark with only a marginal gap. This proximity illustrates the scheme’s capacity to approximate optimal performance despite practical challenges such as communication losses and inter-cell interference. In contrast, other baseline schemes continue to exhibit significant error at high communication rounds, particularly the benchmark with maximum power, which highlights their limitations in achieving efficient convergence and generalization.

Figure 4 illustrates the achievable regions of the optimality gap for two cells at communication rounds

N = 100

. The Pareto boundary is delineated by varying the profile vector of the centralized power control scheme as

κ = [κ_{m}, 1 - κ_{m}]

, where

κ_{m} = [0.001, 0.01, 0.1, 0.3, 0.4, 0.5, 0.6, 0.7, 0.9, 0.99, 0.999]

. The bottom-leftmost point on the Pareto boundary by the centralized scheme represents the Pareto-optimal point, while the bottom-leftmost point by the benchmark without AirComp denotes the idealized learning performance, which serves as a benchmark for evaluating the proximity of practical schemes to optimal performance. It is observed that the Pareto-optimal point lies closest to this idealized benchmark, indicating that the centralized scheme achieves the best practical performance. Furthermore, the proposed distributed scheme demonstrates a performance remarkably closer to the Pareto-optimal point, with deviations within

10^{- 4}

, which demonstrates its efficacy in approaching centralized optimality. This proximity underscores its practical applicability, especially in scenarios constrained by communication resources. In contrast, the benchmark without interference exhibits inferior overall performance, which highlights the importance of considering inter-cell dynamics. The benchmark with maximum power performs the worst due to a lack of power adjustment or interference management.

Figure 5 demonstrates the convergence behavior of the proposed distributed scheme by depicting the evolution of the optimality gap for Cell 1 and Cell 2 across iterative rounds with

K = 10

and

K = 20

devices per cell. The results demonstrate that all schemes converge within 30 rounds. (While our scheme demonstrates convergence within approximately 30 global rounds, we do not compare directly with FedAvg or FedProx, as these methods operate under fundamentally different assumptions. As such, we evaluate against baselines that are aligned with our system model to ensure a fair and meaningful comparison.) Comparative analysis between the two device configurations reveals that, under the same number of iterations, the scenario with

K = 20

devices per cell attains a significantly lower optimality gap than the

K = 10

scenario. This suggests that incorporating more devices into the optimization process enhances the overall system performance, which drives the learning outcomes closer to the optimal solution. Such scalability is crucial for practical applications, where the number of participating devices can vary. While an increased number of devices introduces greater complexity in collaborative computation and power adjustment, potentially leading to a slight reduction in convergence speed, the algorithm maintains stable convergence within a limited number of iterations. This resilience confirms the effectiveness of the proposed distributed scheme in managing complex multi-cell scenarios, thereby ensuring reliable performance even as the network scales.

5.3. Performance on Multi-Task MNIST Dataset

We establish a parallel multi-task handwritten digit classification scenario utilizing the MNIST dataset. Specifically, Cell 1 and Cell 2 are assigned distinct subsets of digit classes, labeled as 0–4 and 5–9, respectively. This configuration introduces task heterogeneity, as each cell processes non-overlapping label distributions. Both cells employ an identical convolutional neural network (CNN) architecture for local model training. The architecture comprises two convolutional layers with rectified linear unit (ReLU) activation functions and kernel sizes of

5 \times 5

, containing 32 and 64 channels, respectively. Each convolutional layer is followed by a

2 \times 2

max-pooling operation. Subsequently, the network includes a fully connected layer with 1024 units, culminating in a softmax output layer for classification. During local training, all devices utilize a fixed mini-batch size of

m_{b} = 128

. The performance of the models is evaluated using the average test accuracy and loss function values of both cells to provide a comprehensive assessment of classification efficacy and convergence behavior within the multi-task learning system.

Figure 6 illustrates the impact of communication rounds N on the overall performance of parallel multi-task Air-FL over the MNIST dataset. From Figure 6a, all schemes demonstrate a continuous improvement in average test accuracy as N increases, with convergence observed in the later stages. The benchmark without AirComp serves as the performance upper bound, which achieves optimal accuracy. Notably, the proposed distributed scheme closely approaches this ideal benchmark throughout the training process, with an optimal test accuracy difference within

1 %

. This proximity demonstrates its near-optimal classification performance, which surpasses both the benchmark without interference and the benchmark with maximum power, thereby validating its superior convergence speed and learning efficacy. Figure 6b reveals that the average loss function value decreases across all schemes, albeit with varying rates of decline and final convergence levels. The proposed distributed scheme achieves a loss function value closer to the optimal performance of the benchmark without AirComp compared to other baseline schemes. This outcome highlights its capability to effectively enhance the model’s generalization performance.

6. Conclusions

This paper addressed the challenge of inter-cell interference in parallel multi-task Air-FL systems, where each cell independently trains a distinct model while simultaneously performing over-the-air aggregation. We analyzed how aggregation errors impact the local optimality gap and proposed a joint optimization framework for device transmit power and denoising factors aimed at minimizing the cumulative optimality gap across all cells. To characterize the trade-offs in multi-cell performance, we employed Pareto boundary theory and designed a centralized optimization scheme that serves as a performance upper bound. Building upon this, we introduced a distributed power control strategy based on IT, which decouples the globally coupled problem into locally solvable subproblems. This approach allows each cell to independently adjust its transmit power using only local CSI. To enhance computational efficiency, we derived closed-form solutions for the subproblems through Lagrangian duality theory and implemented a dynamic IT update mechanism to effectively approach the Pareto boundary. Simulation results demonstrated that the proposed distributed scheme outperforms benchmark methods in terms of training convergence speed, cross-cell performance balance, and test accuracy. Future work will extend our framework to larger-scale and more heterogeneous datasets such as CIFAR-10, which will enable validation under non-convex learning objectives and more realistic visual data distributions. This will further demonstrate the scalability and robustness of our distributed scheme.

Author Contributions

Conceptualization, C.T. and J.Y.; methodology, C.T. and J.Y.; software, C.T. and D.H.; validation, C.T., J.Y. and D.H.; formal analysis, C.T. and J.Y.; investigation, C.T. and J.Y.; resources, J.Y.; data curation, C.T. and D.H.; writing—original draft preparation, C.T. and J.Y.; writing—review and editing, C.T. and J.Y.; visualization, J.Y. and D.H.; supervision, J.Y.; project administration, J.Y.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

FL	Federated learning
AirComp	Over-the-air computation
Air-FL	Over-the-air federated learning
IT	Interference temperature
CSI	Channel state information
STAR-RIS	simultaneously transmitting and reflecting reconfigurable intelligent surface
MSE	mean squared error
AO	Alternating optimization
AP	Access point
FedSGD	Federated stochastic gradient descent
AWGN	Additive white Gaussian noise
SOCP	Second-order cone program
KKT	Karush–Kuhn–Tucker
non-IID	non-identically distributed
MNIST	Modified national institute of standard and technology
LoS	line-of-sight
NLoS	non-line-of-sight
CNN	Convolutional neural network
ReLU	Rectified linear unit

References

Verbraeken, J.; Wolting, M.; Katzy, J.; Kloppenburg, J.; Verbelen, T.; Rellermeyer, J.S. A survey on distributed machine learning. ACM Comput. Surv. 2020, 53, 1–33. [Google Scholar] [CrossRef]
Wang, S.; Tuor, T.; Salonidis, T.; Leung, K.K.; Makaya, C.; He, T.; Chan, K. When edge meets learning: Adaptive control for resource-constrained distributed machine learning. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM), Honolulu, HI, USA, 16–19 April 2018; pp. 63–71. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. arXiv 2017, arXiv:1602.05629v4. [Google Scholar]
Imteaj, A.; Thakker, U.; Wang, S.; Li, J.; Amini, M.H. A survey on federated learning for resource-constrained IoT devices. IEEE Internet Things J. 2022, 9, 1–24. [Google Scholar] [CrossRef]
Zhou, C.; Liu, J.; Jia, J.; Zhou, J.; Zhou, Y.; Dai, H.; Dou, D. Efficient device scheduling with multi-job federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; pp. 9971–9979. [Google Scholar] [CrossRef]
Ma, H.; Guo, H.; Lau, V.K.N. Communication-efficient federated multitask learning over wireless networks. IEEE Internet Things J. 2023, 10, 609–624. [Google Scholar] [CrossRef]
Sami, H.U.; Güler, B. Over-the-air clustered federated learning. IEEE Trans. Wirel. Commun. 2023, 23, 7877–7893. [Google Scholar] [CrossRef]
Cao, X.; Başar, T.; Diggavi, S.; Eldar, Y.C.; Letaief, K.B.; Poor, H.V.; Zhang, J. Communication-efficient distributed learning: An overview. IEEE J. Sel. Areas Commun. 2023, 41, 851–873. [Google Scholar] [CrossRef]
Cao, X.; Lyu, Z.; Zhu, G.; Xu, J.; Xu, L.; Cui, S. An overview on over-the-air federated edge learning. IEEE Wirel. Commun. 2024, 31, 202–210. [Google Scholar] [CrossRef]
Wang, Z.; Zhao, Y.; Zhou, Y.; Shi, Y.; Jiang, C.; Letaief, K.B. Over-the-air computation for 6G: Foundations, technologies, and applications. IEEE Internet Things J. 2024, 11, 24634–24658. [Google Scholar] [CrossRef]
Şahin, A.; Yang, R. A survey on over-the-air computation. IEEE Commun. Surv. Tuts. 2023, 25, 1877–1908. [Google Scholar] [CrossRef]
Zhu, J.; Shi, Y.; Zhou, Y.; Jiang, C.; Chen, W.; Letaief, K.B. Over-the-air federated learning and optimization. IEEE Internet Things J. 2024, 11, 16996–17020. [Google Scholar] [CrossRef]
Azimi-Abarghouyi, S.M.; Fodor, V. Scalable hierarchical over-the-air federated learning. IEEE Trans. Wirel. Commun. 2024, 23, 8480–8496. [Google Scholar] [CrossRef]
Aygün, O.; Kazemi, M.; Gündüz, D.; Duman, T.M. Over-the-air federated edge learning with hierarchical clustering. IEEE Trans. Wirel. Commun. 2024, 23, 17856–17871. [Google Scholar] [CrossRef]
Asaad, S.; Wang, P.; Tabassum, H. Over-the-air FEEL with integrated sensing: Joint scheduling and beamforming design. IEEE Trans. Wirel. Commun. 2025, 24, 3273–3288. [Google Scholar] [CrossRef]
Liang, Y.; Chen, Q.; Zhu, G.; Jiang, H.; Eldar, Y.C.; Cui, S. Communication-and-energy efficient over-the-air federated learning. IEEE Trans. Wirel. Commun. 2025, 24, 767–782. [Google Scholar] [CrossRef]
Zhong, C.; Yang, H.; Yuan, X. Over-the-air federated multi-task learning over MIMO multiple access channels. IEEE Trans. Wirel. Commun. 2023, 22, 3853–3868. [Google Scholar] [CrossRef]
Li, F.; Ye, Q.; Fapi, E.T.; Sun, W.; Jiang, Y. Multi-cell over-the-air computation systems with spectrum sharing: A perspective from α-fairness. IEEE Trans. Veh. Technol. 2023, 72, 16249–16265. [Google Scholar] [CrossRef]
Wang, Z.; Zhou, Y.; Shi, Y.; Zhuang, W. Interference management for over-the-air federated learning in multi-cell wireless networks. IEEE J. Sel. Areas Commun. 2022, 40, 2361–2377. [Google Scholar] [CrossRef]
Zeng, X.; Mao, Y.; Shi, Y. STAR-RIS assisted over-the-air vertical federated learning in multi-cell wireless networks. In Proceedings of the IEEE International Conference on Communications Workshops (ICC Wkshps), Rome, Italy, 28 May–1 June 2023; pp. 361–366. [Google Scholar] [CrossRef]
Zhou, F.; Wang, Z.; Shan, H.; Wu, L.; Tian, X.; Shi, Y.; Zhou, Y. Over-the-air hierarchical personalized federated learning. IEEE Trans. Veh. Technol. 2025, 74, 5006–5021. [Google Scholar] [CrossRef]
Guo, W.; Huang, C.; Qin, X.; Yang, L.; Zhang, W. Dynamic clustering and power control for two-tier wireless federated learning. IEEE Trans. Wirel. Commun. 2024, 23, 1356–1371. [Google Scholar] [CrossRef]
Li, W.; Chen, G.; Zhang, X.; Wang, N.; Ouyang, D.; Chen, C. Efficient and secure aggregation framework for federated-learning-based spectrum sharing. IEEE Internet Things J. 2024, 11, 17223–17236. [Google Scholar] [CrossRef]
Wu, T.; Qu, Y.; Liu, C.; Dai, H.; Dong, C.; Cao, J. Cost-efficient federated learning for edge intelligence in multi-cell networks. IEEE/ACM Trans. Netw. 2024, 32, 4472–4487. [Google Scholar] [CrossRef]
Li, X.; Huang, K.; Yang, W.; Wang, S.; Zhang, Z. On the Convergence of Fedavg on Non-Iid Data. arXiv 2019, arXiv:1907.02189v4. [Google Scholar]
Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization methods for large-scale machine learning. SIAM Rev. 2018, 60, 223–311. [Google Scholar] [CrossRef]
Feng, C.; Yang, H.H.; Hu, D.; Zhao, Z.; Quek, T.Q.S.; Min, G. Mobility-aware cluster federated learning in hierarchical wireless networks. IEEE Trans. Wirel. Commun. 2022, 21, 8441–8458. [Google Scholar] [CrossRef]
Wang, S.; Tuor, T.; Salonidis, T.; Leung, K.K.; Makaya, C.; He, T.; Chan, K. Adaptive federated learning in resource constrained edge computing systems. IEEE J. Sel. Areas Commun. 2019, 37, 1205–1221. [Google Scholar] [CrossRef]
Cao, X.; Zhu, G.; Xu, J.; Wang, Z.; Cui, S. Optimized power control design for over-the-air federated edge learning. IEEE J. Sel. Areas Commun. 2022, 40, 342–358. [Google Scholar] [CrossRef]
Lan, Q.; Kang, H.S.; Huang, K. Simultaneous signal-and-interference alignment for two-cell over-the-air computation. IEEE Wirel. Commun. Lett. 2020, 9, 1342–1345. [Google Scholar] [CrossRef]
Grant, M.; Boyd, S. CVX: MATLAB Software for Disciplined Convex Programming. 2016. Available online: http://cvxr.com/cvx (accessed on 3 July 2025).
Boyd, S.P.; Vandenberghe, L. Convex Optimization. 2004. Available online: https://web.stanford.edu/~boyd/cvxbook/ (accessed on 3 July 2025).
Xu, J.; Yao, J. Exploiting physical-layer security for multiuser multicarrier computation offloading. IEEE Wirel. Commun. Lett. 2019, 8, 9–12. [Google Scholar] [CrossRef]
Cao, X.; Zhu, G.; Xu, J.; Huang, K. Cooperative interference management for over-the-air computation networks. IEEE Trans. Wirel. Commun. 2021, 20, 2634–2651. [Google Scholar] [CrossRef]
Jiao, L.; Ge, Y.; Zeng, K.; Hilburn, B. Location privacy and spectrum efficiency enhancement in spectrum sharing systems. IEEE Trans. Cogn. Commun. Netw. 2023, 9, 1472–1488. [Google Scholar] [CrossRef]
Liu, D.; Simeone, O. Privacy for free: Wireless federated learning via uncoded transmission with adaptive power control. IEEE J. Sel. Areas Commun. 2021, 39, 170–185. [Google Scholar] [CrossRef]

Figure 1. Parallel multi-task Air-FL system.

Figure 2. Pareto boundary.

Figure 3. Learning performance on synthetic datasets versus communication rounds.

Figure 4. Achievable optimality gap region.

Figure 5. Convergence behavior of the proposed distributed scheme.

Figure 6. Learning performance on MNIST dataset versus communication rounds.

Table 1. Summary of related work.

Reference	Focuses	Contributions	Limitations
[19]	Achieved efficient downlink and uplink model aggregation in multi-cell Air-FL.	Constructed the Pareto boundary to characterize performance trade-offs among multiple tasks.	Do not fully consider the long-term effect of cumulative aggregation errors on convergence.
[20]	Addressed inter-cell interference in multi-cell using simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) assisted Air-FL.	Characterized Pareto-optimal gaps for inter-cell trade-offs and demonstrated mean squared error (MSE) reduction in uplink/downlink via experiments.	Assumed low noise, neglected higher-order errors, and experimented only cover two-cell networks.
[21]	Addressed data heterogeneity in hierarchical FL.	Derived the convergence bound under inter-cluster interference and data heterogeneity.	AF with lower communication overhead was not considered.
[22]	Optimized the learning performance of two-tier Air-FL.	Derived the impact of aggregation errors on convergence performance.	The impact of inter-cluster interference was not considered.
[23]	Addressed the issues of low communication efficiency and weak privacy protection in Air-FL spectrum sharing.	Proposed a compressed sensing-based Air-FL framework to achieve efficient and secure aggregation that is noise-free/encryption-free.	Intra-group nodes require strict synchronization; pseudo-transmitters add redundancy.
[24]	Optimized the joint edge aggregation and association decision-making for Air-FL.	Proposed a theoretically guaranteed two-stage search algorithm, reconstructed the supermodular function, and extended a flexible bandwidth allocation scheme.	The algorithm complexity increases significantly with network scale.

Table 2. Simulation parameters.

Parameter	Value
K	10
$K_{tot}$	20
$Ω_{0}$	$- 60$ dB
$ζ$	3
$σ_{m}^{2}$	$10^{- 7}$ W
$P_{k}^{max}$	1 W
$ι$	$10^{- 9}$
$ℓ_{j, m}$	$0.5$
$η_{1}^{(n)}$	$2 / (n + 8)$
$η_{2}^{(n)}$	$2 / (n + 10)$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, C.; He, D.; Yao, J. Distributed Interference-Aware Power Optimization for Multi-Task Over-the-Air Federated Learning. Telecom 2025, 6, 51. https://doi.org/10.3390/telecom6030051

AMA Style

Tang C, He D, Yao J. Distributed Interference-Aware Power Optimization for Multi-Task Over-the-Air Federated Learning. Telecom. 2025; 6(3):51. https://doi.org/10.3390/telecom6030051

Chicago/Turabian Style

Tang, Chao, Dashun He, and Jianping Yao. 2025. "Distributed Interference-Aware Power Optimization for Multi-Task Over-the-Air Federated Learning" Telecom 6, no. 3: 51. https://doi.org/10.3390/telecom6030051

APA Style

Tang, C., He, D., & Yao, J. (2025). Distributed Interference-Aware Power Optimization for Multi-Task Over-the-Air Federated Learning. Telecom, 6(3), 51. https://doi.org/10.3390/telecom6030051

Article Menu

Distributed Interference-Aware Power Optimization for Multi-Task Over-the-Air Federated Learning

Abstract

1. Introduction

1.1. Related Work

1.2. Contributions

2. System Model

2.1. Federated Learning Model

2.2. Communication Model

3. Convergence Analysis and Problem Formulation

3.1. Assumptions

3.2. Optimality Gap vs. Aggregation Error

3.3. Problem Formulation

3.4. Pareto Boundary Definition and Characterization

4. Proposed Method

4.1. Centralized Scheme

4.1.1. Denoising Factor Optimization

4.1.2. Device Transmit Power Optimization

4.2. Distributed Scheme

5. Numerical Results

5.1. Simulation Setup and Benchmark Schemes

5.2. Multi-Task Ridge Regression Performance

5.3. Performance on Multi-Task MNIST Dataset

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI