In this section, our specific focus lies on algorithms suitable for solving problem (
3) within the context of over-the-air and analog aggregation. Firstly, in
Section 3.1, we examine baseline algorithms that are compatible with distributed problems and analog gradient transmission (GT), while also highlighting the distinctions between model transmission (MT) and GT. Next, in
Section 3.2, we delve into randomized methods that, in an initial approximation, integrate the advantages of cost-effective iterations from SGD with the rapid convergence of GD. Many of these methods can be categorized into one of two classes: dual methods of the randomized coordinate descent type and primal methods of the stochastic gradient descent with variance reduction type. Our emphasis lies on stochastic variance reduced gradient (SVRG), and we optimize this method for FL within the framework of analog aggregation while considering the presence of white Gaussian channel noise.
3.2. Fsvrg-Oacc Algorithm
An additional algorithm within the SGD category is SVRG [
26]. The SVRG algorithm operates through two nested loops. The outer loop involves calculating the full gradient of the entire function,
, which is typically a computationally expensive operation to be avoided whenever possible. In the inner loop, the update step is iteratively computed as follows.
where
represents the stochastic gradient computed based on a randomly selected data label,
denotes the stochastic gradient computed over the entire dataset, and
is stepsize. This iteration is specific to a single device, and its fundamental concept lies in utilizing stochastic gradients to estimate the gradient change from point
to
, rather than directly estimating the gradient itself.
Indeed, this algorithm is naturally suited for centralized implementations since it necessitates computing the stochastic gradient over the complete dataset, thereby making it well-suited for centralized scenarios. But a notable contribution was made in [
27], where they introduced FSVRG, which is particularly applicable in the context of distributed optimization. They demonstrated that existing SVRG algorithms are not suitable for distributed approaches and proposed the FSVRG algorithm, specifically designed for sparse distributed convex problems. The pseudocode for FSVRG is provided in Algorithm 1. This algorithm has been implemented and subjected to evaluation, and the results will be presented in the experimental section. The findings indicate that this algorithm does not demonstrate satisfactory convergence in the realm of FL for over-the-air analog aggregation, specifically in relation to the absence of power transfer control.
Let us now elucidate the motivation behind considering a different algorithm suitable for FL in the context of over-the-air analog aggregation. A crucial aspect that demands attention is the significant variation in the number of available data points among different devices, which may differ greatly from the average number of data points available to any single device. It looks like a similar issue to FSVR, but it should be noted that in our assumptions, analog communication is the sole communication type between local devices and the PS. As a result, the PS lacks information regarding the number of data points and the type of data distribution.
Algorithm 1: Federated SVRG |
|
Additionally, this scenario frequently entails the local data being clustered around a specific pattern, which renders it unrepresentative of the overall distribution we aim to learn. Consequently, considering an aggregation on the entire gradient direction in each iteration could be a promising approach that could be undertaken in the concept of analog aggregation.
From a practical perspective, in FSVRG-OACC, it is postulated that all devices possess a randomly allocated initialization value for the parameter vector, . This assumption holds significant importance in the practical execution of the algorithm. The proposed algorithm involves two communication rounds, which results in increased communication costs. However, it offers advantages in terms of the convergence algorithm. Algorithm 2 introduces the FSVRG-OACC, a modified FSVRG variant tailored for over-the-air analog aggregation.
During the initial communication round (distributed loop 1), all devices compute the complete gradient of the entire function and subsequently determine the internal gradient as
, where
is sampled uniformly from the local dataset
. These gradients are derived using SGD, rendering their computation relatively inexpensive. The computed internal gradient is then transmitted over the air to the PS, while the estimated aggregated internal gradient is sent back to the devices via the analog medium, as shown in
Figure 2.
The estimated aggregated internal gradient, denoted as
, compels all devices in the second round of communication (Distributed loop 2) to move in the same direction. In this communication round, the updated gradient is denoted as
, which was estimated in the first round as
for device k. Subsequently, the PS aggregates this gradient over the air from all devices and transmits it back to them for the remaining iterations. Each device
k uploads the following gradient over the air for aggregation.
Algorithm 2: FSVRG With Over-the-Air Communication and Computation (FSVRG-OACC) |
|
By aggregating the gradient
at the PS and transmitting it back to each device, a uniform descent direction can be achieved across all devices.
This approach is motivated by the primary goal of maintaining gradient step consistency among all clients by utilizing the aggregated gradient over-the-air signal in stochastic first-order methods. Therefore, the algorithm’s complexity is exceptionally low due to the simplicity of first-order gradient calculations at each step. The only cost involved is the number of communication rounds.
In summary, the gradient update is divided into two parts for FSVRG-OACC. In the first distributed loop, the last two gradients and are calculated, and afterward, the distance between these two gradients is aggregated across all edge devices. In the second distributed loop, we can access the aggregated value of for all devices. In the first iteration of this loop (), we calculate the stochastic gradient for each edge device and subtract it from the aggregated gradient obtained from the first distributed loop . This gives us , the overall gradient update for device k, which is then aggregated with all other gradients over the air. Starting from the second iteration () and beyond, we can utilize the whole aggregated gradient to directly update the model parameters of each device.