Low-Latency Collaborative Predictive Maintenance: Over-the-Air Federated Learning in Noisy Industrial Environments

The emergence of Industry 4.0 has revolutionized the industrial sector, enabling the development of compact, precise, and interconnected assets. This transformation has not only generated vast amounts of data but also facilitated the migration of learning and optimization processes to edge devices. Consequently, modern industries can effectively leverage this paradigm through distributed learning to define product quality and implement predictive maintenance (PM) strategies. While computing speeds continue to advance rapidly, the latency in communication has emerged as a bottleneck for fast edge learning, particularly in time-sensitive applications such as PM. To address this issue, we explore Federated Learning (FL), a privacy-preserving framework. FL entails updating a global AI model on a parameter server (PS) through aggregation of locally trained models from edge devices. We propose an innovative approach: analog aggregation over-the-air of updates transmitted concurrently over wireless channels. This leverages the waveform-superposition property in multi-access channels, significantly reducing communication latency compared to conventional methods. However, it is vulnerable to performance degradation due to channel properties like noise and fading. In this study, we introduce a method to mitigate the impact of channel noise in FL over-the-air communication and computation (FLOACC). We integrate a novel tracking-based stochastic approximation scheme into a standard federated stochastic variance reduced gradient (FSVRG). This effectively averages out channel noise’s influence, ensuring robust FLOACC performance without increasing transmission power gain. Numerical results confirm our approach’s superior communication efficiency and scalability in various FL scenarios, especially when dealing with noisy channels. Simulation experiments also highlight significant enhancements in prediction accuracy and loss function reduction for analog aggregation in over-the-air FL scenarios.


Introduction
The pervasive connectivity of numerous devices and sensors, fueled by recent advancements in communication networks and Internet of Things (IoT) applications, has resulted in the generation of immense volumes of data.The copious volume of data at hand serves as the training datasets for machine learning (ML) algorithms that find utility across various applications in industry, such as process optimization, defining product quality, and PM, critical components within the purview of Industry 4.0 [1].Conventionally, the training process for these ML algorithms has followed a centralized approach, wherein multiple devices transmit their raw and occasionally sensitive data to a PS possessing robust computational resources dedicated to training tasks.Nevertheless, the data required for training these ML algorithms are produced by numerous assets and devices in the industry, which often necessitate privacy-preserving measures to safeguard data integrity.Furthermore, the presence of bandwidth constraints poses a bottleneck when transmitting a massive amount of data from the devices to the PS [2].For this reason, there has been considerable focus on federated and distributed learning algorithms, primarily due to their ability to train ML models in a fully decentralized manner.
FL algorithms have been proposed as an alternative scheme for privacy-preserving distributed ML, where each device participates in the training process using exclusively locally available data facilitated by a PS [3].In the context of FL, devices communicate with the PS by exchanging model parameters and their respective local updates, while ensuring that the raw data remain localized on the devices.This approach offers not only privacy advantages but also proves to be a compelling solution for wireless edge devices, particularly when dealing with substantial dataset sizes.In the domain of PM applications, the volume of data obtained from online sensors is significant.Consequently, it becomes imperative to account for this substantial data volume when implementing federated learning techniques for PM.Additionally, time sensitivity and temporal awareness are essential attributes for the effective execution of PM activities.Therefore, time latency should be considered for PM applications.
The migration learning from centralized clouds to the edge enables edge servers to quickly obtain real-time data generated by edge devices, facilitating rapid AI model training.As a result, distributing these models from servers to devices in close proximity enables the devices to respond effectively to real-time events, making them well-suited for PM applications.Despite the rapid advancement of computing speeds, wireless transmission of large amounts of data by any device faces limitations due to limited radio resources and the adverse conditions of wireless channels.This creates a communication bottleneck that hinders fast-edge learning [4,5].Hence, communication efficiency has been at the forefront of FL, and the paramount objective is to achieve high model accuracy while minimizing the number of communication rounds of resource usage.In the realm of cutting-edge research, the fusion of digital twins and multi-access edge computing has gained significant attention [6].This innovative approach represents a crucial technology in the context of 6G networks, primarily serving as a fundamental enabler for the Industrial IoT.The primary objective of this research is to minimize the total task completion delay for IoT devices by optimizing various parameters.
Communication schemes for FL can be categorized as digital or analog.Digital communication, though burdensome for wireless networks, assigns communication resources to each client's ML model parameters.Analog communication reduces overhead by allowing shared resources for transmitting FL models.Early research aimed to reduce communication rounds or payload size.However, most FL literature assumes a perfect communication channel shifting focus to ML design.Recent research addresses this gap by emphasizing system design, especially for wireless FL [7][8][9].Although early studies have delved deeply into optimizing communication design for FL, the crucial and practical matter of a noisy channel with over-the-air communication and computation (OACC) has not been thoroughly investigated.Incorporating the impact of noisy channels in OACC complicates convergence analysis due to noise propagation during each communication round.Moreover, the collective impact of these noisy communications in OACC on the final learning performance necessitates a comprehensive design and analysis approach.
This work addresses the impact of communication-induced noise during FL training on the convergence and accuracy performance of the ML model, and then proposes robust algorithms within the OACC framework that mitigate these effects and optimize client resources concurrently.Our focus lies in analog communication for model updates [10][11][12] and the exploration of new distributed algorithms capable of withstanding high levels of channel noise and low signal-to-noise ratio (SNR) in industrial environments, in accordance with PM applications.

Related Work
In recent years, substantial endeavors have been undertaken to implement the FL framework over wireless networks, with its origins traced back to McMahan's seminal works [13,14].A comprehensive survey of FL has been conducted in [15], offering an overview of this paradigm where statistical models are trained on distributed networks at the edge.These studies explore the distinctive challenges and characteristics of FL compared to traditional approaches while identifying open problems that necessitate interdisciplinary research efforts.However, these works overlook the practical implications of channel effects and assume a seamless integration of FL algorithms into wireless networks.Recently, there has been significant attention given to exploring methods for mitigating the impact of these effects on FL algorithms [16][17][18].
This study primarily focuses on analog aggregation schemes for over-the-air transmission, which are driven by the inherent superposition property of signals in the wireless multiple-access channel.Analog over-the-air aggregation is a highly promising technique extensively used in Federated Averaging (FedAvg) due to the fact that the ps or clients require only the sum of their local gradients or model parameters.These schemes have been explored in various studies, including [2,19], and other relevant works.
In order to provide context and underscore the contributions of this article, we will now discuss previous studies that have investigated FL in the presence of imperfect or noisy communication.The main objective of [20] was to tackle the issue of noise in wireless communications for federated learning.The paper addressed this challenge by formulating the problem using an expectation-based model and a worst-case model.They introduced a sampling-based successive convex approximation algorithm for solving the problem.This approach successfully handles noise by incorporating it as a regularizer in the loss function during the training process.Simulation results showcased improved prediction accuracy and reduced loss function values, affirming the effectiveness of the proposed methods in mitigating the impact of noise.In [21], Amiri et al. examined the impact of a noisy channel on FedAvg.They determined that when dealing with noisy downlink transmission, the presence of noise cannot be neutralized through step-size design, resulting in an inability to ensure precise convergence.Consequently, addressing this fundamental problem has necessitated imposing strict demands on the estimation noise associated with the aggregated model weights.
In [22], researchers conducted an investigation into the influence of a noisy channel in uplink and downlink analog transmission on the training process.The findings of this analysis led the authors to conclude that for the FedAvg algorithm to converge, it is necessary for the noise variance to decrease at a rate of O 1/k 2 , where k represents the communication round.Consequently, in order to effectively mitigate the impact of channel noise on the training procedure, the authors recommend two approaches.Firstly, it is advised to employ an increased transmission power gain of O(k) in both uplink and downlink transmissions.Alternatively, if the power gain remains fixed, extending the transmission time to O(k) is suggested.The implementation of either of these strategies is crucial for preserving the integrity of the training process when faced with channel noise.
Another critical issue is analog transmission, which is widely used in over-the-air aggregation within the wireless channel.It is extensively utilized and plays a crucial role in enhancing spectral efficiency and reducing multi-access latency.Gau et al. [23] developed a hardware transceiver and application software to train a real-world FL task using over-the-air analog aggregation.They focused on developing an over-the-air aggregation solution for wireless FL based on orthogonal frequency-division multiplexing (OFDM).The main challenge they faced was achieving perfect waveform superposition, which was complicated by the presence of frame timing offset and carrier frequency offset.To tackle these challenges, they proposed a two-stage waveform pre-equalization technique.
The focus of this study pertains to the exploration of algorithms aimed at resolving optimization problems through over-the-air analog aggregation.While numerous papers have discussed communication-efficient solutions for distributed learning problems, it is of paramount importance to thoroughly investigate the optimization of analog FL prob-lems over noisy communication channels.This particular problem presents significant importance and complexity, necessitating meticulous attention and consideration.

Our Contributions
The major contributions of this work are summarized as follows.

•
We propose a hierarchical approach to PM, building upon our prior work [1].The key distinction lies in the utilization of OACC for the FL algorithm at the factory level.This choice is motivated by the benefits of low latency, making it suitable for PM applications while also improving spectral efficiency.At higher levels, such as fog and cloud servers, occasional requests are made to the factory level for the aggregate model, enabling averaging over multiple factory parameters, FedAvg.Our primary emphasis is on the factory level, specifically investigating FLOACC as the focal point of our research.

•
We propose FSVRG-OACC as a distributed approach to solve the optimization problem for PM at the factory level based on OACC.FSVRG-OACC leverages analog over-the-air aggregation, which enables it to effectively handle highly noisy communication channels and allows for improved convergence in minimizing the cost function associated with the ML algorithm.• FSVRG-OACC facilitates the transmission of local gradient updates by individual agents, capitalizing on the advantages of computation over the air.This algorithm effectively mitigates the impact of channel perturbations on convergence by incorporating the effects of the communication channel into the algorithm update process.The utilization of FSVRG-OACC ensures that convergence is not compromised, enabling efficient and robust optimization in the presence of varying channel conditions.

•
The simulation results demonstrate the substantial reduction in convergence sensitivity to noise achieved by our proposed algorithm.This finding holds significant implications for the implementation of ML algorithms on analog over-the-air aggregation in highly noisy industrial environments.
The remainder of this paper is structured as follows.In Section 2, we introduce the system model, starting with the description of FL over-the-air analog aggregation and extending it to FLOACC.Section 3 presents our proposed algorithm, FSVRG-OACC, along with a comparison to other stochastic gradient descent algorithms.Following that, in Section 4, we evaluate the performance of FSVRG-OACC through a PM application and present the corresponding experimental results.Finally, in Section 5, we provide concluding remarks and discuss future research directions.

Federated Edge Learning System
We consider a distributed learning system specifically designed for a PM application at the factory level.This system comprises a single parameter server and K edge nodes, as depicted in Figure 1.Collaboratively, the edge nodes at each factory train a shared learning process involving the global model w.Each machine collects a fraction of labeled training data via the interaction with a local dataset, denoted as D 1 , D 2 , . . ., D K .The local loss function of the model vector w on D k is given by where φ k is the number of data points stored on data partition D k and f (w, x j , y j ) is the loss function quantifying the prediction error of the model w for each data sample j, which consists of the training sample x j and its ground true label y j .For convenience, we write f (w, x j , y j ) as f j (w) and assume uniform sizes for local datasets φ k for all k.Table 1 presents the typical loss function used in FL for various applications.In the specific case of anomaly detection over edge nodes at the factory, the squared-SVM loss function has been employed in this study as a leading ML method that offers flexibility in modeling complex nonlinear boundaries between normal and abnormal data points.With these definitions, the global loss function on all the distributed datasets can be defined as where |.| denotes the size of the datasets and each dataset satisfies D i ∩ D j = ∅ when i = j.
The training target is to minimize the global loss function F(w) according to the distributed process to find w * .
In this section, we begin by presenting the problem formulation of FL in the context of a hierarchical PM scenario.Subsequently, we delve into the specific case of FL-OACC, where all agents at the factory levels broadcast their updated models over the air, leveraging the advantages of computation through analog communication.Table 1.Several examples of loss functions.

Model Loss Function
Linear regression 1 − y j w T x j To compute F(w), one method involves directly uploading all local data, which raises privacy concerns.To address this issue, the FL framework is employed to solve the problem outlined in Equation (3) through a distributed approach.There are two approaches based on FedAvg for solving this distributed optimization problem: 1.
Model averaging: In this approach, each agent minimizes its local loss function and transmits the model parameters to the PS for aggregation.In the second round of iteration, the agent receives the updated model from the PS.

2.
Gradient averaging: In this approach, each agent calculates the gradient of its loss function and transmits the gradient to the PS for aggregation.In the second round of iteration, the agent can update its model based on the gradient averaging received from the PS.
The agents generally employ a gradient descent (GD) algorithm or stochastic gradient descent (SGD) algorithm to minimize the loss function described in Equation (1).A singlestep SGD for updating the model parameter for device K can be defined following the model and gradient averaging, respectively.
where α is the step size and ∇ is the gradient operator.Hence, the only operation performed by the PS is to compute the average of the model parameters or the gradients received from the agents.
The learning process entails iterating between Equations ( 4) and ( 6) or Equations ( 5) and (7) until the model converges.The averaging process on the PS serves as a motivation for the low-latency FL scheme, utilizing FLOACC.Further details about FLOACC are provided below.

Fl Over-the-Air Communication and Computation
In this section, we discuss the details of Federated Learning over-the-Air Communication and Computation, abbreviated as FLOACC.This method provides an efficient multi-access scheme in a low-latency scenario, which is crucial for applications like PM that require very fast and real-time task decision making.The idea of the over-the-air computation model for FL has been examined in several previous works, such as [8,16].Their approach was inspired by the PS's lack of interest in individual model weight vectors.Instead, the server solely needs the average of the model weights, which is automatically provided by the wireless multiple-access channel in the form of their sum.
As depicted in Figure 2, the FLOACC enables simultaneous transmission of result vectors from all devices and assets at the factory level to the PS in an analog manner.Let T be the local gradient vector of the loss function from the k-th device.In FLOACC, it is assumed that the local gradient of each model is transferred over an analog medium.In this scenario, the transmitted symbols are denoted by ∇F k,i and are normalized to have zero mean and unit variance E ∇F k,i ∇F * k,i = 1.By employing OFDM, it becomes feasible to allocate each element of the gradient vector to a distinct sub-carrier OFDM channel.This approach enables a significant reduction in the learning process latency for FLOACC.
During each round n, all the local devices at a factory simultaneously transmit their local gradient based on the distributed loss function as is the small-scale fading of the channel between the k-th device and the PS, and p k [n] is the allocated transmission power for each device k.In particular, the aggregated gradient in the n-th communication round, denoted by y[n], is expressed as follows.

∇F(w[n
where Z[n] ∼ CN (0, σ 2 z I) is additive white Gaussian noise.Note that in over-the-air analog aggregation, it is possible to transfer the model parameters as well.However, in FLOACC, we only consider the local gradient during aggregation.This is because the aggregated gradient is less sensitive in the optimization algorithms compared to the model parameters.We will discuss this issue further in the next section.We assumed that, similar to the LTE system, there are common downlink reference symbols in the radio resource block for the devices at the factory level to estimate the channel fading coefficient.Then, the local devices can transmit the signal with the specific power they calculated.The received signal at the PS has been depicted with gain blocks in Figure 3, so the PS can estimate the aggregated gradient as follows: where η is a receiver scaling factor.Owing to the imposed physical constraints, the transmission of each device is subject to a long-term transmission power constraint.
Figure 3. Illustration of the system design for FLOACC using analog transmission over-the-air aggregation.

Effective Noise and Definition of SNR in FLOACC
In order to keep the problem general, we adopt the effective noise model in analog aggregation.We assume accurate channel state information at the transmitters.To meet the aggregation requirement of FL, the local devices implement the channel inversion rule, which yields the instantaneous transmit power of user k during the communication round n for gradient aggregation.
where H is Hermitian transpose.Based on this definition, the received signal can be formulated as follows.
η represents a scalar that signifies the average transmission power, based on which the received SNR of the global gradient can be expressed as follows.
It is evident that the received SNR is consistent across all users due to devices with weaker channels compensating by transmitting at higher powers.In various studies, devices with significantly weak channels are often excluded from training due to their inability to pre-equalize their channels.Many research has been undertaken to optimize η to enhance model convergence over the wireless communication network.In [24,25], the optimal selection of p k and η is determined by solving the corresponding optimization problem. min Extensive research has been conducted to optimize transmit power for the purpose of achieving model convergence and mitigating the effects of existing noise in the wireless transmission medium.However, to the best of our knowledge, no existing studies have investigated the comparison of convergence performance among different algorithms without any transmission power control.In the upcoming section, we delve into various gradient descent algorithms for ML optimization problems, considering the over-the-air scheme, and provide an analysis of which algorithm demonstrates superior convergence properties.

Proposed Adaptive FSVRG-OACC Algorithm
In this section, our specific focus lies on algorithms suitable for solving problem (3) within the context of over-the-air and analog aggregation.Firstly, in Section 3.1, we examine baseline algorithms that are compatible with distributed problems and analog gradient transmission (GT), while also highlighting the distinctions between model transmission (MT) and GT.Next, in Section 3.2, we delve into randomized methods that, in an initial approximation, integrate the advantages of cost-effective iterations from SGD with the rapid convergence of GD.Many of these methods can be categorized into one of two classes: dual methods of the randomized coordinate descent type and primal methods of the stochastic gradient descent with variance reduction type.Our emphasis lies on stochastic variance reduced gradient (SVRG), and we optimize this method for FL within the framework of analog aggregation while considering the presence of white Gaussian channel noise.

Baseline Algorithms
A fundamental approach for solving a distributed optimization problem with the structure (4) involves employing the GD algorithm, particularly when the functions f possess smoothness and convexity characteristics.In a distributed setting, there are two possible approaches: the GT method, where gradients are transmitted over the air and an aggregated gradient signal is received, which is subject to noise for model updating using GD, or the MT method, where the model is updated through GD iterations and then transmitted for model aggregation.In the presence of noise, the convergence deteriorates due to the elevated sensitivity of the loss function to the model parameters.Therefore, in over-the-air distributed algorithms, the GT approach is preferred.This preference stems from the lower sensitivity of the cost function to the aggregated gradient compared to the individual model parameters.GD demonstrates a fast convergence rate.However, each iteration has the potential to be computationally intensive on each local device.In contrast, SGD selects a random data label and performs the update, which offers a more efficient alternative.

Fsvrg-Oacc Algorithm
An additional algorithm within the SGD category is SVRG [26].The SVRG algorithm operates through two nested loops.The outer loop involves calculating the full gradient of the entire function, ∇F(w t [n]), which is typically a computationally expensive operation to be avoided whenever possible.In the inner loop, the update step is iteratively computed as follows.
where ∇F i (w[n]) represents the stochastic gradient computed based on a randomly selected data label, ∇F i (w t [n]) denotes the stochastic gradient computed over the entire dataset, and α is stepsize.This iteration is specific to a single device, and its fundamental concept lies in utilizing stochastic gradients to estimate the gradient change from point w t to w, rather than directly estimating the gradient itself.Indeed, this algorithm is naturally suited for centralized implementations since it necessitates computing the stochastic gradient over the complete dataset, thereby making it well-suited for centralized scenarios.But a notable contribution was made in [27], where they introduced FSVRG, which is particularly applicable in the context of distributed optimization.They demonstrated that existing SVRG algorithms are not suitable for distributed approaches and proposed the FSVRG algorithm, specifically designed for sparse distributed convex problems.The pseudocode for FSVRG is provided in Algorithm 1.This algorithm has been implemented and subjected to evaluation, and the results will be presented in the experimental section.The findings indicate that this algorithm does not demonstrate satisfactory convergence in the realm of FL for over-the-air analog aggregation, specifically in relation to the absence of power transfer control.
Let us now elucidate the motivation behind considering a different algorithm suitable for FL in the context of over-the-air analog aggregation.A crucial aspect that demands attention is the significant variation in the number of available data points among different devices, which may differ greatly from the average number of data points available to any single device.It looks like a similar issue to FSVR, but it should be noted that in our assumptions, analog communication is the sole communication type between local devices and the PS.As a result, the PS lacks information regarding the number of data points and the type of data distribution.
Additionally, this scenario frequently entails the local data being clustered around a specific pattern, which renders it unrepresentative of the overall distribution we aim to learn.Consequently, considering an aggregation on the entire gradient direction in each iteration could be a promising approach that could be undertaken in the concept of analog aggregation.

Algorithm 1: Federated SVRG
Parameters: φ = number of data points, φ k = number of data points store on device k, α = stepsize, data partition {D k } K k=1 .
From a practical perspective, in FSVRG-OACC, it is postulated that all devices possess a randomly allocated initialization value for the parameter vector, w k .This assumption holds significant importance in the practical execution of the algorithm.The proposed algorithm involves two communication rounds, which results in increased communication costs.However, it offers advantages in terms of the convergence algorithm.Algorithm 2 introduces the FSVRG-OACC, a modified FSVRG variant tailored for over-theair analog aggregation.
During the initial communication round (distributed loop 1), all devices compute the complete gradient of the entire function and subsequently determine the internal gradient as , where i t is sampled uniformly from the local dataset D k .These gradients are derived using SGD, rendering their computation relatively inexpensive.The computed internal gradient is then transmitted over the air to the PS, while the estimated aggregated internal gradient is sent back to the devices via the analog medium, as shown in Figure 2.
The estimated aggregated internal gradient, denoted as ĝ, compels all devices in the second round of communication (Distributed loop 2) to move in the same direction.In this communication round, the updated gradient is denoted as Ĝ, which was estimated in the first round as Ĝ = ∇F i t (w k [n]) − ĝ for device k.Subsequently, the PS aggregates this gradient over the air from all devices and transmits it back to them for the remaining iterations.Each device k uploads the following gradient over the air for aggregation.
By aggregating the gradient G k at the PS and transmitting it back to each device, a uniform descent direction can be achieved across all devices.
This approach is motivated by the primary goal of maintaining gradient step consistency among all clients by utilizing the aggregated gradient over-the-air signal in stochastic first-order methods.Therefore, the algorithm's complexity is exceptionally low due to the simplicity of first-order gradient calculations at each step.The only cost involved is the number of communication rounds.
In summary, the gradient update ) is divided into two parts for FSVRG-OACC.In the first distributed loop, the last two gradients ∇F i t k (w t [n]) and ∇F k (w t [n]) are calculated, and afterward, the distance between these two gradients is aggregated across all edge devices.In the second distributed loop, we can access the aggregated value of ) for all devices.In the first iteration of this loop (n = 1), we calculate the stochastic gradient for each edge device and subtract it from the aggregated gradient obtained from the first distributed loop )) .This gives us G k , the overall gradient update for device k, which is then aggregated with all other gradients over the air.
Starting from the second iteration (n = 2) and beyond, we can utilize the whole aggregated gradient Ĝ to directly update the model parameters of each device.

Performance Evaluation of FSVRG-OACC
In this section, we evaluate the performance of the proposed algorithm on the task of anomaly detection for a PM application in an FL manner with over-the-air analog aggregation.It aims to investigate the convergence characteristics of four optimization algorithms, GD, SGD, FSVRG, and FSVRG-OACC, in the context of over-the-air analog aggregation.The primary focus is to analyze the performance of these algorithms in terms of both model accuracy and convergence time, which are crucial performance metrics in the field.Similar to prior research in the domain of Federated Machine Learning Algorithm for Collaborative Predictive Maintenance [1], we employ the widely recognized and benchmarked CMAPSS [28] dataset in this study.
The C-MAPSS model, developed by NASA and implemented in MATLAB/Simulink, represents a nonlinear dynamic model of a commercial turbofan engine.By manipulating the input parameters of this simulation model, it becomes possible to simulate diverse degradation patterns under various engine conditions.In order to generate datasets reflecting different fault modes, four distinct time series (FD001, FD002, FD003, FD004) were produced using these simulation tools.These datasets comprise multivariate time series, each of which is further divided into training and testing subsets.For the purpose of this study, we selected and analyzed two datasets: FD003 and FD004.The FD003 dataset comprises 100 test trajectories and 100 train trajectories, focusing on a single fault mode and two types of degradation: high-pressure compressor degradation and fan degradation.On the other hand, the FD004 dataset includes 248 test trajectories and 249 train trajectories, covering six fault modes and two types of degradation.The time series data in each dataset include 21 sensor observations, three operating settings, a trajectory ID, and a cycle count.The Remaining Useful Life (RUL) of an engine is estimated based on the number of operation cycles remaining before the engine fails.

Algorithm Implementation
In this study, a federated SVM model was employed to predict the RUL using the provided time series data.Specifically, FD003 and FD004 datasets were distributed among ten devices located at the factory site, with the purpose of participating in a collaborative PM scenario.It is crucial to emphasize that the data distribution was non-independent and not identically distributed (non-IID).This implies that the instances of failures were not randomly distributed among the edge devices, and the data distribution was not evenly spread across all edge devices.Instead, specific edge devices were linked to particular types of failures, while others experienced different types.We intentionally opted for this non-IID distribution as a worst-case scenario to thoroughly assess the effectiveness of our proposed method.This collaboration aimed to achieve low latency through over-the-air analog aggregation.During all the simulations, the optimizer parameters were configured to achieve optimal performance.In particular, the learning is kept fixed at 0.01, and the 2 -regularization parameter λ is set to λ = 0.1.The momentum factor is kept equal to 0.9 and the number of local epochs is set to 1.The definition of the loss function for the federated SVM is as follows.
The input data, denoted as x j , are structured in a 2D format resembling an image, where one dimension corresponds to the sequence length, and the other dimension represents the number of sensor measurements.The output variable, y j , indicates the condition of the engine.Specifically, if the input data are associated with a particular condition whose RUL is below a specified threshold, the output value is assigned as {−1}, indicating the detection of an anomaly.Conversely, if the RUL exceeds the threshold, the output value is assigned as {1}.In all the conducted experiments, the reported convergence and accuracy averaged over five independent runs were calculated over 500 communication rounds.

General Performance
We first conduct a comparison between FSVRG-OACC and standard FSRVG, SGD, and GD methods to illustrate the notable enhancements in convergence rate and accuracy achieved in the context of over-the-air analog aggregation with a noisy environment.Throughout the experiments, we assessed the performance of each method and observed the significant gains achieved by FSVRG-OACC.We assume the environment noise variance to be σ 2 z = 1 for Z k ∼ CN 0, σ 2 z I , and the probability of noise presence as p = 1.In this study, our primary focus is not on the transmission policy.Therefore, we make the assumption that the channel inversion policy is employed for estimation.Additionally, we assume that each component of Z[n] has a zero mean and a variance of σ 2 z .The performance is demonstrated on the FD003 and FD004 datasets.In this experiment, for all of K = 10 devices at the factory site, this implies that the transmitting power is set to P = 300 mW and the receiving scale factor, √ η, is set to 0.1.The accuracy formula used is as follows: where µ P represents the number of true positives, Γ P represents false positives, µ N represents true negatives, and Γ N represents false negatives.The results have been plotted in Figure 4 for FD003 and Figure 5 for FD004.We observed the following results. 1.
In the case of FD003, it is evident that both the GD-MT and FSVRG algorithms fail to converge under the given noise environment and transmission power settings.Consequently, these algorithms are excluded from the accuracy analysis.On the other hand, the GD-GT and SGD-GT algorithms demonstrate convergence, but they exhibit significant fluctuations during the convergence phase.In contrast, our proposed FSVRG-OACC method shows excellent convergence performance under the same environmental conditions and noise levels.
As depicted in the accuracy plot for FD003, both SGD-DT and GD-GT algorithms experience a considerable drop in accuracy.However, our proposed algorithm achieves an average accuracy of 91% and demonstrates higher stability compared to the other algorithms.2.
In the case of FD004, we observed similar results, although this dataset presents greater challenges due to its inclusion of six fault modes, making the prediction algorithm significantly more complex compared to other CMAPSS datasets.Despite these difficulties, our proposed FSVRG-OACC algorithm demonstrates robust convergence, with only minimal fluctuations occurring after the convergence stage.These fluctuations can be attributed to the estimation anomalies present in the most challenging instances of the dataset.
In the accuracy plot, our proposed method achieved a commendable accuracy of 61% on the model.In contrast, the average accuracy of the other two converged methods is lower than that of the proposed method, and their accuracy values exhibited less fluctuation during the communication rounds.
In conclusion, we provide the average local training runtime and accuracy of FSVRG-OACC, SGD-GT, and GD-GT at the client level, as summarized in Table 2.It is noteworthy to mention that the computational runtime is primarily governed by the computation of gradients in all 500 iterations.Additionally, we assume that the communication round is negligible in comparison to the gradient computation.As is evident, the runtime in FSVRG-OACC exceeds that of the other algorithms.However, this method, employed in FLOACC, demonstrates substantial reduction in communication latency while achieving commendable convergence and consistent accuracy.We compared our results with a scenario where there is no communication noise.In a previous study, we looked at how well the FL algorithm works for PM applications when there is no communication interference.We summarized those results in Table 3.As evident from the results, there is a significant disparity in accuracy between GD-GT and SGD-GT when applied to FLOACC and noiseless communication settings.This demonstrates that employing a power transmission policy can enhance the accuracy of both GD and SGD in the context of FLOACC, bringing them closer to achieving results similar to those in noiseless communication scenarios.It is worth noting that, for the FSVRG-OACC method and FSVRG applied to noiseless communication, the difference in accuracy remains minimal.This observation is particularly noteworthy in the case of FD003, where the RUL prediction task is less challenging than FD004.These findings underscore the robustness and effectiveness of our proposed approach, which achieves these results without the need for any power transmission policies.In this section, we conduct an analysis to assess the robustness of the proposed algorithms for Over-The-Air analog aggregation in the presence of varying levels of noise.
The performance evaluation is conducted using a part of the FD003 dataset while systematically varying the noise level σ z .Specifically, we consider noise levels σ z from the set {0.25, 0.75, 1.5, 2.5} to assess the algorithm's performance under different noise conditions.For the total of 10 local devices, the receiver scaling factor, denoted as √ η, is varied across different values {0.4,0.13, 0.06, 0.04}.To investigate the impact of noise, we manipulate the probability of noise presence, denoted as p, throughout the experiments.The set {0.1, 0.25, 0.5, 1} is utilized to vary the probability of noise.The results for GD-GT and SGD-GT have been plotted in Figures 6 and 7  For both GD-GT and SGD-GT, as the noise level increases, there is a significant degradation in performance, accompanied by amplified fluctuations in the loss function.Additionally, we observe a parallel trend in the cost function when the probability of noise presence is elevated.Furthermore, we conducted the same analysis on the FSVRG-OACC method, and the corresponding results are presented in Figure 8. Notably, the proposed algorithm demonstrates significantly enhanced robustness against higher noise levels and increased probability of noise presence.
All of these analyses provide evidence supporting the suitability of the proposed FSVRG-OACC method for analog aggregation and FL over the air.It is important to note that all of these analyses were conducted under the same transmission power conditions.However, we anticipate that employing a power control policy in communication with this method would yield even more accurate results in convergence and model accuracy.
On the other hand, implementing our proposed approach in real-world industrial settings has practical implications and challenges.A significant part of this approach is adjusting the local gradients in large edge devices using an analog waveform and sending them through the same wireless channels.Our main challenge is achieving perfect waveform superposition, which is crucial for our algorithm's success.This task is complicated because of frame timing offset and carrier frequency offset.To overcome these issues, we require high-performance devices with substantial computational capabilities.Another important challenge emerges when we only have access to partial information, resulting in some edge devices being unable to effectively join the FLOACC aggregation process.This led to a deviation in the gradient descent regime.This deviation highlights the potential of over-the-air analog aggregation in FL for hierarchical PM scenarios.Our study focused on the factory site as the lower hierarchical component, recognizing the criticality of meeting low-latency requirements for time-sensitive applications.We have demonstrated that this challenge can be effectively addressed through the utilization of analog aggregation.
Throughout our investigation, we thoroughly examined the impact of practical channel effects, including noise and fading, on the learning algorithm's performance.To enhance the robustness of learning algorithms against channel noise effects, we proposed a novel algorithm called FSVRG-OACC.The effectiveness and superiority of the proposed algorithm were demonstrated through the application of a distributed SVM for anomaly detection using the CMPASS dataset.The simulation results validate the effectiveness of FSVRG-OACC in reducing aggregation distortion while operating at the same transmission power level.Furthermore, the learning behavior of the proposed algorithm can be further enhanced by incorporating power control and device selection policies into its design.

Figure 1 .
Figure 1.FL edge system model for a hierarchical PM scenario at the factory level.

Figure 2 .
Figure 2. FLOACC scenario at the factory level.

) Algorithm 2 :
FSVRG With Over-the-Air Communication and Computation (FSVRG-OACC) Parameters: φ = number of data points, φ k = number of data points store on device k, α = stepsize, data partition {D k } K k=1 , Randomly initialize w k on each device.for n = 0, 1, . . .do » Overall iterations for k = 1 to K do in parallel over device k do 1: Let {i t } φ k t=1 be random permutation of D k » Distributed loop 1 2: Compute

Figure 4 .Figure 5 .
Figure 4. Convergence and accuracy versus communication round for different algorithms in Overthe-Air analog aggregation (σ 2 z = 1, p = 1, and dataset: FD003).The pale graphs represent the signal, while the bold graphs depict the windowed average of the signal. .

Table 2 .
Performance analysis of different algorithms on FLOACC.