Efficient Asynchronous Federated Learning for AUV Swarm

The development of automatic underwater vehicles (AUVs) has brought about unprecedented profits and opportunities. In order to discover the hidden valuable data detected by an AUV swarm, it is necessary to aggregate the data detected by AUV swarm to generate a powerful machine learning model. Traditional centralized machine learning generates a large number of data exchanges and faces problems of enormous training data, large-scale models, and communication. In underwater environments, radio waves are strongly absorbed, and acoustic communication is the only feasible technology. Unlike electromagnetic wave communication on land, the bandwidth of underwater acoustic communication is extremely limited, with the transmission rate being only 1/105 of the electromagnetic wave. Therefore, traditional centralized machine learning cannot support underwater AUV swarm training. In recent years, federated learning could only interact with model parameters without interacting with data, which greatly reduced communication costs. Therefore, this paper introduces federated learning into the collaboration of an AUV swarm. In order to further reduce the constraints of underwater scarce communication resources on federated learning and alleviate the straggler effect, in this work, we designed an asynchronous federated learning method. Finally, we constructed the optimization problem of minimizing the weighted sum of delay and energy consumption, relying on jointly optimizing the AUV CPU frequency and signal transmission power. In order to solve this complex optimization problem of high-dimensional non-convex time series accumulation, we transformed the problem into a Markov decision process (MDP) and use the proximal policy optimization 2 (PPO2) algorithm to solve this problem. The simulation results demonstrate the effectiveness and superiority of our method.


Introduction
An automatic underwater vehicle (AUV) is a kind of submarine robot that can carry out ocean sampling activities independently; it is widely used in underwater research and the marine industry. For example, AUVs have been used to sample coastal frontiers, monitor coastal areas, measure thermocline turbulence, obtain interdisciplinary data, conduct fishery research, and install submarine cables under the frozen sea [1,2]. However, due to the limited capacity of a single AUV, processing more complex tasks by AUV swarm has become a hot research topic. In recent years, underwater research based on the AUV group has become more popular [3,4]. The advantages of the AUV swarm cooperation over a single AUV can be summarized into two points. On the one hand, an AUV swarm can obtain more data than a single AUV, such as underwater hydrological characteristic modeling, on the other hand, an AUV swarm can perform tasks that cannot be completed by a single AUV, such as rounding-up targets, formation cruising, collaborative positioning, etc. In the above two aspects, it is often critical to establish a powerful machine learning model to mine the values behind the observed data, or to improve the effectiveness of the AUV's own actions. Specifically, for the sake of exploring the hidden valuable information of the data detected by the AUV swarm, it is necessary to aggregate the data detected by the AUV swarm to generate a powerful machine learning model, such as forming the hydrological characteristics model of the water area by data fitting. Moreover, in order to improve the movement of the AUV, it is a requisite to train the AUV to obtain a strong reinforcement learning model to achieve efficient autonomous action of the AUV. In order to support the above two goals, traditional centralized machine learning needs to interact with a large number of data, which will require a lot of communication resources. However, unlike on land, underwater electromagnetic wave communications with high bandwidths cannot be used; they often rely on underwater acoustic communication for data transmission. The bandwidth of underwater acoustic communication is very limited, and the transmission rate is only 1/10 5 of the electromagnetic wave. Therefore, traditional centralized machine learning cannot support underwater AUV swarm training and the production of large models. Federated learning, a new machine learning framework, was first proposed by Google in 2016 [5]. Each data holder uses his/her own data to train a model, and the models interact with each other. Finally, a global model is obtained through model aggregation. Unlike traditional centralized machine learning, federated learning can only interact model parameters without interacting with data, which can greatly reduce communication overhead and reduce dependence on communication resources, and is suitable for the environment where underwater communication resources are scarce. Therefore, this paper introduces federated learning into the collaboration of the AUV swarm. Furthermore, considering that the interaction model parameters are much less than the interactions with the original data, there is still a large number of data, and the traditional federated learning method (i.e., FEDAVG proposed by Google [5]) is synchronized, which is prone to the emergence of the straggler effect, especially in unreliable underwater information transmission and large time delays. Therefore, this paper designs an asynchronous federated learning method to further reduce data transmission and reduce the straggler effect, which will play an important role in the harsh underwater environment. Furthermore, considering the limited energy of the AUV and the time requirement of model training, we constructed an optimization problem of minimizing the weighted sum of delay and energy consumption, relying on jointly optimizing AUV CPU frequency and signal transmission power. In order to solve this complex optimization problem of high-dimensional non-convex time series accumulation, we built the problem into an MDP and used the PPO2 algorithm to solve this problem. The main contributions are as follows: • To the authors' knowledge, this paper is the first to introduce federated learning into the AUV swarm. Federated learning can help AUV swarm train large-scale machine learning models in an environment where underwater communication resources are scarce. • In order to further reduce the constraints of underwater scarce communication resources on federated learning, we designed an asynchronous federated learning method, which can effectively alleviate the straggler effect and further reduce data interaction. • We constructed the optimization problem of minimum delay and energy consumption, by jointly optimizing AUV CPU frequency and signal transmission power. • In order to solve the optimization problem efficiently, we converted it into an MDP and proposed the PPO2 algorithm to solve this problem. The simulation results verify the effectiveness of the proposed algorithm.
The content and organizational structure of this article are as follows: The last section of this paper is the summary, which summarizes the main points and shortcomings of this work.

Related Work
The bottleneck limiting the performance of the AUV swarm mainly lies in the difficulties of underwater communications [6]. When AUV groups are used for federated learning tasks, as the scale of task models and AUV groups expand, the pressure on the self-organizing data interaction network increases, which requires designing algorithms to save communication resources as much as possible. However, reducing communication resources excessively may lead to compromised federated learning performance and confusion about the AUV formations, which creates a paradox. How to make a trade-off between resource consumption and model performance is worth studying at present.
AUV swarm has the advantages of rapid deployment, controllable action, flexible networks, and other advantages because of its formation, dynamic, mobility, and other characteristics. In recent years, the research into underwater exploration systems based on multiple AUVs or AUV groups has become popular, especially in navigation or path planning [7], collaborative data collection [8], target hunting [9], and so on. For example, in the offshore oil and gas industry, AUVs equipped with underwater environment monitoring sensors cooperate to effectively detect the boundary of oil and gas producing areas. Cui et al. [10] proposed an adaptive path planning algorithm based on the random tree star algorithm to estimate the scalar field underwater for the AUV group. Noguchi et al. [11] put forward a dynamic task allocation and path planning algorithm for the AUV group, which gives full play to the advantages of the AUV group. Huang et al. [12] used the AUV group equipped with scanning imaging sonar, and proposed using the binary Bayesian filter to process the input signal. Then, a path-planning strategy based on the artificial potential field was proposed for real-time target tracking. Cao et al. [13] used the artificial potential field theory to model the underwater environment information and introduce the potential field function into the dispersion, similarity, and difference of the potential field function. On this basis, an intelligent path planning model that can achieve accurate AUV swarm collaborative target search is proposed.
In the federated learning aggregation algorithm, each participant needs to completely send local parameters to the server for model aggregation. These model parameters tend to have a lot of data, which brings huge communication overhead. In [14], the author protects the privacy of the federated learning architecture through covert communication technology, minimizes the federated learning delay by optimizing the interference power, signal power, and training accuracy, and uses an alternating descent algorithm to solve the optimization problem. In [15], the author uses federated learning to build a digital twin model based on the digital twin edge network architecture, uses an asynchronous model update scheme, and performs device selection locally, minimizing channel allocation, and CPU frequency as variables. The federated learning delay uses the artificial intelligence method DNN (deep neural network) model to dynamically solve the optimization problem. In [16], the author selects some clients to send their local model parameters to the server, avoids the training loss, and convergence time being affected by a reasonable local selection strategy, and establishes an optimization problem to optimize the convergence time and training loss, using ANN (artificial neural network) to solve optimization problems. In [17], the author defines a new performance evaluation criterion, i.e., learning efficiency, which is discussed in scenarios where the device is equipped with CPU and GPU, respectively, e.g., channel resource allocation and bandwidth allocation as optimization variables were used to maximize learning efficiency via the Lagrange multiplier method, KKT (Karush-Kuhn-Tucker) conditions. A two-dimensional search algorithm was used to solve the optimization problem and detect and skip some redundant communication rounds locally with adaptive communication rules; a comparable convergence speed was used to the original SGD.
In [18], the authors proposed a communication-saving method for distributed machine learning, which is a gradient-based selection method that reduces redundant communication rounds. In [19], the author introduced the above method [18] into federated learning, based on the traditional SGD (stochastic gradient descent) method. In [20], the author delves into the basis of the above work, and used the gradient sparse method to select redundant nodes locally to reduce the communication amount. The work of this paper continues on this basis. So far, there is no work to introduce federal learning into AUV swarm.

System Model
This paper considers the following scenario: a fixed-form swarm consisting of a leader AUV L and a set M of the M follower AUV sailing at the same depths with constant speed, collecting seabed data, cooperating, and using federated learning to perform various machine learning tasks, such as target recognition, path planning, etc. Each AUV maintains the same distance from other AUVs and navigators, collecting data with their own independent sensors, and training local models. The local model parameters are then sent to the navigator using the uplink. The leader AUV uses the information uploaded by the follower to aggregate into global model parameters and sends it to the follower using the downlink. The follower proceeds with a new round of local training using the received global model parameters. In order to keep AUVs in formation, AUV L and AUV m also need to transfer position information and speed information to each other. In each round, AUV m uploads its current speed and position data to AUV L. AUV L calculates and broadcasts the subsequent speed and direction from the received information.

Federated Learning Model
Assuming that the input data collected by the follower AUV m ∈ M is X m , the output data are Y m , and the local model parameter is w m , then the data set of the AUV m can be expressed as D m = {(x m,1 , y m,1 ), (x m,2 , y m,2 ), . . . , (x m,N m , y m,N m )}, where N m is the number of the data that AUV m owns, the loss function on its data set D m can be expressed as average of sample loss functions: Assuming that the global model parameter is w, for an AUV m ∈ M containing N m data, the corresponding local loss function is weighted, averaged, and defined as the global loss function: The purpose of federated learning is to find a parametric model that minimizes the global loss function, i.e., the optimal parametric model can be expressed as where w = w 1 = · · · = w M ,

Communication Model
In the underwater environment, electromagnetic waves and optical signals attenuate quickly and travel over short distances. Therefore, the most commonly used underwater communication is underwater acoustic communication. The attenuation of the water signal with frequency f at distance D can be given by where k represents the spreading factor, and a( f ) is the absorption coefficient, which can be expressed by the following empirical formula [21]: In the marine environment, noise is mainly divided into turbulence noise, shipping noise, wave noise, and thermal noise, respectively. According to [22], the power spectral density (p.s.d) of four main types of noise in dB re µPa per Hz on the communication frequency f can be given by: Moreover, s ∈ [0, 1] is the shipping activity factor, while w represents the wind velocity (m/s). Hence, the combined noise N( f ) can be represented as Therefore, the normalized SNR of a signal with unit transmitted power and bandwidth can be represented as The uplink data rate between AUV m and leader AUV L is given by where B U m is the allocated uplink bandwidth of the AUV m, while p m ∈ (0, p max ) is the transmitting power of the AUV m.
In the iteration, the leader AUV broadcasts the information to the followers and, hence, the downlink data rate between AUV m and leader AUV L can be represented as where B D is the downlink bandwidth, p L ∈ (0, p max ) denotes the transmitting power of leader AUV.

Control Model
Due to the high cost of underwater communication, most gradient interactions in the distributed SGD are redundant. Therefore, this paper introduces the concept of lazy nodes, allowing each follower drone to perform self-detection locally so that some nodes skip some rounds of communication. The definition of a lazy node in this paper refers to the work of Chen et al. [19].
Among them, these lazy nodes constitute a set M N of size M N . In this paper, the gradient descent algorithm is used to optimize the global model, where γ represents the learning rate Since the global model tends to converge, the following approximation is used According to the mean inequality, we have Let M N = βM, Equation (14) can be deduced as Equation (18) At one round t, AUV m checks locally whether Equation (18) is satisfied. If satisfied, skip this round of uploading, otherwise participate in this round of uploading. Considering the extreme case, if all nodes in a certain round t satisfy Equation (18), then AUV L cannot receive the model information uploaded by the follower AUV. At this time, AUV L randomly selects some follower drones to participate in the upload after a certain time interval. In this paper, we randomly select 1 AUV m to participate in uploading in extreme cases.
In order to prevent AUVs from not communicating for a long time due to the setting of lazy nodes at certain times, we require that AUVs communicate at least once within τ round. Each AUV locally records whether it uploads itself in each round of uploading where λ m (t) ∈ {0, 1} indicates whether AUV m participates in the t round of gradient upload after controlling the model, λ m (t) = 1 indicates that it participates in this round of the gradient upload, otherwise not. At one round t, AUV m checks locally whether Equation (20) is all of 0. If so, AUV m must be uploaded in this round.

Local Parameter Calculating Latency
Let f m ∈ ( f min , f max ) and f L ∈ ( f min , f max ) denote the CPU frequency of the AUV m and leader AUV L, respectively, where f min and f max represent the minimum and the maximum CPU frequency, respectively.
The local computation latency on each AUV m at the t-th slot can be calculated as where c m is the required CPU cycles for training one sample data by backpropagation algorithm [17] at AUV m.

Uploading Latency
Assume that the local parameter has a counterpart gradient consists of α elements, and each of them has an average quantitative bit number denoted by ζ. The size of the current position information and speed information is ψ. Hence, the total data size of each local parameter is |w m | = αζ [17], uploading latency is given by

Global Parameter Aggregating Latency
At the leader AUV L, the global parameter aggregating latency is calculated by where c 0 is the computational complexity [15] to aggregate the newly updated parameters from follower AUVs, while |w m | is the data size of the local parameter w m .

Global Parameter Updating Latency
The local parameter updating latency is given by where c L is the computational complexity for performing global parameter updating [17].

Downloading Latency
The downloading latency can be calculated as where |w| is the data size of the global parameter, which is basically similar to the local parameter |w m |.

Total Latency
Regardless of extreme cases, there are always devices uploaded in a round, a complete federated learning delay is Consider the extreme case where there are no devices uploaded within the round. According to the settings of the control model, when the AUV L does not receive the gradient upload after the time interval T , a device is randomly selected for uploading. Assuming that the selected device is m 0 , then the federated learning delay for this round is

Energy Consumption of Follower AUV
The computational energy consumption of the follower AUV is mainly composed of a local parameter calculating and local parameter updating, which can be given by The communication energy consumption is mainly the local parameter uploading, and it can be calculated by

Energy Consumption on Leader AUV
The computational energy consumption of the leader AUV is mainly caused by a global parameter aggregating, which is given as Similarly, the communication energy consumption of leader AUV is mainly due to the global parameter downloading, which can be calculated by

Problem Formulation
Our goal is to minimize the total cost by optimizing the variables p m , f m , p L , f L , β, which are defined as follows.

P1 : min
Among them, N t is the total number of iterations. Equation (33) is the optimization objective, and the total time of each round is minimized by the optimizing power p m , p L and CPU frequency f m , f L , Equations (34) and (35) are the power constraints, and CPU frequency constraints of the AUV m, respectively, Equations (36) and (37) are the power constraints and CPU frequency constraints of AUV L, respectively, Equation (38) is the lazy node scale factor, and Equations (39) and (40) are the energy constraints of AUV m and AUV L, respectively.

Algorithm Design
In this section, we consider PPO to deal with P1, since it is non-convex and has high dimensionality.

Modeling of Deep Reinforcement Learning Environment
An MDP consists of the state space, action space, state transition matrix function, reward function, and discount factor. Thus the optimization problem P1 can be transformed into the following : According to the above, the network state of the agent at time slot t in this paper can be expressed by the network state of the agent is given by S t = {s(t)} action space: at each the time slot t, a(t) is composed of the following parts-formally, the action space at time slot t is denoted by: and the action of the agent is given by a t = {a(t)}. State transition function: transition probability of the agent at time slot t can be denoted as P T (s t+1 | s t , a t ).
Policy: Let π denote the policy function, which is based on the observed state to make decisions and control the action of the agent π(a | s) = P(a | s).
Reward function: the reword function in this paper is designed for optimizing p m , f m , p L , f L , β; the presented reword function of the agent at one time slot t is expressed as Maximizing the cumulative reward obtained by the sequence T involves the sum of the rewards obtained at each stage, called R n (T ). Therefore, the expected reward can be obtained as follows with policy π: where ξ τ denotes the discounter factor.

Proximal Policy Optimization Algorithm
The proximal policy optimization (PPO) is the off-policy model-free reinforcement learning algorithm. OpenAI uses it as the current baseline algorithm, which uses a new class of objective functions and updates parameters in small batches with multiple training steps. To better describe reward and prevent overfitting, we used advantageÂ n instead R n (T ) to evaluate actionsÂ where V φ (s n ) can be calculated by a value network. Moreover, our expectation is to update the actor's policy π to maximize the expected reward. So, we need to use the gradient boosting method to update the network parameters θ. The gradient solution process is as follows: PPO2 was inspired by the same question as TRPO: using only the current data to improve the policy as much as possible, without causing a sudden decline in the performance of the policy. Differently from TRPO attempting to solve the problem with a complex second-order approach, PPO2 uses a first-order approach that uses a few other tricks to make the new policy approximate the old one. The essence of the PPO algorithm is to introduce a ratio coefficient to indirectly describe the difference between the new strategy and the old strategy, denoted by r t (θ) = π θ (a t |s t ) π θ k (a t |s t ) ; the loss function is defined as follows: where is a (small) hyperparameter that roughly limits the range of variation of r t (θ). In PPO2, we use the following formula to update the value network strategy The algorithm used in this paper is summarized in Algorithm 1.

Algorithm 1 Proximal policy optimization clip.
Input: initial policy parameters θ 0 , initial value function parameters φ 0 for k = 1, 2, . . . do Run policy π θ for K timesteps, collecting s n , a n , r n Compute returnR n according to Equation (

Simulation Results
This chapter uses the TensorFlow framework in Python to build a federated learning model, using a MNIST (Modified National Institute of Standards and Technology Database) consisting of 42,000 digital images, and retaining 10% of the data as a test set for the global model. At the same time, a three-layer MLP (multilayer perceptron) neural network was created as a model for the classification task. The number of neurons in each layer was 200, 200, and 10, respectively. We used this model to perform a classification task and classify the resulting accuracy and loss function. This section gives a maximum number of convergence rounds of 1000. Assuming M = 9, the follower AUVs are 20 m apart from each other and form a formation to form a formation and fly at a constant speed in a certain direction, and the leader AUV is 20 m in front of the formation and the center. The rest of the simulation parameters are shown in Table 1.

The Performance of Each Index in the Gradient Compression Test
In Figure 1, the comparison of communication times among the different schemes versus β, We can observe that as β increases, the follower AUVs that do not participate in communication within the round decrease more, and the corresponding total number of communications increased. From the analysis of Equation (19), the decrease of β leads to the increase of the right side of the inequality, resulting in more nodes satisfying Equation (19), the AUV skipping communication rounds increase, and the communication decreases. The experimental results are in good agreement with the theoretical analysis. It can be concluded from Figures 1 and 2 that the federated learning after the control model still converges. However, as it decreases, the number of communications decreases. Although the communication resources are saved, the global model is affected due to the lack of gradients of some follower AUVs in certain rounds, resulting in a decrease in the performance of federated learning. Specifically, the performance is reduced in accuracy, and the experimental results are consistent with the theoretical analysis. In Figure 3, we show the total time comparison of different schemes versus β. We can observe that as β increases, the total time reduces. This is because with β increasing, the communications gradually increase, and the model can converge faster with more communications, resulting in a shorter total time.

The Performance Analysis of the Scheme Proposed in This Paper
To show that the scheme proposed in this paper has the best effect on reducing the training cost, we compare it with other offloading schemes.
Scheme 1: the scheme proposed in this paper. Scheme 2: asynchronous federated learning with dynamically optimized β Scheme 3: asynchronous federated learning with fixed β. Scheme 4: asynchronous federated learning with LAG. Scheme 5: traditional asynchronous federated learning.
The two sub-graphs in Figure 4 describe the communication times and corresponding accuracy rates of different experimental schemes with different follower AUV numbers. Accuracy will increase. At the same time, a larger communication number corresponds to a higher accuracy rate. When the communication number time is reduced by 710, which is only 21% of the traditional asynchronous federated learning, the accuracy rate is only reduced by 0.03%. This shows that we greatly reduced the communication number, but the accuracy basically did not decrease, indicating that our proposed scheme has good results.  In Figure 5, we show the accuracies of different control models versus the number of follower AUVs. Compared with the work of [19], the control model proposed in this paper increases the accuracy. This shows that the improved control model in this paper is effective. In Figure 6, we show the cost comparison of different schemes versus the number of follower AUVs. With the increase in the follower AUV, the cost gradually increases. As can be seen from the figure, the scheme proposed in this paper has a smaller cost and can save more resources.

The Performance Analysis of the PPO2 Algorithm
In Figure 7, we compare the performance of state-of-the-art algorithms in coping with problem P1, including GA, PSO, AC, DDPG, and our employed PPO2 algorithm. We can observe that PPO2 has a smaller cost under the same number of iterations, which shows that this algorithm is better than other algorithms.

Conclusions
In order to reduce the constraints of underwater scarce communication resources on AUV swarm machine learning, we designed an asynchronous federated learning method. By constructing the optimization problem of minimizing the weighted sum of delay and energy consumption, the AUV CPU frequency and signal transmission power were jointly optimized. In order to solve this complex optimization problem of high-dimensional nonconvex time series accumulation, we transformed the problem into an MDP and used the PPO2 algorithm to solve this problem. Finally, we carried out some experiments to verify the effectiveness of the proposed scheme.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: