Joint Optimization of Relay Communication Rates in Clustered Drones under Interference Conditions

: To address the issues of communication failure and inefficiency in clustered drone relay communication due to external malicious interference, this paper proposes a joint optimization method for relay communication rates under interference conditions for clustered drones. This method employs the following two-step processing framework: Firstly, the Discrete Soft Actor-Critic (DSAC) algorithm is used to train the relay drones for dynamic channel selection, effectively avoiding various types of interference. Simultaneously, the Bayesian optimization algorithm is applied to optimize the hyperparameters of the DSAC algorithm, further enhancing its performance. Subsequently, the modulation order, transmission power, trajectory of the relay drones, and power allocation factors of the clustered drones are jointly optimized. This complex problem is transformed into a convex subproblem for determining a solution, aiming to maximize the communication rate of the clustered drones. The simulation’s results demonstrate that the proposed algorithm exhibits excellent performances in terms of anti-interference capability, solution convergence, and stability. It effectively improves the mission efficiency of clustered drones under interference conditions and enhances their adaptability to dynamic environments.


Introduction
In recent years, with the advancements in multi-agent technology, efficient computing technology, network convergence technology, and new communication technologies, various types of equipment have shed the "fight alone" approach.Instead, they operate in clusters, networking and collaborating to execute various tasks, thereby effectively enhancing the system's ability to handle complexity, dynamics, and adverse conditions.However, in this process, effectively managing system resources, ensuring fair efficiency, scheduling tasks, and combating external interference have become new challenges in meeting application requirements.These challenges are currently hotspots in both academia and industry.
As an important application product of multi-agent systems, clustered drones have a broad application space [1].They form an unmanned network system that is low cost with high dynamics and strong stability due to resource integration, self-organized collaboration, and intelligent scheduling.This provides strong support for the future construction of air, sky, earth, and sea globalized networks, the realization of communication-perceptioncomputing integration, and the development and construction of unmanned platforms.Reference [2] introduces the use of multiple drones to form an integrated IoT between sky and Earth, redeploying multiple drones to a different location, changing the resource allocation, and offloading server computations by alternating heuristic greedy and successive convex approximation methods to minimize the total computational overhead Drones 2024, 8, 381 2 of 32 of ground-based IoT devices.Reference [3] proposes a generalized intelligent collaborative task scheduling framework that switches between heuristic and deep reinforcement learning-based scheduling solutions to address the ever-changing task requests when dynamically served by multiple drones.Reference [4] applies multiple drones as airborne relays to serve clustered users on the ground and proposes a channel statistical model based on the probability of occurrence.It also establishes analytic expressions for the average rate and the average outage probability of this channel model, comparing its performance with those of the Rayleigh channel and Rice channel models.Reference [5] proposes a traffic load balancing scheme for a multi-drone-assisted Fog network to minimize wireless delays for IoT users.Reference [6] proposes a framework that utilizes a dynamic drone network for the heuristic optimization of coverage areas.This framework employs Bézier curves to plan the flight paths of drones, enabling intelligent adjustments to their positions and trajectories, which significantly enhances the network access quality for ground users.Reference [7] conducted an in-depth investigation into routing protocols based on clustered drone networks, comparing the characteristics and performances of various protocols to provide researchers with a basis for selecting appropriate ones.
Despite the promising results demonstrated by the solutions proposed in the studies referenced, there is a notable lack of consideration for real-world environments and complex scenarios, which introduces certain limitations.Specifically, in the presence of malicious dynamic interference and a complex electromagnetic environment, the communication processes of clustered drones are inevitably impacted.Under these conditions, the signal-to-noise ratios of the received signals are affected by more than just the drones' trajectories and transmission powers.Therefore, it is essential to conduct a thorough analysis of the physical impact of the interference on the entire communication process and to develop tailored algorithms designed to mitigate such external interference.Reference [8] proposes a convergent rate-partitioning-based interference management resource allocation and clustering algorithm for iteratively optimizing drone transmission power, drone pairing, wireless resource scheduling, and wireless resource pricing to achieve interference management in cellular drone networks.Reference [9] proposes a drone-path-tracking scheme that uses a radial basis neural network to approximate the adaptive approximation law of the gyroscopic effect function to balance the effects of system uncertainties and nonlinearities, improving the convergence speed and error accuracy of the controller.Reference [10] designed a joint optimization model that provides optimal truck and drone routing policies to address mission abortion when truck and clustered drone routes are subjected to random attacks and to minimize the total cost.Reference [11] designed a dynamic-based communication antijamming decision-making method to solve the problem of intelligent antijamming decision making for battlefield communication, enabling the intelligent system to effectively avoid jamming while ensuring uninterrupted communication as much as possible.
Current research on antijamming primarily concentrates on methods to eliminate and mitigate the impacts of interference on communication processes.However, this focus often overlooks the crucial aspect of enhancing the system performance of clustered drones post-elimination of interference, which is essential for bolstering their mission execution capabilities.Consequently, the joint optimization of the "residual performance" of clustered drones following antijamming efforts holds significant research value, promising substantial improvements in their operational effectiveness.Reference [12] jointly optimizes the trajectory and scheduling plan of each drone, combining convex optimization and ant colony optimization algorithms to obtain the proposed optimal solution, maximizing the total amount of relay data.Reference [13] proposes a drone relay vehicle networking architecture based on relay protocols with nonorthogonal multiple access and maximum ratio combination techniques, as well as improved particle swarm optimization algorithms to achieve increased data rates for cell-edge vehicles in rural road scenarios.Reference [14] investigates the problem of clustered drone deployment for better coverage and proposes a genetic algorithm that encodes the solution of the problem as a chromosome and simulates the process of biological evolution to find favorable solutions, balancing between the energy consumption of all drones and the maximization of the lifetime of the full coverage network.Reference [15] proposes a 3D layout analysis framework for coordinating a fleet of drone relays and uses it to construct a complex mixed-integer nonconvex planning problem for network performance optimization in terms of user throughput fairness through a parallel alpha-fair drone deployment method.Reference [16] considers a scenario in which flying drones provide wireless services to multiple ground nodes simultaneously and proposes an effective joint transmission power and trajectory optimization algorithm that can maximize the minimum average throughput for a given length of time.
Based on the considerations outlined above, this paper delves deeper into the subject, and the contributions are as follows: - Aiming to maximize the communication rate among clustered drones, this study constructs a multivariable coupled 0-1 mixed nonlinear integer programming model.This model incorporates various constraints, including the physical limitations of drones, channel selection, and communication performance, among others; - The Discrete Soft Actor-Critic (DSAC) algorithm is deployed on relay drones, training them for channel selection to adeptly navigate and mitigate dynamic interference; - The Bayesian optimization algorithm is utilized to refine the hyperparameters of the DSAC algorithm, such as the learning rate, discount factor, and target entropy, enhancing the algorithm's capability to resist interference; - The joint optimization problem is systematically decomposed and transformed into manageable convex subproblems, including the modulation order of relay drones, their transmission power, trajectories, and the power allocation factors for clustered drones, all aimed at iterative solutions to maximize the total communication rate.
The experimental results demonstrate that the algorithm proposed in this paper effectively addresses the efficient communication challenges faced by clustered drones in environments with interference.Furthermore, the algorithm exhibits robust stability, strong convergence, and impressive task performance.
The structure of this article is as follows: Section 1 introduces related work and provides an overview of the current research status.Section 2 describes the system model, detailing the tasks, channels, and interference.Section 3 presents the problem description, integrating the model to express specific research questions.Section 4 proposes a DSAC algorithm using Bayesian optimization.Section 5 presents a joint optimization algorithm for communication rates.Section 6 covers experimental simulations.Section 7 concludes the paper.Table 1 presents a description of the key parameters used throughout the paper.

Task Model
When performing tasks, there is generally a considerable spatial distance between the swarm of drones and the ground station, requiring the involvement of relay drones for realtime information transmission.Suppose that relay drones assisting in communications suffer from malicious interference by an enemy, causing communication disruptions.Relay drones need to sense enemy interference, intelligently select channels, and optimize their transmission power, trajectory, and modulation methods to maximize communication rates, thereby forming real-time, efficient, and stable relay communications.Figure 1 shows a schematic diagram of relay communication by a swarm of drones under interference conditions.
Drones 2024, 8, x FOR PEER REVIEW 5 of 35 optimize their transmission power, trajectory, and modulation methods to maximize communication rates, thereby forming real-time, efficient, and stable relay communications.
Figure 1 shows a schematic diagram of relay communication by a swarm of drones under interference conditions.Assuming a relay drone assists in communication for a duration of T , it is divided into K time intervals, T δ and M , where drones form a cluster, with the coordinates ( ( ), ( ), ( )) , The coordinates of the ground station are (0, 0, 0) , the jammer's coordinates are ( , ,0) x y , and the relay drone's coordinates are ( ( ), ( ), ( )) Changes in positions during each time slot k are negligible.Additionally, the relay drone updates the position coordinates of the cluster drones every δ time interval, during which changes in the cluster drones' coordinates can also be neglected.The relay drone uses NOMA multiplexing to communicate, enhancing the real-time transmission efficiency of the cluster drones, and employs amplify-andforward in a half-duplex communication mode for data transmission.

Channel Model
According to Reference [17], the operational trajectory of the relay drone in the air exhibits strong Line-of-Sight (LOS) characteristics with both the cluster of drones and the ground station.Therefore, the channel model between the relay drone and both the cluster of drones and the ground station can be approximated by the free-space propagation model of electromagnetic waves, which is described as follows: In this model, t r G G represents the product of the antenna gains of the transmitter and receiver, G λ denotes the wavelength of the electromagnetic waves at different frequencies, and ( ) D k signifies the distance between the relay drone and either the cluster of drones or the ground station.For example, the expression for calculating the free-space path loss between the relay drone and the ground station node can be illustrated as follows: Jamming devices are typically hidden in relatively concealed areas to prevent detection and subsequent damage or destruction.Therefore, according to the 3GPP [18], the Assuming a relay drone assists in communication for a duration of T, it is divided into K time intervals, δ T and M, where Ω = {1, 2, ..., M} drones form a cluster, with the coordinates The coordinates of the ground station are (0, 0, 0), the jammer's coordinates are (x J , y J , 0), and the relay drone's coordinates are Q = (x uav r (k), y uav r (k), z uav r (k))k ∈ K. Changes in positions during each time slot k are negligible.Additionally, the relay drone updates the position coordinates of the cluster drones every δ time interval, during which changes in the cluster drones' coordinates can also be neglected.The relay drone uses NOMA multiplexing to communicate, enhancing the real-time transmission efficiency of the cluster drones, and employs amplify-andforward in a half-duplex communication mode for data transmission.

Channel Model
According to Reference [17], the operational trajectory of the relay drone in the air exhibits strong Line-of-Sight (LOS) characteristics with both the cluster of drones and the ground station.Therefore, the channel model between the relay drone and both the cluster of drones and the ground station can be approximated by the free-space propagation model of electromagnetic waves, which is described as follows: In this model, G t G r represents the product of the antenna gains of the transmitter and receiver, λ G denotes the wavelength of the electromagnetic waves at different frequencies, and D(k) signifies the distance between the relay drone and either the cluster of drones or the ground station.For example, the expression for calculating the free-space path loss between the relay drone and the ground station node can be illustrated as follows: Jamming devices are typically hidden in relatively concealed areas to prevent detection and subsequent damage or destruction.Therefore, according to the 3GPP [18], the channel model between the jammer and the relay drone is usually a probabilistic loss model that mixes LOS and Non-Line-of-Sight (NLOS) components.The calculation expression for this model is as follows: In this model, L D represents the spatial propagation loss of the electromagnetic waves.The calculation formula for this loss is as follows: In this model, 2 + (z uav r (k)) 2 represents the distance between the relay drone and the jammer, f uav c denotes the communication frequency of the relay drone, δ refers to shadow fading, which is typically modeled as a log-normal distribution, and ϑ indicates the Rician distribution.The Rician distribution is used to model small-scale fading, as jammers typically have high transmission power, and there is always a line-of-sight component present in the signal, making this distribution more suitable for real-world conditions.

Interference Model
During the transmission of information by the relay drone, the jammer exhibits energy detection and power suppression capabilities, thereby inducing communication interference in the relay drone.The interference detection model based on energy is formulated as follows: In this context, rs(k) denotes the currently captured signal, ts(k) represents the communication signal, and j(k) signifies the interference signal.After the subband signals pass through the band-pass filter, their signal power, ρ u (k), is computed.Subsequently, the energy levels for each frequency band are calculated.The corresponding calculation expression is given as follows: In this context, v denotes the number of subbands.By comparing E t with the respective thresholds, λ C 1 , λ C2 , within the model, the channel state is determined.The relay drone also exhibits spectrum-sensing capabilities, enabling it to intelligently select the operating frequency, f Ruav n through frequency hopping, thereby avoiding the impact of interference on relay transmission.The subfrequency ranges within each frequency point do not overlap, and the available frequency set contains F frequency points, with subchannel bandwidths of B Ruav n .The limits for the working frequency range of the relay drone, shared with the jammers, are defined by f Ruav l and f Ruav h .Assuming the maximum transmission power of cluster drone i is P F i (k), k ∈ K, the transmission power of the relay drone is P R (k), and the maximum transmission power of the jammer is P J , with the product of the antenna gains between them being H * 1 .In a relay scenario in which the cluster drones need to transmit detection information back to the ground station, the Signal-to-Interference-plus-Noise Ratio (SINR) expression for a relay drone working at frequency f Ruav n receiving the signal transmitted by cluster drone i is as follows: In this model, n 0 represents the noise power spectral density, H w i (k) denotes the channel gain between the cluster drone, i, and the relay drone.The magnitude of the SI NR i,R directly reflects the interference condition of the relay drone; the smaller its Drones 2024, 8, 381 7 of 32 value, the greater the degree of interference, and conversely, the larger its value, the lesser the degree of interference.Assuming the normal communication threshold is D n , SI NR i,R ≥ D n indicates the normal transmission of relay information, while SI NR i,R < D n signifies a failure in the information relay.
The interference model proposed in this paper encompasses constant frequency interference, sweep frequency interference, and hybrid frequency interference.The corresponding expressions are as follows: In this context, ) represent the interference intensity of constant frequency interference, sweep frequency interference, and hybrid frequency interference, respectively, which are functions of time and frequency.A f ixed , A sweep denote the amplitude of constant frequency interference and sweep frequency interference signals, f f ixed represents the set of constant frequencies, f sweep (k) signifies the set of frequencies at time k for sweep frequency interference, and Sa(•) denotes the sampling function.The constant frequency interference mode is characterized as a bandwidth-limited random process with a central frequency of f f ixed , while the sweep frequency interference mode is a random process with frequencies varying over time.The hybrid frequency interference mode combines both constant frequency interference and sweep frequency interference, making it a more complex random process.As a result, external malicious interference exhibits high randomness and instability.Combined with the extremely limited prior knowledge of the interference signals, achieving efficient and accurate prediction and estimation is challenging.This renders the electromagnetic environment faced by the relay drone exceptionally complex and severe, posing significant challenges to communication performance.The dynamic interference process is illustrated in Figure 2.
In this model, 0 n represents the noise power spectral density, ( ) H k denotes the channel gain between the cluster drone, i , and the relay drone.The magnitude of the , i R SINR directly reflects the interference condition of the relay drone; the smaller its value, the greater the degree of interference, and conversely, the larger its value, the lesser the degree of interference.Assuming the normal communication threshold is indicates the normal transmission of relay information, while signifies a failure in the information relay.
The interference model proposed in this paper encompasses constant frequency interference, sweep frequency interference, and hybrid frequency interference.The corresponding expressions are as follows: In this context, ( , ), ( , ), ( , ) ference mode is a random process with frequencies varying over time.The hybrid frequency interference mode combines both constant frequency interference and sweep frequency interference, making it a more complex random process.As a result, external malicious interference exhibits high randomness and instability.Combined with the extremely limited prior knowledge of the interference signals, achieving efficient and accurate prediction and estimation is challenging.This renders the electromagnetic environment faced by the relay drone exceptionally complex and severe, posing significant challenges to communication performance.The dynamic interference process is illustrated in Figure 2.Under the context of the above interference modes, the expression for the SINR of the relay drone working on the f Ruav n channel, receiving signals from cluster drone i, can be reformulated as follows: In this context, S = 0, 1, ..., F J represents the number of channels affected by interference, and Y( f , f Ruav

Power Transmission Model and Energy Model
Transmission power is closely related to the communication rate, ensuring reliable data transmission across varying distances and environments, while also being significant for optimizing energy consumption.Hence, a power transmission model is introduced to achieve precise measurement.The corresponding calculation expression is given by the following: In this context, P uav RX denotes the received signal power, P uav TX represents the transmitted signal power, L uav indicates the path loss, G uav r signifies the receiver antenna gain, and G uav t signifies the transmitter antenna gain.In a drone cluster relay communication system, the system energy consumption primarily comprises the flight energy consumption and communication energy consumption of the drones.The calculation expression for flight energy consumption is given by the following: In this context, P F/R (k) represents the transmission power of the drone, while the calculation expression for communication energy consumption is given by the following: In this context, v(k) represents the drone's flight speed at time k.In summary, the total system energy consumption, E s , can be expressed as follows:

Problem Description
This section focuses on a scenario in which swarm drones transmit collected information back to the ground station in real time, exploring the modeling issues of real-time communication between the swarm drones and the ground station with the assistance of a relay drone.Assume that the working frequency of the relay drone is f Ruav n , and this channel comprises M/2 subcarriers for the paired transmission of information from the swarm drones.During this process, the relay drone often faces a certain degree of malicious interference.Considering the high-speed movement of drones and the complex electromagnetic environment, their channel state information (CSI) may become imperfect or outdated.In this context, the traditional assumption of perfect CSI is unrealistic.By referring to current mainstream methods for CSI estimation and prediction, such as deep learning algorithms [19], adaptive algorithms [20], and coding techniques [21], it is possible to dynamically predict and compensate for the quality of the CSI, thereby addressing the issues related to outdated and imperfect CSI.According to the principles of Non-Orthogonal Multiple Access (NOMA), the sender transmits data by allocating different levels of transmission power to different users, and the receiver decodes the signals sequentially by detecting the gain differences of the received signals, thus canceling the links with smaller channel gains [22].In the designated task scenario, the communication link gains from swarm drone i and swarm drone j to the relay drone are H w i (k) and H w j (k), respectively.When H w i (k) < H w j (k), the relay drone decodes the link from swarm drone j first; otherwise, it decodes the link from swarm drone i first.The power allocation factor set for the swarm drones is λ = {λ 1 , λ 2 , ..., λ M }, and since the swarm drones select only one channel for data transmission in different time slots, in time slot k, the SINR for the relay drone working on the frequency f Ruav n channel and receiving the transmission signal from swarm drone i is as follows: Similarly, in time slot k, the Signal-to-Noise Ratio (SNR) for the ground station receiving the transmission signal from the relay drone is as follows: The relay drone employs an amplify-and-forward method to transmit information.As stated in reference [23], in time slot k, the SNR for the ground station receiving the two-hop link from cluster drone i is as follows: From this, it is evident that in time slot k, the communication rate between the cluster drone i and the ground station is as follows: Because of the dynamic characteristics of the relay drone, the channel conditions for its assisted communication change in real time.Therefore, dynamically adjusting the channel modulation method is an important strategy to flexibly respond to real-time channel variations.Hence, this paper introduces Quadrature Amplitude Modulation (QAM) technology into relay drone-assisted communication to maximize high-performance auxiliary communication.Consequently, in time slot k, the communication rate between cluster drone i and the ground station becomes the following: In this case, U(k) represents the modulation order at time k, and the set of selectable modulation orders is given by {2, 4, 16, 64}.Therefore, the expression for the total com- Drones 2024, 8, 381 10 of 32 munication rate between the cluster drone and the ground station over the entire mission period is as follows: From the above formula, it is evident that optimizing the total communication rate achievable by the ground station is dependent on the power allocation factors of the cluster drones, the modulation method, transmission power, and three-dimensional trajectory of the relay drone.However, it is also necessary to consider the physical constraints of the communication equipment and the performance constraints of the communication process.
The transmission power of the relay drone must not exceed its maximum transmission power, P R max , and the energy consumption for auxiliary communication must not exceed the maximum energy storage capacity of the relay drone, E Ruav max , expressed as follows: Similarly, similar constraints exist for all drones in the cluster, namely, the following: Additionally, in each time slot, the horizontal movement trajectory of the relay drone cannot exceed the trajectory length that would be covered at its maximum horizontal speed, V hor max , expressed as follows: Similarly, the vertical movement trajectory of the relay drone cannot exceed the trajectory length that would be covered at its maximum vertical speed, V ver max , and the altitude of the relay drone must be within an appropriate range, expressed as follows: In this case, H l , H h represent the flight altitude limit for the relay drone, with the starting point coordinates given by U AV R init = (x init , y init , z init ) and the ending point coordinates by U AV R term = (x term , y term , z term ).The movement constraints also need to satisfy the following: Since the auxiliary communication operates on a two-hop model, there is a single time slot delay in relay transmission [24].Therefore, the transmission power of the cluster drones at the final time slot and the relay drone at the initial time slot is set to zero, expressed as follows: Here, K represents the last task slot.To ensure that the data volume returned by the cluster drones is sufficient for analysis and decision making at the ground station, it is required that the total communication rate of the cluster drones exceed a specified threshold R l , expressed as follows: Because of the relay drone's ability to switch between different modulation modes based on varying channel conditions and according to different modulation orders, the formula for calculating the bit error rate (BER) also varies [25], expressed as follows: Therefore, for each communication link formed between the cluster drones and the ground station, the BER must meet the threshold requirement BER q , as follows: Thus, the optimization model for the communication rate of cluster drones under interference conditions is constructed as follows: (13) − ( 26), (28) The optimization model for the communication rate of cluster drones under interference conditions integrates several constraints, as follows: channel selection constraints (37) and (38), physical constraints of relay-assisted communication ( 20)- (30), and communication service performance constraints (31)-( 33), ( 35), (39), and (40).The problem is formulated as a highly coupled nonlinear mixed-integer optimization model involving both discrete and continuous variables, which presents significant computational challenges due to its complexity.To effectively tackle this, the approach is divided into the following two strategic steps: firstly, employing an intelligent channel selection scheme to choose interference-free channels, and secondly, conducting the joint optimization of the relay's modulation method, transmission power, three-dimensional trajectory, and the cluster drones' power allocation factors.This method aims to maximize the data transmission rate of the relay drone under interference conditions while simplifying the optimization process and enhancing solution efficiency.

Intelligent Channel Selection Scheme
Because of the dynamic nature of jammer interference patterns, conventional antiinterference algorithms often struggle to cope effectively.Deep reinforcement learning, as a potent decision-making tool within artificial intelligence, is capable of adapting to dynamic environments and achieving optimal outcomes.However, deep reinforcement learning algorithms strongly depend on hyperparameters, which can be a limiting factor in their application.Therefore, this section proposes an intelligent channel selection scheme based on Bayesian optimization within a Discrete Soft Actor-Critic (BO_DSAC) framework.This approach is designed to facilitate channel selection for relay drones, thereby mitigating the impact of dynamic interference on the communication of cluster drones.

Discrete Soft Actor-Critic Algorithm
The DSAC algorithm requires discussion primarily around the agent and the interfering environment.The agent mainly consists of the following three components: an action selection network, a value evaluation network, and an entropy structure.The Soft Actor-Critic algorithm adds an entropy component to the previous actor-critic network architecture to enhance the algorithm's ability to explore global solutions and prevent convergence to local optima.Entropy is a measure of uncertainty, with higher randomness corresponding to greater entropy values.In the DSAC algorithm, the presence of entropy shifts the focus of the evaluation network from determining the cumulative rewards of actions for each state to exploring actions that could yield the maximum cumulative rewards in a given state.This adjustment addresses the shortcomings of traditional AC architectures in deep reinforcement learning.The working framework of this architecture is illustrated in Figure 3.For ease of expression, this paper denotes the soft predictive value network with Q so f t as Q.
Drones 2024, 8, x FOR PEER REVIEW 13 of 35 illustrated in Figure 3.For ease of expression, this paper denotes the soft predictive value network with soft Q as Q .From the diagram, it can be observed that the Discrete Soft Actor-Critic algorithm architecture consists of one policy network ( ) π  and four evaluation networks ( ) Q  .The evaluation networks are divided into two sets, as follows: predictive value networks and their corresponding target value networks.The configuration within each set aims to prevent overestimation and ensure the stability of the learning process, while the arrangement among the sets is designed to enhance learning efficiency and further improve stability.In terms of the evaluation network, unlike algorithms for continuous action spaces, in discrete action spaces, the state is solely input into the evaluation network, which then outputs values for all possible actions corresponding to that state, denoted as A S →  .
The expression for the loss function is , ; , { } l n ; ; In this setup, ) , ( r s a represents the expected reward value when action t a is exe- cuted in state t s .α denotes the entropy coefficient, which signifies the weight of the entropy, ( ) ( ) ( ) represents the entropy itself, and J D From the diagram, it can be observed that the Discrete Soft Actor-Critic algorithm architecture consists of one policy network and four evaluation networks Q(•).The evaluation networks are divided into two sets, as follows: predictive value networks and their corresponding target value networks.The configuration within each set aims to prevent overestimation and ensure the stability of the learning process, while the arrangement among the sets is designed to enhance learning efficiency and further improve stability.In terms of the evaluation network, unlike algorithms for continuous action spaces, in discrete action spaces, the state is solely input into the evaluation network, which then outputs values for all possible actions corresponding to that state, denoted as S → R |A| .The expression for the loss function is L Q (ω u ).
Drones 2024, 8, 381 13 of 32 In this setup, r t (s t , a t ) represents the expected reward value when action a t is executed in state s t .α denotes the entropy coefficient, which signifies the weight of the entropy, H(π(s ′ ; θ u )) = −απ(s ′ ; θ u ) T ln(π(s ′ ; θ u )) represents the entropy itself, and D J indicates the experience replay buffer used to store interaction data between the agent and the environment.Regarding the policy network, its input remains the state, but its output is the probability distribution of discrete actions, unlike the output of mean and variance in continuous action spaces, denoted as S → [0, 1] |A| .The expression for the loss function of the policy network is L π (θ u ).
Regarding entropy, the entropy coefficient, α, determines the importance of the entropy term.To enable the adaptive variation in the entropy coefficient at different stages of learning, its loss function is designed as L(α), as follows: In this context, H represents the target entropy.The application of gradient descent to these loss functions facilitates the updating of network parameters.Under the framework of the Discrete Soft Actor-Critic algorithm, and in consideration of the actual conditions of the interference environment, the required basic elements are defined as follows: State Space: The state space S i should include the physical quantities that the relay drone needs to compute, primarily characterized by the energy state of the channel.Its definition is as follows: In this context, S i mix represents the energy state of channel i.When the channel is idle, only environmental noise n 0 B Ruav n is present.During normal communication, the energy is characterized by P nc .Based on prior information, when the cluster drones communicate at full power, the maximum reception power of the relay drone is P rmax = λ i P F H w i .When there is interference, the channel energy is represented by P i f .The current energy state of the channel can be used to determine the communication status of the channel.
Action Space: The action space A i represents the possible choices the relay drone can make regarding channel selection during time slot k.Therefore, the action space is designed as follows: Therefore, the size of the action space is 2 F , where a = 0 indicates that the channel is not selected, and a = 1 indicates that the channel is selected.During each interaction, the agent selects only one channel for communication at a time.
Reward Function: The real-time reward function R i measures the approval of action a and is designed with the following expression: The relay drone receives the maximum reward if it successfully avoids interference while maintaining the same working channel.A slightly lower reward is given if the drone successfully avoids interference by switching channels.If the drone fails to avoid interference and does not change channels, it receives zero reward.However, if it changes channels and still fails to avoid interference, a penalty is applied.In this setup, ∆r is set to 0.4 to measure the reward for channel switching decisions.

Bayesian Optimization Algorithm
Bayesian Optimization is an efficient global optimization algorithm that can optimally extract hyperparameters with minimal evaluations using a statistical approach.The fundamental principle is based on Bayes' theorem [26], and its expression is as follows: In this context, X represents the distribution model to be specified, Θ = {(x i , y i )|y i = X(x i )} is the set of observed data, p(Θ|X) denotes the likelihood distribution, and p(X) is the prior probability.In this paper, the decision variables {x i } include the discount factor γ u , learning rate lr u , and target entropy H.The cumulative reward function is the target value, y i .The prior and posterior models use Gaussian distributions for simulation.The acquisition function, based on Expected Improvement function, is used to select the next evaluation point, x i+1 , from the posterior distribution, and its expression is as follows: In this context, X + represents the current known maximum value simulated by the Gaussian process.This establishes a probabilistic optimization model concerning the discount factor, learning rate, and target entropy.By continuously selecting the next point through the acquisition function, the process iterates until the optimal solution is found.
Summarizing, the implementation process of the BO_DSAC intelligent channel selection scheme is detailed as follows in Algorithm 1.

Algorithm 1 BO_DSAC intelligent channel selection scheme
Inputs: discount factor γ u , learning rate lr u , and target entropy H, each with their respective optimization ranges.Additionally, inputs include the entropy coefficient α, maximum number of training iterations num_episodes, Markov chain length MDPL, soft update parameter τ ls , capacity of the replay buffer p size , batch size for data processing batch_size, and the number of iterations for Bayesian Optimization BSN.These parameters collectively set the framework for optimizing the intelligent channel selection strategy.Output: policy function π B (•) and value function Q B (•). 1: Inputs discount factor γ u , learning rate lr u , and target entropy H, each with their respective optimization ranges.Additionally, inputs include the entropy coefficient α, maximum number of training iterations num_episodes, Markov chain length MDPL, soft update parameter τ ls , capacity of the replay buffer p size , batch size for data processing batch_size, and the number of iterations for Bayesian optimization, BSN.These parameters collectively set the framework for optimizing the intelligent channel selection strategy.2: Initialize the replay buffer D J ̸ = ∅ and the network parameters of the policy and evaluation networks θ u , ω u i , i = 1, 2, 3, 4 .

7:
Perform the action, receive the reward, and form the data tuple:< s, a, r, s ′ >.

9:
Store the data tuple in the replay buffer: D = D∪ < s, a, r, s ′ >.

10:
i f D J > p size 11: Delete the earliest data tuple from the replay buffer.

15:
Calculate the loss function of the soft predictive value network L Q (ω u ).

16:
Calculate the loss function of the soft policy network L π (θ u ).17: Calculate the entropy loss function L(α).

18:
Minimize the L Q (ω u ) loss using the data tuples to train the network parameters.

19:
Maximize the L π (θ u ) using the data tuples to train the network parameters.

20:
Minimize the L(α) loss using the data tuples to adaptively adjust the entropy coefficient.

22:
end i f 23: end f or 24: end f or 25: Output the final policy function π * (•) and value function Q * (•).The computational overhead of the Bayesian optimized Discrete Flexible Actor-Critic algorithm comprises the following two parts: the Discrete Flexible Actor-Critic algorithm and the Bayesian optimization process.The Discrete Flexible Actor-Critic algorithm includes the complexity of network forward and backward propagation O(N n ), where N n denotes the number of network layers; the complexity of experience replay O(batch_size); and the complexity of policy and value updates O(kN n ), with k being the number of network updates.Hence, its computational overhead is O(batch_size • N n ).The second part involves the Bayesian optimization process, which includes the complexity of Gaussian process training O(u 3 ), prediction complexity O(u), and acquisition function optimization complexity O(u r ), with u r representing the number of evaluated candidate points.Therefore, the overall computational complexity of the proposed algorithm in this section is O(u 3 + u • u r + u r + batch_size • N n ), primarily dominated by O(u 3 ).

Joint Optimization Scheme for Communication Rate
After intelligent channel selection, the relay drone can avoid the impact of jammers and achieve information transmission on interference-free channels.To further efficiently decouple the multivariable mixed-integer nonlinear optimization problem, this section divides the joint optimization problem of communication rate into the following four subproblems: relay drone modulation method optimization, relay drone transmission power optimization, three-dimensional trajectory optimization, and cluster drone power allocation factor optimization.These subproblems are iteratively solved alternately until convergence, thereby maximizing the communication rate of the cluster drones.

Relay Drone Modulation Method Optimization
To study the optimization problem of the relay drone modulation method U(k), it is necessary to fix variables such as the power allocation factor, λ (c−1) (k), of the cluster drones, the transmission power, P R,(c−1) (k), of the relay drone, and trajectory, Q (c−1) (k), Drones 2024, 8, 381 16 of 32 during the (c − 1)th iteration in time slot k.Thus, the original optimization problem C 1 can be simplified as follows: The above issue is a 0-1 integer programming problem, which can be directly solved using the Mosek optimization toolkit to obtain the U (c) (k) value.

Optimization of Relay Drone Transmission Power
To study the optimization problem of relay drone transmission power P R (k), it is necessary to fix the power allocation factor λ (c−1) (k) of the cluster drones in the (c − 1)th round of iterations within time slot k, the 3D trajectory Q (c−1) (k), and the modulation method U (c) (k) of the relay drones that was just updated in the (c)th round.Consequently, the original optimization problem C 1 can be simplified as follows: The above optimization problem C 5 is convex and can be solved by directly applying the CVX toolkit to obtain a P R,(c) (k) global suboptimal solution.
Proof: Let the expression of the objective function,

S
be: Taking the second order derivative of f (P R,(c) (k)) results in the following: From the constraint restrictions and physical significance of each variable, c pr 1 ≥ 0, c pr 2 ≥ 0, c pr 3 ≥ 0, P R,(c) (k) ≥ 0, the second order derivative f ′′ (P R,(c) (k)) ≤ 0, and the objective function is concave.Also, the restriction constraints satisfy the convex set property and hence the optimization problem C 5 is convex, as evidenced.

Optimization of Three-Dimensional Trajectory for Relay Drones
To study the optimization problem of the three-dimensional trajectory, Q(k), of the relay drones, it is necessary to fix the power allocation factor, λ (c−1) (k), of the cluster drones during the (c − 1)th round of iterations within time slot k and the modulation method U (c) (k) of the relay drones that was just updated in the (c)th round, as well as the transmission power P R,(c) (k).Consequently, the original optimization problem C 1 can be simplified as follows: In optimization problem C 6 , the objective function expression is nonconvex, has a nonlinear relationship with Q(k), and its components are interdependent.Additionally, the limiting constraints do not meet the requirements for a convex problem, thus making problem C 6 a nonlinear nonconvex optimization problem.Because ∂SNR i,S /∂SI NR Ruav i,R ≥ 0 and ∂SNR i,S /SNR Ruav R,S ≥ 0, the objective function can be transformed as follows: The changed objective function is still difficult to solve, in order to effectively solve this problem, this paper adopts the quadratic transformation to deal with the above objective function expression, so that ϕ 1 , ϕ 2 is the new objective function value, then the objective function becomes the following: Drones 2024, 8, 381 18 of 32 where the a i , b i expressions are, respectively, as follows: where R are the distances between the two hops obtained in the previous iteration, respectively, and the optimization problem becomes the following: (24)-( 30), (33) Unfortunately, because 1/D * (k) R Fi , 1/D * (k) R Fj exist, this leads to the problem that C 7 is still nonconvex, and the solution to this problem is still difficult.So this paper applied continuous convex approximation to solve the problem [27].First-order Taylor expansion was carried out for ( The expanded objective function expression becomes the following: The optimization problem C 8 is obtained by bringing Equations ( 60), (61), and (64) into problem C 7 for collation, as follows: The log 2 (•) is a monotonically increasing concave function, the quadratic transformation of the partition in problem C 6 must be a concave function, and in Reference [28], the optimization problem C 8 after the quadratic transformation of problem C 6 is a convex problem, which can be solved by applying the CVX toolkit in order to obtain the global suboptimal solution.

Optimization of Power Allocation Factor for Cluster Drones
To study the optimization problem of the power allocation factor λ for cluster drones, it is necessary to fix variables such as the recently updated modulation method U (c) (k) of the relay drone, the transmission power P R,(c) (k), and the three-dimensional trajectory Q (c) (k) during the cth round within time slot k.Given ∂SNR i,S /∂SI NR Ruav i,R ≥ 0, and with the other variables held constant, the optimization of the transmission power for cluster drones across different time slots is independent [29].Let φ be the new objective function value, with its expression as follows: However, the objective function φ (c) is still not a convex function and the problem is still difficult to solve; fortunately, Equation ( 66) is satisfied, as follows: Thus, the original optimization problem C 1 can be varied as follows: 23), ( 31), ( 33), (39) In the above optimization problem C 9 , the objective function φ (c) is a concave function of the optimization variables λ (c) (k).
Proof: Take the second order derivative of φ (c) , as follows: From the constraints on the variables and the physical significance of the constraints, it can be seen for the second-order derivative that d ′′ φ (c) /d ′′ λ function is a concave function.Also, the restriction constraints satisfy the convex set property, so the optimization problem is convex, as proven.Therefore problem C 9 can be solved with CVX toolkit to obtain a global suboptimal solution for λ (c) (k).
In summary, because of the solver employs mixed-integer computation and the interior-point method to solve each subproblem, the computational complexity of the joint optimization of the relay drone communication rate is O((W IV) 3 ), where W is the total number of iterations of the algorithm, I is the maximum number of iterations for the subproblems, and V is the maximum dimension of the decision variables.The algorithm execution process is shown in Algorithm 2.

Algorithm 2 Flow of joint communication rate optimization algorithm
Input: Use the coordinates of the cluster drones, the relay drone, the jammer, the ground station, and constraints such as the BER threshold to calculate the initial feasible solutions P R,(0) , Q (0) , λ (0) , and U (0) .Initialize the optimal value storage space ℵ, the current iteration count c = 1, the maximum number of iterations c max , and the precision µ.Output: U * , (P R ) * , Q * , λ * .

1:
Input the coordinates of the cluster drones, the relay drone, the jammer, the ground station, and constraints such as the BER threshold to calculate the initial feasible solution P R,(0) , Q (0) , λ (0) , and U (0) .Initialize the optimal value storage space ℵ, the current iteration count c = 1, the maximum number of iterations c max , and the precision µ.
Optimize the modulation method of the relay drone:

5:
Apply the Mosek toolkit to compute problem C 3 and obtain a Suboptimal solution for the relay drone modulation method U (c) * .6: Updating the suboptimal solution for the relay drone modulation method obtained in the (c − 1)th round of iterations U (c) * .7: Optimize the transmission power of the relay drone: 8: Initial value of input U (c) * , Q (c−1) * , λ (c−1) * .9: From Equation (52), the transmission power constraint corresponding to the BER threshold is calculated.10: Use the CVX toolkit to solve problem C 5 and obtain the suboptimal solution P R,(c) * for the transmission power of the relay drone.11: Update the suboptimal solution P R,(c) * for the relay drone's transmission power obtained in the (c)th round of iteration.12: Optimize the three-dimensional trajectory of the relay drone's transmission: 13: Initial value of input U (c) * , P R,(c) * (k), and λ (c−1) * .14: Input the two-hop channel gain value obtained by solving in the previous iteration and calculate the value of expressions a i and b i according to Equation (60) and Equation (61).

and calculate the first-order Taylor expansion expression to obtain H w,(c) i
Algorithm 2 Cont.

24:
Referring to the channel gain order, rewriting the objective function according to Equation (67) yields φ (c) .25: Use the CVX toolbox to solve problem C 9 and obtain the suboptimal solution λ (c) * for the power allocation factor of cluster drones.26: Update the suboptimal solution λ (c) * for the power allocation factor of cluster drones obtained in the (c)th round of iteration.

Experimental Simulation
To verify the joint optimization effect of the communication rate for cluster drones under interference conditions as proposed in this paper, two comparative experiments were set up.These experiments allowed for observing the channel selection performance of the relay drone and the optimization of the total communication rate.It is assumed that the formation spacing of cluster drones is 200m, and their information backhaul tasks are performed within a three-dimensional space of 3500 m × 3500 m × 100 m.

Verification of the BO_DSAC Algorithm's Effectiveness
The purpose of the first set of experiments was to verify whether the relay drone's channel selection using the BO_DSAC algorithm can effectively avoid external malicious interference.The types of interference set in this study included constant frequency interference, sweep frequency interference, and hybrid frequency interference.The number of channels affected by the constant frequency interference was F/2, the number of sweep frequency interference channels was F/2, and each cycle moved down six channels in the decreasing frequency direction.The hybrid frequency interference consisted of constant frequency interference on three channels combined with sweep frequency interference that moved down three channels each cycle in the decreasing frequency direction.To verify the effectiveness and superiority of the algorithm presented in this paper, it is compared with the Convolutional Neural Network-Enhanced DQN (Conv_DQN) algorithm, the standard DQN algorithm, and random selection.To ensure fairness, the three algorithms were set with the same number of training sessions (num_episodes = 1000), number of interactions (MDPL = 200), network layers (N n = 256), learning rate (lr u = 5 × 10 −4 ), discount factor (γ u = 0.97), and training batch size (batch_size = 48).Figure 4 illustrates the anti-interference performance of the BO_DSAC algorithm under a constant frequency interference scenario.From the graph, it is evident that the BO_DSAC begins to converge gradually between 70 and 100 training rounds, achieving maximum reward values upon convergence.Although the constant frequency interfer-  Figure 4 illustrates the anti-interference performance of the BO_DSAC algorithm under a constant frequency interference scenario.From the graph, it is evident that the BO_DSAC begins to converge gradually between 70 and 100 training rounds, achieving maximum reward values upon convergence.Although the constant frequency interfer- Figure 4 illustrates the anti-interference performance of the BO_DSAC algorithm under a constant frequency interference scenario.From the graph, it is evident that the BO_DSAC begins to converge gradually between 70 and 100 training rounds, achieving maximum reward values upon convergence.Although the constant frequency interference model is relatively simple, the algorithm demonstrates optimal performance, profoundly reflecting the superiority of the approach proposed in this paper.The introduction of entropy in the BO_DSAC algorithm leads to occasional fluctuations in reward values across some rounds; however, this also confirms that the BO_DSAC maintains a robust search for solutions throughout the training process, preventing it from settling into local optima.The graph also shows that compared to the Conv_DQN and DQN algorithms, the proposed algorithm converges faster and exhibits more pronounced resistance to interference.
Figure 5 shows the comparison of resistance to sweep frequency interference effects.From the graph, it is clear that due to the high dynamism of sweep frequency interference, the resistance effect is not as pronounced as that against constant frequency interference.However, the BO_DSAC algorithm begins to converge after about 30 rounds, demonstrating better performance in terms of timeliness and accuracy compared to the Conv_DQN and DQN algorithms.It adapts more quickly and flexibly to the dynamic environment's impact on the information backhaul of cluster drones.
Figure 6 presents a comparison of resistance to hybrid frequency interference effects.From the chart, it is evident that the BO_DSAC algorithm begins to gradually converge around 100 rounds and maintains a relatively stable resistance to interference after convergence, showing a more significant performance compared to other algorithms.Even though hybrid frequency interference is a challenging type of interference in real-world scenarios, and the high dynamism of the interference caused by ∆r settings leads to an increased frequency of channel selection, which might reduce the reward results, the algorithm proposed in this article still offers a robust anti-interference effect.
Figure 7 illustrates the impact of different hyperparameter sets on the BO_DSAC algorithm, conducted against a backdrop of hybrid frequency interference using Bayesian optimization.The optimized hyperparameters include the discount factor γ u * , learning rate lr u * , and target entropy H * .The discount factor is used to determine the current value of future rewards.A higher discount factor makes the algorithm place more emphasis on future rewards, allowing the channel selection to achieve the best results over the entire task period.Conversely, a low discount factor will lead to "short-sightedness", ignoring beneficial long-term behaviors and increasing interference avoidance errors.The learning rate determines the speed of policy updates.An excessively high learning rate may cause instability during training, resulting in significant fluctuations in anti-interference performance.On the other hand, a low learning rate will slow down the learning process, making the anti-interference effects less noticeable and convergence slower.The target entropy is used to control the randomness of the policy.A high entropy value encourages the policy to explore more of the state space, maintaining channel stability while finding channels that successfully avoid interference.Conversely, a low entropy value makes the policy more deterministic, which can lead to local optima and unsatisfactory interference avoidance results.Therefore, appropriate hyperparameters can enhance the algorithm's ability to handle interference and improve the efficiency of relay communication in the drone cluster.
To enhance the anti-interference performance throughout the entire task period and achieve efficient policy exploration, this paper implements dynamic control of the entropy coefficient, maintaining the entropy coefficient without exponential changes with training iterations.The initial set of hyperparameters is denoted as (0.9,1.0 × 10 −4 ,10).According to [30], mainstream hyperparameter sets that have shown better performance are used for comparison experiments, denoted as (0.99, 5 × 10 −4 , 0.98), corresponding to the BO_DSAC_III and BO_DSAC_II curves in the figure.BO_DSAC_I represents the parameter set optimized by Bayesian optimization, denoted as (0.97, 5 × 10 −4 , 0.6).As shown in the figure, the parameter set optimized by Bayesian optimization exhibits a certain degree of improvement in anti-interference performance.Compared to the parameters during optimization and the nonoptimized parameter set, the total reward values increased by 7.1% and 3.3947 times, respectively.This demonstrates that the proposed algorithm can significantly enhance the problem-solving capability of the algorithm.
ence performance.On the other hand, a low learning rate will slow down the learn process, making the anti-interference effects less noticeable and convergence slower.target entropy is used to control the randomness of the policy.A high entropy value courages the policy to explore more of the state space, maintaining channel stability w finding channels that successfully avoid interference.Conversely, a low entropy v makes the policy more deterministic, which can lead to local optima and unsatisfac interference avoidance results.Therefore, appropriate hyperparameters can enhance algorithm's ability to handle interference and improve the efficiency of relay commun tion in the drone cluster.To enhance the anti-interference performance throughout the entire task period achieve efficient policy exploration, this paper implements dynamic control of the entr coefficient, maintaining the entropy coefficient without exponential changes with train iterations.The initial set of hyperparameters is denoted as  Figures 8-10 illustrate the action selection of three anti-interference algorithms under various interference modes.As seen from the figures, the proposed algorithm exhibits the lowest action selection uniformity under the constant frequency interference mode.This is closely related to the constant frequency interference mode, as the algorithm, after convergence, consistently selects the optimal action, which corresponds to its reward performance.In contrast, under the hybrid frequency interference mode, the action selection uniformity is the highest due to the presence of entropy, which prompts the relay drone to continuously explore new solutions to maximize entropy.As seen from the figures, the proposed algorithm exhibits the lowest action selection uniformity under the constant frequency interference mode.This is closely related to the constant frequency interference mode, as the algorithm, after convergence, consistently selects the optimal action, which corresponds to its reward performance.In contrast, under the hybrid frequency interference mode, the action selection uniformity is the highest due to the presence of entropy, which prompts the relay drone to continuously explore new solutions to maximize entropy.With the Conv_DQN and DQN algorithms, the action selection was relatively uniform across the three interference modes, with each action in the action space having an occurrence frequency of approximately 10%.Additionally, the Conv_DQN algorithm demonstrated better action uniformity compared to the DQN algorithm.

Verification of the Joint Optimization Algorithm Effectiveness for Cluster Drone Communication Rates
The second set of experiments primarily verified the joint optimization results for the communication rates of cluster drones.To maintain generality and verify the adaptability of the algorithm proposed in this paper to different task trajectories of cluster drones, three distinct types of trajectories are set up as follows: circular, oscillatory, and complex trajectories.Their specific performances are depicted in Figure 11.With the Conv_DQN and DQN algorithms, the action selection was relatively uniform across the three interference modes, with each action in the action space having an occurrence frequency of approximately 10%.Additionally, the Conv_DQN algorithm demonstrated better action uniformity compared to the DQN algorithm.

Verification of the Joint Optimization Algorithm Effectiveness for Cluster Drone Communication Rates
The second set of experiments primarily verified the joint optimization results for the communication rates of cluster drones.To maintain generality and verify the adaptability of the algorithm proposed in this paper to different task trajectories of cluster drones, three distinct types of trajectories are set up as follows: circular, oscillatory, and complex trajectories.Their specific performances are depicted in Figure 11.
Figure 12 shows the optimization results for the total communication rate of cluster drones corresponding to the above trajectories.From the graph, it is evident that the circular trajectory has the highest total communication rate, while the oscillatory trajectory has the lowest.The total communication rate for the circular trajectory is approximately 3.3 × 10 7 bps higher than that for the complex trajectory, and the complex trajectory's rate is about 1 × 10 9 bps higher than that for the oscillatory trajectory.The total communication rates for all trajectories have been effectively jointly optimized and, ultimately, maintained stability.Figure 12 shows the optimization results for the total communication rate of cluster drones corresponding to the above trajectories.From the graph, it is evident that the circular trajectory has the highest total communication rate, while the oscillatory trajectory has the lowest.The total communication rate for the circular trajectory is approximately 7 3.3 10 bps × higher than that for the complex trajectory, and the complex trajectory's rate is about 9 1 10 bps × higher than that for the oscillatory trajectory.The total communication rates for all trajectories have been effectively jointly optimized and, ultimately, maintained stability.To verify the optimization effectiveness of the joint optimization proposal presented in this paper, a circular trajectory is used for the motion of the cluster drones, and it is compared with three other optimization schemes.The first scheme optimizes the relay drone's trajectory, transmission power, and power allocation of cluster drones.The second scheme focuses on optimizing the relay drone's trajectory, modulation, and power To verify the optimization effectiveness of the joint optimization proposal presented in this paper, a circular trajectory is used for the motion of the cluster drones, and it is compared with three other optimization schemes.The first scheme optimizes the relay drone's trajectory, transmission power, and power allocation of cluster drones.The second scheme focuses on optimizing the relay drone's trajectory, modulation, and power allocation for cluster drones, while the third optimizes the relay drone's modulation, transmission power, and power allocation for cluster drones.According to the results depicted in Figure 13, the joint optimization scheme outperformed the others, improving the communication rate by 1.2932 × 10 11 bps, which represents 47.68% of the total optimized communication rate.Moreover, the system achieves stability after the fourth round of optimization iterations, showing good convergence.To verify the optimization effectiveness of the joint optimization proposal presented in this paper, a circular trajectory is used for the motion of the cluster drones, and it is compared with three other optimization schemes.The first scheme optimizes the relay drone's trajectory, transmission power, and power allocation of cluster drones.The second scheme focuses on optimizing the relay drone's trajectory, modulation, and power allocation for cluster drones, while the third optimizes the relay drone's modulation, transmission power, and power allocation for cluster drones.According to the results depicted in Figure 13, the joint optimization scheme outperformed the others, improving the communication rate by   Communication delay is primarily composed of propagation delay and transmission delay.Propagation delay is calculated on the basis of the transmission distance and propagation speed, whereas the transmission delay is related to the amount of data to be transmitted and the communication rate.In this study, since all optimization schemes are conducted under the same task context, the propagation delay is identical for different schemes.Therefore, this paper mainly considers the significant impact of transmission delay on the total communication delay.Assuming the total amount of data to be transmitted is 50 Gbit, Figure 14 compares the communication delays of the different schemes.As shown in the figure, the proposed scheme exhibited excellent delay performance during both the initial optimization phase and the stable optimization phase, reducing communication delay by 47.68%, 49.04%, and 49.78% compared to other schemes.
Figure 15 illustrates the changes in the modulation order of the relay drone during the iterative process of joint optimization.As seen in the graph, with the progression of the task cycle and while meeting the bit error rate requirements, the choice of relay drone modulation order shifts from 16 to 4, then gradually changes from 64 to 16, and, finally, stabilizes with 64 occurring 88% of the time and 16 occurring 12% of the time.
Figure 16 shows the changes in the transmission power of the relay drone over iterations.Since the trajectory of the cluster drones varies during the task cycle, the transmission power of the relay drone needs to compensate for the signal-to-noise ratio differences caused by the spatial displacement between the two.As the trajectory of the cluster drones follows a circular pattern, the transmission power level exhibits an oscillatory behavior, aligning with the changes in the scenario.
schemes.Therefore, this paper mainly considers the significant impact of transmissio delay on the total communication delay.Assuming the total amount of data to be tran mitted is 50 Gbit, Figure 14 compares the communication delays of the different scheme As shown in the figure, the proposed scheme exhibited excellent delay performance du ing both the initial optimization phase and the stable optimization phase, reducing com munication delay by 47.68%, 49.04%, and 49.78% compared to other schemes.Figure 15 illustrates the changes in the modulation order of the relay drone durin the iterative process of joint optimization.As seen in the graph, with the progression the task cycle and while meeting the bit error rate requirements, the choice of relay dron modulation order shifts from 16 to 4, then gradually changes from 64 to 16, and, finall stabilizes with 64 occurring 88% of the time and 16 occurring 12% of the time.Figure 15 illustrates the changes in the modulation order of the relay drone during the iterative process of joint optimization.As seen in the graph, with the progression of the task cycle and while meeting the bit error rate requirements, the choice of relay drone modulation order shifts from 16 to 4, then gradually changes from 64 to 16, and, finally, stabilizes with 64 occurring 88% of the time and 16 occurring 12% of the time.Figure 16 shows the changes in the transmission power of the relay drone over iterations.Since the trajectory of the cluster drones varies during the task cycle, the transmission power of the relay drone needs to compensate for the signal-to-noise ratio differences caused by the spatial displacement between the two.As the trajectory of the cluster drones follows a circular pattern, the transmission power level exhibits an oscillatory behavior, aligning with the changes in the scenario.Figure 17 illustrates the energy consumption variations in the relay drone during the mission execution period.As shown in the figure, with the continuous advancement of the mission, the drone needs to adjust physical variables such as trajectory and transmission power to adapt to the highest signal-to-noise ratio for information transmission.Consequently, throughout the mission period, there is a linear increase in flight energy consumption, total energy consumption per iteration, and total energy consumption when Figure 17 illustrates the energy consumption variations in the relay drone during the mission execution period.As shown in the figure, with the continuous advancement of the mission, the drone needs to adjust physical variables such as trajectory and transmission power to adapt to the highest signal-to-noise ratio for information transmission.Consequently, throughout the mission period, there is a linear increase in flight energy consumption, total energy consumption per iteration, and total energy consumption when the relay drone operates at full power.Figure 17 illustrates the energy consumption variations in the relay drone during the mission execution period.As shown in the figure, with the continuous advancement of the mission, the drone needs to adjust physical variables such as trajectory and transmission power to adapt to the highest signal-to-noise ratio for information transmission.Consequently, throughout the mission period, there is a linear increase in flight energy consumption, total energy consumption per iteration, and total energy consumption when the relay drone operates at full power.Figure 18 shows the energy consumption of the relay drone during the mission.As seen in the figure, flight energy consumption constitutes the major portion of the total energy consumption.After optimization, the total energy consumption of the relay drone at a stable transmission power is 0.61% lower compared to operating at full power.Additionally, the total energy consumption required to complete the mission accounts for 70.89% of the system's stored energy.Figure 18 shows the energy consumption of the relay drone during the mission.As seen in the figure, flight energy consumption constitutes the major portion of the total energy consumption.After optimization, the total energy consumption of the relay drone at a stable transmission power is 0.61% lower compared to operating at full power.Additionally, the total energy consumption required to complete the mission accounts for 70.89% of the system's stored energy.Figure 19 shows the changes in the three-dimensional trajectory of the relay drone.Initially, the relay drone's trajectory is a straight line.As the number of iterations increases, the relay drone adjusts its path to balance the maximization of the signal-to-noise ratio across two-hop links, curving from the starting point to the termination point.The trajectory is smooth, without any interruptions, facilitating a stable communication link between the two hops.Figure 19 shows the changes in the three-dimensional trajectory of the relay drone.Initially, the relay drone's trajectory is a straight line.As the number of iterations increases, the relay drone adjusts its path to balance the maximization of the signal-to-noise ratio across two-hop links, curving from the starting point to the termination point.The trajectory is smooth, without any interruptions, facilitating a stable communication link between the two hops.Figure 19 shows the changes in the three-dimensional trajectory of the relay drone.Initially, the relay drone's trajectory is a straight line.As the number of iterations increases, the relay drone adjusts its path to balance the maximization of the signal-to-noise ratio across two-hop links, curving from the starting point to the termination point.The trajectory is smooth, without any interruptions, facilitating a stable communication link between the two hops.For the power allocation issue of cluster drones, given the significant spacing between the drone formations, substantial channel variations can occur among individual drones.Additionally, each subcarrier transmits data from two different drones.The relay drone can effectively distinguish the power levels of different cluster drones during serial interference cancellation.Therefore, throughout the entire mission cycle, the cluster drones can operate at full power for transmitting information.For the power allocation issue of cluster drones, given the significant spacing between the drone formations, substantial channel variations can occur among individual drones.Additionally, each subcarrier transmits data from two different drones.The relay drone can effectively distinguish the power levels of different cluster drones during serial interference cancellation.Therefore, throughout the entire mission cycle, the cluster drones can operate at full power for transmitting information.
Figure 20 illustrates the impact of variables across different dimensions on the total communication rate.The graph shows that individually optimizing each variable can effectively enhance the objective of increasing the total communication rate.Compared to the unoptimized initial values, the degrees of improvement are substantial, with the respective increases reaching  Based on the results of the two sets of simulation experiments, the joint optimization method for cluster drone relay communication rates under interference conditions proposed in this paper demonstrates good performance.It effectively avoids external interference while significantly enhancing the total communication rate of the cluster drones.

λ (c− 1 )
(k) Power Allocation Factor of the Drone Cluster in the (c-1)-th Round P R,(c−1) (k) Transmission Power of the Relay Drone in the (c-1)-th Round Q (c−1) (k) Trajectory Coordinates of the Relay Drone in the (c-1)-th Round U (c) (k) Modulation Order Set of the Relay Drone in the c-th Round O c {X,Y,Z} First-Order Partial Derivative Value with Respect to Q(k) ℵ (c) Solution Set in the c-th Round

Figure 1 .
Figure 1.Schematic diagram of swarm drone relay communication under interference conditions.

Figure 1 .
Figure 1.Schematic diagram of swarm drone relay communication under interference conditions.

Figure 2 .
Figure 2. Schematic diagram of the dynamic interference process in the channel.Figure 2. Schematic diagram of the dynamic interference process in the channel.

Figure 2 .
Figure 2. Schematic diagram of the dynamic interference process in the channel.Figure 2. Schematic diagram of the dynamic interference process in the channel.

n)
denotes the channel interference indicator function, which takes Drones 2024, 8, 381 8 of 32 a value of 0 or 1.When

Figure 3 .
Figure 3. Schematic diagram of the Discrete Soft Actor-Critic algorithm architecture.

Figure 3 .
Figure 3. Schematic diagram of the Discrete Soft Actor-Critic algorithm architecture.

26 : 30 :
Obtain the initial observation set Θ based on prior probabilities, and construct the prior Gaussian distribution and the posterior Gaussian distribution.27: f or n= 1 : BSN 28: Determine x i+1 using the EI function.29: Calculate the target function value y i+1 corresponding to x i+1 .Update the Gaussian model.31: end f or 32: Output the optimal discount factor γ u * , learning rate lr u * , and target entropy H * .33: Incorporate the optimal hyperparameters γ u * , lr u * , H * , and re-execute Steps 2-25.

.
Because of the inconvenient computation of discrete variable values involved in the above optimization problem, the auxiliary variables b 1 , b 2 , b 3 , b 4 are introduced, thus transforming the optimization problem C 2 into the following: the first-order derivative value of the above expression for Q(k) should be a trajectory along three dimensions (X,Y,Z).The first-order partial derivative values of O c X , O c Y , O c Z for the Taylor expansion expression of Q(k) are as follows:

i
(k) ≤ 0; then, the objective Drones 2024, 8, 381 20 of 32 respectively show the anti-interference effects of the three types of interference.
compared with the Convolutional Neural Network-Enhanced DQN (Conv_DQN) algorithm, the standard DQN algorithm, and random selection.To ensure fairness, the three algorithms were set with the same number of training sessions ( show the anti-interference effects of the three types of interference.

Figure 4 .
Figure 4. Comparison of resistances to Constant frequency interference effects.Figure 4. Comparison of resistances to Constant frequency interference effects.

Figure 4 .Figure 5 .
Figure 4. Comparison of resistances to Constant frequency interference effects.Figure 4. Comparison of resistances to Constant frequency interference effects.Drones 2024, 8, x FOR PEER REVIEW 24 of 35

Figure 6 .
Figure 6.Comparison of resistance to hybrid frequency interference effects.

Figure 5 .Figure 5 .
Figure 5.Comparison of resistance to sweep frequency interference effects.

Figure 6 .
Figure 6.Comparison of resistance to hybrid frequency interference effects.

Figure 6 .
Figure 6.Comparison of resistance to hybrid frequency interference effects.

Figure 7 .
Figure 7.Comparison of anti-interference effects with different hyperparameters.
mainstream hyperparameter sets that have shown better performance are used

Figure 7 .
Figure 7.Comparison of anti-interference effects with different hyperparameters.
the BO_DSAC_III and BO_DSAC_II curves in the figure.BO_DSAC_I represents the parameter set optimized by Bayesian optimization, denoted as 4 (0.97 5 10 0.6) − × ， ， .As shown in the figure, the parameter set optimized by Bayesian optimization exhibits a certain degree of improvement in anti-interference performance.Compared to the parameters during optimization and the nonoptimized parameter set, the total reward values increased by 7.1% and 3.3947 times, respectively.This demonstrates that the proposed algorithm can significantly enhance the problem-solving capability of the algorithm.Figures 8-10 illustrate the action selection of three anti-interference algorithms under various interference modes.

Figure 8 .
Figure 8. Action selection of the BO_DSAC algorithm under different interference modes.Figure 8. Action selection of the BO_DSAC algorithm under different interference modes.

Figure 8 .
Figure 8. Action selection of the BO_DSAC algorithm under different interference modes.Figure 8. Action selection of the BO_DSAC algorithm under different interference modes.

Figure 8 .
Figure 8. Action selection of the BO_DSAC algorithm under different interference modes.

Figure 9 .
Figure 9. Action selection of the Conv_DQN algorithm under different interference modes.Figure 9. Action selection of the Conv_DQN algorithm under different interference modes.

Figure 9 . 35 Figure 10 .
Figure 9. Action selection of the Conv_DQN algorithm under different interference modes.Figure 9. Action selection of the Conv_DQN algorithm under different interference modes.Drones 2024, 8, x FOR PEER REVIEW 27 of 35

Figure 10 .
Figure 10.Action selection of the DQN algorithm under different interference modes.

Figure 11 .
Figure 11.Different motion trajectories of the cluster drones.

11 1 .
2932 10 bps × , which represents 47.68% of the total optimized communication rate.Moreover, the system achieves stability after the fourth round of optimization iterations, showing good convergence.

Figure 13 .
Figure 13.Comparison of total communication rate values among different optimization schemes.Communication delay is primarily composed of propagation delay and transmission delay.Propagation delay is calculated on the basis of the transmission distance and

Figure 13 .
Figure 13.Comparison of total communication rate values among different optimization schemes.

Figure 14 .
Figure 14.Communication delay under different optimization schemes.

Figure 15 .
Figure 15.Changes in the modulation order of the relay drone.

Figure 14 .
Figure 14.Communication delay under different optimization schemes.

Figure 14 .
Figure 14.Communication delay under different optimization schemes.

Figure 15 .
Figure 15.Changes in the modulation order of the relay drone.Figure 15.Changes in the modulation order of the relay drone.

Figure 15 .
Figure 15.Changes in the modulation order of the relay drone.Figure 15.Changes in the modulation order of the relay drone.

Figure 16 .
Figure 16.Variation in the transmission power of the relay drone.

Figure 16 .
Figure 16.Variation in the transmission power of the relay drone.

Figure 16 .
Figure 16.Variation in the transmission power of the relay drone.

Figure 17 .
Figure 17.Energy consumption variation in the relay drone during the mission period.Figure 17.Energy consumption variation in the relay drone during the mission period.

Figure 17 .
Figure 17.Energy consumption variation in the relay drone during the mission period.Figure 17.Energy consumption variation in the relay drone during the mission period.

Figure 18 .
Figure 18.Mission energy consumption value of the relay drone.

Figure 18 .
Figure 18.Mission energy consumption value of the relay drone.

Figure 18 .
Figure 18.Mission energy consumption value of the relay drone.

Figure 19 .
Figure 19.Changes in the three-dimensional trajectory of the relay drone.

Figure 19 .
Figure 19.Changes in the three-dimensional trajectory of the relay drone.

Figure 20
illustrates the impact of variables across different dimensions on the total communication rate.The graph shows that individually optimizing each variable can effectively enhance the objective of increasing the total communication rate.Compared to the unoptimized initial values, the degrees of improvement are substantial, with the respective increases reaching, 8.0654 × 10 10 bps, 3.7722 × 10 10 bps, 3.2535 × 10 10 bps, and 1.1919 × 10 11 bps.Drones 2024, 8, x FOR PEER REVIEW 33 of 35

Figure 20 .
Figure 20.Impact of variables across different dimensions on the optimization of total communication rate.

Figure 20 .
Figure 20.Impact of variables across different dimensions on the optimization of total communication rate.

Table 1 .
Description of key parameters.
J Maximum Transmission Power of the Jammer Y(•) Channel Interference Indicator Function SI NR i,R SINR of the Signal Received by the Relay Drone from Drone i in the Cluster

Table 1 .
Cont.SINR of the Signal Received by the Ground Station from the Relay Drone ) Table 2 lists the experimental parameters set for this study.