A Novel Intelligent Anti-Jamming Algorithm Based on Deep Reinforcement Learning Assisted by Meta-Learning for Wireless Communication Systems

: In the ﬁeld of intelligent anti-jamming, deep reinforcement learning algorithms are regarded as key technical means. However, the learning process of deep reinforcement learning algorithms requires a stable learning environment to ensure its effectiveness. Moreover, the inherent limitations of deep reinforcement learning algorithms mean that they can only demonstrate excellent learning capabilities on speciﬁc tasks with constant parameters. When parameters change, they can only resample and relearn to converge. In a changing jamming environment, its stability and convergence speed may be challenged, thereby affecting its robustness and generalization capabilities. Aiming at the naive yet unique similarity characteristics of the communication anti-jamming problem, this paper designs a new Meta-PPO deep reinforcement learning algorithm that combines Proximal Policy Optimization (PPO) and MAML meta-learning ideas. The proposed algorithm engrafts the principle of meta-learning used in the Model Agnostic Meta-Learning (MAML) model onto the Proximal Policy Optimization (PPO)-based schemes, enabling the communication systems to harness its prior learned experiences acquired from previous anti-jamming tasks to facilitate and speed up its optimal decision-making process when faced with incoming jamming attacks with similar features. The proposed algorithm is veriﬁed through computer simulation analyses and the results show that the proposed novel Meta-PPO algorithm can outperform traditional DQN-and PPO-based algorithms in terms of better robustness and generalization abilities, which can be used to enhance the anti-jamming capabilities of wireless communication systems.


Introduction
Wireless communication technologies have been widely applied in many fields over the past two decades and have become one of the cornerstones of modern military information systems [1,2].Improving anti-jamming capabilities in communications can enhance the reliability and security of military communications, especially in complex electronic warfare environments.For unmanned aerial vehicles (UAVs) and remote control systems, enhanced communication stability is crucial.Autonomous vehicles and intelligent transportation systems rely on reliable communication to prevent accidents, and improving anti-jamming performance can enhance the reliability of vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications.Enhancing anti-jamming abilities can maintain higher operational stability of networks in the face of natural disasters or human attacks.At personal and enterprise levels, improved communication anti-jamming capabilities can reduce the risk of private information leakage.In remote or complex environments, more reliable communication can provide better connectivity and service quality.In summary, enhancing communication anti-jamming capabilities benefits not only specific industries but also has a profound impact on the security and efficiency of the entire wireless communication field.
However, due to its open channel and application environments, wireless communication systems are vulnerable to all kinds of jamming, especially malicious jamming from electronic warfare attacks.And those jamming and attacks have already become one of the primary factors affecting the reliability and efficiency of wireless communication systems [3].As a result, the issue of communication anti-jamming has become a hot topic in the field of wireless communication technology research.
Traditional anti-jamming techniques that have been used in wireless communication systems are mostly based on signal spread spectrum technologies, combined with all kinds of error-correcting coding and signal power/frequency adapting techniques.These methods can be effectively used to counter conventional jamming such as single-tone, multi-tone, and partial-band jamming [4].However, with the development of machine learning, research on intelligent jamming has gradually deepened.In the confrontation with the communication party, intelligent jamming accumulates "jamming knowledge" through experience or exploratory learning, and makes jamming decisions to dynamically adapt to the local changes in the electromagnetic environment.In order to enhance its own benefits, intelligent jamming has achieved a strategic closed loop of "perception-learning-predictiondecision-feedback."Intelligent jamming not only includes the innovation of jamming styles, but also the comprehensive, flexible, and efficient application of basic jamming styles, that is, according to the actual needs of confrontation, based on reasoning, prediction, and decision-making capabilities, comprehensively and flexibly using aiming or blocking interference methods to block or weaken the effective communication of the communication party.It can be seen that traditional anti-jamming measures can no longer solve intelligent jamming with high dynamic characteristics, and various intelligent jamming empowered by artificial intelligence technology urgently needs effective countermeasures.
In [5], a deep reinforcement learning-based routing algorithm was proposed to tackle the anti-jamming communication challenges in the heterogeneous Internet of Satellite (IoS), which operates in a highly dynamic communication environment with potential intelligent jamming.The proposed algorithm can obtain a usable subset of routes for the traffic in IoS with superior anti-jamming performance and can lower the routing costs.In addition, the proposed anti-jamming strategy can converge to a Stackelberg equilibrium.
In [6], a deep reinforcement learning-based maritime anti-jamming algorithm was proposed to address the issue of wireless communication interruptions in a maritime communications environment due to tracking jamming.The algorithm reduced the probability of communication systems being found and tracked by tracking jammers and lowered the bit error rate of maritime wireless transmission, and the proposed algorithm also helped save the energy consumption of the ship and drone platforms.
As can be seen from the existing literature, deep reinforcement learning techniques can help wireless communication systems learn from their environment and adjust their strategies according to the jamming and jamming they encounter in various application scenarios, thus making optimal anti-jamming decisions to ensure their performance of signal transmissions [7][8][9][10][11][12][13][14][15].
However, existing deep reinforcement learning anti-jamming algorithms still face two challenges.First, their robustness is expected to be further improved.When faced with the same jamming pattern, even a minor adjustment to jamming parameters can compromise or even fail their convergence strategy, causing the algorithms to start their relearning process, which can be very time-consuming.Secondly, the proposed algorithms still lack generalization ability, and their applications are still limited in combating some specific jamming patterns.Most of the existing algorithms still lack the ability to explore their pre-Appl.Sci.2023, 13, 12642 3 of 18 vious anti-jamming task experiences to accelerate their learning process for implementing new anti-jamming tasks.
Meta-learning is a branch of General AI, which aims to enable machines to learn like humans, possessing cognitive and logical analytical abilities, and to be able to realize a self-update so as to adapt to new learning tasks.In 2017, Chelsea Finn from the Bair lab proposed the Model Agnostic Meta-Learning (MAML) algorithm for the first time in [16].This algorithm can quickly and accurately transfer a well-trained deep learning model to a new environment.
Unlike transfer learning, MAML-based algorithms can complete their decision-making tasks using a meta-learner (meta-layer) to guide their learner (base layer) to reach the optimized decisions.The meta-layer can let the algorithms combine their previous learning experiences from all of their past tasks, providing the necessary initial model values for their learning process for new tasks.
MAML is applicable to all optimized learners based on stochastic gradient descent.Hence, it is possible to integrate an MAML module into all deep learning models, which can improve their generalization ability and greatly enlarge their application scenarios while maintaining the algorithm performance [17].
In [18], the author addressed the issue that deep neural networks usually perform well in data-rich situations but poorly when data are less available or when rapid adaptation is needed for task changes.They introduced a meta-learner called SNAIL (Simple Neural Attentive Learner), which offered an effective solution for deep neural networks in datascarce environments or when rapid task adaptation is needed and excellent performance was achieved in some multiple-task scenarios.
In [19], a facial recognition algorithm (Meta Face Recognition (MFR) Algorithm) was proposed by using the meta-learning technique.This algorithm addressed the poor generalization ability of the facial recognition systems used in real-life applications and can effectively apply models well-trained with webface data to real-life surveillance scenarios to improve their performances.
The existing works show that MAML-based algorithms require fewer training samples and have better generalization abilities and they can especially enable intelligent antijamming decisions to fully exploit their experiences from previous anti-jamming tasks to make optimal decisions in a faster manner when faced with jamming patterns with a few parameter changes [20][21][22][23][24][25].
These advantages make MAML-based algorithms particularly suitable for combating malicious intelligent jamming and attacks with high dynamic characteristics.Also, they can overcome the limitations of existing deep reinforcement learning anti-jamming algorithms in terms of algorithm robustness and generalization abilities.
In this paper, an intelligent anti-jamming algorithm based on the integration of the principles of deep reinforcement learning and meta-learning is proposed and verified by computer simulations.First of all, a base learner is employed by the algorithm to derive optimal strategies from a series of known, similar anti-jamming tasks.Then, a meta-learner is used to abstract the general characters and rules of all the previous tasks and acquire the general knowledge about the jamming and attacks.This can provide a more universal strategy to be used by the intelligent anti-jamming algorithms to finish their learning process for new tasks.Thus, the proposed algorithm, starting from this initial strategy, can swiftly adapt to new and more dynamic jamming scenarios.
The main contribution of this paper is as follows: We propose the Meta-PPO intelligent anti-jamming algorithm to address the issue where the communication party is unable to use known anti-jamming experience and thus led to slow strategy convergence when faced with malicious jamming that dynamically and randomly modifies jamming parameters.When the external jammer randomly changes jamming parameters, the algorithm uses the MAML thought to update the initial model parameters in real-time.This enables the communication party to utilize known historical anti-jamming experience and converge more quickly to the optimal anti-jamming strategy.
To validate the effectiveness of the Meta-PPO algorithm proposed in this paper, we simulated scenarios of wireless communication interaction using Pytorch software (version 2.0).In these scenarios, the communication party employed both Meta-PPO and other algorithms for decision making, and their normalized throughput was compared.Simulation results show that the proposed algorithm can consistently maintain a high throughput performance for data transmission, especially in a jamming environment where jamming parameters are frequently changing.Also, the proposed algorithm shows enhanced generalization ability and robustness compared with conventional reinforcement learning algorithms, resulting in superior anti-jamming capabilities.For future research directions, we plan to integrate actual hardware and conduct experimental simulations of the Meta-PPO algorithm in specific wireless communication environments.
The organization of this paper is as follows: The components of the system and the formulation of the problem are presented in Section 2. The Meta-PPO-based intelligent anti-jamming algorithm for wireless communication is detailed in Section 3. The simulation outcomes and analysis of the algorithm are discussed in Section 4. Finally, the paper is concluded in Section 5.The wireless communication transceivers operate in the frequency bandwidth of B with dynamic spectrum access and power control abilities.The channels available are denoted as f n ∈ { f 1 , f 2 , ..., f M } and the M channels do not overlap with each other, and each of them has a bandwidth of ∆B = B/m.Their available power levels are represented as p n ∈ {p 1 , ..., p D }.The transmission channel gain for the communication system is denoted as g n = g p |h s |, where g p represents the path loss at a given distance, and |h s | is a Rayleigh distributed random variable.The total time duration for communication is denoted as T max .Equation (1) shows the signal-to-jamming-plus-noise ratio (SINR) H T in time slot T.

System Model
H T = P r,T p k,T + P n,T The SINR in the communication system needs to meet the requirement of H T ≥ H th T at any moment in time slot T to make sure the data frame transmitted in time slot T can be successfully decoded, where H th T represents the SINR threshold.The normalized throughput of the communication system's data transmission is The transmitter can transmit one subframe in each time slot, with each subframe containing the same amount of information.A packet, comprising l subframes, includes a Cyclic Redundancy Check (CRC) field.The transmitter is located far from the jammer, minimally affected by jamming, and can reliably receive control information transmitted by the receiver through a protocol-reinforced low-capacity control channel, enabling cooperative anti-jamming.c

Malicious jammer
The jammer preemptively acquires the communication frequency and time slot synchronization information of both the transmitter and receiver.The bandwidth of the pulse jamming signal can completely cover the transmission channel, and the duration of a single pulse is equal to the length of a time slot.
The jammer can indiscriminately attack all nodes in the communication system and can dynamically adjust its jamming power and its targeted channels between different communication time slots.Its available power is denoted as p k ∈ p 1 , ..., p E , and the selectable channel as f k ∈ { f 1 , f 2 , ..., f M }, where k ∈ {1, 2, ..., K T } represents the number of blocked channels, and K T ≤ M represents the number of channels blocked by the jammer during time slot T. P n,T represents the background noise during the time slot T.
The jammer preemptively acquires the communication frequency and time slot syn chronization information of both the transmitter and receiver.The bandwidth of the puls jamming signal can completely cover the transmission channel, and the duration of a sin gle pulse is equal to the length of a time slot.
The jammer can indiscriminately attack all nodes in the communication system and can dynamically adjust its jamming power and its targeted channels between differen communication time slots.Its available power is denoted as ∈ { , . . ., }, and the se lectable channel as ∈ { , , . . ., }, where ∈ {1,2, . . ., } represents the number o blocked channels, and ≤ represents the number of channels blocked by the jamme during time slot ., represents the background noise during the time slot .

Problem Modeling
In this study, we model the deep reinforcement learning problem of the agent using a Markov Decision Process (MDP), which can be represented as < , , , >, where denotes the set of environmental states, represents the set of actions taken by the agent is the state transition probability function, indicating the probability distribution of the next state given a specific state and action, and is the reward function for the agent.The base learner uses the classic PPO (Proximal Policy Optimization) deep reinforce ment learning algorithm.
The main parameters include a state space , an action space , a state transition probability , and a reward function .
The above parameters are defined as follows: a. State space : The system state at time slot is defined as ( , , , , , , , ), where , repre sents the communication channel chosen by the communication system, , represent the channel number chosen to be jammed by the jammer, , is the transmission powe chosen by the communication system, and , is the jamming power chosen by the jam mer, all in time slot .
Equation (2) shows the received power of the communication system , in tim slot : The action space regarding the time slot is defined as ( , , , ), where , represents the transmission channel chosen by the communication system in time slot + 1 and , represents the power chosen by the communication system in time slot + 1 c.Reward Function : When the communication system successfully transmits its data once, it will receiv a reward; otherwise, if it fails in data transmission, it will receive a penalty.

Problem Modeling
In this study, we model the deep reinforcement learning problem of the agent using a Markov Decision Process (MDP), which can be represented as < S, A, F, R >, where S denotes the set of environmental states, A represents the set of actions taken by the agent, F is the state transition probability function, indicating the probability distribution of the next state given a specific state and action, and R is the reward function for the agent.
The base learner uses the classic PPO (Proximal Policy Optimization) deep reinforcement learning algorithm.
The main parameters include a state space S, an action space A, a state transition probability P, and a reward function R.
The above parameters are defined as follows: a. State space S: The system state at time slot T is defined as f n,T , f k,T , p n,T , p k,T , where f n,T represents the communication channel chosen by the communication system, f k,T represents the channel number chosen to be jammed by the jammer, p n,T is the transmission power chosen by the communication system, and p k,T is the jamming power chosen by the jammer, all in time slot T. Equation (2) shows the received power of the communication system P r,T in time slot b Action Space A: The action space regarding the time slot T is defined as ( f n,T+1 , p n,T ), where f n,T+1 represents the transmission channel chosen by the communication system in time slot T + 1 and p n,T represents the power chosen by the communication system in time slot T + 1.

c
Reward Function R: When the communication system successfully transmits its data once, it will receive a reward; otherwise, if it fails in data transmission, it will receive a penalty.
The communication system needs to consider the channel switching cost C f when making decisions.The reward function of the communication system represents the immediate reward that can be obtained by executing action a T under the environmental state S T .
where c T is the normalized throughput of the communication system and the power discount factor η has a constant value within it [0,1].The higher the transmission power P r,T is, the greater the reward discount ηP r,T /p max will be.

Meta-PPO Anti-Jamming Intelligent Decision Algorithm
Meta-learning enables agents to have the ability to learn how to learn.The focus of meta-learning is on how to introduce prior knowledge into the model and optimize external memory during training, so as to learn faster and more accurately when training new tasks.Unlike other deep learning algorithms, MAML does not aim to find the optimal parameters for a specific task, but rather seeks to find initial parameters η through training a series of task-related meta-tasks, which enable the model to quickly reach the optimum when faced with new tasks.η has the sensitive characteristic to the learning domain distribution of new tasks, which allows certain features inside the trained model to be more easily transferred among various tasks.Optimal model network parameters can be obtained after a few steps of updating.The gradient descent process of MAML is shown in Figure 2. The η in Figure 2 represents the initial parameters obtained after MAML pre-training; L 1 , L 2 , L 3 , respectively, represent the loss functions of the new task; ∇ represents the gradient operator; η 1 ,η 2 ,η 3 indicate the optimal updating directions under the new task.
mediate reward that can be obtained by executing action under the environ state .
( , ) = − , where is the normalized throughput of the communication system and the pow count factor has a constant value within it [0,1].The higher the transmission pow is, the greater the reward discount , / will be.

Meta-PPO Anti-Jamming Intelligent Decision Algorithm
Meta-learning enables agents to have the ability to learn how to learn.The fo meta-learning is on how to introduce prior knowledge into the model and optimize nal memory during training, so as to learn faster and more accurately when trainin tasks.Unlike other deep learning algorithms, MAML does not aim to find the o parameters for a specific task, but rather seeks to find initial parameters through ing a series of task-related meta-tasks, which enable the model to quickly reach th mum when faced with new tasks.has the sensitive characteristic to the learning d distribution of new tasks, which allows certain features inside the trained mode more easily transferred among various tasks.Optimal model network parameters obtained after a few steps of updating.The gradient descent process of MAML is in Figure 2. The in Figure 2 represents the initial parameters obtained after MAM training; , , , respectively, represent the loss functions of the new task; ∇ rep the gradient operator; , , indicate the optimal updating directions under th task.The proposed intelligent communication anti-jamming algorithm is based o deep reinforcement learning combined with the meta-learner and base learner def the meta-learning-based MAML algorithm.It takes anti-jamming tasks with differe cesses but belonging to similar types as independent base learners.Then, the base le transmit the knowledge they have learned to the meta-learner for collection and su rization, and through which the initial network parameters of the model can be ob with fast convergence, strong robustness, and better generalization abilities.Althou base learners are independent from each other, they perform intelligent anti-jam tasks of similar types.Therefore, they are based on the same model.The proposed gent communication anti-jamming algorithm is presented in Figure 3.
As shown in Figure 3, the Meta-PPO algorithm consists of a base learner and a learner.The base learner learns from multiple similar anti-jamming tasks, where i the network parameters learned from different tasks, which represent the knowle each task's characteristics.The meta-learner is responsible for integrating all the tas acteristics and guiding the base learner, allowing the base learner to adapt to new faster, thus solving new problems with better performance.The meta-learner rep The proposed intelligent communication anti-jamming algorithm is based on meta deep reinforcement learning combined with the meta-learner and base learner defined in the meta-learning-based MAML algorithm.It takes anti-jamming tasks with different processes but belonging to similar types as independent base learners.Then, the base learners transmit the knowledge they have learned to the meta-learner for collection and summarization, and through which the initial network parameters of the model can be obtained with fast convergence, strong robustness, and better generalization abilities.Although the base learners are independent from each other, they perform intelligent anti-jamming tasks of similar types.Therefore, they are based on the same model.The proposed intelligent communication anti-jamming algorithm is presented in Figure 3.
As shown in Figure 3, the Meta-PPO algorithm consists of a base learner and a metalearner.The base learner learns from multiple similar anti-jamming tasks, where it gains the network parameters learned from different tasks, which represent the knowledge of each task's characteristics.The meta-learner is responsible for integrating all the task characteristics and guiding the base learner, allowing the base learner to adapt to new tasks faster, thus solving new problems with better performance.The meta-learner represents the common knowledge of all tasks.Meta-testing uses the initial network parameters obtained by the meta-learner to learn the anti-jamming model and evaluates the parameters with new tasks not included in the meta-training set.
The modules of the Meta-PPO algorithm are described in detail as follows: the common knowledge of all tasks.Meta-testing uses the initial network parame tained by the meta-learner to learn the anti-jamming model and evaluates the para with new tasks not included in the meta-training set.
The modules of the Meta-PPO algorithm are described in detail as follows: . . .

Base Learner
In the proposed Meta-PPO algorithm, the base learner learns from commun anti-jamming tasks, which are similar but are independent from each other.All ta finished on the same communication system, facing similar jamming environmen only difference is that the jamming parameters used by the jammer for each task ferent.Therefore, the intelligent anti-jamming strategies used by the algorithm f task are similar.The basic functions of the base learner performed in each task ar lows: (1) As shown in Figure 4, for the current task, using the PPO algorithm to f pattern of the jamming signals from the jammer and obtaining the optimal comm tion strategy under the current communication environment.
(2) Obtaining the experience from the meta-learner that is helpful for comple current task, including the initial model, initial parameters, etc.
(3) Feeding the learned model and parameters back to the meta-learner after rent task learning is completed.

Base Learner
In the proposed Meta-PPO algorithm, the base learner learns from communication anti-jamming tasks, which are similar but are independent from each other.All tasks are finished on the same communication system, facing similar jamming environments.The only difference is that the jamming parameters used by the jammer for each task are different.Therefore, the intelligent anti-jamming strategies used by the algorithm for each task are similar.The basic functions of the base learner performed in each task are as follows: (1) As shown in Figure 4, for the current task, using the PPO algorithm to find the pattern of the jamming signals from the jammer and obtaining the optimal communication strategy under the current communication environment.
(2) Obtaining the experience from the meta-learner that is helpful for completing the current task, including the initial model, initial parameters, etc.
(3) Feeding the learned model and parameters back to the meta-learner after the current task learning is completed.

Base Learner
In the proposed Meta-PPO algorithm, the base learner learns from communic anti-jamming tasks, which are similar but are independent from each other.All task finished on the same communication system, facing similar jamming environments only difference is that the jamming parameters used by the jammer for each task ar ferent.Therefore, the intelligent anti-jamming strategies used by the algorithm for task are similar.The basic functions of the base learner performed in each task are a lows: (1) As shown in Figure 4, for the current task, using the PPO algorithm to fin pattern of the jamming signals from the jammer and obtaining the optimal commu tion strategy under the current communication environment.
(2) Obtaining the experience from the meta-learner that is helpful for completin current task, including the initial model, initial parameters, etc.
(3) Feeding the learned model and parameters back to the meta-learner after the rent task learning is completed.The proposed algorithm adopted the idea used in MAML meta-learning.During the base learner's learning process, the methods reported in [26][27][28][29][30] were adopted and the concepts used in the PPO deep reinforcement learning algorithm such as the experience replay, valuation neural network, and target neural network were also adopted.
During the algorithm learning phase, the base learner uses the ε − greedy strategy for learning updates.Under this strategy, the action with the highest reward is selected with a probability of 1 − ε, and actions are randomly selected with a probability of ε, as shown in Equation (4): The goal of adopting meta-learning in the proposed algorithm is to let its learning process quickly adapt to new tasks.In order to achieve this, the Meta-PPO algorithm uses a few samples of multiple similar tasks as the input data for the learning process to reduce the expected loss of the algorithm on multiple tasks.And by doing so, the direction is set for the parameter updates.Thus, a set of model parameters, adapting to a new task, can be acquired in a faster manner.Therefore, the optimized objective function of the proposed algorithm can be expressed with Equation ( 5): In the equation, p(τ) represents the task distribution.θ represents the parameters of the model.U k τ (θ) indicates that the model parameters are updated k times using a small amount of data collected from task τ.L τ U k τ (θ) represents the loss function of the model on task τ.
The parameter updating process in the meta-training phase of the proposed algorithm can be divided into two parts, an inner loop and an outer loop, according to Equation ( 6).In the inner loop, the Meta-PPO algorithm uses a small amount of data from a randomly chosen task τ as the learning data to update the model parameters, reducing the model's loss on task τ.In this loop, the model parameter updating process is the same as the PPO algorithm proposed in [26][27][28][29][30].The neural network of the algorithm learns from several batches of data on the randomly chosen tasks.In each round, the agent performs actions according to the policy of the Actor network, interacting with the environment to collect experience, and then saves the collected experience to an experience pool.When the number of collected experiences meets the model update threshold, the model parameters will be updated.
First of all, during the updating process, the temporal difference error δ T is calculated according to Equation (6): In the equation, δ T represents the single-step temporal difference error at time T. r T represents the immediate reward at time T. µ represents the parameters of the Critic network.V µ (S T+1 ) is the estimated state value of state S T+1 at time T + 1, output by the Critic network.V µ (S T+1 ) is the estimated state value of the next state S T+1 at time T + 1, and V µ (S T ) is the output of the Critic network.
After calculating the temporal difference error for each time step, the target for updating the Critic network is calculated according to Equation (7).The mean squared error is used as the loss function for updating the Critic network.
The equation for calculating the Critic loss function can be expressed as Equation (8).Then, the Critic network is updated through backpropagation.
In the equation, A T represents the estimated advantage function at time T.
In the equation, y T represents the update target of the Critic network.V µ (S T ) repre- sents the estimated state value of state S T at time T.
The loss function calculation equation for the Actor network is shown in Equation ( 9).The Actor network parameters are updated by minimizing the loss function.
In the equation, r T (θ) = π θ (a T |S T ) π θ1 (a T |S T ) represents the ratio of the new policy to the old policy.θ represents the parameters of the new policy network.θ 1 represents the parameters of the old policy network.A T represents the estimated advantage function at time T.
In the inner loop, after updating the model parameters k times, the process will enter the outer loop.In the outer loop, the Meta-PPO algorithm will calculate the update gradient according to Equations ( 10) and ( 11) and update the model parameters again to minimize the expected loss function over the task distribution, aiming to find a set of initial parameters that can quickly adapt to new tasks.
In the equation, µ represents the parameters of the Critic network before learning, ∼ µ represents the parameters after learning on task τ, and ε is the update step size.
In the equation, θ represents the parameters of the Critic network before learning, ∼ θ represents the parameters after learning on task τ, and ε is the update step size.
For the assumed communication environment in our research, the decision model of the basic learner based on the proposed Meta-PPO algorithm is shown in Figure 5.When a communication process starts, the Actor network of the PPO algorithm receives the communication environment values perceived by the upper layer as its input, and then it outputs action commands to control the decisions made by the communication system.During the interaction with the external environment, the generated data (S t , A t , R t , S t+1 ) will be stored in the experience pool for subsequent model learning.
In the equation, represents the update target of the Critic network.( ) r sents the estimated state value of state at time .The loss function calculation equation for the Actor network is shown in Equ (9).The Actor network parameters are updated by minimizing the loss function.represents the parameters of the new policy network.represents the pa eters of the old policy network.
represents the estimated advantage function at .
In the inner loop, after updating the model parameters times, the process wi ter the outer loop.In the outer loop, the Meta-PPO algorithm will calculate the up gradient according to Equations ( 10) and ( 11) and update the model parameters aga minimize the expected loss function over the task distribution, aiming to find a set o tial parameters that can quickly adapt to new tasks.

= + ( − )
In the equation, represents the parameters of the Critic network before lear ′ represents the parameters after learning on task , and is the update step size.

= + ( − )
In the equation, represents the parameters of the Critic network before lear ′ represents the parameters after learning on task , and is the update step size.
For the assumed communication environment in our research, the decision mod the basic learner based on the proposed Meta-PPO algorithm is shown in Figure 5. W a communication process starts, the Actor network of the PPO algorithm receives the munication environment values perceived by the upper layer as its input, and then it puts action commands to control the decisions made by the communication system.ing the interaction with the external environment, the generated data ( , , , ) be stored in the experience pool for subsequent model learning.

Meta-Learner
As shown in Figure 6, in the proposed Meta-PPO algorithm, the meta-learner sponsible for collecting and summarizing the learned experiences from all tasks.each learning process from the basic learner, the meta-learner will integrate those ex

Meta-Learner
As shown in Figure 6, in the proposed Meta-PPO algorithm, the meta-learner is responsible for collecting and summarizing the learned experiences from all tasks.After each learning process from the basic learner, the meta-learner will integrate those experiences and update its parameters accordingly.In the end, the meta-learner, having integrated the learning experiences from all tasks, delivers the initial model network values to the basic learner, allowing the basic learner to achieve good accuracy after a few iterations in dealing with new tasks.
As shown in Figure 3, the memory module of the proposed algorithm includes two parts: an initial parameter value and the loss function's gradient value.The initial parameter value represents the initial parameter of a certain task.During the basic learner's learning process, the meta-learner extracts the initial parameter value of the most relevant task from the memory module and feeds it to the basic learner as the initial parameter value.
After the basic learner completes its learning process, it feeds the loss function's gradient value back to the meta-learner.The meta-learner then distributes the initial parameter values to all the basic learners.ences and update its parameters accordingly.In the end, the meta-learner, having integrated the learning experiences from all tasks, delivers the initial model network values to the basic learner, allowing the basic learner to achieve good accuracy after a few iterations in dealing with new tasks.As shown in Figure 3, the memory module of the proposed algorithm includes two parts: an initial parameter value and the loss function's gradient value.The initial parameter value represents the initial parameter of a certain task.During the basic learner's learning process, the meta-learner extracts the initial parameter value of the most relevant task from the memory module and feeds it to the basic learner as the initial parameter value.
After the basic learner completes its learning process, it feeds the loss function's gradient value back to the meta-learner.The meta-learner then distributes the initial parameter values to all the basic learners.
After the basic learners complete their learning processes, they feed the loss function's gradient values back to the meta-learner.Finally, the meta-learner collects the function's gradient values from all the tasks and then uses these gradient values to update the initial parameter values stored in the memory module in a timely manner.Therefore, the latest initial parameter values from the most relevant task experience can be provided timely for each task.
The intelligent anti-jamming model of the Meta-PPO algorithm is shown in Figure 7.The specific learning process of the meta-learner is as follows: Assume there is a model () affected by and the distribution of tasks ( ).First, initialize the parameters with some random values.Next, draw a batch of tasks ( ~ ( )) from the distribution of tasks.
Then, for each task, draw trajectories and construct the training and test sets: ~ .Execute gradient descent through Equation ( 12) and find the optimal parameters to minimize the loss on the training set : Before drawing the next batch of tasks, perform a meta-update.Minimize the loss by calculating the loss gradient relative to the optimal parameter through Equation ( 13) After the basic learners complete their learning processes, they feed the loss function's gradient values back to the meta-learner.Finally, the meta-learner collects the function's gradient values from all the tasks and then uses these gradient values to update the initial parameter values stored in the memory module in a timely manner.Therefore, the latest initial parameter values from the most relevant task experience can be provided timely for each task.
The intelligent anti-jamming model of the Meta-PPO algorithm is shown in Figure 7.The specific learning process of the meta-learner is as follows: Assume there is a model f θ () affected by θ and the distribution of tasks p(T).First, initialize the parameters θ with some random values.Next, draw a batch of tasks T i ( T i ∼ p(T)) from the distribution of tasks.
Then, for each task, draw k trajectories and construct the training and test sets: D train i ∼ T i .Execute gradient descent through Equation ( 12) and find the optimal parameters θ i to minimize the loss on the training set D train i : Before drawing the next batch of tasks, perform a meta-update.Minimize the loss by calculating the loss gradient relative to the optimal parameter θ i through Equation ( 13) to update our randomly initialized parameter θ:

Meta-PPO Intelligent Anti-Jamming Algorithm
The Meta-PPO intelligent anti-jamming algorithm designed in this paper is shown in Algorithm 1.After iterating n times, the meta-learner will send the final initial model parameter θ to the basic learner, and then learn and test new tasks that are not in the training set D train training.Then use this gradient to update the Actor network's parameters θ, as specified in Equation ( 12).8: t = t + 1; 9: end for 10: When the training round is over, the meta-learner will send the initial model parameters θ and µ obtained in the current round and the random initial model parameters for the next round to the base learner and test the new tasks outside the training set.

Simulations and Analyses
In our simulation analyses, we mainly focus on how the communication system, when faced with a random sweeping jamming attack with the change in jamming parameters, can extract the learned experience from previously learned experiences using the proposed Meta-PPO algorithm and quickly select the best channel.
As shown in Figure 8, it is assumed that the jammer launches random sweep jamming attacks.Within each jamming cycle, the jammer randomly chooses its transmission power, then sweeps all channels within frequency band B, and randomly selects up to three channels to launch jamming attacks.The proposed algorithm is then verified through computer simulations based on a neural network built with Pytorch.Table 2 shows the parameters used in our simulations.In the simulations, the proposed Meta-PPO algorithm implemented its learning processes for 1000 episodes and 10 iterations for parameter updates in each episode.

Parameters
Numerical Values Parameters Numerical Values The proposed algorithm is then verified through computer simulations based on a neural network built with Pytorch.Table 2 shows the parameters used in our simulations.In the simulations, the proposed Meta-PPO algorithm implemented its learning processes for 1000 episodes and 10 iterations for parameter updates in each episode.Figure 9 shows the curves of normalized network throughput changing with the episodes when the proposed Meta-PPO algorithm was used, which was compared with PPO [31], DQN [32], Double DQN [33], and Dueling DQN [34] algorithms.Normalized throughput is a widely accepted criterion for measuring the performance of an agent's strategy [35].A better strategy the agent takes will result in higher normalized network throughput.
Appl.Sci.2023, 13, x FOR PEER REVIEW 14 of 18 As can be seen from Figure 9, for the proposed Meta-PPO algorithm, the normalized network throughput rises rapidly in the initial stage, before the 50th episode, and then reaches its maximum value after it.This indicates that in the early stages of the learning process, the agent can quickly reach an effective strategy to deal with the jamming attacks.However, between the 100th and 150th episodes, the normalized network throughput drops slightly, suggesting that the Meta-PPO algorithm undergoes a fine-tuning or policy oscillation stage.After the 200th episode, the normalized network throughput curve becomes flat, showing that there is little change in the subsequent learning process.This indicates that the proposed Meta-PPO algorithm has found an efficient strategy and further optimization after this point brings few gains in the normalized network throughput.
In contrast, for PPO, DQN, Double DQN, and Dueling DQN algorithms, it takes them more than 300 episodes to research the best strategy.This shows that the proposed Meta-PPO algorithm can quickly achieve similar or even better anti-jamming strategies for wireless communication systems using a shorter period of learning time compared with traditional deep reinforcement learning-based algorithms.The Meta-PPO algorithm saves 60% of the time in strategy convergence compared to the PPO algorithm, and nearly 80% of the time compared to the DQN algorithm.
This simulation result shows the better performance of our proposed Meta-PPO algorithm in terms of rapid adaptation ability for anti-jamming tasks.
The clipping ratio ( ) of a new and an old policy is an important hyperparameter for Meta-PPO algorithms that is used to limit the size of policy updates, ensuring that the difference between these two policies is not too large.If ( ) is too small, then beneficial policy updates might be hindered, while if ( ) is too large, then overly aggressive and potentially harmful updates might happen.Therefore, this hyperparameter must be cho- As can be seen from Figure 9, for the proposed Meta-PPO algorithm, the normalized network throughput rises rapidly in the initial stage, before the 50th episode, and then reaches its maximum value after it.indicates that in the early stages of the learning process, the agent can quickly reach an effective strategy to deal with the jamming attacks.However, between the 100th and 150th episodes, the normalized network throughput drops slightly, suggesting that the Meta-PPO algorithm undergoes a fine-tuning or policy oscillation stage.After the 200th episode, the normalized network throughput curve becomes flat, showing that there is little change in the subsequent learning process.This indicates that the proposed Meta-PPO algorithm has found an efficient strategy and further optimization after this point brings few gains in the normalized network throughput.
In contrast, for PPO, DQN, Double DQN, and Dueling DQN algorithms, it takes them more than 300 episodes to research the best strategy.This shows that the proposed Meta-PPO algorithm can quickly achieve similar or even better anti-jamming strategies for wireless communication systems using a shorter period of learning time compared with traditional deep reinforcement learning-based algorithms.The Meta-PPO algorithm saves 60% of the time in strategy convergence compared to the PPO algorithm, and nearly 80% of the time compared to the DQN algorithm.
This simulation result shows the better performance of our proposed Meta-PPO algorithm in terms of rapid adaptation ability for anti-jamming tasks.
The clipping ratio r(θ) of a new and an old policy is an important hyperparameter for Meta-PPO algorithms that is used to limit the size of policy updates, ensuring that the difference between these two policies is not too large.If r(θ) is too small, then beneficial policy updates might be hindered, while if r(θ) is too large, then overly aggressive and potentially harmful updates might happen.Therefore, this hyperparameter must be chosen properly to optimize the algorithm performance.
Figure 10 shows the normalized network output under different clipping ratios while other parameters remain unchanged for the simulation analyses.As can be seen, when clipping ratio r(θ) of the new and old policies is modified, the speed at which the Meta-PPO algorithm learns the optimal policy remains unchanged.The optimal policy appears after about 100 episodes.When the clipping ratio is chosen to be r(θ) = 0.2, the algorithm approaches its optimal upper limit.At this point, the Meta-PPO algorithm allows the policy to make necessary updates without over-updating.Therefore, the best clipping ratio choice is r(θ) = 0.2.The MAML gradient learning update rate, denoted as , is another crucial hyperparameter in the Meta-PPO algorithm, determining the update rate of parameters in the inner loop.The purpose of the inner loop in the MAML algorithm is to learn a specific task rapidly, while the outer loop aims for generalization across multiple tasks.
As shown in Figure 11, when other parameters remain unchanged and only the MAML gradient learning update rate is modified, the speed at which the Meta-PPO algorithm learns the optimal policy remains unchanged, converging around 100 episodes.The highest normalized throughput is achieved when = 0.1.At this point, in the inner loop, the model parameter updates are moderate, allowing the model to adapt to the specific tasks and environments of the basic learner without over-fitting.This prevents a reduction in the generalization capability in the outer loop.The MAML gradient learning update rate, denoted as α, is another crucial hyperparameter in the Meta-PPO algorithm, determining the update rate of parameters in the inner loop.The purpose of the inner loop in the MAML algorithm is to learn a specific task rapidly, while the outer loop aims for generalization across multiple tasks.
As shown in Figure 11, when other parameters remain unchanged and only the MAML gradient learning update rate α is modified, the speed at which the Meta-PPO algorithm learns the optimal policy remains unchanged, converging around 100 episodes.The highest normalized throughput is achieved when α = 0.1.At this point, in the inner loop, the model parameter updates are moderate, allowing the model to adapt to the specific tasks and environments of the basic learner without over-fitting.This prevents a reduction in the generalization capability in the outer loop.
Figure 12a illustrates the curve of the action loss for the Meta-PPO algorithm varying with the number of episodes.Typically, a loss function signifies a negative return of a strategy as expected, which means that a loss function should have negative values.The optimization process is expected to maximize this return, which can be translated to minimizing the negative value of this loss function.As seen from Figure 12a, during the strategy optimization process of the proposed algorithm, the strategy loss steadily rises from a notably large negative value and then starts to converge to a stable value after the 100th episode.This suggests that the strategy has been refined and has reached an optimal point.MAML gradient learning update rate is modified, the speed at which the Meta-PPO algorithm learns the optimal policy remains unchanged, converging around 100 episodes.The highest normalized throughput is achieved when = 0.1.At this point, in the inner loop, the model parameter updates are moderate, allowing the model to adapt to the specific tasks and environments of the basic learner without over-fitting.This prevents a reduction in the generalization capability in the outer loop.Figure 12a illustrates the curve of the action loss for the Meta-PPO algorithm varying with the number of episodes.Typically, a loss function signifies a negative return of a strategy as expected, which means that a loss function should have negative values.The optimization process is expected to maximize this return, which can be translated to minimizing the negative value of this loss function.As seen from Figure 12a, during the strategy optimization process of the proposed algorithm, the strategy loss steadily rises from a notably large negative value and then starts to converge to a stable value after the 100th episode.This suggests that the strategy has been refined and has reached an optimal point.Figure 12b presents the curve of the Meta-PPO algorithm's strategy entropy varying with the number of episodes.Entropy is widely regarded as a measure of the randomness concentrated ones.As seen from Figure 12b, during the algorithm's policy optimization, the entropy escalates from 0 to approximately 2.71 after the 100th episode and subsequently oscillates between 2.7 and 2.71.Notably, even when the policy loss stabilizes around the 100th episode, the policy entropy does not plummet to zero or near-zero values.This indicates that the Meta-PPO's learning trajectory initially transitions from a deterministic strategy, diversifying over time, and then stabilizing while retaining a degree of exploration.This suggests that the identified optimal solution is not trapped to a local optimum.
Figure 12c shows the curve of the proposed Meta-PPO algorithm's value loss varying with the number of episodes.In the reinforcement learning process, the value function can be used to predict the expected return of a given state.In our proposed Meta-PPO algorithm, a Critic network is employed to estimate this value loss, which represents the disparity between the Critic network's predictions and the actual returns.A greater value loss indicates a greater deviation between the Critic network's predictions and the actual obtained rewards.If this loss keeps increasing, it may indicate that the Critic network is not learning effectively or that the learning rate is not fitting its purpose.As can be seen from Figure 12c, the use of the Meta-PPO algorithm initially results in a high initial value loss, which is because of the neural network with random initialization for value function estimation used in the algorithm.As the episode approaches 100, there's a noticeable drop in the value loss, reaching 2 × 10 .This sharp decline is then followed by small variations.These changes suggest that the value function estimator is gradually becoming more accurate in predicting actual returns.In summary, the Meta-PPO algorithm demonstrates a rapid and efficient learning trend in the initial learning phase, quickly mastering the optimal strategy, and then undergoing fine-tuning and stabilization in subsequent learning.Meanwhile, although DQN can eventually find an effective strategy, it requires a longer time to achieve similar performance.Under the current system model, the Meta-PPO algorithm achieves the maximum normalized throughput when using the clipping ratio ( ) = 0.2 and the MAML gradient learning update rate = 0.1 as parameters.This curve change reflects the effectiveness and robustness of the two algorithms and further emphasizes the superior performance of Meta-PPO.

Conclusions
This algorithm uses the PPO policy optimization method to allow the basic learner to learn a series of anti-jamming tasks with known jamming parameters.Then, the meta- Figure 12b presents the curve of the Meta-PPO algorithm's strategy entropy varying with the number of episodes.Entropy is widely regarded as a measure of the randomness or uncertainty of a strategy.Enhancing the entropy during a strategy's optimization process means exploration in a wider action space is encouraged, which is beneficial to the strategy optimization since a strategy with a high-entropy optimization process means that the strategy is optimized over a wider range of actions rather than only over a few concentrated ones.As seen from Figure 12b, during the algorithm's policy optimization, the entropy escalates from 0 to approximately 2.71 after the 100th episode and subsequently oscillates between 2.7 and 2.71.Notably, even when the policy loss stabilizes around the 100th episode, the policy entropy does not plummet to zero or near-zero values.This indicates that the Meta-PPO's learning trajectory initially transitions from a deterministic strategy, diversifying over time, and then stabilizing while retaining a degree of exploration.This suggests that the identified optimal solution is not trapped to a local optimum.
Figure 12c shows the curve of the proposed Meta-PPO algorithm's value loss varying with the number of episodes.In the reinforcement learning process, the value function can be used to predict the expected return of a given state.In our proposed Meta-PPO algorithm, a Critic network is employed to estimate this value loss, which represents the disparity between the Critic network's predictions and the actual returns.A greater value loss indicates a greater deviation between the Critic network's predictions and the actual obtained rewards.If this loss keeps increasing, it may indicate that the Critic network is not learning effectively or that the learning rate is not fitting its purpose.As can be seen from Figure 12c, the use of the Meta-PPO algorithm initially results in a high initial value loss, which is because of the neural network with random initialization for value function estimation used in the algorithm.As the episode approaches 100, there's a noticeable drop in the value loss, reaching 2 × 10 11 .This sharp decline is then followed by small variations.These changes suggest that the value function estimator is gradually becoming more accurate in predicting actual returns.
In summary, the Meta-PPO algorithm demonstrates a rapid and efficient learning trend in the initial learning phase, quickly mastering the optimal strategy, and then undergoing fine-tuning and stabilization in subsequent learning.Meanwhile, although DQN can eventually find an effective strategy, it requires a longer time to achieve similar performance.Under the current system model, the Meta-PPO algorithm achieves the maximum normalized throughput when using the clipping ratio r(θ) = 0.2 and the MAML gradient learning update rate α = 0.1 as parameters.This curve change reflects the effectiveness and robustness of the two algorithms and further emphasizes the superior performance of Meta-PPO.

Conclusions
This algorithm uses the PPO policy optimization method to allow the basic learner to learn a series of anti-jamming tasks with known jamming parameters.Then, the metalearner in the meta-learning idea summarizes and generalizes the general rules of this series of similar anti-jamming tasks, so that when the communication system encounters new jamming similar to historical anti-jamming tasks, the communication system can make decisions more quickly using the experience provided by the meta-learner.
Simulation results show that compared with other deep reinforcement learning algorithms that use random initial networks, Meta-PPO can still maintain good learning performance under constantly changing jamming conditions by adopting the initial strategy provided by the meta-learner.It also demonstrates more outstanding robustness and generalization characteristics, further enhancing its anti-jamming characteristics.
In the experimental simulation of this paper, the sensitivity of communication parameters still needs further study.At the same time, the idea of combining meta-learning with the DQN algorithm itself has a lot of room for optimization, such as optimizing the computational complexity of second-order gradient descent algorithms, and measuring the specific similarity of meta-training tasks.We plan to conduct research on these issues in the next step and perform actual simulations on hardware to continuously enhance the generalization and robustness of intelligent anti-jamming decision making.

Figure 1
Figure 1 depicts the model of a communication system utilized in this paper.This model includes a set of wireless communication transceivers and a jamming device.The jamming signals from this device are capable of effectively enveloping the receiver.This paper makes the following assumptions: a Communication receiver

Figure 1 .
Figure 1.The wireless communication system model used in our research.

Figure 1 .
Figure 1.The wireless communication system model used in our research.

Figure 4 .
Figure 4. Decision model of the basic learner based on the PPO algorithm.

Figure 4 .
Figure 4. Decision model of the basic learner based on the PPO algorithm.

Figure 4 .
Figure 4. Decision model of the basic learner based on the PPO algorithm.
represents the ratio of the new policy to th policy.

Figure 5 .
Figure 5. Decision model of the basic learner based on the Meta-PPO algorithm.

Figure 5 .
Figure 5. Decision model of the basic learner based on the Meta-PPO algorithm.

Figure 6 .
Figure 6.Overall decision-making process based on the Meta-PPO algorithm.

Figure 6 .
Figure 6.Overall decision-making process based on the Meta-PPO algorithm.

i.Algorithm 1 : 6 : 7 :
Meta-PPO 1: initialize: Number of initial tasks T i ; Number of tracks to be extracted in each task k; Number of training rounds n; Inner loop gradient update hyperparameter α; Outer loop gradient update hyperparameter β; Random initialization Actor network parameters θ; Critic network parameters µ. 2: for t = 1, 2, ..., T do 3: Extract k tracks to form a training set D train i and use PPO neural network algorithm for training 4: Calculate the temporal difference error δ t according to Equation (7) 5: Update Critic network parameters by minimizing Critic network loss function according to Equation (9) to obtain the updated parameters ∼ µ Update the Actor network parameters by minimizing the Actor network loss function according to Equation (10), and the updated parameters ∼ θ Calculate the gradient as the difference between the model parameters before and after

Figure 9 .
Figure 9.Comparison of anti-jamming performance of different algorithms.

Figure 9 .
Figure 9.Comparison of anti-jamming performance of different algorithms.

18 Figure 10 .
Figure 10.Comparison of normalization throughput at different clipping ratios.

Figure 11 .
Figure 11.Comparison of different MAML gradient learning update rates.

Figure 10 .
Figure 10.Comparison of normalization throughput at different clipping ratios.

Figure 11 .
Figure 11.Comparison of different MAML gradient learning update rates.

Figure 11 .
Figure 11.Comparison of different MAML gradient learning update rates.