RIS-Assisted Robust Beamforming for UAV Anti-Jamming and Eavesdropping Communications: A Deep Reinforcement Learning Approach

: The reconﬁgurable intelligent surface (RIS) has been widely recognized as a rising paradigm for physical layer security due to its potential to substantially adjust the electromagnetic propagation environment. In this regard, this paper adopted the RIS deployed on an unmanned aerial vehicle (UAV) to enhance information transmission while defending against both jamming and eavesdrop-ping attacks. Furthermore, an innovative deep reinforcement learning (DRL) approach is proposed with the purpose of optimizing the power allocation of the base station (BS) and the discrete phase shifts of the RIS. Speciﬁcally, considering the imperfect illegitimate node’s channel state information (CSI), we ﬁrst reformulated the non-convex and non-conventional original problem into a Markov decision process (MDP) framework. Subsequently, a noisy dueling double-deep Q-network with prioritized experience replay (Noisy-D3QN-PER) algorithm was developed with the objective of maximizing the achievable sum rate while ensuring the fulﬁllment of the security requirements. Finally, the numerical simulations showed that our proposed algorithm outperformed the baselines on the system rate and at transmission protection level.


Introduction
Recently, the advancement of next-generation wireless communications has led to exponential growth in data transmission and connected nodes [1].However, owing to the open nature of wireless channels, wireless communications are progressively susceptible to active jamming and passive eavesdropping [2,3].With this as the focus, the academic community has studied various techniques to combat jamming and eavesdropping attacks, e.g., power control [4], frequency hopping [5], artificial-noise-aided beamforming [6], and cooperative relaying scheme [7].However, power control cannot handle the jamming attacks with high power, and frequency hopping consumes additional spectrum resources.On the other hand, releasing artificial noise consumes extra power, and employing relays may incur additional hardware cost [4][5][6][7].
The above-mentioned shortcomings have motivated a new paradigm called the reconfigurable intelligent surface (RIS) [8].This technology has recently been regarded as a promising solution for enhancing the power/spectral efficiency of wireless communication systems [8][9][10][11].Specifically, the RIS consists of massive passive elements, which can dynamically adjust the reflection coefficient on the elements according to the needs of different communication scenarios to increase the received signal power or significantly reduce the impact of interference in the network [9][10][11].Therefore, the RIS has garnered extensive research attention in the domain of secure communications [12][13][14][15][16][17][18][19][20][21][22][23][24][25].However, in the face of increasingly complex electromagnetic environments, there is an urgent need for highly efficient and reliable beamforming algorithms for RIS-aided secure communications.

Related Works
In recent years, several fundamental technical challenges of RIS-assisted secure communication systems have been addressed [12][13][14].In [12], the joint beamforming scheme was proposed to protect secure transmission from eavesdropping attacks, where several optimization algorithms were applied, including alternating optimization (AO) and semidefinite relaxation (SDR).To maximize the secrecy rate of the RIS-assisted Gaussian multiple-input multiple-output (MIMO) channel, the authors in [14] used the AO algorithm to jointly optimize the transmit covariance at the transmitter and the phase shift coefficient at the RIS and further proposed the minimization-maximization (MM) algorithm to optimize the local optimal phase shift.However, these works assumed that the base station (BS) can acquire the ideal channel state information (CSI) of all nodes, which is impractical due to the uncooperative relationship between the BS and the illegitimate nodes.To tackle this matter, a robust algorithm has been developed to jointly optimize active beamforming and passive reflecting beamforming to secure the wireless transmission system against jammer attacks, where the CSI of illegitimate nodes at the BS is not completely known [15][16][17].In addition, the authors in [18] iteratively solved an energy-efficient secure transmission problem with the probabilistic outage constraint by low-complexity first-order algorithms in the presence of imperfect information about the eavesdropper's channel state.
Considering that the actual communication environment may become increasingly complex, such as in densely populated areas with clusters of buildings, the links between the RIS and various nodes may encounter obstacles.Unmanned aerial vehicles (UAVs) have been widely used in complex communication networks due to their low cost and flexible maneuverability [19][20][21][22][23].In addition, when we mount the RIS on a UAV, the channel attenuation of the ground-to-air channel is much lower than that of the ground channel, which can significantly reduce the energy loss of passive reflection.The authors in [21] utilized UAVs carrying reflective surfaces to facilitate power delivery to intelligent devices, while simultaneously transmitting information.Liu et al. used an AO framework to study a multi-controllable system for RIS-aided UAV communication [22].In [23], the authors studied the secrecy problem of RIS-based integrated satellite UAV relay networks with multiple eavesdroppers.
The obvious challenge is that traditional convex optimization algorithms may be less efficient for large-scale communication systems.Besides, the practical RIS's coefficient adjustment is discrete, which makes the traditional algorithms no longer applicable.Benefiting from the rapid development of artificial intelligence (AI), reinforcement learning (RL) has attracted much interest in beamforming design in RIS-assisted wireless communication systems [24][25][26][27][28][29][30], which can effectively deal with the large-scale discrete RIS's coefficients.The authors in [24] proposed a passive phase shift design to maximize the downlink received signal-to-noise ratio based on deep reinforcement learning (DRL).In [25], DRL and extremum seeking control were incorporated for the purposes of model-free control of the RIS.In response to increased network demand and interference challenges from nearby UAV cells, a direct collaborative-communication-enabled multi-agent decentralized double-deep Q-network (CMAD-DDQN) approach facilitates direct collaboration among UAVs, optimizing their 3D flight trajectories to maximize energy efficiency while outperforming existing methods by up to 85% [26].However, these works did not explore the issue of the security of AI in RIS-enhanced communication systems.In [27,28], the authors proposed secure DRL-based beamforming methods for protecting RIS-assisted wireless communications from active jamming or passive eavesdropping.Furthermore, in order to maximize the energy efficiency of multi-UAV-assisted wireless coverage, the authors in [29] proposed a cooperative multi-agent decentralized double-deep Q-network (MAD-DDQN) approach, but the algorithm could not be directly applied to optimize the reflecting beamforming for the RIS.To the best of our knowledge, no exiting work has considered the design of DRL in RIS-assisted secure transmission strategies in the presence of both jammers and eavesdroppers and imperfect CSI conditions.

Contributions
In this paper, we aimed to delve into the anti-jamming and anti-eavesdropping problems in an RIS-assisted UAV transmission system and introduce an innovative robust DRL-based approach to design discrete RIS coefficients in the presence of imperfect CSI from illegitimate nodes.In conclusion, our principal contributions are itemized as follows: • Considering the illegitimate nodes' imperfect CSI, the joint optimization problem of power allocation at the BS and reflecting beamforming at the RIS is formulated to maximize the achievable system rate, while ensuring fulfillment of the security requirements.

•
To cope with the non-convex and non-conventional optimization problem, we first used a robust method to process the imperfect CSI, and subsequently, the optimization problem was reformulated into a Markov decision process (MDP) framework.Then, a noisy dueling double-deep Q-network with prioritized experience replay (Noisy-D3QN-PER) algorithm with safety performance awareness is proposed, where the D3QN is the improvement of the DQN, the NoisyNet can be encouraged to avoid falling into local optima, and the PER accelerates the convergence.

•
The numerical results indicated that the Noisy-D3QN-PER algorithm outperformed conventional approaches in improving the safety performance protection level and achievable sum rate.For example, the proposed algorithm improved the system rate and transmission protection level by 27.43% and 11.11%, respectively, compared to the conventional DQN of the benchmark scheme.

System Model
Figure 1 depicts the secure transmission scenario under consideration, wherein a BS aided by a fixed aerial RIS-UAV seeks to establish dependable links with K single-antenna users in the presence of a smart jammer and a single-antenna eavesdropper.Here, we assumed that the BS and the jammer are equipped with N, N J antennas, respectively, and the RIS deployed on the UAV has L reflecting units.For the ease of exposition, we further denote the channel matrix between the BS and the RIS-UAV, the smart jammer and the RIS-UAV, the BS and the k-th user, the RIS-UAV and the k-th user, the smart jammer and the k-th user, the BS and the eavesdropper, and the RIS-UAV and the eavesdropper by and h H RE ∈ C 1×L , respectively.Due to the cooperation between the legitimate nodes, we assumed that the CSI of the involved legitimate channel G BR , h BU,k , h RU,k is accurately available at the BS.However, in light of the expectation that illegitimate nodes will not collaborate with the BS to perform channel estimation, we took the practical assumption into account that the CSI of illegitimate channels, namely G JR , h JU,k , h BE , h RE , cannot be perfectly obtained.To elaborate on this, considering a more-practical and more-general situation, rather than using a statistical or bounded uncertainty model [15], we further characterized the illegitimate CSI as a given angle-based range, i.e., where ∆ J = ∆ J,h , ∆ J,G , θ L represents the minimum vertical angle of AoD (AoA), while θ U represents the maximum vertical angle of AoD (AoA).Similarly, ϕ L represents the minimum horizontal angle of AoD (AoA), while ϕ U represents the maximum horizontal angle of AoD (AoA).Finally, g L and g U represent the lower and upper limits of the channel gain amplitude, respectively.Let s k be defined as the information symbol transmitted to the k-th user, satisfying E[s k ] and E |s k | 2 = 1.Before transmission, s k should be multiplied by the beamforming vector Consequently, the total transmitted signal at the BS can be written as x = ∑ K k=1 √ P k w k s k , where P k denotes the allocated transmit power assigned to the k-th user.Meanwhile, the smart jammer endeavors to disrupt the legitimate communication by transmitting jamming signal w J s J ∈ C N J×1 .As such, the RIS receives the superimposed signals and imposes the phase shift coefficient Φ = diag β 1 e jφ 1 , ..., β l e jφ l , ..., β L e jφ L on them, where φ l ∈ [0, 2π] and β l ∈ [0, 1] represent the phase shift and the amplitude of the l-th RIS reflective element, respectively.Hence, the received signal at the k-th user and the eavesdropper can be, respectively, expressed by where U,k represents the additive white Gaussian noise (AWGN) at the k-th user, and n E ∼ CN 0, σ 2 E is the AWGN at the eavesdropper.Hence, the achievable system rate of the k-th user and the wiretap rate of the eavesdropper can be, respectively, expressed as The secrecy rate of the k-th user can be written as where [z] + = max(z, 0).

Problem Formulation
Our objective is to maximize the achievable sum rate through jointly optimizing the transmit power allocation {P k } k∈K and the reflecting beamforming matrix Φ under the imperfect illegitimate node's CSI, while meeting the worst-case secrecy/achievable rate constraints.As such, the optimization problem can be formulated as where R min sec,k and R min k represent the minimum secrecy rate and the target rate of the k-th user.The power allocation is restricted to C3 due to the limited energy supply at the BS, and P max is the BS's maximum transmit power.Note that, due to the non-convexity of both the objective function and the constraints, ( 9) is a non-convex and non-trivial problem.Many existing traditional optimization methods, such as the SDR algorithm and the AO algorithm, obtain the solution in each time slot, where the correlation of consecutive instants is ignored, and phase adjustment is usually discrete in form on practical RIS elements, which leads traditional methods to no longer be applicable.In addition, in the scenario we are considering, the jammer is intelligent and can change the unknown jamming strategy in real-time.In order to be able to optimize in real-time and from the perspective of long-term interests, instead of directly solving this problem mathematically, we propose a robust DRL-based approach that can constantly interact with the environment that contains eavesdroppers and smart jammers to learn the optimal solution.

Robust Channel Processing
As stated in Section 2, the imperfect CSI results in infinite non-convexity in both the objective function and constraints.With this as the focus, according to the works [28][29][30], the equivalent worst-case CSI of the illegitimate channel that can be obtained through utilizing the discretization method is given, respectively, by where M ∈ (BE, RE, {JU, k}), M N ∈ (N, L, N J), and are the discrete CSI by uniformly discretizing all the angles in the set of ∆ J and ∆ U , respectively, i.e., where Q 1 and Q 2 are the sample numbers of θ and ϕ.Here, the detail is omitted for brevity, which can be referenced in [31,32].

Overview of DRL
DRL amalgamates the feature acquisition prowess inherent to deep learning (DL) with the decision-making capabilities intrinsic to RL.It comprises two fundamental constituents: the agent and the environment.The agent continuously improves its strategy by receiving feedback through interactions with the environment to achieve maximum return.This learning process is described as an MDP [33].The MDP framework can be defined by a tuple {S, A, P, R}.Herein, S represents the state space denoting the set of observations characterizing the environment.A denotes the set of potential choices.P is the state transition probability denoting the distribution of the next state s t+1 given the action a t taken in the current state s t .Lastly, R is the immediate reward, which provides the quality evaluation r t (s t , a t ) of the state-action pair (s t , a t ).At each time step t, the agent obtains the state s t ∈ S from the environment and executes an action a t ∈ A according to the policy function π(a t |s t ) = Pr(A t = a t |S t = s t ).Subsequently, the environment will transit to a new state s t+1 with probability P(s t+1 |s t , a t ) = Pr(S t+1 = s t+1 |S t = s t , A t = a t ); in the meantime, the agent will receive the immediate reward r t ∈ R. The agent aims at learning strategies maximizing the long-term reward, i.e., the cumulative discounted future reward U t = ∑ ∞ τ=0 γ τ R t+τ+1 , where γ ∈ [0, 1] is the discount factor.Therefore, the tuples (s 1 , a 1 , r 1 , s 2 , ..., s t−1 , a t−1 , r t−1 , s t ) constitute the trajectory in an episode used for the iterative updating of the agent.
To accommodate the proposed algorithm in our problem, we first reformulated Problem (9) into an MDP framework.The corresponding elements of the MDP problem are specified as follows: State S: The state s t fed back from the RIS-UAV-assisted communication system is given as where h k and h e denote the composite channel coefficients of the k-th user and eavesdropper, respectively.Action A: Based on the current state s t , the agent needs to make a coordinated decision on the phase shift at the RIS and the power allocation at the BS.Hence, the action a t at each time step t is given as where ∆φ l ∈ − π 4 , 0, π 4 is the variable for the phase shift of the l-th reflection element and ∆P k ∈ {− p, 0, p} is the variable for assigning the k-th user's transmit power.
Reward R: Our goal was not only to maximize the achievable rate, but also to ensure the system safety performance requirements.Therefore, we designed a composite reward function expressed as where In ( 16), the base reward is the sum of the rates of all users, and when the constraints in (17) or ( 18) are not satisfied, we add a penalty term to encourage the agent's behavioral strategy to be closer to our needs.The coefficients ρ 1 and ρ 2 are the positive constants.
With DRL, a well-known function measuring the expected return for the agent to execute action a t in the state s t under the policy π is the action value function Q: where w represents the parameters of the deep neural networks (DNNs).In the learning process, the agent intends to find optimal policy π * .Thus, the optimal Q function is expressed as In order to obtain the above equation, the optimal Q function can be constantly approximate by updating the parameter w using the temporal difference (TD) algorithm: where α ∈ (0, 1) is the learning rate for the update on w and ∇ w L(w) is the gradient of the loss function L(w) with respect to w, which is given by where r t + γ max a∈A Q(s t+1 , a; w) refers to the TD target value.

Joint Power Allocation and Reflecting Beamforming Using Noisy-D3QN-PER
Prevailing reinforcement learning techniques, such as Q-learning, the policy gradient, and the deep Q-network (DQN), have demonstrated notable accomplishments in diverse control tasks.However, regarding the safety beamforming policy requirements discussed in Section 2, the policy gradient algorithm is inadequate for addressing Problem (9), as it involves continuous action space optimization and may converge to suboptimal solutions [34].Furthermore, although the DQN performs well in in environments characterized by high-dimensional continuous state spaces and discrete action spaces, it remains plagued by several inherent limitations, which adversely affect algorithmic efficacy [35].Therefore, the Noisy-D3QN-PER algorithm was developed to deal with the challenges in this paper, as shown in Figure 2, which can overcome the constraints associated with the aforementioned methods and significantly enhance the attainable performance.
... ...  It is noteworthy that a significant disadvantage inherent to the DQN algorithm is over-estimation of the Q function value.The overestimation issue is primarily attributable to two principal factors.First, the process of maximization causes the target value to overestimate the value of the true value.Second, bootstrapping engenders the propagation of bias.In order to address this issue, the double-DQN was adopted in the algorithm [36].We applied another neural network, i.e., the target network Q π (s t , a t ; w − ), whose neural network architecture is identical to that of the primary network, but the parameter w − is different from w. Specifically, the primary network was used to choose an action that maximizes the output of the Q function a * = arg max a∈A Q(s t+1 , a; w), and then, the target network calculates the TD target value r t + γQ(s t+1 , a * ; w − ) with the selected action.Thus, the primary network parameter is updated with the following loss function: Subsequently, the parameter of the target network is updated with w and w − every regular interval.
In order to further enhance the algorithm's performance, we incorporated the dueling layer [37], resulting in the formation of the dueling double-DQN (D3QN).The core concept underlying the dueling layer is the decomposition of the optimal action value Q * into the optimal state value V * and the optimal advantage D * .As such, the expression of the optimal advantage function is formulated as follows: The advantage of modeling the state value function and the advantage function separately is that, in some specific situations, agents only pay attention to the value of the state and do not care about the differences caused by different actions.More specifically, in the optimization problem we are considering, the state values differ greatly, while the action in the same state differs little.The agent pays attention to the difference in the advantage value of different actions, which makes the algorithm converge more stably.As shown in Figure 3, the dueling layer comprises two distinct neural networks.The neural network denoted by D s, a; w D is an approximation of the optimal advantage function D * (s, a), and the other neural network is V(s; w V ), which is an approximation of the optimal state value function V * (s).The corresponding optimal action value function can be approximated as the following neural network: In addition, there is a dilemma of exploration and exploitation in RL that greatly affects the performance of the algorithm.By gathering more information, or sufficient information, the agent can achieve the optimal long-term strategies on a macro-level at the expense of some short-term benefits.In an effort to attain a good tradeoff between exploration and exploitation, several basic strategies have been proposed, such as Boltzmann exploration and the ε-greedy policy.However, these methods only utilize action dithering, which results in a low exploration rate, especially in complex and unstable environments.Therefore, we propose a NoisyNet technique to improve the exploration efficiency, i.e., adding parameterized noise to the DNN layer [38].Specifically, as shown in Figure 4, the weight parameter w of the DNN is replaced with where µ and σ are learnable parameters and denote the mean and standard deviation, respectively, and ξ ∼ N (0, 1) is the noise.Here, the term • denotes the multiplication of the corresponding elements, i.e., Hence, the Q function is written as The loss function can be further rewritten as where a * = arg max a∈A Q(s t+1 , a, ξ; µ, σ), and the noise value ξ is different from ξ .In the training process, noise is added to the training parameters to force the algorithm to minimize the error in the case of parameters with noise, which means that it is forced to tolerate the disturbance of the parameters.It does not matter if the parameters are not strictly equal to the mean; as long as the parameters are in the neighborhood of the mean, the prediction made by the agent can be reasonable.Therefore, the NoisyNet is not only beneficial to enhance exploration, but also to enhance the robustness of the algorithm.Experience replay is often utilized in the classical DQN to store and uniformly sample experience transitions, which help in reusing experiences and breaking the correlation of experience transition sequences.However, due to the uncertainty of the jamming strategy of the jammer, the importance of different transitions is different, and uniformly sampling may be ineffective.Hence, we adopted prioritized experience replay (PER) to make the algorithm learn more efficiently and converge faster [39].PER non-uniformly samples each transition, where the priority of the transition is proportional to its TD error value.Therefore, the sampling probability of transition j is given by where α j adjusts the importance of the priority.In addition, the loss function needs to be multiplied by importance sampling weights to counteract the bias caused by varying the sampling probabilities.Thus, the parameters of the proposed algorithm are update by a mini-batch transition: where α σ and α µ are the learning rate, m is the mini-batch size, N represents the number of samples in the buffer, and ∈ (0, 1) is a hyperparameter that determines the extent to which PER affects the convergence result.
The detailed training process of the Noisy-D3QN-PER algorithm is shown in Algorithm 1.At the beginning of the training, we sample new channel realizations and randomly choose the phase shifts and power allocation to compute the first state s 0 .Since the NoisyNet is inherently random, exploration can be encouraged.Based on the current state s t , the ε-greedy policy is implemented to select action a t and, subsequently, receive feedback reward r t and the next state s t + 1.The transition sequence (s t , a t , r t , s t+1 ) is saved in the experience replay buffer D. After storing enough experiences transitions, the training of the primary networks starts, and mini-batch transitions are selected according to the PER principle and put into the neural networks to obtain the loss function according to Equation (29).Then, the parameters of the primary networks are updated by the Adam optimizer according to Equations ( 31) and (32), and the target network copies the parameters of the primary networks in every T NET time interval.In addition, each time the experience transitions are sampled, the selected transitions need to update the priority with the new TD error.

4:
for each step = 1, 2, ... , T do 5: Select action a t using ε-greedy policy, i.e., select the action that yields the largest action value with a probability of 1 − ε, or randomly select from all the possible actions with the probability of ε.

6:
Receive an instantaneous reward r t , and obtain the next state s t+1 .

7:
Store the experience transitions (s t , a t , r t , s t+1 ).Sample mini-batch transitions based on PER using (30), and then, update the priority of the selected transition based on its TD error.

10:
Calculate the loss function for the mini-batch according to (29).

11:
Perform gradient descent, and update the parameters of the primary networks using (31) and (32). 12: if t mod T NET = 0 then 13: target network copies the parameters of the primary networks.end for 17: end for Ensure: joint power allocation and RIS phase shift design strategy.

Simulation Results
This section presents an evaluation of the Noisy-D3QN-PER algorithm.We varied the maximum transmission power P max between 10 dBm and 30 dBm.The number of antennas on both the BS and the jammer were N = N J = 64, and the number of users was K = 2.The fixed deployment height of the RIS-UAV was 100 m.The minimum secrecy rate and target data rate were R min sec,k = 0.5 bits/s/Hz and R min k = 1 bits/s/Hz, respectively.The background noise at each user and eavesdropper was set to σ 2 U,k = σ 2 E = −90 dBm.All involved neural networks were considered to be fully connected.The learning rates α σ and α µ were set as α = 0.001.The initial exploration rate ε was 1, then was linearly annealed to 0.1.The parameters ρ 1 and ρ 2 in (12) were set to ρ 1 = ρ 2 = 2.The replay buffer size was D = 100,000, and the mini-batch size was m = 32.In addition, the jammer chooses power was from 10 dBm to 30 dBm based on its own jamming strategy, which the BS could not access.Besides, we chose three conventional approaches as benchmarks, namely the classical DQN, the DDQN, and the optimal transmit power allocation without the RIS approach.All of the displayed illustrations are the average results of over 100 independently executed implementations.
Figure 5 shows the average gain graph of the Noisy-D3QN-PER algorithm and the benchmark algorithm.It can be observed that, in the initial phase of training, the algorithms obtained approximately the same reward gain.However, after 100 episodes of training, the Noisy-D3QN-PER algorithm significantly achieved higher gains and faster convergence compared to the benchmark algorithm.This was due to the fact that the preferred empirical playback and competition layers included in the proposed algorithm were better able to adapt to the dynamic and complex interference environment.Specifically, the dueling layer helps to analyze the state bias due to unknown jammer power and unknown location information, and the NoisyNet encourages the exploration of more reflecting beamforming strategies for higher long-term benefits.Moreover, it can be observed that both the DDQN and the proposed algorithm outperformed the classical DQN, which suggests that the use of the DDQN can effectively mitigate the overestimation problem. Figure 6 shows the achievable sum rate with varying maximum transmit power P max .Here, we set L = 64.As expected, the proposed algorithm outperformed other approaches.This was because the dueling layers modeling the advantage function and the state value function separately can better focus on states that are less correlated with the current strategy-action relationship and better predict the jammer's strategy when the transmit power changes.Besides, the NoisyNet can prevent the proposed algorithm from becoming stuck at the undesired suboptimal solutions.It can be also observed that the three RIS-UAVassisted approaches can obtain a much higher achievable rate than that without the RIS, which indicates that deploying the RIS-UAV can efficiently enhance the secure performance.To elaborate on this, the system can enhance the desired signals at the users and eliminate the jamming signal by adjusting the reflecting beamforming at the RIS.
To further highlight the security performance enhancement of the proposed algorithm, the security requirement satisfaction probability (the probability of the satisfaction of the rate constraints [27,28]) of different approaches is shown in Figure 7.It is evident from the figure that the security performance of the optimal PA without the RIS approach cannot be guaranteed when the P max is low, and the security performance protection improved until P max was raised to a certain value.However, the other approaches with the RIS-UAV can obtain satisfactory performance at different P max , which further confirmed the superiority of deploying the RIS-UAV in wireless communication systems.Furthermore, it is noteworthy that the proposed algorithm achieved the best result as compared to other conventional approaches.This can be explained by the fact that the comparison approaches usually fell into the suboptimal solution, which only increased the achievable sum rate, but ignored the security performance requirement.However, due to the adopted NoisyNet and the security-aware reward function, the proposed Noisy-D3QN-PER algorithm can explore strategies and make a desirable balance between the security performance and the achievable rate.

Conclusions
This paper delved into the optimization of joint power allocation and reflecting beamforming regarding secure communication via RIS-UAV assistance with imperfect CSI.Specifically, the original optimization problem was formulated into an MDP framework and solved by a Noisy-D3QN-PER algorithm, in which the agent can estimate the unknown jamming strategy through constantly interacting with the environment to quickly adapt to the dynamic environment and, finally, obtain the optimal policy that maximizes the achievable rate and meets the requirements of system security performance, which provides technical support for the realization of the intelligence of the RIS-assisted robust beamforming system.The numerical results confirmed the predominance of the proposed Noisy-D3QN-PER algorithm over other existing conventional approaches in improving the achievable sum rate and system security performance.Although the method proposed in this paper can effectively resist the jamming attack with the uncertainty of the CSI, it is still necessary to know the variation range of interference.The next step needs to focus on the following two aspects of research: one is to study the anti-jamming method without any interference information; the other is to explore the AI interpretability, to improve the trustworthiness and effectiveness of the AI method.

Figure 2 .
Figure 2. The process of the Noisy-D3QN-PER algorithm.

Figure 3 .
s, a;w D )−mean a∈A D(s, a;w D ) , a;w D ) ensures the stability of the parameters in the training process and w ∆ = (w V ; w D ), since, at each iteration, the function V(s; w V ) is updated, which also affects the action value of the other actions.Dueling layer.

Figure 5 .
Figure 5. Average reward of the Noisy-D3QN-PER algorithm and other comparison approaches.

Figure 6 .Figure 7 .
Figure 6.Achievable sum rate with varying maximum transmit power P max .