Joint Beamforming and Phase Shifts Design for RIS-Aided Multi-User Full-Duplex Systems in Smart Cities

Full-duplex (FD) and reconfigurable intelligent surface (RIS) are potential technologies for achieving wireless communication effectively. Therefore, in theory, the RIS-aided FD system is supposed to enhance spectral efficiency significantly for the ubiquitous Internet of Things devices in smart cities. However, this technology additionally induces the loop-interference (LI) of RIS on the residual self-interference (SI) of the FD base station, especially in complicated urban outdoor environments, which will somewhat counterbalance the performance benefit. Inspired by this, we first establish an objective and constraints considering the residual SI and LI in two typical urban outdoor scenarios. Then, we decompose the original problem into two subproblems according to the variable types and jointly design the beamforming matrices and phase shifts vector methods. Specifically, we propose a successive convex approximation algorithm and a soft actor–critic deep reinforcement learning-related scheme to solve the subproblems alternately. To prove the effectiveness of our proposal, we introduce benchmarks of RIS phase shifts design for comparison. The simulation results show that the performance of the low-complexity proposed algorithm is only slightly lower than the exhaustive search method and outperforms the fixed-point iteration scheme. Moreover, the proposal in scenario two is more outstanding, demonstrating the application predominance in urban outdoor environments.


Introduction
With the recent progress in requirement definitions and developing potential technologies of the sixth generation (6G) wireless communication, the capacity of the 6G network will increase by nearly 1000 times compared with the fifth generation (5G) to support the ubiquitous Internet of Things (IoT) devices [1][2][3].Full-duplex (FD) technology (i.e., in-band co-time co-frequency) can double the spectral efficiency (SE) at most compared with the traditional time division duplex (TDD) or frequency division duplex (FDD).Thus, it is regarded as one of the Beyond 5G/6G candidate technologies [1,[4][5][6].On the other hand, the reconfigurable intelligent surface (RIS) has reflection elements with programmable super-atomic structure and ultra-low power consumption, which can manipulate the signals' reflection and scattering scenes to improve coverage and quality of service (QoS).Moreover, it can reduce energy consumption compared with conventional relays [7].In view of the revolutionary technology that endows network entities with reconfigurable properties, RIS has become a promising technology for 6G networks [8,9] and could be extensively applied in the coverage tribulations circumstances [10,11].Overall, the united technology of FD and RIS can theoretically obtain two-fold performance gains, including time-frequency domain multiplexing and signal enhancement.So, RIS-aided FD technology Sensors 2024, 24, 121 2 of 27 is expected to be applied in a wide range of IoT devices in smart cities that require superior performances in terms of high data transmission rate and extensive coverage [3,12].
It is known that FD technology is limited by strong self-interference (SI).Although the existing SI cancellation technology (i.e., active cancellation and passive cancellation) [5] can eliminate the SI to approximate the noise floor [13], residual SI still exists due to hardware impairment [14].With the RIS introduced, the residual SI can be mitigated to a degree.However, the FD system additionally affiliates with the loop-interference (LI) caused by the transmit signal rebounded via RIS unexpectedly.Particularly when IoT devices are located in complicated urban outdoor environments, including severe attenuation, reflections, and blockages [15], these two types of interference tend to occur frequently, which will not give full play to the potential performance of the RIS-aided FD systems.
Scholars often model an SE or energy efficiency (EE) optimization to overcome interference-related issues.To tackle the residual SI in the FD system, the scholars of [16] constructed a maximum SE problem concerning sub-carrier and power allocation.They transformed the non-convex problem into a convex problem through first-order Taylor approximation.The authors of [17,18] mainly studied the allocation of FD antennas and the design of beamforming strategy to construct sum rate maximization problems.Specifically, in [17], the maximum ratio transmission and genetic algorithms were used to solve the beamforming and antenna selection concerned subproblems, respectively.The authors in [18] devised a block coordinate descent method to solve the multi-variable optimization problem alternately.Refs.[16][17][18] proved that the optimized FD system with residual SI was still superior to the half-duplex (HD) system, such as TDD, in improving the performance of SE.Meanwhile, RIS is a disruptive technology that is greatly concerned by many scholars.Therefore, the scholars of [19] introduced RIS to formulate a target optimization problem with respect to channel cascade.They obtained the global EE optimum by alternating optimization of fixed variables.Other scholars applied RIS to FD/HD relay scenarios [20].They established a problem of minimizing the required power of the base station (BS) and relay to enhance EE, where the QoS constraints were also covered.To solve the two scenarios involved problems, they proposed the semi-definite programming and the maximum weakest hop-signal-noise-ratio methods.Simulation results showed that RIS joined with the FD relay was superior to other cases.Based on the previous inferences, the authors of [21][22][23][24][25][26] considered the union of RIS and FD technology and demonstrated the performance gains.To name a few, the authors of [21] proposed an SI mitigation method in RIS-assisted FD system, thus mitigating the strong SI to feed in the analog-to-digital converters of finite bit resolutions.The authors of [22] formulated the mathematical expressions of outage probability and ergodic capacity.They concluded that RIS could indirectly diminish the adverse consequence of the residual SI and ameliorate the performance in the FD system.Other scholars designed the minimized transmit power objective with active and passive beamforming to strengthen EE by suppressing interference [23].The work of [24] discussed the influence of different numbers of reflection elements and receiving antennas on FD system performance.
Deep reinforcement learning (DRL) can allow a wireless communication system agent to seek the optimal policy by observing the reward without a priori knowledge.Accordingly, the mathematically intractable problems could be settled by the agent interacting with the environment [27][28][29].The authors established the objective by considering SE and energy harvest in the FD system [30].Then, they devised a hybrid deep deterministic policy gradient and deep double Q-learning network approach to train the networks regarding different variables, respectively.Based on DRL, the authors of [31] proposed a maximizing entropy scheme to solve active and passive beamforming.In [32], neural epsilon-greedy, deep Q-learning network, upper confidence bound, and other DRL approaches were considered.They took the sum rate of the RIS system as the objective and proved that the trained artificial neural network (NN) could improve performance compared with some traditional non-convex algorithms on certain occasions.The authors of [25] employed a Sensors 2024, 24, 121 3 of 27 DRL method to work out the single and distributed RIS-related issues.They demonstrated that DRL could reduce the optimum performance loss without pre-relaxation.
Although several previous studies have focused on the performance enhancement of FD RIS-aided systems, they did not discuss the influence of LI with different user wireless environments and RIS locations.More importantly, to the best of our knowledge, there are no two-step algorithms combined with a closed-form solution and a learning method in multi-user RIS-aided FD systems in the previous works.Furthermore, the discrete phase shifts model of RIS is consistent with the RIS physical realization, which is more practical than the continuous phase shifts model.Motivated by this, we have studied the RIS-aided FD system with discrete phase shifts, considering the effects of residual SI, LI, and RIS location in different urban outdoor scenarios.Then, we devise a two-step solution to the formulated non-convex problem.The main contributions are as follows: • We introduce a discrete phase shifts RIS model in the blockage of the line-of-sight (LoS)  outdoor environment of smart cities.The objective and constraints concerning the residual SI and LI are formulated according to two typical scenarios.Under the two scenarios, we can focus on the influence of the primary interference.Next, a two-step algorithm based on the variable types is proposed, and we can further emphasize the proposal advantage through the scenario handoff.

•
Specifically, the original optimization problem is decomposed into two subproblems in light of the type of variables.We attempt to optimize the transmitting and receive beamforming matrices through fixed phase shifts in subproblem one.Due to the non-convexity of this problem, we design a novel successive convex approximation (SCA) method to obtain the approximate convex lower bound of the objective function and constraints.Therefore, the original non-convex problem is transformed into a convex one that can be directly solved.• Then, we tackle subproblem two to optimize the phase shifts vector via given beamforming matrices.In view of the non-convex optimization for the discrete variable to be solved, we develop a discrete soft actor-critic (SAC) algorithm based on DRL.We seek to maximize the reward to obtain the optimal sum SE by defining the corresponding action, state, and reward.Remarkably, our devised DRL-based method only involves discrete phase shifts, dramatically reducing the dimensions of the action space.Additionally, the state of the environment is a vector consisting of signal to interference-plus-noise ratio (SINR) of each IoT device, which can be efficiently applied to the multi-user case.

•
Finally, after iteratively optimizing the beamforming matrices and phase shifts vector, we evaluate the performance of the proposed algorithm from various perspectives and draw relevant conclusions.To be specific, the extensive simulation results show that the low-complexity proposal performance is second only to the exhaustive search method and outweighs the fixed-point iteration baseline.Particularly, the proposed algorithm performs outstandingly in scenario two, demonstrating the superiority of our proposal to mitigate interference in the complicated urban outdoor environment.
The rest of this paper is organized as follows.The system model and problem formulation are described in Section 2. Section 3 presents our proposal that concerns a joint beamforming and phase shifts design.The computational complexity is also provided.Section 4 discusses numerical results to evaluate the performance of the proposed algorithm.A conclusion and future issues are given in Section 5.
Notations: For a general matrix A, A H , and ∥A∥ denote the Hermitian and Frobenius norm of A. The subscript/superscript t, r, d, and u indicate transmitting, receiving, downlink (DL), and uplink (UL) related.E[•] represents the expectation.We signify the conjugate and real part of a complex number by (•) * and Re(•), respectively.For a general vector x, diag{x} denotes a diagonal matrix with diagonal elements x.The modulus of x is denoted by |x|.C a×b represents the space of a × b complex number matrix. 1 and 0 denote a vector or matrix where all elements are one and zero.An identity matrix is represented by the letter I. CN denotes a complex Gaussian distribution.Other calligraphy upper-case letters, such as J and K, stand for the sets.The letter of superscript apostrophes, such as k ′ , indicates the element of the supplementary set of {k ∈ K, k ̸ = k ′ }.The partial derivative of f (x) to x is signified by ∇ x f (x).
The acronyms used in this paper are given in Table 1.

System Model and Problem Formulation
In this section, we first describe a multi-user RIS-aided FD system, as presented in Figure 1, exploiting RIS to promote FD performance in the blockage of LoS urban outdoor scenario.Then, by describing the transmission model of two typical urban outdoor scenarios, we focus on the problem of maximizing the sum SE in the simultaneous DL and UL data transmission.The problem formulation is characterized as a joint design of transmitting and receive beamforming matrices and phase shifts vector.Sensors 2024, 24, 121 5 of 27

System Overview
Figure 1 describes a multi-user RIS-aided FD system, which composes one BS equipped with N t transmitting and N r receiving antennas, (J + K) single-antenna IoT devices, and one RIS.BS works in FD mode, while the IoT devices, in terms of service type, are divided into J UL IoT devices and K DL IoT devices, operating in HD mode.The sets of UL and DL IoT devices are represented by J = {1, 2, . . . ,J} and K = {1, 2, . . . ,K}, where we let IoT j (∀j ∈ J ) and IoT k (∀k ∈ K) denote the j-th UL IoT device and k-th DL IoT device for simplification, respectively.Assuming that the BS and all IoT devices are entirely blocked by obstacles such as large buildings, the direct link between them can be ignored due to unfavorable transmission conditions [33].We suppose that a RIS with L reflection elements is deployed beyond the obstacles.With the help of the RIS, we can maintain the direct links of BS-RIS and RIS-IoT devices, thus controlling the respective transmission wave characteristics, such as scattering, reflection, and refraction.In this way, RIS can reconstruct and enhance the desired signal for BS and devices [34].Meanwhile, the controller connects with the RIS to programmatically operate each reflection element and communicates with BS through a dedicated channel.
As shown in Figure 1, the BS-RIS DL channel matrix, RIS-BS UL channel matrix, RIS-IoT k DL channel vector, and IoT j -RIS UL channel vector are denoted as , respectively.The BS residual SI matrix is represented as H SI ∈ C N t ×N r .We assume that the above channel state information (CSI) of all channels involved is known, which allows us to investigate the upper bounds for the performance [35,36].The l-th reflection element's coefficient is regarded as β l e jϕ l , where l belongs to the set L = {1, 2, . . . ,L} of reflection elements.β l and ϕ l are the related amplitude and phase.Considering that the reflection element is a passive device without an external power amplifier, we usually let β l = 1 [37].Thus, the diagonal matrix of coefficients is denoted as Φ = diag e jϕ 1 , e jϕ 2 , . . ., e jϕ L ∈ C L×L .For the convenience of matrix operation in this paper, we fetch the diagonal entries of Φ and reshape a new phase shifts vector ϕ = [e jϕ 1 , e jϕ 2 , . . ., e jϕ L ] ∈ C 1×L .Note that, the phase shifts are discrete values limited by the diode mechanism in engineering practice, such as {0, 2π/2 b , 2 • 2π/2 b , . . ., (2 b − 1) • 2π/2 b , where the phase shift resolution b determines the phase shift accuracy.

Transmission Model
In this subsection, we focus on the DL/UL data transmission process.

DL Transmission Model
Since we introduce LI of RIS in this paper (see Figure 1 on the IoT devices side), the received signal at IoT k is expressed as where In (1), the first term represents the desired received signal at IoT k .The second and third terms mean the regular DL and UL co-channel multi-user interference without the unexpected rebound signal.Unlike the common interference, the fourth term denotes LI caused by UL transmitted signals rebounding on the IoT devices side.
We observe that (1) contains several types of interference.The mixture of different kinds of interference is not conducive to studying the respective influencing factors of performance.Thus, the DL transmission model can be approximated in terms of two typical scenarios, according to the actual geographical location of IoT devices in a complicated urban outdoor environment, as follows.

•
Scenario one: The IoT devices are relatively open to each other in a local region like a town square, where no barriers are between them.

•
Scenario two: The IoT devices are located in a residential area and separated by smallsized obstacles, such as low residences and trees (Though the real situation in smart cities may be a hybrid of scenarios one and two, the research through two typical cases under extreme conditions can well extend to the general situation).
To be specific, in scenario one, even the RIS located at the IoT devices side, the direct IoT j -IoT k path loss is smaller than the IoT j -RIS-IoT k path loss of loopback due to the double fading effect.The influence of the third term of (1) is greater than the fourth term, which becomes the dominating factor of performance deterioration.Considering the subordinate influence of LI among interference and the insignificant performance improvement by deliberately optimizing RIS for restraining LI in scenario one, we can weaken the LI effect and approximate the reconfiguration of LI by RIS as a constant term for ease of simplicity.Thus the LI is approximated as an equivalent AWGN, whose intensity depends on the average distance between RIS and IoT devices.On the contrary, for scenario two, the channel gains between IoT devices are relatively small because of the scattered small-sized obstructions, so the direct link interference of UL to DL IoT devices (simplified as UL to DL interference (UDI) below) can be seen as equivalent AWGN in comparison to LI.On this occasion, the effect of LI should be elaborately studied.Overall, scenarios one and two mainly focus on UDI and LI, respectively, which lets us do careful research about the influence of different major interference and not overlook either interference.
In the light of scenarios one and two, (1) can be rewritten as where y d k,i , w k,i , and ϕ i represent the received signal, the transmitting beamforming vector, and the phase shifts vector in scenario i(i ∈ {1, 2}).nd k,i denotes the aggregated AWGN of IoT k , detailed as where n LI,k and n UDI,k indicate the equivalent AWGN of LI and UDI in scenarios one and two, respectively.

UL Transmission Model
As the SI at BS cannot be eliminated absolutely, the received signal of IoT j at BS is expressed as where H j,RB = H u diag(h u j ) ∈ C N r ×L denotes the IoT j -RIS-BS cascade channel matrix.n u j and H SI represent the AWGN of IoT j and the residual SI matrix at BS, respectively, which follow n u j ∈ C N r ×1 ∼ CN (0, σ 2 u,k 1) and H SI ∼ CN (σ SI (a/(a + 1)) 1/2 1, σ 2 SI I/(a + 1)). a is the rician factor and σ 2 SI is the SI power elimination level [38].Similar to (1), the first three terms of (5) indicate the desired signal, regular multi-user interference, and LI on the BS side, while the fourth term represents the residual SI at BS.It is worth noting that the second term-related interference produced at the BS side has the same form in each scenario.This varies from the DL received signal.
Due to the connection of the RIS controller and BS via the dedicated channel, BS can obtain the RIS reflection coefficients to effectively remove the LI on the BS side [26].Thus, the third term of (5) can be ignored.For this reason, we also do not display LI on the BS side in Figure 1.So (5) can be simplified as Since the subscript i, standing for scenario one or two, only determines the last two terms of (3a) or (3b), respectively, we omit the subscript i in the other terms for formula conciseness in the following parts.

Problem Formulation
Next, for the proposed RIS-aided FD system, we formulate the problem of SE maximation considering the joint optimization of transmitting and receive beamforming matrices and phase shifts vector under the two typical scenarios.
The DL SINR of IoT k is denoted as where Ψ d 1,k represents the DL-to-DL IoT device interference power.Ψ d 2,k,i denotes the UDI power.Meanwhile, Ψ d 3,k,i indicates the LI and AWGN power aggregated together.Similarly, the UL SINR of IoT j is represented as where Sensors 2024, 24, 121 8 of 27 Ψ u 1,j and Ψ SI,j denote the UL to UL IoT device interference and residual SI power.u j ∈ C N r ×1 is the receive beamforming vector for IoT j .
According to Shannon formula, the SE of IoT k and IoT j are recorded as where we signify transmitting beamforming matrix by W = [w 1 , w 2 , . . ., Then, the sum SE is expressed as where U = [u 1 , u 2 , . . ., u J ] ∈ C N r ×J represents the receive beamforming matrix.Mathematically, the SE maximization problem is formulated as where constraint (13b) determines the urban outdoor scenario in which the IoT devices are located.Constraint (13c) gives the BS power budget.(13d) represents the unit-modulus constraint for each reflection element of RIS.Constraints (13e) and (13f) ensure the QoS of IoT k and IoT j .It can be seen that the objective function (13a) involves a logarithmic function that includes fractions, and so do the constraints (13e) and (13f).Therefore, P1 with non-convex objective and constraints is intractable to optimize by conventional methods.

Solution to SE Maximization Problem in RIS-Aided FD System
Considering that the variables involved in P1 can be classified into continuity (such as W, U with continuous weights) and discreteness (such as ϕ with discrete phase shifts), it motivates us to adopt different schemes to solve problems with respect to different variable characteristics.Consequently, we decompose P1 into two subproblems to simplify and find the two-step solution through iteration until convergence.The rest of this section describes the optimization of subproblems one and two in developing beamforming and phase shifts, respectively.

Beamforming Design
Given the phase shifts vector ϕ * , the Ψ d 3,k,i in objective function is a constant, and unitmodulus constraint (13d) does not need to be considered.Thus, the design of beamforming matrices can be transformed into the following expression.
(13b), (13c), where Obviously, the DL SE depends on the transmitting beamforming matrix.However, the UL SE is up to the receive and the residual SI-related transmitting beamforming, which shows the high coupling relationship between W and U. Therefore, P2 makes its solution challenging due to the non-convex objective function (14a) and constraints (14b), (14c).
In view of the two coupled variables, the core idea is to combine alternating optimization with SCA to transform the non-convexity into convexity.Motivated by [39], we can approximate the lower bound by inequality transformation to acquire the convexity as follows.
First, we introduce an auxiliary variable set {λ k } to help find the lower bounds of the SINR of the DL IoT devices, which satisfies where λ k indicates the lower bound of the intended signal of IoT k .
Given the feasible point w k at the n-th iteration, we can acquire the boundary of λ k by (16b) and (17a).
From ( 18), we can easily obtain that the slack variable λ k is convex with respect to w k .
Similarly, with the feasible point λ where Sensors 2024, 24, 121 10 of 27 and γ d k is the relaxed SINR of IoT k .It is obviously known that the right-hand side of ( 19) is convex about λ k due to its quadratic form.Substituting ( 18) into (19), we can acquire the convexity of ( 19) in relation to w k .
Next, we will relax the expressions about the SINR of UL IoT devices.Similar to (17a), we bring in auxiliary variable sets {α}, β j , ω j,j ′ , υ j to approximate, which follow the supplementary constraints.
where α in (21a) acts on the slack of the BS power budget.Together, β j and α in (21b) restrict the modulus of the receive beamforming vector for IoT j and react on the residual SI. υ j in (21d) and ω j,j ′ in (21c) determine the boundaries of the desired received signal and the interference caused by IoT j' , respectively.Although (21a) is a convex second-order cone (SOC) constraint, the constraints (21b)-(21d) are non-convex.Like (18), given the feasible point Obviously, we achieve the linear constraints.Substituting (21a), (22a)-(22c) into (9), we obtain the lower bound of γ u j . where Thus, we can acquire the slacked R u j with the additional feasible point υ Sensors 2024, 24, 121 11 of 27 where and γ u j is the approximate SINR of IoT j .The right-hand side of ( 25) is convex with respect to υ j and Ψ u j , respectively.Since υ j and Ψ u j are convex about u j (see (22c) and ( 26)), we can obtain the convexity of (25) relating to u j .
Above all, we achieve the convex lower bound of (14a) via the SCA method, which is reformulated as by ( 19) and (25).
The problem P2 ′ is a convex SCA that can be optimally solved via a convex optimization tool such as CVX.
The proposed SCA algorithm for solving subproblem one is summarized in Algorithm 1.
Algorithm 1: Proposed SCA Algorithm for Problem P2 Until: The value of sum SE converges.7 Output : w * , U * .

RIS Phase Shifts Design
We now focus on the optimization of ϕ when the beamforming matrices are fixed.The problem P1 is simplified as The variable ϕ in the non-convex objective function (31a) with fractional structure is associated with DL and UL IoT devices.Moreover, P3 has an additional unit-modulus constraint.Especially in scenario two, A k,i is also relevant to ϕ (see (15)).All the mentioned properties determine that P3 is more challenging compared with P2.
Since the DRL method is implemented without slackness, it can reduce the optimal performance loss and computational complexity.Based on a model-free way, DRL can be directly used to solve tough mathematical problems.Therefore, scholars resort to the DRL method as a powerful tool to tackle wireless network issues [40,41].In addition, it works with non-labeled data sets that merit high storage efficiency [42].
SAC is a model-free DRL algorithm based on maximum entropy, which avoids local optimization via entropy regularization.Meanwhile, it applies two independent Q networks and an off-policy scheme to promote stability and learning efficiency [43], which is also suitable for discrete action space [44].

Markov Decision Process
Considering that the Markov decision process (MDP) is the theoretical cornerstone of reinforcement learning (RL) [45], we assume that the RIS controller exercises an agent role for pursuing the maximum reward through sequential decision making, thus maximizing the sum SE.Hence, we map the RIS-aided FD system with the key elements of MDP as follows.
• Action: A phase shifts vector is treated as a one-dimensional discrete action, such as an action at time step t defined as We normalize the actions via the policy network to meet the constraint (13d).It is worthwhile mentioning that the action is only relevant to the phase shifts with L elements, which improves the learning efficiency in the multi-user scenario through a reduced action space.
• State: The state vector includes the SINR of each IoT device and is represented as at time step t. s t can be acquired via the interaction between the agent RIS controller and the environment based on the given phase shifts and state of the previous time step.
Given that the characteristic dimension of the state can reduce the RL performance [46], the state vector should be whitened each time after the SINR is observed.Since the state is associated with each IoT device's wireless communication condition, it can efficiently cover the multi-user scenario.In general, the phase shifts vector to be performed at the current step only depends on the real-time conditions of each IoT device without regard to the previous action, which simplifies the interaction.
• Reward: Considering that the state is only dependent on each individual instead of the entirety, we take the reward as a guide of the global policy to the RIS controller.
With the aim of the maximum sum SE, we choose a modified objective as a reward.
where R d k,t and R u j,t indicate the return of SE at time step t for IoT k and IoT j , respectively.Notably, we bring in the penalty to inspire the agent to find a policy that satisfies the QoS constraints (31b), (31c).
In summary, when the RIS controller chooses a series of phase shifts with a certain probability, the SINR of each IoT device will be updated.Thus, the next state is transitioned with the corresponding reward acquired.The process of the RIS controller interacting with the defined wireless environment model (i.e., the urban outdoor scenario one or two) can be deemed as an MDP, visualized in Figure 2. It is noteworthy that Scenario One and Scenario Two in Figure 2 only differ in wireless environments where the IoT devices are located.This distinction will not influence the workflow of the model-free DRL method due to the interaction mechanism between the agent and the environment.The RIS controller decision making would introduce delay in practice, which takes seconds due to the computational power of the RIS controller and the magnitude of changes in the environment.Nevertheless, if the RIS controller has been saturated from learning, it will immediately make an optimal decision at the general channel condition changes.More importantly, the RIS controller can also rapidly cope with large environmental transformations over time.This is because the RIS controller would learn more about potential fluctuations of the environment in the long term, which endows the RIS controller with the capability to tackle extreme cases easily.Overall, the DRL-based method can better reflect advantage in the long run.

Mechanism of Soft Actor-Critic Learning
The SAC architecture contains actor and critic networks for action selection and evaluation.Specifically, a random policy network constitutes the actor network.In addition, the critic network consists of online and target subnetworks, which include two online and two target Q networks, respectively.The online and target Q networks have the same structure but differ in the update method and frequency.The critic network copes with the provided action from the actor network and selects a minimum Q value from the two calculated soft Q values, thus avoiding overfitting [47].Our goal is to train the above SAC network to acquire the ability to output the best phase shifts vector over extended interactions with the environment.The training of the SAC network, including implementation and learning processes, is described in detail.The RIS controller decision making would introduce delay in practice, which takes seconds due to the computational power of the RIS controller and the magnitude of changes in the environment.Nevertheless, if the RIS controller has been saturated from learning, it will immediately make an optimal decision at the general channel condition changes.More importantly, the RIS controller can also rapidly cope with large environmental transformations over time.This is because the RIS controller would learn more about potential fluctuations of the environment in the long term, which endows the RIS controller with the capability to tackle extreme cases easily.Overall, the DRL-based method can better reflect advantage in the long run.

Mechanism of Soft Actor-Critic Learning
The SAC architecture contains actor and critic networks for action selection and evaluation.Specifically, a random policy network constitutes the actor network.In addition, the critic network consists of online and target subnetworks, which include two online and two target Q networks, respectively.The online and target Q networks have the same structure but differ in the update method and frequency.The critic network copes with the provided action from the actor network and selects a minimum Q value from the two calculated soft Q values, thus avoiding overfitting [47].Our goal is to train the above SAC network to acquire the ability to output the best phase shifts vector over extended interactions with the environment.The training of the SAC network, including implementation and learning processes, is described in detail.

•
Implementation: s t+1 and r t are derived correspondingly by inputting a t at the current state s t .Then, the acquired transition tuple (s t , a t , r t , s t+1 ) is stored in a replay buffer that gradually enriches with multiple interactions.Notably, for the sake of the agent to explore comprehensively, the replay buffer can be stuffed off-policy.

•
Learning: In each time step, a mini-batch containing several transition tuples is randomly sampled from the replay buffer.Then, the learning process in a mini-batch is as follows.
In particular, unlike the conventional DRL-based method, the soft-state value function in the SAC algorithm introducing a relative entropy is defined as where θ i (i ∈ {1, 2}) is the network parameter relating to the i-th online Q network.
|A| represents the policy with probability [0, 1], calculated through the policy network, in the action space |A|.φ and α are the parameters with respect to the actor network and temperature.α also determines the importance of entropy relative to the Q value.In addition, the Q value is an action-state value function written as where γ is the discount factor used to calculate the cumulative returns.p(s t , a) indicates the probability of executing a specific action.Accordingly, the state is transitioning from s t to s t+1 .Above all, the SAC algorithm intends to explore as many varied actions as possible to maximize the target entropy based on a given s t .This process is also accompanied sequentially by loss calculation and parameter updating for the three modules below.First, the online Q network is trained by minimizing the Bellman residual as follows.
where λ Q and J Q (θ i ) represent the Q network's learning rate and loss function.θ − i is the network parameter concerning the i-th target Q network.D denotes the replay buffer.
Similar to the critic network, the actor network is trained subsequently.
where λ π denotes the actor network's learning rate.J π (φ) signifies the loss function of the policy network.Meanwhile, α is also trained by updating the loss function to avoid being set as a hyper-parameter and to reduce the estimation error, written as where λ represents the temperature's learning rate.J(α) indicates the loss of temperature.
H is a constant vector equivalent to the hyper-parameter of target entropy.Finally, the two target Q network parameters are updated as where ρ is the soft update factor.The stage of (37a)-( 40) implies the agent learning in one time step.After the ET time steps of E episodes, the agent's performance will saturate, and the optimized phase shifts vector ϕ * has been acquired.
Figure 3 shows the parameter updating process for actor and critic networks, and the proposed SAC algorithm is summarized in Algorithm 2.   9 Randomly sample min-batch transition tuples with batch size N from D. 10 Update the parameters θ 1 and θ 2 of the online Q networks by (37a), (37b). 11Update the parameter ϕ of the actor Q networks by (38a), (38b). 12Update the temperature parameter α by (39a), (39b). 13Update the parameters θ − 1 and θ − 2 of the target Q networks by (40).14 Set = +1 t t .

Proposed Deep Neural Network Design
Each NN contains an input, an output, and two hidden layers.The hidden layer of both the actor and critic networks includes L i (i ∈ {1, 2}) neurons.The input and output layers of the actor network have (J + K) and L neurons-the same number as state and action sizes, respectively.Since the input layer of the critic network additionally concatenates the action that the actor network selects, it includes (J + K + L) neurons.After getting the action and state, the critic network evaluates the action and gives a corresponding Q value as an assessment result, thus occupying one neuron at its output layer.The ReLU activation functions are adopted after the hidden layers of actor and critic networks.Moreover, the output layer of the critic network applies the Linear activation function.In contrast, the output layer of the actor network introduces a Softmax activation function to ensure that the discrete actions are distributed with effective probabilities.All networks use an Adam optimizer for parameter updating.

Algorithm Development and Computational Complexity
The proposed two-step algorithm is presented in Algorithm 3 by merging the two solutions for P2 ′ and P3.In particular, P2 ′ is a convex problem that ensures the convergence in each iterative sub-solution.Meanwhile, the convergence of P3 is also guaranteed by tuning the hyper-parameters.Since each sub-solution of P2 ′ and P3 outputs the optimal value, the SE value after each iteration m satisfies SE (m+1) ≥ SE (m) .Moreover, the objective SE is upper-bounded depending on the transmit power constraint and the interference level, so the convergence of Algorithm 3 can be acquired.Update Solve P3 with fixed w (m) , U (m) by Algorithm 2. 6 Update Until: The value of sum SE converges.9 Output : w * , U * , ϕ * .
Since P2 ′ only contains linear and SOC constraints, the computational complexity of Algorithm 1 is relatively low.The polynomial time complexity considers O((J + K + 2) 2.5 (KN t + JN r + J 2 + J + K) 2 + (J + K + 2) 3.5 ) [48].According to the dimensions of the afore- mentioned NN, the complexity of Algorithm 2 is O(2((J Compared to the fully exhaustive search method with exponential complexity applied in a combinatorial problem P3, the complexity of Algorithm 2 is significantly reduced.

Performance Evaluation
This section provides comprehensive numerical results to validate the effectiveness of our proposal.The two-step solution is solved through Matlab and Python tools.To be specific, subproblem one with parameters W and U is worked out through Matlab (version: Matlab R2020b), and subproblem two with parameter ϕ is tackled with Python (version: Python 3.8).

Simulation Setup and Parameters Setting
According to Figure 4, as the simplified system model, we consider a three-dimensional coordinate system where the BS antennas and RIS are located at (0 m, 20 m, 0 m) and (x m, 40 m, 0 m), respectively.In addition, the IoT devices are uniformly distributed in a horizontal square region with a side length of 40 m, which is centered at (100 m, 1 m, 0 m), where 1 m signifies the working height of each IoT device.Without loss of generality, for J = K = 3, we set the coordinates of the DL IoT devices as (90 m, 1 m, 10 m), (100 m, 1 m, −10 m), and (110 m, 1 m, 10 m), while UL IoT devices as (90 m, 1 m, −10 m), (100 m, 1 m, 10m), and (110 m, 1 m, −10 m).The QoS of each IoT device is considered 1 bps/Hz to guarantee all devices, especially the devices with relatively poor channel conditions, in normal communications.The channel path loss of large-scale fading is defined as where d represents the distance between two nodes.The value of −22 indicates that the path loss exponent is set at 2.2.Meanwhile, the value of −35.6 denotes the path loss at the reference distance of 1 m, which depends on the average channel attenuation and antenna characteristics [49].
Sensors 2024, 24, x FOR PEER REVIEW 18 of 28 reference distance of 1 m, which depends on the average channel attenuation and antenna characteristics [49].The small-scale fading is modeled by where LoS H and LoS h represent the deterministic LoS components of BS-RIS and RIS-IoT device channels for UL/DL, respectively.

H
and NLoS h mean the stochastic NLoS components in a similar manner [50].The rician factor ε is set to 10.We assume the AWGN σ = RIS and the IoT to manifest its subordinate influence compared with the UDI in this occasion).Considering the different channel characteristics and the up-to-date capability of SI elimination, we set the rician factor a and power elimination level σ 2 SI of the residual SI at BS to 1 and −100 dB, respectively [17].Other required simulation parameters will be listed in the title of the corresponding figures.Each hidden layer of our proposed SAC algorithm contains 256 neurons.The main DRL-related hyper-parameters refer to Table 2.The small-scale fading is modeled by where H LoS and h LoS represent the deterministic LoS components of BS-RIS and RIS-IoT device channels for UL/DL, respectively.H NLoS and h NLoS mean the stochastic NLoS components in a similar manner [50].The rician factor ε is set to 10.We assume the AWGN σ 2 d =σ 2 u =−107 dBm and the equivalent AWGN σ 2 UDI,k = −107 dBm, σ 2 LI,k = −96 dBm (Since we assume LI in scenario one is an approximated AWGN, we set the value of σ 2 LI,k −96 dBm at x = 50 m according to the average distance between the RIS and the IoT to manifest its subordinate influence compared with the UDI in this occasion).Considering the different channel characteristics and the up-to-date capability of SI elimination, we set the rician factor a and power elimination level σ 2 SI of the residual SI at BS to 1 and −100 dB, respectively [17].Other required simulation parameters will be listed in the title of the corresponding figures.Each hidden layer of our proposed SAC algorithm contains 256 neurons.The main DRL-related hyper-parameters refer to Table 2. Furthermore, we introduce a fully exhaustive search (i.e., an upper bound for discrete phase shifts) method as a benchmark to solve P3.However, due to the non-deterministic polynomial time, the exhaustive search approach lacks practicality in real-world applications.We additionally take a relatively low complexity local fixed-point iteration method [51], where the complexity is O((L + 1) 2 (K 2 + J 2 )) and O((L + 1) 2 (K 2 + J 2 +KJ)) for scenarios one and two, respectively.Meanwhile, to highlight the gain from the phase shifts optimization, we additionally bring the random phase shifts method and take the case without RIS as a reference.Finally, the Riemannian manifold under continuous phase shifts is also introduced as an ideal case to reveal the effectiveness of the discrete phase shifts methods [37].The simulation results are averaged based on 300 channel realizations.

Optimizer Performance
AdaGrad, RMSprop, and Adam all have their own merits.However, empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods [52].In this regard, we take Adam optimizer in the SAC network and give a performance comparison between AdaGrad, RMSprop, and Adam based on our framework.From Figure 5, the learning process of the Adam optimizer saturates faster than RMSprop, while the AdaGrad optimizer is absolutely unable to work in our framework.Furthermore, we introduce a fully exhaustive search (i.e., an upper bound for discrete phase shifts) method as a benchmark to solve 3    .However, due to the non-deterministic polynomial time, the exhaustive search approach lacks practicality in real-world applications.We additionally take a relatively low complexity local fixed-point iteration method [51], where the complexity is )) KJ for sce- narios one and two, respectively.Meanwhile, to highlight the gain from the phase shifts optimization, we additionally bring the random phase shifts method and take the case without RIS as a reference.Finally, the Riemannian manifold under continuous phase shifts is also introduced as an ideal case to reveal the effectiveness of the discrete phase shifts methods [37].The simulation results are averaged based on 300 channel realizations.

Optimizer Performance
AdaGrad, RMSprop, and Adam all have their own merits.However, empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods [52].In this regard, we take Adam optimizer in the SAC network and give a performance comparison between AdaGrad, RMSprop, and Adam based on our framework.From Figure 5, the learning process of the Adam optimizer saturates faster than RMSprop, while the AdaGrad optimizer is absolutely unable to work in our framework.

Convergence of Algorithm 3
Figure 6 shows the convergence behavior of the proposed two-step algorithm with respect to different parameter settings in both scenarios.It can be observed that increasing the number of BS antennas and IoT devices with the fixed RIS configuration will slow down the convergence.For instance, the parameter settings of

Convergence of Algorithm 3
Figure 6 shows the convergence behavior of the proposed two-step algorithm with respect to different parameter settings in both scenarios.It can be observed that increasing the number of BS antennas and IoT devices with the fixed RIS configuration will slow down the convergence.For instance, the parameter settings of N t = N r = J = K = 3 and N t = N r =J = K = 4 undergo 10-15 iterations to convergence, while nearly 20 iterations are required for setting N t = N r = J = K = 8.Moreover, there are no evident distinctions for convergence between the two scenarios.It confirms that the proposed DRL method is less affected by the scenarios.

Impact of the RIS Location
Figure 7a illustrates the performance impact of RIS location in scenario one.It is evident that the RIS deployed either close to the BS or the IoT devices will improve the sum SE.The proposed algorithm can obtain 135% and 62% performance gains with RIS deployment on the BS ( = 0 x m) and IoT devices side ( = 100 x m) compared with that at = 50 x m, respectively.The reason is that RIS can reconstruct the signal utmost at these positions, thus significantly enhancing the desired signal and reducing interference via adjusting the different transmission directions when the LoS link is severely blocked.Further, with x < ≤ (0 50 m) x increasing, the sum SE decreases.It is caused by the weakening reflected signal, which degrades the capability of suppressing the residual SI at BS.When the deployment of RIS is far from the BS and begins approaching the IoT devices, the sum SE increases.It implies that the RIS next to the IoT devices can further alleviate the UDI.In addition, it is found that the continuous phase shifts method outperforms other algorithms owing to the infinite RIS resolution.Although the proposal's performance is slightly lower than the exhaustive search method, it is better than the sub-optimal local fixed-point iteration method.This is because the fixed-point iteration method easily falls into local optimization when RIS includes many reflection elements.Next, because of the aimless signal reconstruction of the random phase shifts method, it achieves trivial profit from the RIS and is slightly better than the case without RIS.It is also for this reason that the random phase shifts method, or the case without RIS, is naturally less ( ≤ 2.84 bps/Hz) or none affected by the location of the RIS.

Impact of the RIS Location
Figure 7a illustrates the performance impact of RIS location in scenario one.It is evident that the RIS deployed either close to the BS or the IoT devices will improve the sum SE.The proposed algorithm can obtain 135% and 62% performance gains with RIS deployment on the BS (x = 0 m) and IoT devices side (x = 100 m) compared with that at x = 50 m, respectively.The reason is that RIS can reconstruct the signal utmost at these positions, thus significantly enhancing the desired signal and reducing interference via adjusting the different transmission directions when the LoS link is severely blocked.Further, with x(0 < x ≤ 50 m) increasing, the sum SE decreases.It is caused by the weakening reflected signal, which degrades the capability of suppressing the residual SI at BS.When the deployment of RIS is far from the BS and begins approaching the IoT devices, the sum SE increases.It implies that the RIS next to the IoT devices can further alleviate the UDI.In addition, it is found that the continuous phase shifts method outperforms other algorithms owing to the infinite RIS resolution.Although the proposal's performance is slightly lower than the exhaustive search method, it is better than the sub-optimal local fixed-point iteration method.This is because the fixed-point iteration method easily falls into local optimization when RIS includes many reflection elements.Next, because of the aimless signal reconstruction of the random phase shifts method, it achieves trivial profit from the RIS and is slightly better than the case without RIS.It is also for this reason that the random phase shifts method, or the case without RIS, is naturally less (≤2.84 bps/Hz) or none affected by the location of the RIS.
Figure 7b shows the impact of the RIS location in scenario two.Since the LI on the IoT devices side is emphasized in this scenario, it will further highlight the deployment benefit when RIS is located on the BS side.That is because we assume BS can absolutely eliminate the LI on the BS side, while the IoT devices incur the major performance impact of LI at the consideration of the trivial UDI.The trend of the curves is also consistent with Figure 7a when x is less than 70 m, which is the same reason as Figure 7a has explained.For x > 70 m, the sum SE decreases with x increasing.It is caused by the enhanced LI when RIS is excessively approaching the IoT devices side that the loss of LI will partially offset the profit of signal reconstruction.
into local optimization when RIS includes many reflection elements.Next, because of the aimless signal reconstruction of the random phase shifts method, it achieves trivial profit from the RIS and is slightly better than the case without RIS.It is also for this reason that the random phase shifts method, or the case without RIS, is naturally less ( ≤ 2.84 bps/Hz) or none affected by the location of the RIS.Furthermore, the performance gap between the proposal and the fixed-point iteration method is distinct in scenario two, where the gap at x = 50 m is 3.22 bps/Hz, compared to 1.54 bps/Hz at the same location in scenario one.It is relevant to the different objective functions in relation to phase shifts in each scenario.Specifically, the related objective in scenario two is more complicated than that in scenario one, leading to a larger error of the fixed-point iteration method when calculating the unit operations of each related phase shift to the next iteration.However, the proposed algorithm is based on the model-free, which does not particularly care about the concrete objective structure.Similarly to the fixed-point iteration method, the continuous phase shifts scheme based on the Riemannian manifold tends to deviate from manifold space when updating the tangent space in scenario two.Thus, the gap between the continuous phase shift and the proposal is smaller in scenario two than in one, such as the 3.90 bps/Hz performance gap at x = 50 m in Figure 7b, while 4.33 bps/Hz in Figure 7a.Even if under the heavy LI at x = 100 m of scenario two, the proposed algorithm outperforms the continuous phase shifts and fixed-point iteration methods by 1.03 bps/Hz and 0.68 bps/Hz relative gains with respect to scenario one, respectively.It suggests that the proposed algorithm is more advantageous in LI mitigation than the other algorithms under the scenario handoff.
Notably, scenario two with LI stressed is more likely to reproduce due to the high density of population and buildings in the smart cities.Thus, our proposal is more adaptable to practical application in the urban outdoor environment.
Joining Figure 7a,b, we can conclude that the SE performance concerning RIS mainly depends on RIS location and the wireless environment of IoT devices.In any case, the performance gain from signal reconstruction of the optimized RIS is better than that of non-RIS and random RIS methods.Since the varied performance among scenarios mainly depends on the two relevant factors we have discussed above, we only display the numerical results with respect to other influencing elements (as we shall see below) at x = 50 m in scenario two.We can concisely highlight our proposal through the scenario handoff from scenarios one to two and the trade-off of interference mitigation between the residual SI and LI at x = 50 m.

Impact of the Number of RIS Reflection Elements
Figure 8 shows the influence of the number of RIS reflection elements.We can observe that the sum SE increases, and the gap between the continuous phase shifts method and others widens with the increase in L. This expected result is mainly produced by the more reflection elements, the more cumulative gain from the reconstructed signal.For L = 64, the fixed-point iteration and proposed algorithms are obviously weaker than the continuous phase shifts method.The discrepancy in performance loss of the proposed algorithm ascends from 19% at L = 16 to 29% at L = 64 in reference to the continuous phase shifts method.Same as this, the fixed-point iteration method is up from 42% to 51%.The reason is that our proposal reduces computational complexity at the cost of a certain performance when the action dimension is large.Moreover, the fixed-point iteration easily falls into the local optimization as the number of L is large, which is consistent with the conclusion in Figure 7a.
Sensors 2024, 24, x FOR PEER REVIEW 22 o local optimization as the number of L is large, which is consistent with the conclusion Figure 7a.

Impact of the Number of Bits
In Figure 9, we evaluate the performance trend under different b.It can be seen t the performance improves with b ≤ ( 5) b increases except for the resolution uncorrela methods, such as the continuous phase shifts, random phase shifts, and case without methods.Especially for the random phase shifts method, the gain of RIS only depends the number of reflection elements instead of the resolution.It accounts for the fact t different random phases in one element cannot reconstruct the signal well due to the c ual transmission direction.Additionally, the performance of the exhaustive sea method is approximately equal to the continuous phase shifts method when b is set bits and only loses 0.12 bps/Hz.It explains that the discrete phase shifts method can a approach the continuous one at some point.Therefore, the rationality of our propo adopting discrete phase shifts is proved.
Nevertheless, the performance of the proposed algorithm decreases at 6 bits.Co pared with the fixed-point iteration method, the proposal has an extra 7% gain defici = 6 b relative to = 5 b .This is due to the fact that a larger resolution will incur an ex nential increment in the action space, which will reduce the learning efficiency.It sugg that our proposal should find a trade-off between performance and resolution.

Impact of the Number of Bits
In Figure 9, we evaluate the performance trend under different b.It can be seen that the performance improves with b (b ≤ 5) increases except for the resolution uncorrelated methods, such as the continuous phase shifts, random phase shifts, and case without RIS methods.Especially for the random phase shifts method, the gain of RIS only depends on the number of reflection elements instead of the resolution.It accounts for the fact that different random phases in one element cannot reconstruct the signal well due to the casual transmission direction.Additionally, the performance of the exhaustive search method is approximately equal to the continuous phase shifts method when b is set to 6 bits and only loses 0.12 bps/Hz.It explains that the discrete phase shifts method can also approach the continuous one at some point.Therefore, the rationality of our proposal adopting discrete phase shifts is proved.
Nevertheless, the performance of the proposed algorithm decreases at 6 bits.Compared with the fixed-point iteration method, the proposal has an extra 7% gain deficit at b = 6 relative to b = 5.This is due to the fact that a larger resolution will incur an exponential increment in the action space, which will reduce the learning efficiency.It suggests that our proposal should find a trade-off between performance and resolution.SI ≤ −90 dB) levels.To be specific, the maximum loss of our proposal caused by the increasing residual SI is 0.91 bps/Hz and 3.01 bps/Hz during low and normal levels, respectively, whereas that of the case without RIS is 1.79 bps/Hz and 4.77 bps/Hz.For the ideal case, the corresponding loss of the continuous phase method is 0.22 bps/Hz and 2.71 bps/Hz.It demonstrates that it is practical to adopt the proposed algorithm to reduce the FD performance loss by the residual SI.The sum SE drops distinctly when the residual SI level exceeds −90 dB.At σ 2 SI = −90 dB, our proposal outperforms the fixed-point iteration and random phase shifts methods by 3.15 bps/Hz and 8.71 bps/Hz, respectively.Neglecting the benefit factor of the opti-mized RIS by comparing the random phase shifts method, we can infer that the proposed algorithm can also brilliantly restrain the LI to improve performance.

Impact of the Transmit Power
Figure 11a,b show the performance impact of the transmit power.In Figure 11a, with the BS power budget increasing, the sum SE improves and reaches the bottleneck at 30 dBm.This is mainly due to the effect of residual SI and DL to DL interference.For instance, the excess transmit power at BS will severely enhance its SI, thus degrading the performance.Accordingly, the actual power allocated by BS will be lower than the threshold to seek the optimum sum SE when the transmit power budget is oversaturated.At the saturation point, the proposed algorithm outweighs the fixed-point iteration method by 3.22 bps/Hz, while it only has 1.98 bps/Hz SE less than the exhaustive search method.
Sensors 2024, 24, x FOR PEER REVIEW 24 of 28 by 3.15 bps/Hz and 8.71 bps/Hz, respectively.Neglecting the benefit factor of the optimized RIS by comparing the random phase shifts method, we can infer that the proposed algorithm can also brilliantly restrain the LI to improve performance.

Impact of the Transmit Power
Figure 11a,b show the performance impact of the transmit power.In Figure 11a, with the BS power budget increasing, the sum SE improves and reaches the bottleneck at 30 dBm.This is mainly due to the effect of residual SI and DL to DL interference.For instance, the excess transmit power at BS will severely enhance its SI, thus degrading the performance.Accordingly, the actual power allocated by BS will be lower than the threshold to seek the optimum sum SE when the transmit power budget is oversaturated.At the saturation point, the proposed algorithm outweighs the fixed-point iteration method by 3.22 bps/Hz, while it only has 1.98 bps/Hz SE less than the exhaustive search method.On the other hand, we further compare the influence of the IoT device transmit power.The simulation result in Figure 11b presents that performance starts to decline when the transmit power of IoT devices is over 20 dBm, except for the case without RIS.For example, the proposal and fixed-point iteration methods degrade 1.16 bps/Hz and 1.47 bps/Hz from = 20 p dBm to 25 dBm, respectively.We can infer that the composite strong LI and UL to UL interference due to the superfluous transmit power mainly creates the trend of these curves.
Because of no LI in the case without RIS, the peak rests on = 25 p dBm.However, if p keeps rising, the performance decreases due to the difficulty of BS in decoding the signal mixed with an intensive UL to UL interference.It explains that the IoT device transmit power in real-world applications should have an upper bound.In fact, we can draw another interesting conclusion that the optimum transmit power of IoT devices with RIS is lower than that without RIS, which can illustrate that RIS-aided systems not only improve SE but also save energy.On the other hand, we also conclude that if the phase shifts are not well tuned in the intense LI situation, the random phase shifts method can even be lower by 0.84 bps/Hz than the case without RIS.This underscores the importance of RIS optimization in the urban outdoor environment.Meanwhile, for = 30 p dBm, our proposal outweighs the unoptimized RIS approach by 6.23 bps/Hz.It demonstrates that the proposed algorithm still performs well even with an excessively high transmit power.
To further illustrate the merit of the proposal in terms of complexity and power consumption, we list the performance of discrete phase shifts related methods with transmit powers = 30 P and = 20 p dBm in Table 3.On the other hand, we further compare the influence of the IoT device transmit power.The simulation result in Figure 11b presents that performance starts to decline when the transmit power of IoT devices is over 20 dBm, except for the case without RIS.For example, the proposal and fixed-point iteration methods degrade 1.16 bps/Hz and 1.47 bps/Hz from p = 20 dBm to 25 dBm, respectively.We can infer that the composite strong LI and UL to UL interference due to the superfluous transmit power mainly creates the trend of these curves.
Because of no LI in the case without RIS, the peak rests on p = 25 dBm.However, if p keeps rising, the performance decreases due to the difficulty of BS in decoding the signal mixed with an intensive UL to UL interference.It explains that the IoT device transmit power in real-world applications should have an upper bound.In fact, we can draw another interesting conclusion that the optimum transmit power of IoT devices with RIS is lower than that without RIS, which can illustrate that RIS-aided systems not only improve SE but also save energy.On the other hand, we also conclude that if the phase shifts are not well tuned in the intense LI situation, the random phase shifts method can even be lower by 0.84 bps/Hz than the case without RIS.This underscores the importance of RIS optimization in the urban outdoor environment.Meanwhile, for p = 30 dBm, our proposal outweighs the unoptimized RIS approach by 6.23 bps/Hz.It demonstrates that the proposed algorithm still performs well even with an excessively high transmit power.
To further illustrate the merit of the proposal in terms of complexity and power consumption, we list the performance of discrete phase shifts related methods with transmit powers P = 30 and p = 20 dBm in Table 3.It can be seen that the EE of the proposal is only 7.4% less than the exhaustive, but the complexity is greatly reduced.

Impact of the Rician Factor
Finally, we evaluate the SE performance under different rician factors in Figure 12.With the increase in the rician factor value, the sum SE decreases, especially for the optimized RIS cases.Moreover, it also can be seen that with ε growth, the gain loss with the proposed algorithm is more evident than the fixed-point iteration method.For instance, the gap between the proposal and the case without RIS is 18.58 bps/Hz at ε = 0 and reduces to 8.47 bps/Hz at ε = 10, whereas the fixed-point iteration method decreases from 13.49 to 5.25 bps/Hz with the same settings.The reason is that RIS improves performance by reconstructing signals to increase multi-path diversity.A larger ε will not be conducive to promoting multi-path and cause severe co-channel interference due to the enhanced main path.Therefore, the gain brought by multi-path diversity decreases coupled with the increase in ε, and the greater profit deriving from the diversity gain will cause more performance decline under the same level of the enlarged ε.We conclude that the RIS should be deployed in a relatively rich scattering environment to seek optimum performance.Thus, our proposal fairly suits the urban outdoor environment with rich scattering.It can be seen that the EE of the proposal is only 7.4% less than the exhaustive, but the complexity is greatly reduced.

Impact of the Rician Factor
Finally, we evaluate the SE performance under different rician factors in Figure 12.With the increase in the rician factor value, the sum SE decreases, especially for the optimized RIS cases.Moreover, it also can be seen that with ε growth, the gain loss with the proposed algorithm is more evident than the fixed-point iteration method.For instance, the gap between the proposal and the case without RIS is 18.58 bps/Hz at ε = 0 and reduces to 8.47 bps/Hz at ε = 10 , whereas the fixed-point iteration method decreases from 13.49 to 5.25 bps/Hz with the same settings.The reason is that RIS improves performance by reconstructing signals to increase multi-path diversity.A larger ε will not be conducive to promoting multi-path and cause severe co-channel interference due to the enhanced main path.Therefore, the gain brought by multi-path diversity decreases coupled with the increase in ε , and the greater profit deriving from the diversity gain will cause more performance decline under the same level of the enlarged ε .We conclude that the RIS should be deployed in a relatively rich scattering environment to seek optimum performance.Thus, our proposal fairly suits the urban outdoor environment with rich scattering.

Conclusions
In this paper, considering residual SI, LI, and RIS location, we have proposed a novel DRL-based two-step algorithm practical for two typical urban outdoor scenarios in RISaided FD systems.Specifically, we devise a scheme to maximize the sum SE of IoT devices by joint design of the receive, transmitting beamforming matrices and phase shifts vector.Firstly, we decompose the original optimization problem into two subproblems according to the type of optimized variables.We obtain the closed solution of subproblem one by approximating the convex lower bound of the objective function and constraints.Then, we devise a low computational complexity SAC algorithm to solve subproblem two.Simulation results demonstrate that our low-complexity proposal is second only to the upper bound of the discrete phase shifts method and outperforms the fixed-point iteration baseline.Moreover, it is especially advantageous in scenario two with a complicated objective function, proving the superiority of our proposal in the urban outdoor environment.
Since obtaining the perfect CSI of the cascaded channel is idealistic, imperfect CSI should be considered in practice.Moreover, we only indicate that the RIS-aided system is also power-saving from the perspective of SE.We could formulate an EE objective function to investigate further.The above-mentioned is our future work.

Figure 1 .
Figure 1.A multi-user RIS-aided full-duplex system in urban outdoor environment.

Figure 1
Figure 1 describes a multi-user RIS-aided FD system, which composes one BS equipped with t N transmitting and r N receiving antennas, (J + K) single-antenna IoT devices, and one RIS.BS works in FD mode, while the IoT devices, in terms of service

Figure 1 .
Figure 1.A multi-user RIS-aided full-duplex system in urban outdoor environment.

Figure 2 .
Figure 2.An MDP of RIS controller and environment.

Figure 2 .
Figure 2.An MDP of RIS controller and environment.

Algorithm 2 :
Proposed SAC Algorithm for Problem P3

Figure 3 .
Figure 3. Framework of the SAC.
we assume LI in scenario one is an approximated AWGN, we set the value of σ 2 LI ,k −96 dBm at = 50 x m according to the average distance between the

Figure 5 .
Figure 6 shows the convergence behavior of the proposed two-step algorithm with respect to different parameter settings in both scenarios.It can be observed that increasing the number of BS antennas and IoT devices with the fixed RIS configuration will slow down the convergence.For instance, the parameter settings of = = = = t r 3 N N J K and

Figure 7 .
Figure 7. Impact of the horizontal distance between BS and RIS.System parameters: N t = N r = J = K = 3, b = 4, L = 16, P = 30 dBm,p = 20 dBm.(a) Sum SE versus x in scenario one; (b) Sum SE versus x in scenario two.

Figure 8 .
Figure 8. Impact of the number of RIS reflection elements.System parameters: = = = t r N N J K

Figure 9 . 7 .
Figure 9. Impact of the number of RIS resolution bits.System parameters: = = = = t r 3 N N J K ,

Figure 10 .
Figure 10.Impact of the level of residual self-interference.System parameters: = = = = t r 3 N N J K ,

Figure 9 .
Figure 9. Impact of the number of RIS resolution bits.System parameters: N t = N r = J = K = 3, L = 16, P = 30 dBm, p = 20 dBm, x = 50 m.4.7.Impact of the Residual Self-Interference Figure 10 presents the sum SE trend under different residual SI levels.Obviously, the performance is more afflicted with the increased residual SI, and our proposal can better eliminate the residual SI at the condition of low (−120 dB ≤ σ 2 SI ≤ −105 dB) and normal (−105 dB ≤ σ 2SI ≤ −90 dB) levels.To be specific, the maximum loss of our proposal caused by the increasing residual SI is 0.91 bps/Hz and 3.01 bps/Hz during low and normal levels, respectively, whereas that of the case without RIS is 1.79 bps/Hz and 4.77 bps/Hz.For the ideal case, the corresponding loss of the continuous phase method is 0.22 bps/Hz and 2.71 bps/Hz.It demonstrates that it is practical to adopt the proposed algorithm to reduce the FD performance loss by the residual SI.

Table 1 .
List of abbreviations.

)
H BR,k and H LI,j,k represent the BS-RIS-IoT k and IoT j -RIS-IoT k cascade channels.w k ∈ C N t ×1 is the transmitting beamforming vector for IoT k , and the modulus of the vector determines the allocated power level to IoT k .x d k and x u j indicate the receiving and transmitting symbols of IoT k and IoT j , respectively, which satisfy E x d = 1.p is the transmit power of UL IoT devices.We suppose each device adopts the full power p for transmission.g k,j is the channel gain between IoT k and IoT j .n d * (34) a t the phase shifts vector to the environment with the obtained CSI and the given w * , U * to calculate the next state s t+1 and reward r t by (32)-(34).8Storethetransitiontuple(s t , a t , r t , s t+1 ) in the replay buffer D. 9Randomly sample min-batch transition tuples with batch size N from D. 10Update the parameters θ 1 and θ 2 of the online Q networks by (37a), (37b).11Update the parameter φ of the actor Q networks by (38a), (38b).