Joint Task Offloading and Resource Allocation for Intelligent Reflecting Surface-Aided Integrated Sensing and Communication Systems Using Deep Reinforcement Learning Algorithm

This paper investigates an intelligent reflecting surface (IRS)-aided integrated sensing and communication (ISAC) framework to cope with the problem of spectrum scarcity and poor wireless environment. The main goal of the proposed framework in this work is to optimize the overall performance of the system, including sensing, communication, and computational offloading. We aim to achieve the trade-off between system performance and overhead by optimizing spectrum and computing resource allocation. On the one hand, the joint design of transmit beamforming and phase shift matrices can enhance the radar sensing quality and increase the communication data rate. On the other hand, task offloading and computation resource allocation optimize energy consumption and delay. Due to the coupled and high dimension optimization variables, the optimization problem is non-convex and NP-hard. Meanwhile, given the dynamic wireless channel condition, we formulate the optimization design as a Markov decision process. To tackle this complex optimization problem, we proposed two innovative deep reinforcement learning (DRL)-based schemes. Specifically, a deep deterministic policy gradient (DDPG) method is proposed to address the continuous high-dimensional action space, and the prioritized experience replay is adopted to speed up the convergence process. Then, a twin delayed DDPG algorithm is designed based on this DRL framework. Numerical results confirm the effectiveness of proposed schemes compared with the benchmark methods.


Introduction
The integrated sensing and communication (ISAC) framework has been proposed as one of the critical technologies in the six-generation (6G) networks, enabling many emerging applications such as virtual reality, smart city, autonomous driving, etc. [1].The application scenarios mentioned above require a high data transmission rate while ensuring target sensing performance.In recent works [2][3][4][5], a tight combination of sensing and communication functions in ISAC systems has been achieved through a series of designs, including integrated architecture, waveforms designing, etc.By achieving the sharing of spectrum and wireless infrastructure, the ISAC technology improves resource efficiency and utilization, and reduces signal interference and hardware overhead [6].
However, despite the enormous benefits of ISAC technology, its applications face considerable challenges in practice due to the obstacles of dense buildings or landscape trees in urban environments [7].Unlike communication systems in which both line-of-sight (LoS) and non-LoS (NLoS) links can be leveraged for data transmission, the radar sensing function relies on the LoS link between the transmitter and the target area, while the NLoS link is considered to be an interference [8].Therefore, exploring the target sensing problem for the ISAC system without an LoS link is necessary [9].
The intelligent reflecting surface (IRS) is a promising technology in next-generation wireless systems due to its excellent ability to reconstruct wireless environments [10].By manipulating the phase shifts and amplitude of reflecting elements, the IRS creates the virtual LoS link in NLoS areas.Motivated by the advantages of IRS in reconstructing the wireless propagation environment, it is natural to exploit IRS to assist ISAC systems to improve communication data rate and enhance sensing accuracy and resolution [11].In the IRS-assisted ISAC system, multiple beams can be synthesized for the user and the desired signal can be enhanced by the joint design of phase shift and transmit beamforming [9,12].Moreover, the IRS reduces hardware and energy overhead using low-cost passive components without needing a radio frequency (RF) unit [13].Hence, high spectrum efficiency and low cost advantages prompt us to research IRS-assisted ISAC systems.
Although the IRS-assisted ISAC system shows significant potential, its implementation still faces challenges, such as the joint design of phase shift and beamforming matrices.The ISAC system's data calculation and signal processing are generally complex and require more resources.Due to the constrained computation and energy resources of the user terminal, the heavy sensory data processing load of user equipment (UE) is solved by mobile edge computing (MEC) technology.MEC works by offloading the computational task from UE to the edge network and achieving better time efficiency and performance [14].This work investigates the joint resource allocation and task offloading optimization problem in the multi-user IRS-assisted ISAC scenario.In particular, power and spectrum resources are allocated by beamforming and phase shift design, while computing resources are allocated by task offloading.

Related Works
Adopting the IRS to improve communication quality has provided certain benefits; inspired by this, researchers have conducted extensive studies to explore the potential of employing IRS in ISAC systems [8,13,[15][16][17][18][19][20][21].In [8], the virtual LoS channel was created with the IRS's assistance to enhance the communication and sensing performance in an ISAC system, and the semi-definite relaxation (SDR) was adopted for the beampattern gain maximization problem.The authors in [13] exploited the IRS to strengthen the radar detection function in the dual-function radar and communication system, in which a joint optimization of precoding and IRS phase shift matrices was proposed, and a majorization-minimization (MM) method was used to solve it.A hybrid IRS model was investigated in [15], which comprised active and passive elements for enhancing ISAC systems and realizing worst-case target illumination power maximization through an alternating optimization (AO) algorithm.In [16], the authors proposed an IRS-aided radar system architecture and studied the benefits of IRSs and the deployment location issues.Through a joint beamforming design, the authors in [17] optimized the total transmit power while meeting signal-to-interference-plus-noise (SINR) requirements for communication and radar signal cross-correlation pattern constraint for sensing in IRS-assisted ISAC systems.The authors in [18] proposed penalty dual decomposition (PDD) and block coordinated descent (BCD) methods for the joint optimization problem in the IRS-aided communication radar coexistence system.In [19], the authors studied the joint waveform and discrete phase design in the IRS-aided ISAC system to mitigate the multi-user interference.In [20], an alternative direction method of multipliers (ADMMs) and MM approaches were proposed to optimize the sensing performance while satisfying the communication requirements.The authors in [22] developed an ISAC-assisted MEC and employed IRS to reduce the mutual interference between MEC offloading transmission and radar sensing, and a BCD algorithm was employed.Inspired by the above-mentioned work, we investi-gate the joint computation offloading and resource allocation problems in the IRS-aided ISAC system.
Recently, the excellent performance of artificial intelligence (AI) algorithms in dealing with nonlinear and high computational complexity problems has triggered a revolution in the industry and academia [23][24][25][26].Considering that there are numerous elements in the IRS-assisted ISAC system, the high-dimensional optimization problems in this system are difficult to solve using traditional mathematical methods.However, it is very suitable for AI technology.Meanwhile, deep reinforcement learning (DRL) takes advantage of deep learning in neural network training and the extraordinary ability of reinforcement learning on large-scale non-convex problems [25].Therefore, DRL finds a broad array of applications within the domain of wireless communications, including computing offloading [27], power allocation [28], task scheduling [29], etc.The authors in [30] designed a DRL approach to address a joint transmit precoding and phase shift matrix design with the maximizing data rate optimization goal.An adaptive DRL framework twin delayed deep deterministic policy gradient was developed in [31] to deal with the joint beamformer design problem in IRS-aided wireless systems.The authors in [6] designed a distributed reinforcement learning scheme for the joint optimization problem in the terahertz band IRS-aided ISAC system.Therefore, given the time-varying channel conditions and dynamic resources, we reformulated the proposed optimization problem in our work as a Markov decision process (MDP).Then, an innovative DRL-based framework is developed for solving the joint resource optimization and computation offloading problem.Table 1 lists the main closely-related existing efforts and compares them with our work.

Contributions
We investigate the joint optimization problem in the multi-user IRS-assisted ISAC system.Specifically, the design of transmit beamforming and IRS phase shift matrices for communication and radar sensing, as well as the computation offloading for local data processing, are studied in this context.Our aim is to optimize the system's data transmission and energy efficiency while meeting the radar sensing requirement and power constraints.Considering the dynamic environment and high-dimensional solution space of the optimization problem, we develop a DRL scheme for solving it.We can summarize the contributions as follows: • We propose the IRS-assisted ISAC framework, exploiting the IRS to assist and enhance sensing and communication functions in NLoS coverage areas.We construct a comprehensive optimization goal, covering the sensing, communication, and computation offloading.The main goal is to maximize the data sum-rate while minimizing energy consumption under the radar performance, transmit power budget, and offloading time delay constraints through the joint design of transmit beamforming and IRS phase shift.

•
Considering the coupled relationship between optimization variables, the joint optimization problem is NP-hard and non-convex, making it challenging to use traditional mathematical methods.Therefore, the optimization problem is formulated as an MDP problem, and two innovative DRL schemes are designed to solve it.Due to the continuous and large-dimension action space, we develop a deep deterministic policy gradient (DDPG) scheme, which combines prior experience replay technology to enhance training efficiency.Furthermore, a twin delayed DDPG (TD3) scheme is designed based on the DDPG framework.

•
Simulation results confirm the effectiveness and convergence of our proposed scheme.
In contrast with benchmarks, our proposed DRL scheme achieves a better balance between communication and sensing performance.Moreover, system's energy consumption and latency are optimized by proper computation offloading.Finally, the benefits and feasibility of the IRS-assisted ISAC framework are verified.
Notation: Bold uppercase and lowercase letters represent matrices and vectors, respectively.(•) T and (•) H denote the transpose and Hermitian transpose operators.Tr(•) is the rank operation.diag(•) expresses the diagonal elements.∥•∥ F and |•| are the Frobenius norm and absolute operators.

System Model
A multi-user, single-input, single-output (MISO) IRS-aided ISAC system is presented in Figure 1, with K single-antenna users and a base station (BS) equipped with M antennas.Specifically, the BS deployed the uniform linear array (ULA) antennas, and the IRS employed the uniform planar antenna (UPA).Our work considers a case wherein direct links of BS users are obstructed by dense obstacles.Therefore, the IRS with N × N reflecting elements is employed to aid the user's wireless data transmission and to provide target sensing service in NLoS areas.We can denote the set of users, BS antennas, and IRS elements as K = {1, 2, . . . ,K}, M = {1, 2, . . . ,M}, and N = {1, 2, . . . ,N}, respectively.The trans- mitted information-bearing symbol vector is denoted as The signal transmitted by BS is given by where W = [w 1 , w 2 , • • • , w K ] ∈ C M×K represents the transmit beamforming matrix, with w m ∈ C M×1 denoting the transmit beamforming vector for user k.
The covariance matrix of the transmit signal is computed by Therefore, the transmit power budget can be obtained by where P max is the transmit power budget.

ISAC signal
Radar echo signal

Communication Model
Let H 1 ∈ C N×M denote the channel matrix from BS to IRS. h k,2 ∈ C N×1 represents the channel vector from IRS to user k with ∀k ∈ K.The transmitted signal received by the user k is given by where ) indicate the amplitude and phase of element n with ∀n ∈ N , respectively, due to the high overhead of simultaneous implementing of independent control of phase shift and amplitude [13].Therefore, we assume the ideal reflection amplitude of the passive IRS with is the additive white Gaussian noise (AWGN).
We take the Rician fading channel model in this work, and channel H 1 can be formulated as where γ 1 denotes the Rician factor.H LoS ∈ C N×M and H NLoS ∈ C N×M are LoS component and NLoS component, respectively.The LoS channel matrix can be expanded as , where α and φ are the large-scale channel gain and a ran- dom phase uniformly distributed in the range from 0 to 2π, respectively.Meanwhile, a r (θ r ) ∈ C N×1 represents the receive steering vector at IRS with the angle of arrival θ r , b t (θ t ) ∈ C M×1 indicates the transmit steering vector of BS with the angle of departure θ t .The steering vector of BS b(θ) can be formulated as where d 0 and λ denote the antennas' spacing and signal wavelength.Similarly, the steering vector of IRS a(υ, θ) can be formulated as We leverage the SINR ratio as the performance indicator of communication.Let ρ k denote the SINR of user k, which is given by

Radar Sensing Model
At time slot t, the received radar echo signal at BS can be expressed as where A ∈ C N×N represents the target response matrix of IRS.τ k denotes the propagation delay between the transmitter and the target.The n r (t) is AWGN with n r (t) ∼ CN 0, σ 2 r I M .The specific formulas are listed as The received sensing echo signal from the k-th target y r,k ∈ C M×1 can be formulated as We use the SINR as the sensing performance indicator [33].Therefore, the SINR of the radar can be given by

Computation Offloading Model
The UE generates a series of data processing tasks that need to be executed in a timely manner for the low latency requirement.Due to the constrained energy and computation resources of UE, the task can be offloaded to the BS.The computation task generated by UE k(k ∈ K) at time slot t is denoted by a tuple D k (t) = {d k (t), c k (t), ξ k (t)}, where d k (t) denotes the input data size (bits), c k (t) represents the required computation cost (e.g., the number of CPU cycles for processing one-bit data), and ξ k (t) indicates the maximum tolerable latency of UE k, respectively.We assume that tasks are bitwise separable and can be partially executed locally, while the remaining parts directly send the raw data to the BS for processing.The processing delay of the BS server executing task D k (t) can be calculated by where the f o,k (t) represents the CPU frequency of the BS server.The processing delay of UE k to execute the task D k (t) locally can be written as where f l,k is the CPU frequency of UE k (cycles/s).The overall latency for processing the task D k (t) is depicted as where w k ∈ [0, 1] represents the offloading ratio.Under two extremes, w k = 1 when the task is offloaded to BS and w k = 0 when the task is processed locally at the UE k.
is the uplink transmission delay with the uplink data rate r k , which is listed as where B k and p k are the uplink transmit bandwidth and power for UE k, respectively.Due to the small size of the processing result, the latency of receiving the result T d,k can be ignored [34,35].Meanwhile, the energy consumption of executing task offloading by the UE k can be denoted by where κ o denotes the effective capacitance coefficients related to the chip architecture [36,37].The energy consumption for UE k executing the task locally can be formulated as Similarly, κ l is the effective capacitance coefficient.Therefore, the overall energy consumption can be given by where E u,k represents the offloading energy consumption with E u,k = p k d k /r k .The energy consumption for result receiving can also be ignored.

Problem Formulation
This section studies the performance optimization and trade-offs of sensing, data transmission, and computation offloading.The overall system performance is optimized through joint beamforming, phase shifting design, and resource allocation.

Transmission Performance Optimization
The optimization goal of the IRS-assisted ISAC system is to maximize the data rate while satisfying the sensing performance requirement.Then, the objective of data transmission optimization can be formulated as follows: max subject to where ρ thr is a threshold value for the radar SINR.Constraint (21) depicts the transmit power limit for deploying the ISAC.Constraint (23) ensures the sensing performance while optimizing the communication performance.

System Energy Consumption Optimization
Due to the strained resources of UE, it is necessary to optimize UE energy consumption.The optimization objective for computation offloading is to minimize system energy consumption for the system while satisfying the latency constraints, which is written as subject to where F tol o and f tol l,k indicate the total computing resource of BS server and local computing resource of UE k, respectively.Constraints ( 25)-( 27) represent the computing resource limi-tation of BS, maximum local computation resource, and latency constraint for processing the task D k (t).Constraint (28) represents the offloading decision.

System-Comprehensive Performance Optimization
In this work, we aim to optimize the system's transmission performance and energy consumption through joint beamforming, phase design, and power allocation.Considering that there is a coupling relationship between optimization objects ( 21) and (23), we can reformulate the optimization problem as max subject to ( 21)-( 23), ( 25)- (28).
The downlink sum data rate is related to the number of users, transmit power, and sensing requirement of quality, which can be maximized through reasonable beamforming and phase shift design.Meanwhile, the total energy consumption of the system can be optimized by appropriate computation offloading decisions.The optimization problem ( 29) is NP-hard and non-convex; thus, using mathematical methods to solve it will bring substantial computational complexity.Moreover, considering the time-varying wireless channel environment, a model-free DRL approach is adopted to obtain the optimal solution.

DRL-Based Joint Task Offloading and Resource Allocation Scheme
In this section, we formulate the optimization goal as an MDP.Then, we propose two improved DRL-based schemes to solve the joint precoding and computation offloading problem in the IRS-aided ISAC system.

MDP Formulation
We use a four-elements tuple (S, A, P, R) to denote the MDP, where S and A denote the set of system state and actions, respectively.P is the state transition probability and R represents the reward function.We can outline the process of RL interacting with the environment as follows.The agent adopts action a t under environment state s t , and receives the instant reward r t as the response for the action a t .Then, the environment state s t turns to new s t according to the transition function P(s t , a t , s t+1 ).The reinforcement learning aims to obtain the optimal policy π * (a | s) from a given MDP, which is the mapping from state to action that can obtain the maximum long-term cumulative reward R t = ∑ ∞ i=0 γ i R(s t+i+1 , a t+i+1 ).γ ∈ [0, 1) is the discount factor.We can define state, action, and reward in our model as follows.
State: The environmental state at the t-th time step consists of channel matrices, BS transmit power, the size of the computation task, and the action adopted by the agent in (t − 1)-th time step.Thus, the state of agent s t ∈ S is given by where Action: The action of the agent comprises the transmit beamforming matrix at BS, phase shift of IRS, and computation offloading decision.We can formulated the action a t ∈ A as a t = { W(t), Φ(t), w k (t)}, (31) where W(t) = [Re{W(t)}, Im{W(t)}] and Φ(t) = [Re{Φ(t)}, Im{Φ(t)}] indicate the real and imaginary parts of transmit beamforming and phase shift matrices.w k ∈ [0, 1] for ∀k ∈ K is the computation offloading action.Reward: The agent, through the feedback of reward, evaluates the action and makes improvements.This work aims to optimize the communication data rate while minimizing the system energy consumption.Thus, the reward r t at the t-th time step is defined by where ω 1 and ω 2 are the weighting factors with ω 1 + ω 2 = 1.The weighting factor can be used for control optimization preferences.

An Improved DDPG-Based Joint Optimization Algorithm
Considering that the transmit power, phase shift, and the offloading scale factor are continuous variables, we are resorting to the policy-based scheme.The DDPG algorithm has been proven as an effective solution for the continuous control problem [23].Thus, the DDPG-based scheme is developed in this work.Figure 2 depicts the developed framework.The proposed DRL model adopted the evaluate network and target network with identical structures but differing parameters.Both evaluate and target networks contain a set of actor-critic neural networks.

Primary Critic Network
Target Networks Primary Networks

Agent Environment
Prioritized sampling At each time slot, the evaluate network obtains environmental state s t and then outputs the action a t .The Q value is adopted to describe the long-term reward of executing a t , which can be calculated by the Bellman equation [38] Q µ (s t , a t ) = E[r t (s t , a t ) + γQ µ (s t+1 , a t+1 )], (33) where µ : S ← A is the deterministic policy function, and the actor function µ(s|ω µ ) works by mapping a state to an action to specific current policy.The DRL agent interacts with the environment to find the optimal action corresponding to the maximum Q value The experience replay mechanism is leveraged to break the correlation between experience tuples [39].Applying J tuples sampled from the experience buffer, the critic network is trained by minimizing the loss function where denotes the target value.ω Q ′ represents the parameters of the function approximator.The actor network updating following the policy gradient rule and the loss function can be expressed as To address the unstable issue in the learning process, the soft target is leveraged for the updating of target actor-critic networks, which can be formulated by with the soft update factor τ ≪ 0. The experience replay mechanism overcomes the problem prone to divergence in the training process.Since the conventional experience replay mechanism replayed the transition tuples uniformly, the importance of different experiences is ignored.The prioritized experience replay (PER) assigns priorities based on the importance of the experience samples, which is adopted to speed up the training convergence [39].The internal logic of the PER mechanism is to replay extremely good or bad experiences more frequently.The temporal difference error (TD-error) is usually leveraged as the measurement of the experiences' value [40].The absolute TD-error is proportional to the correction to the expected action value.The TD-error of transition tuple i can be formulated by The probability of the transition i is given by where 1 rank(i) , rank(i) represents the ranking of transition i when sorted according to the absolute TD-error.ϱ is the degree of priority adopted.However, PER changes the state access frequency, introduces the bias, and may cause oscillation and divergence.Thus, the importance-sampling weights are employed to handle the bias with W i = 1/[B • P(i)] β , B denotes replay buffer size, and β is the factor that controls the degree of correction.The proposed DRL-based joint task offloading and resource allocation algorithm is summarized in Algorithm 1.

Twin Delayed DDPG (TD3)-Based Joint Optimization Algorithm
The TD3 algorithm is considered as an improver of DDPG, which solves a series of issues caused by overestimation in the process of the Q value estimate in DDPG [41].We depicts the TD3-based joint optimization framework in Figure 3.Although the overestimated values are small in each update, they may accumulate after every update, creating a significant bias.Furthermore, the inaccurate Q value leads to the deterioration of the policy network.This process forms a feedback loop in which suboptimal behavior is continuously reinforced.The TD3 algorithm addresses the above-mentioned challenge through the following technologies.( ) Firstly, clipped double Q learning.The TD3 leverages twin critic networks to estimate two Q function, and choose the smaller one as the target Q value to compute loss in the Bellman equation.The target update in the double critic networks framework is formulated as where ω Q ′ n (n = 1, 2) denote weight parameters of two target critic networks, respectively.Critic networks are updated by using the loss function, which are given by where ω Q 1 and ω Q 2 indicate weight parameters of two estimate critic networks, respectively.The smaller value is adopted for the Bellman error function.Secondly, delayed policy updates.The actor and its target network reduce the update frequency compared to critic networks, to avoid the divergent behavior caused by the policy updates under inaccurate value estimate.Thirdly, target policy smoothing.A regularization strategy is leveraged in TD3 to address the overfit at high peaks and Q value error.In practice, a random noise is added in the action selection process to enforce the generalization of similar actions as given by ãt+1 ← µ(s t+1 |ω µ ) + ϵ, (45) where the added noise ϵ ∼ clip(N (0, σ a ), −c, c) is clipped by the constant c to ensure the proximity between the target action and the original.The TD3-based joint optimization algorithm is similar to the processing process of Algorithm 1, with improvements in the following aspects: • In the Input Step, input two pairs of critic networks Q 1 s, a|ω Q 1 and Q 2 s, a|ω Q 2 , respectively.In Step 1, initialize parameters of two estimate critics and two target critics with Target policy smoothing is realized by (45).Then, the agent updates the target value using (42).In Step 16, the loss is computed by (43) and (44).• Before turning to Step 17, the agent adopts a delayed update strategy to keep policy networks updated less frequently than value networks.

Numerical Results
In this section, the simulation results are presented to assess the proposed DRL-based task offloading and resource allocation schemes in the IRS-assisted ISAC system.The simulation is based on Python 3.8 and PyTorch 1.8.0.We assume that the BS and the IRS are located at [−10, 0, 0] m and [90, 0, 2] m.UEs are randomly distributed in a radius of 1 m below the IRS [21].The channel matrix H 1 and h k,2 with k ∈ K follow the Rician distribution with the Rician factor γ 1 = 3 dB [42].According to [43], the carrier frequency is set to 30 GHz, and the shadowing standard deviation is 7.8 dB.The path-loss exponent of BS-IRS and IRS-UE are set to 2.8 and 2.5 [44].The noise power σ 2 = −174 dBm/Hz.Meanwhile, we set noise power σ 2 = −85 dBm and the bandwidth B k = 2 MHz [43].The input data size of task d k , required computation cost c k , and CPU frequency of BS server f l,k are randomly generated in the interval [1,2] Mbits, [1,3] Kcycles/bit, and [1,2] Gcycles/s, respectively [45].CPU frequency of BS server F tol o , effective capacitance coefficients κ o , and κ l are set to 10 Gcycles/s, 10 −26 , and 3 × 10 −26 [45].The default simulation parameter is listed in Table 2.

Convergence Performance
Considering the relationship between the DDPG-based algorithm's performance and the parameters in the system, we first conducted several experiments to find the appropriate learning rate and discount factor.Meanwhile, as shown in Figures 4 and 5, the proposed DDPG-based algorithm's convergence performance is displayed.Figure 4 depicts the average rewards under different learning rates.The average reward is obtained by , where T max denotes the maximum time steps.It can be obtained from the figure that the maximum average reward can be achieved when the learning rate is 0.01. Figure 5 is the convergence performance under different discount factors.The figure shows that the algorithm performs better than others when the discount value is 0.7.Therefore, we set the learning rate and discount value as 0.01 and 0.7 in the following experiments for the DDPG-based framework.Moreover, it can be obtained that average rewards increase with the number of training time steps and finally converge at about 10 4 rounds.

Performance Comparison
We compare the performance of the PER-DDPG-based scheme, TD3-based scheme, and random IRS phase scheme under different transmit power budgets and different numbers of IRS elements.We set the number of IRS elements N × N as 100, 256, and 400, and the number of users K as 10, 16, 20, respectively.Figure 6 illustrates that the achievable weighted communication data rate is directly proportional to the maximum transmit power budget and the number of IRS elements.It can be seen from the figure that the TD3-based algorithm achieves the best communication performance, the PER-DDPGbased algorithm is slightly inferior to the TD3 scheme, and the random phase shift-based one has the worst performance.Figure 7 plots the sensing SINR versus the transmit power budget, where the number of users K and IRS's elements N × N are set to 10 and 100, respectively.The radar sensing SINR consistently increases with the expansion of transmit power budget, but the growth speed gradually slows down.Our proposed DRL-based algorithms show a better performance than the baseline.

Transmit power P (dB)
Radar SINR (dB) PER-DDPG-based scheme TD3-based scheme Random phase shift-based scheme Figure 8 depicts the correlation between the number of users and the system energy consumption.We set the number of IRS elements N × N as 100.As depicted in the plot, it is evident that the system energy consumption increases with the growing number of users.With the rising number of users, the amount of tasks offloaded to the base station rises, resulting in a growth in energy consumption.The proposed two schemes dramatically reduce the overall execution energy consumption compared with local execution methods, and the TD3-based scheme is slightly better than the PER-DDPG-based scheme.
Figure 9 describes the relationship between the number of users and the offloading delay, and the number of IRS elements N × N is set to 100.The figure demonstrates that an increase in the number of users leads to a rise in data processing time due to resource competition among the users.Compared with the local execution method, the proposed DRL methods greatly reduce the overall average execution delay of the task, and the TD3 algorithm has the lowest total delay.

Conclusions
In this paper, we studied the IRS-assisted ISAC framework, wherein the IRS is exploited to establish virtual links in NLoS areas for enhancing radar sensing performance and communication data rate.We aim to improve the system's transmission and energy efficiency through joint task offloading and resource allocation under constraints of transmit power budget, sensing quality, and tolerable latency of offloading.Specifically, transmit beamforming, IRS phase shift, and task offloading are jointly designed, and the weight coefficient is leveraged to control the balance between performance and overhead.The PER DDPG-based and TD3-based algorithms are developed for the complex optimization problem.Numerical results demonstrate that the proposed algorithms have better performance than the baseline scheme.In addition, the simulation shows that the system performance is related to the transmit power, the number of IRS components, and the number of users.In practical applications, we can optimize system performance by setting parameters reasonably.In future work, we will combine the distributed DRL algorithm and federated learning framework to improve the efficiency and scalability of the joint optimization scheme in large-scale networks.Meanwhile, extended to multi-IRS scenarios, our proposed method suffers from the action space explosion problem caused by the exponential increase in intermediate channel coefficients.Therefore, the meta-reinforcement learning can be adopted to decompose the cascaded channel, and reduce the solution complexity and computational overhead.Moreover, future experiments will focus on implementing and testing the proposed strategies in real environments, striving to translate the theoretical potential into practical gains.

Figure 3 .
Figure 3. Proposed task offloading and resource allocation framework based on TD3.

Figure 6 .
Figure 6.The weighted achievable data rate versus the transmit power budget.

Figure 7 .
Figure 7. Sensing SINR versus the transmit power budget.

Figure 8 .
Figure 8.The total energy consumption versus the number of users.

Figure 9 .
Figure 9.The total average execution latency versus the number of users.

Table 1 .
Comparison with the state of the art.