Task Offloading Decision-Making Algorithm for Vehicular Edge Computing: A Deep-Reinforcement-Learning-Based Approach

Efficient task offloading decision is a crucial technology in vehicular edge computing, which aims to fulfill the computational performance demands of complex vehicular tasks with respect to delay and energy consumption while minimizing network resource competition and consumption. Conventional distributed task offloading decisions rely solely on the local state of the vehicle, failing to optimize the utilization of the server’s resources to its fullest potential. In addition, the mobility aspect of vehicles is often neglected in these decisions. In this paper, a cloud-edge-vehicle three-tier vehicular edge computing (VEC) system is proposed, where vehicles partially offload their computing tasks to edge or cloud servers while keeping the remaining tasks local to the vehicle terminals. Under the restrictions of vehicle mobility and discrete variables, task scheduling and task offloading proportion are jointly optimized with the objective of minimizing the total system cost. Considering the non-convexity, high-dimensional complex state and continuous action space requirements of the optimization problem, we propose a task offloading decision-making algorithm based on deep deterministic policy gradient (TODM_DDPG). TODM_DDPG algorithm adopts the actor–critic framework in which the actor network outputs floating point numbers to represent deterministic policy, while the critic network evaluates the action output by the actor network, and adjusts the network evaluation policy according to the rewards with the environment to maximize the long-term reward. To explore the algorithm performance, this conduct parameter setting experiments to correct the algorithm core hyper-parameters and select the optimal combination of parameters. In addition, in order to verify algorithm performance, we also carry out a series of comparative experiments with baseline algorithms. The results demonstrate that in terms of reducing system costs, the proposed algorithm outperforms the compared baseline algorithm, such as the deep Q network (DQN) and the actor–critic (AC), and the performance is improved by about 13% on average.

tasks [10], posing challenges such as limited network bandwidth, high latency, and security concerns [11]. To address these challenges, mobile edge computing moves computing and storage functions to the network edge as the complement of cloud computing, and becomes a promising solution [12][13][14]. This approach, suitable for vehicular networking environments requiring latency and reliability needs, alleviates IoV limitations. Vehicular edge computing combines mobile edge computing and the Internet of Vehicles, enabling distributed edge servers to facilitate computing offloading and interaction between vehicle users and roadside units via wireless access networks [15,16]. While mobile edge computing effectively addresses the limited resources and real-time constraints of the Internet of Vehicles, it still faces challenges such as restricted computing, storage, and bandwidth resources compared to cloud servers [17,18]. Overloading edge servers leads to increased service delay and reduced user experience [19]. Moreover, in vehicular edge computing where vehicles are the main research object, their high-speed mobility results in dynamic changes in communication links within the computing offloading environment [20]. Therefore, some service requests from vehicle users still require execution in the cloud or locally to optimize the utilization of system resources. Consequently, it remains a challenge to maintain the efficient operation of a vehicular edge computing system by designing intelligent resource management and scheduling schemes that can dynamically adapt to the complex vehicular edge computing environment while considering the dynamic nature of system status and resource constraints.
In the face of the challenge, traditional optimization techniques such as convex or non-convex optimization are insufficient to address decision-making problems in the vehicular edge computing environment. We introduce Deep Reinforcement Learning as our solution. Deep Reinforcement Learning is a field within artificial intelligence that enables intelligent agents to learn optimal actions through iterative interactions with the environment, allowing them effective adaptation to dynamic environments [21,22]. Inspired by the aforementioned issues, we design a VEC system model about "cloudedge-vehicle" three-tier architecture, which integrates system resources, vehicle mobility, and task offloading to develop a comprehensive offloading strategy that considers both user experience and system energy consumption. This strategy effectively reduces system costs by making decisions on the task offload destinations of vehicles and reasonably allocating the proportion of task offloading. We formulate the task offloading decision process in vehicular edge computing environments as a Markov decision process. To optimize system costs, we propose a novel deep reinforcement learning algorithm that considers dynamic communication conditions. This algorithm effectively allocates the task offloading proportion, resulting in improved efficiency and reduced system costs.
The main contributions of this article are as follows: • This paper proposes a "cloud-edge-vehicle" collaborative vehicular edge computing system model aimed at minimizing system costs by considering both system energy consumption and user experience. Taking into account the dynamic state of the system, we jointly optimize vehicle scheduling, task offloading and resource allocation. The optimization process is formulated as a Markov decision process and corresponding parameters are set. • Taking into account the complexity of vehicular edge computing environment and continuous action space requirement of computation offloading decisions, this paper proposes a task offloading decision-making algorithm based on deep deterministic policy gradient (TODM_DDPG). The algorithm employs DNN networks to approximate the policy and Q-functions, thereby preventing dimensional explosion and obtaining the optimal policy by jointly addressing task offloading and resource allocation. • This paper also provides a detailed training and validation process of the TODM_DDPG algorithm and introduces a state normalization mechanism to pre-process the system state.
The remaining structure of this paper is as follows: Section 2 examines previous related work. The system model and problem descriptions are introduced in Section 3. The background information of deep reinforcement learning and the suggested TODM_DDPG algorithm are introduced in Section 4. Section 5 investigates the experiment results. The paper is finally summarized in Section 6.

Related Work
As autonomous driving technology continues to evolve and intelligent transportation systems (ITS) mature, the scale of vehicular data for new applications such as vision-based object detection, path planning, and in-vehicle entertainment is growing explosively [23]. To address the diverse service needs of vehicular users, it is urgent to design effective computation offloading schemes for vehicular tasks. Due to its ability to meet the time-sensitive requirements of ITS application services, in the subject of intelligent transportation, the combination of mobile edge computing with the Internet of Vehicles has drawn considerable attention [15]. Computation offloading, as a core technology of mobile edge computing, encompasses several key elements, including the offloading target and destination, offloading mode, offloading function, and evaluation criteria for decision-making. These aspects collectively contribute to the overall process of computation offloading, enabling efficient resource utilization and improved system performance.
From the perspective of evaluation criteria for offloading decisions, most offloading decisions aim to optimize delay, energy consumption, system resources, combination of delay and energy consumption and maximize the quality of user experience or minimize the customized system cost. Luo et al. [24] aimed at minimizing offloading delay, proposed a multi-objective particle swarm optimization method using game theory analysis, which comprehensively considered communication, offloading decisions, and the allocation of computing resources. Simulation results show that this strategy is effective and feasible to solve the Pareto optimal solution. Based on the combination of a genetic algorithm and heuristic rules, Sun et al. [25] introduced a hybrid algorithm. Their approach aims to minimze delay and resource consumption by focusing on determining the execution location and the order of tasks. By leveraging the genetic algorithm and heuristic rules, their algorithm can effectively optimize task execution, resulting in reduced delays and efficient utilization of system resources. By introducing methods such as congestion degree and gravity reference point, literature [26] suggested an improved multi-objective whale optimization algorithm. A distributed computation offloading problem was presented in the senario of mobile device users by Chen et al. [27]. They presented a distributed technique that achieves the Nash equilibrium and formulated the problem as a game of multi-user computation offloading. Huang et al. [28] presented a cloud-edge server collaborative computation offloading method that leverages service orchestration. Their approach considers the delay and energy consumption and incorporates differentiated offloading decisions based on varying needs and delay sensitivities. By maximizing the distribution of processing tasks between cloud and edge servers, this technique successfully satisfies the demands of minimal latency and high dependability. In an effort to balance energy consumption and computation offloading delay, Cui et al. [29] suggested an enhancement to the NSGA-II method to discover the better solution to this constrained multi-objective optimization problem.
However, in the vehicular edge computing environment, the research object of computation offloading is the vehicle, while the mobility of the vehicle can lead to dynamic changes in the network topology. Traditional convex optimization or heuristic algorithms are not suitable for dynamic vehicular networks with vehicles, resources, and channel states. The majority of the aforementioned works, however, simply take into account the optimal or nearly optimal solution of the current system snapshot by offloading strategy, without considering the long-term impact of the current strategy on the environment. The decision-making process of computation offloading can be abstracted as a Markov decision process in the mobile edge computing context with dynamically changing network topology and computing resources [30]. With the help of repetitive interactions with the environment, reinforcement learning, a subfield of artificial intelligence, enables intelligent agents to learn the optimal action to maximize cumulative rewards. Reinforcement learning effectively addresses the Markov decision problem, permitting agents to learn through continuous trial and error [31]. A synchronous joint Q-learning (Sync-FQL) approach was introduced by Xiong et al. [32] that considers the probability of offloading failures in the vehicular edge computing environment while minimizing computation and communication costs. The algorithm optimizes the model to optimally utilize available resources by learning the Q-values from different parts. However, the performance of Q-learning algorithms in high-dimensional observation spaces is constrained because of the Q-table size constraint [33]. To overcome the shortcomings of conventional reinforcement learning algorithms in handling high-dimensional state and action spaces, the combination of deep learning with RL offers a potent strategy [21]. By leveraging the representation learning capabilities of deep neural networks, this combined approach enables effective learning and decision-making in complex environments. To improve computation offloading efficiency in vehicular networks, Zhao et al. [34] introduced an AHP-DQN-based computation offloading algorithm. This algorithm utilizes the Analytic Hierarchy Process (AHP) to appropriately allocate vehicle tasks and determines offloading decisions based on real-time channel gains. It addresses challenges such as limited terminal storage capacity and diverse network services in the process of computation offloading. In order to minimize user costs, Wang et al. [35] adopted the Double Deep Q-learning (DDQN) algorithm, treating the computational capacity of the mobile edge computing servers as the system state and improving resource utilization by learning offloading policies. For resource allocation and computation offloading, Khayyat et al. [36] introduced a DQN-based distributed learning algorithm. They approached the problem as a binary optimization challenge with the goal of minimizing the system's overall time and energy costs. In order to meet the low latency requirements of vehicle-to-vehicle communication links and improve the throughput of vehicle-to-infrastructure links, Fu et al. [37] proposed the Double deep Q-learning (DDQN) algorithm, which effectively achieves the intelligent allocation of resources. While DQN-based algorithms effectively address dynamic computation offloading challenges, discretization methods used for continuous action spaces, such as resource allocation, can result in dimensionality explosion [38]. Regarding the computation offloading decisionmaking in continuous action spaces, research has been conducted recently. A mobile fog technique based on the DDPG algorithm, put forth by Chen et al. [39], effectively solves the state space explosion problem. Lu et al. [40] proposed an improved algorithm based on DDPG to solve the joint optimization problem of service delay, energy consumption and task success rate in the mobile edge computing environment, and effectively improved the quality of user experience.

System Model and Problem Description
As illustrated in Figure 1, the suggested system adopts a three-tier, region-based architecture for vehicular edge computing network. They are vehicle layer, edge layer and cloud layer. The vehicle layer comprises user vehicles on the road that are equipped with limited computing resources. Each vehicle can communicate with base stations and roadside units through 5G/LTE technology [41] and a dedicated wireless interface (IEEE 802.11p). The edge layer comprises roadside nodes that are deployed in different regions of the map. These roadside nodes include roadside units with limited signal coverage range and mobile edge computing servers connected to roadside units with computational and storage resources, and the signal ranges between them do not overlap. The cloud layer represents the cloud service layer, which includes high-performance computing resources connected to base stations through wired links to provide necessary resource support. The entire system operates in discrete time with equal time slot intervals [42]. In each time slot, the mobile edge computing servers can establish communication with vehicles within its coverage area. According to literature [14], in this paper, the base station signal coverage area is sufficiently large and that the cloud server can offer services to all vehicles. For vehicles outside the coverage area of the mobile edge computing servers, tasks need to be executed on the cloud. The set of RSUs is K = {1, 2, . . . , k}, and the set of vehicles can be represented by N = {1, 2, . . . , n}. Vehicles generate computational tasks at random for each time slot. Vehicle i's task is denoted by T i = (D i , C i , Γ i ). In this case, D i represents the task's size, its computational complexity is denoted by C i , and the task's maximum tolerated delay is denoted by Γ i . Table 1 gives the main symbols and their definitions.
The number of tasks offloaded to mobile edge computing server p up Vehicle i's transmission power g Channel gain σ 2 The communication channel's noise power θ i The offloading proportion of task T i q j The location of mobile edge computing server j, The flag of task T i offloaded to mobile edge computing server j b i The flag of task T i offloaded to the cloud

Communication Model
In the vehicular edge computing system, computational services are time-divisionallly allocated to vehicles [42]. Each vehicle is given a specific time slot for data transmission as the communication period is divided into T time slots. Using the Shannon formula, the data transmission rate from vehilcle i to the connected j within a given time slot t is given by In the aforementioned equation, several parameters are involved. B mec i represents the total bandwidth available for the uplink channel of mobile edge computing server j. I(t) representes the number of tasks offloaded to mobile edge computing server during time slot t. While g j stands for the channel gain between vehicle and the mobile edge computing server j, p up refers to vehicle i's transmission power. The communication channel's noise power is represented by σ 2 . Similarly, vehicles can also communicate with the cloud server via 5G/LTE technology. The transmission rate from vehicle i to the cloud server is given by where g c is the channel gain between the vehicle and the cloud server, and B cloud denotes the bandwidth of the cloud server.

Computation Model
We use a partial offloading strategy in our vehicular edge computing system, where for each time slot t, the computational task generated by the vehicle is divisible. θ i (t) is the offloading proportion of task T i , and (1 − θ i (t)) is the proportion of the task to be executed at the local terminal.
The connection between vehicles and mobile edge computing servers is limited by the signal coverage range. The mobile edge computing server j in our vehicular edge computing system has a fixed location represented by a coordinate q j (t) = q x j (t), q y j (t) T ∈ R 2×1 , with an r j signal coverage radius. In time slot t, p i (t) = p x i (t), p y i (t) T ∈ R 2×1 represents vehicle i's location. The position of vehicle i after a time interval of ∆t is is the vehicle's speed and d i (t) is the direction of movement, supposing that the vehicles in the same direction and speed during time slot t. For a ∆t time interval in time slot t, only when both p i (t) − q j 2 and p i (t + ∆t) − q j 2 are less than r j , vehicle i is regarded as being inside mobile edge computing server j's signal coverage. For vehicles located outside the mobile edge computing server's signal coverage, their tasks needs to be on the cloud for execution.

Local Computational Model
In time slot t, the local task execution time of task T i for vehicle i is where C i represents the computational complexity, and f local is the vehicle's terminal execution capability. The local energy consumption is given by where P local represents the local energy cost power of vehicle.

Mobile Edge Computing Computational Model
We assume that the feedback delay can be disregarded due to the fact that the resulting data size is significantly smaller than the task data itself [13]. Therefore, in time slot t, the transmission and execution delay on the mobile edge computing server are the two parts of the total delay of task T i . The transmission delay of task T i is given by where θ i (t) denotes the offloading proportion, D i (t) represents the task data size, and V i,j (t) represents the transmission rate from vehicle i to the connected mobile edge computing server j, assuming the server has I cores with the same processing power indicated by f mec for each core. When the offloaded tasks to mobile edge computing server are more than cores available at time slot t, the remaining tasks must be executed on the cloud. The execution delay for tasks offloaded to the mobile edge computing server is described as follows: When a vehicle offloads a task to the MEC, the total delay and energy consumption is depicted as where P mec tr and P mec exc are the MEC's transmission and energy consumption powers, respectively.

Cloud Computational Model
The computational capabilities of the cloud server are specified as f cloud . For task T i excuted on the cloud, the execution time is marked as T cloud exc i (t), while the transmission time is represented as T cloud tri (t). The transmission and execution delay are depicted as When a vehicle offloads a task to the cloud, the total delay and energy consumption are depicted as T cloud Overall, for the task offloaded part, where a i,j (t) indicates whether task T i is offloaded to mobile edge computing j, with a value of 1 if it is, otherwise 0. b i (t) represents whether task T i is offloaded to the cloud. Therefore, for task T i ,

Problem Description
The weighted sum of the user service cost and system energy consumption is used in this paper to define the system's overall cost. For time slot t, the user service cost is defined as the entire waiting time for all tasks to complete: The system energy consumption is defined as In this paper, the objective is to improve system efficiency by minimizing the system cost. According to the above model, we formulate the optimization problem: where C1 is the delay constraint that means the task waiting time cannot exceed the maximum tolerant delay; C2 and C8 are the offloading constraints, which indicate the offloading destination of task T i ; C3 is the mobile edge computing server core constraint; C4 is the computational constraint, which mandates that all computation tasks be finished within the allotted time; C5 and C7 are weight coefficient constraints; and C6 is the offloading proportion constraint, which indicates the value range of offloading proportion.

Deep-Deterministic-Policy-Gradient-Based Task Offloading Decision-Making Algorithm
In this section, first, we suggest the vehicular edge computing system's reinforcement learning framework, and elaborate the main elements of the Markov decision process. We also explain how to train an effective task offloading decision-making algorithm within the vehilce edge computing system. In detail, we introduce the normalization pre-processing of the states and illustrate the training and validation process of the algorithm.

Vehicular-Edge-Computing-Based Reinforcement Learning Framework
The reinforcement learning framework model of our vehicular edge computing system is shown in Figure 2. The described model represents how intelligent agent vehicles interact with environment. The agent observes the state s t and determines the appropriate action a t according to the trained policy π t . After the action a t is chosen, the environment's state transitions from s t to s t+1 . Then, the intelligent agent vehicles receive an instantaneous reward r t associated with the transition. This process can be abstracted and modeled as a Markov decision process, where actions and states follow the Markov property. The basic model of the Markov decison process is a five-tuple: < S, A, P, R, γ >. In this formulation, S represents the system's state space, while A represents the action space. P(s t+1 |s t , a t ) is the state transition function, which determines the probability of transitioning from state s t to s t+1 when the agent performs action a t . R(s t , a t ) determines the instantaneous reward obtained when optimal action a t is performed in state s t . γ, which ranges between zero and one, indicates the extent to which the present reward R(s t , a t ) affects the future. The goal of the Markov decision process framework is to determine an optimal policy π s t for each state s t that maximizes the expected cumulative long-term reward. This long-term reward is determined by the discounted sum of future rewards.
To assess the value of the current state in terms of long-term reward, value functions are utilized. There are two main types of value functions: the action value function Q π (s t , a t ) and the state value function V π (s t ).
V π (s t ) represents the expected cumulative reward of following policy π s t starting from state s t . Q π (s t , a t ) denotes the expected discounted return of the future of following policy π s t and taking action a t , starting from state s t . V π (s t ) and Q π (s t , a t ) can also be represented by the Bellman equation, which indicates the relationship between the value of the current and the subsequent states.
Q π (s t , a t ) = E s t+1 ∼E r(s t , a t ) + γE a t+1 ∼π [Q π (s t+1 , a t+1 )] , The total number of potential future returns is estimated by variables V π (s t ) and Q π (s t , a t ). As mentioned previously, the goal of the Markov decision process framework is to discover the optimal policy, and the effectiveness of the policy is assessed using its associated value function. The optimal value function is the one that reflects the best course of action. Specifically, there are two types of optimal value functions: Then, the optimal policy is For the optimal policy, the Bellman equation becomes the Bellman optimality equation:

Markov Decision Process Elements
Based on the above model, we consider all vehicles of the vehicular edge computing system as a centralized controlled agent that can make effective offloading decisions with global information and the system environment state. The definitions of the Markov decision process elements are as follows: (1) State space: At a time slot, the system state is s t = {D remain (t), request(t), p(t), q 1 (t), . . . , q n (t), D 1 (t), . . . , D n (t), C 1 (t), . . . , C n (t)}.
The coordinate location of the mobile edge computing server is p(t), q i (t) Here, D remain (t) represents the remaining task data size that the system needs to complete, and request(t) denotes the services that vehicles have requested. p(t) is the mobile edge computing server's coordinate location, q i (t) is the vehicle i's coordinate location, D i (t) indicates the task data size, and C i (t) represents the computation complexity.
(2) Action Space: A set of available actions for agents under centralized control within the given time period are represented by the action space. The actions performed by the agent can include selecting vehicles to request service, making offloading decisions, and determining offload proportion for vehicle tasks. The representation of the action is Here, O i (t) = 1/0 indicates whether to offload to a mobile edge computing server or a cloud server, and the task offloading proportion is represented by θ i (t) ∈ [0, 1] (3) Reward: The optimization goal of the vehicular edge computing system's reward function is to minimize system cost while training the reinforcement learning agent to maximize long-term rewards. Consequently, according to Formula (19), the agent's reward function is

TODM_DDPG Architecture
In the VEC system, we utilize the actor-critic architecture-based deep reinforcement learning algorithm DDPG for our task offloading decision-making strategy. As shown in Figure 3, the algorithm iteratively trains the policy network (actor) and the Q network (critic) by interacting with the environment. By taking into account the system's state and action space, this strategy enables the agent to learn optimal offloading decisions. In the TODM_DDPG algorithm, the actor network consists of a target policy network in addition to an online policy network. Action a t that maximizes relevant action value Q(s t , a t ) based on state s t is produced by the online policy network. The actor network aims to learn actions with higher Q values, indicating better performance.
The critic network comprises online and target Q network. The state-action pair (s t , a t )'s action value is estimated by the online Q network as Q(s t , a t ). The goal is to train the critic network to produce more accurate Q values, indicating better estimation.
To ensure training stability, the TODM_DDPG algorithm employs a delayed update technique called soft update. By combining them with the weights of the respective online networks, the target policy network and target Q network's weights progressively update.
The TODM_DDPG algorithm's training process can be outlined in the following way: (1) On the basis of the behavior policy β, the actor network first chooses action a t and sends it to the vehicular edge computing environment for execution.
Here, behavior policy β guides the environment to perform an action in the training phase. By introducing noise into the decision mechanism of action, it takes into account both exploration and exploitation to explore potentially superior policy. The online policy network generated policy µ(s t | θ µ ) in the previous stage, while θ µ is the parameter. N t is Gaussian noise with mean n 0 and variance σ 2 0 . (2) The agent of vehicular edge computing environment executes action a t and returns instantaneous reward r t and new environment state s t+1 . (3) To train the online network, the actor network records the state transition process (s t , a t , r t , s t+1 ) as the dataset. (4) N transition data (s i , a i , r i , s i+1 ) is randomly sampled as mini-batch training data from replay memory buffer R for the online network. (5) Based on actions µ (s i+1 ) generated by the target policy network and the transition data, the target Q network first calculates label value y i .
Gradient θ Q L is then calculated by the online Q network using the back-propagation method in the neural network and loss function L(θ Q ): (6) Online Q network's parameters θ Q are updated. (7) Firstly, according to action a = µ(s i ), the online Q network computes gradient ∇ a Q s, a | θ Q s=s i ,a=µ(s i ) . Then, the online policy network calculates the policy gradient: (8) The online policy network's parameters θ µ are updated. (9) Using the sliding mean approach, the target policy and Q network parameters θ µ and θ Q are softly updated: Here, τ ∈ [0, 1].

State Normalization
Based on the aforementioned TODM_DDPG algorithm's training process, two different DNN networks are utilized to to fit the value and policy function of the actor and critic networks, respectively, as illustrated in Figure 3. Taking the actor network as an example, we can observe that it takes s t as input and outputs action a t = µ(s t ). However, the distribution of activation input values of the DNN network changes during the training process, gradually approaching the top and lower boundaries of the range of nonlinear function values [43]. This leads to the disappearance of gradients in the lower layers during the back-propagation process, resulting in slow convergence. To better train the neural network, in addition to using the Relu activation function, based on batch normalization [44,45], Algorithm 1 is proposed to normalize input states s t with different ranges.

Algorithm 1: State Normalization
Input: State parameters: D remain (t), p(t), q i (t), D i (t), C i (t); Scale factors: λ 1 , λ 2 ; Min-Max value: This paper normalizes the state parameters with different ranges separately. λ 1 and λ 2 are scaling factors used to normalize the remaining total data size D remain (t) and coordinate information, respectively. The task data size and complexity are normalized using the min-max normalization method. The top and lower boundaries of the task data volume and complexity for each time slot are denoted by variables Min D , Min C , Max D , and Max C , respectively.

TODM_DDPG Training and Validating Algorithm
We propose the training algorithm, which is depicted in Algorithm 2, based on the aforementioned TODM_DDPG algorithm and state normalization method. By iteratively adjusting the online policy network parameters θ µ and online Q network parameters θ Q through interactions with the environment, actor, and critic networks throughout the training phase, the algorithm seeks to maximize the long-term reward and returns the optimal online policy network parameter θ µ after training. Algorithm 3 describes the validation algorithm for computing the offloading strategy, which utilizes trained optimal parameters θ µ . It performs the offloading decision-making process and obtains the system cost based on the output policy of the training algorithm.

Algorithm 2: TODM_DDPG training algorithm
Input: Episode L, time slot T; Actor learning rate lr a , critic learning rate lr c ; Experience replay memory buffer R, mini-batch size N;; Soft update factor τ, discount factor γ; Gaussian behavior noise N t Output: Online policy network parameter θ µ 1 Initialize: Initialize θ µ ,θ Q ; θ µ ← θ µ , θ Q ← θ Q randomly; empty buffer R; 2 for each episode l ← 1 to L do 3 Reset the parameters of the VEC environment; 4 for each time slot t ← 1 to T do 5 Normalize s t toŝ t by Algorithm 1;

6
Obtain action a t based on behavior noise N t (33) and current state s t ;

7
Perform a t and calculate reward r t by (32), obtain next state s t+1 and normalize s t+1 toŝ t+1 ; 8 if buffer R is not full then 9 Put transition (ŝ i , a i , r i ,ŝ i+1 ) into R; 10 else 11 Select a set in R randomly, and replace by (ŝ i , a i , r i ,ŝ i+1 );

12
Sample N transition data from buffer R randomly to form a mini-batch; 13 Train online Q network by (34) and (35); 14 Calculate the ∇ a Q s, a | θ Q s=s i ,a=µ(s i ) in online Q netwok; 15 Train online policy network by (36); 16 Soft update the target policy network and Q network by (37)  Trained online policy network parameter θ µ ; Current states: D remain (t), p(t), q i (t), D i (t), C i (t); Output: Minimum system cost 1 while Running do 2 Normalize current state s t toŝ t ;

3
Obtain action a t according to a t = µ(ŝ t | θ µ ); 4 Perform a t and obtain reward r t by (32); 5 Add r t to system cost; 6 end

Parameter Setting
We evaluate the algorithm in the Python development environment based on the Tensorflow platform in order to assess the efficiency and operational performance of the algorithm model in this research. In the vehicular edge computing test system, the simulation scenario is a three-tier structure of cloud-edge-vehicle consisting of one cloud server, one edge server and several vehicle users. Among them, a two-dimensional square area covered by the base station signals is considered, and within the area, the mobile edge computing servers are distributed on the road side, and their signal coverage is a circular area. Within the base station's coverage area, vehicles move at random along the road. We use TensorFlow to initialize an instance of the deep reinforcement learning agent function acccording to the settings of Markov decision elements in Section 4.2. Table 2 displays the environmental parameters as well as the neural network's parameters. Since the selection of different hyperparameters impacts the effectiveness, convergence, and stability of the algorithm, the significance of hyperparameter setting should be considered. In order to obtain the ideal values for the hyperparameters used in the TODM_DDPG algorithm, we conducted a series of experiments. Figure 4 demonstrates the impact of different learning rates on the convergence and stability of the algorithm. Since the actor network is updated by the critic network, we assumed that the learning rate of the critic network is greater than that of the actor network. Two combinations of learning rates were compared: lr a = 4 × 10 −7 and lr c = 2 × 10 −5 , and lr a = 4 × 10 −6 and lr c = 2 × 10 −4 . The algorithm converges in both cases, but the solution obtained with lr a = 4 × 10 −7 and lr c = 2 × 10 −5 is superior to the one with lr a = 4 × 10 −6 and lr c = 2 × 10 −4 . The latter converges to a local optimum instead of the global optimum. This is because the larger update rates for both the critic and actor networks allow for larger update steps, enabling convergence to the local optimum. Alternatively, using smaller learning rates like lr a = 4 × 10 −8 and lr c = 2 × 10 −9 maintains stability but leads to poor optimization performance. The lower learning rate slows down the network's update speed, requiring more iterations to converge. Consequently, it fails to reach the optimal solution. Based on the experiment, the optimal network learning rate selected was lr a = 4 × 10 −7 and lr c = 2 × 10 −5 .  Figure 5 compares the algorithm's performance for experience replay buffers of different sizes, denoted by R. When R = 5000, the algorithm reaches a local optimum around episode 70. However, due to the small size of the experience replay buffer, it fails to extract sufficient data feature information, leading to an inability to learn the optimal strategy. When R = 50,000, the algorithm converges to the global optimum at around episode 520. When R = 500,000, the algorithm fails to converge due to the excessively large experience replay buffer, resulting in a longer training time and insufficient data updates. Therefore, the optimal size of the experience replay buffer chosen in this paper is 50,000. Figure 6 presents the comparison of the algorithm's performance for various discount factors γ. The algorithm achieves the fastest convergence and best performance when γ = 0.01. This is attributed to the dynamic nature of the vehicular edge computing system, where environmental conditions change over time. Considering long-term returns over the entire duration does not accurately represent long-term behavior, resulting in significant variations in data across different time periods and an inability to capture comprehensive data features. Therefore, this paper adopts γ = 0.01 as the optimal discount factor.   Figure 7 illustrates the comparison of the algorithm's performance for different exploration parameters σ 2 0 , which represents the exploration noise in the TODM_DDPG algorithm. Since the policy network outputs deterministic actions, exploration in TODM_DDPG relies on adding noise to the action space. In this case, the noise is represented by the variance parameter σ 2 0 . When σ 2 0 is set to 0.01, the algorithm eventually converges to the best result. Notably, as σ 2 0 increases, the algorithm explores a larger action space, leading to faster convergence but a potentially larger noise distribution space. As shown in the figure, when σ 2 0 is set to 0.2, the algorithm exhibits fluctuations around a system cost of approximately 820. When σ 2 0 is set to 0.001, the exploration parameter is too small, resulting in a narrow range of generated action space, which can lead to convergence to local optima or an inability to converge. Therefore, the optimal exploration parameter chosen in this paper is σ 2 0 = 0.01.  Figure 8 presents the comparison of the algorithm's performance with the influence of state normalization and behavior noise during training. As we mentioned in Section 4.3, the DDPG training process executed by introducing noise into the decision-making mechanism of actions considers both exploration and exploitation to explore potentially superior policies. In the figure, the algorithm's convergence speed slows down when behavior noise is not used during policy training. Furthermore, the presence of state normalization has a more significant effect than that of behavioral noise. As mentioned before, TODM_DPPG utilizes two different deep neural network networks to fit the value and policy function of actor and critic networks, respectively. Taking the actor network as an example, we can observed that it takes the state as input and outputs policy actions. However, the distribution of the activation input values of the deep neural network changes during the training process, gradually approaching the upper and lower bounds of the nonlinear function value range. This causes gradients in lower layers to vanish during the backpropagation process, leading to slow convergence. Therefore, for the characteristics of the deep neural network, we propose Algorithm 1, which incorporates normalization of the state values using scaling factors and the min-max method. Training the algorithm without state normalization, large state parameters can cause slow convergence of the deep neural network, rendering the training process ineffective, and the algorithm approximate to be greedy.

Performance Comparison
The comparison of performance of different algorithms is shown in Figure 9. We compare five different algorithms in this experiment: the AC algorithm, the TODM_DDPG algorithm, the DQN algorithm, the local offloading algorithm, and the cloud offloading algorithm. For the three reinforcement learning algorithms, the training iterations are set to 1000. It is evident from the figure that the three reinforcement learning algorithms implement significantly lower system costs compared to the local offloading and cloud offloading algorithms. However, the AC algorithm fails to converge during the training process due to the interaction between the actor and critic. This is because the action of the actor relies on the critic's value, and updating both networks simultaneously when the critic's value is challenging to converge may leads to instability. On the other hand, both the TODM_DDPG and DQN algorithms, with their dual network structures, effectively overcome this issue by breaking the correlation among training data and achieving conver-gence. However, the DQN algorithm fails to converge to the minimum system cost. This is mainly due to the fact that the DQN algorithm discretizes the action space of continuous actions, resulting in an inability to find the optimal offloading strategy accurately. In comparison, the TODM_DDPG algorithm performs best. Its ability to explore continuous action spaces enables it to find optimal solutions more effectively. Therefore, the TODM_DDPG algorithm demonstrates superior performance in terms of minimizing system cost.   Figure 10 illustrates the comparison of performence of the algorithms under different task size ranges. The system costs of all algorithms increase as the task size increases, showing a positive correlation. Within a given task data size range, the TODM_DDPG algorithm's system cost is significantly lower compared to those of local or cloud offloading algorithms, as these offloading algorithms fail to fully utilize system resources. Furthermore, from the figure, we can observe that the TODM_DDPG algorithm's advantage over other methods in terms of system cost becomes more obvious as the task data size increases. Overall, the TODM_DDPG algorithm consistently converges to the minimum system cost.   Figure 11a shows the convergence performance comparison between TODM_DDPG and DQN approaches under different vehicle computing capacities. The AC algorithm is not compared because of its lack of convergence. It can be observed that when the vehicle's computing capacity is relatively low, i.e., F ve = 9 × 10 9 , both optimization approaches result in higher system costs compared to when the vehicle's computing capacity is F ve = 12 × 10 9 and F ve = 15 × 10 9 . On the other hand, as shown in Figure 11b, when the vehicle's computing capacity is larger, the offloading proportion in the system is smaller. Therefore, as the vehicle's computing capacity increases, the vehicles tend to execute tasks locally. Smaller computing capacities result in slower data processing speeds in the system at a given time, leading to larger maximum delays between local execution and offloading, i.e., higher offloading rates.
Additionally, since DQN algorithms are suitable for discrete action spaces, the offloading proportion level of DQN is set to {0, 0.05, 0.1, . . . , 1.0} in this paper. Based on Figure 11a, we can find that the TODM_DDPG scheme achieves lower system cost compared with the DQN scheme under different vehicle computing capabilities. Also, in Figure 11b, we can see that the DQN algorithm can only output a limited discrete action set for offloading proportion, while TODM_DDPG is capable of continuous action output. This is because TODM_DDPG outputs the value of the action by adding a layer of policy network on the basis of the DQN algorithm instead of directly outputting the maximum Q value like DQN, and thus expands to the control space of continuous actions. Hence, for the same vehicle computing capacity, the TODM_DDPG algorithm exhibits lower system costs compared to the DQN algorithm, as shown in Figure 11a. Figure 12a represents the performance comparison of different algorithms under different energy consumption weights. The figure shows that the TODM_DDPG algorithm always achieves the lowest system cost, regardless of the weight coefficients. Additionally, we can find that with the increase in the energy weight coefficient, the system cost of the cloud offloading algorithm decreases significantly, and the system costs of DQN and TODM_DDPG algorithms gradually approach that of the cloud offloading algorithm. The reason for this is that when the energy weight is larger, the advantage of offloading tasks to the cloud in terms of system cost becomes more significant compared to local offloading. Therefore, the DQN and TODM_DDPG algorithms tend to offload tasks to the cloud or teh edge for executing. As for the AC algorithm, due to its lack of convergence, the optimization effect is not significant as the energy consumption weight increases. Figure 12b demonstrates the performance comparison of different algorithms under different numbers of vehicles. In this case, we assume that in regard to the MEC processing capacity, MEC cores I are equal to seven, and under different numbers of vehicles, the total task size to be completed in whole time period is the same. In the figure, we can find that as the number of vehicles increases, and when the number of vehicles is less than 25, the average system cost of all schemes is almost constant, and when the number of vehicles is greater than 25, the system costs of the three reinforcement learning algorithm schemes begin to increase. This is because, as the number of vehicles increases, the amount of offloaded tasks also increases, and the MEC cores are limited, which leads to redundant tasks being offloaded to the cloud server for processing and increases the system cost. In addition, it can be concluded from the figure that the system cost of the proposed TODM_DDPG algorithm is less than those of the other four algorithms, because the TODM_DDPG algorithm can find the optimal value in the continuous action space and obtain the optimal offloading proportion.

Conclusions
In this paper, the computation offloading problem of vehicular edge computing environment is studied, and a vehicular edge computing system based on three-layer cloud edge vehicle architecture is proposed. Based on the deep deterministic policy gradient, this paper suggests a task offloading decision-making algorithm, namely TODM_DDPG. We provide a comprehensive description of the algorithm's training and testing processes, effectively addressing the challenge of high-dimensional continuous action space. Furthermore, we introduce a state normalization method to enhance the algorithm's convergence and stability. Subsequently, we conduct experiments to examine the effects of several fundamental hyperparameters on the algorithm's performance, and the results are contrasted with those obtained from baseline algorithms. The testing results prove the effectiveness of our suggested approach in reducing system costs. The algorithm evaluation considers the joint optimization of delay and energy, and points out the effectiveness of the strategy under different task sizes and vehicle computing capabilities. The TODM_DDPG algorithm is superior to the other four schemes and obtains lower system cost and more accurate task offloading rate, which shows its effectiveness in optimizing task offloading for vehicular edge computing. Moreover, for a scalable and long-term vehicular edge computing system, the proposed scheme shows high reliability even under different energy consumption and delay weights and the number of vehicles.
In future research, we plan to explore additional aspects such as task types that consider dependencies and mobile edge computing server offloading request forwarding and channel resource allocation, aiming to simulate real-world scenarios more accurately. For this dependency-aware task offloading, the problem of task offloading is transformed into two sub-problems of offloading decision and resource allocation. Firstly, the offloading positions of subtasks are determined according to the dependencies between subtasks. Secondly, the resource allocation optimization process is formulated as a Markov decision process. Based on this, our goal is to design a hybrid computational offloading strategy for resource allocation and task allocation using a collaborative mechanism, taking into account both task offloading and resource allocation, so that sub-tasks can be executed in parallel and the optimal fine-grained offload strategy can be obtained.