A Power Allocation Scheme for MIMO-NOMA and D2D Vehicular Edge Computing Based on Decentralized DRL

In vehicular edge computing (VEC), some tasks can be processed either locally or on the mobile edge computing (MEC) server at a base station (BS) or a nearby vehicle. In fact, tasks are offloaded or not, based on the status of vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) communication. In this paper, device-to-device (D2D)-based V2V communication and multiple-input multiple-output and nonorthogonal multiple access (MIMO-NOMA)-based V2I communication are considered. In actual communication scenarios, the channel conditions for MIMO-NOMA-based V2I communication are uncertain, and the task arrival is random, leading to a highly complex environment for VEC systems. To solve this problem, we propose a power allocation scheme based on decentralized deep reinforcement learning (DRL). Since the action space is continuous, we employ the deep deterministic policy gradient (DDPG) algorithm to obtain the optimal policy. Extensive experiments demonstrate that our proposed approach with DRL and DDPG outperforms existing greedy strategies in terms of power consumption and reward.


Introduction
With the development of the Internet of vehicles (IoV), autonomous vehicles are becoming increasingly popular. At the same time, a series of smart vehicular user (SVU) devices and applications installed on autonomous vehicles have emerged. Communication among these SVU devices and applications is extremely popular [1][2][3][4][5]. Therefore, to reduce the burden of SVUs, vehicular edge computing (VEC) has been introduced to process tasks offloaded by SVUs, while ensuring low processing latency for these tasks [6][7][8]. When SVUs have tasks to process, they can choose to process these tasks locally or offload them to a mobile edge computing (MEC) server on a VEC vehicle or at a base station (BS) [9,10]. Such task offloading incurs multiple kinds of power consumption. To be clear, in this paper, the power consumed for offloading to the BS is defined as vehicle-to-infrastructure (V2I) processing power, and the power consumed for offloading to a VEC vehicle is defined as vehicle-to-vehicle (V2V) processing power. In addition, SVUs also process tasks on their local CPUs; the task processing power consumed by the local CPU of an SVU is defined as local processing power in this paper.
In this paper, V2I communication is assumed to be based on the multiple-input multiple-output and nonorthogonal multiple access (MIMO-NOMA) technology, due to high spectrum utilization and channel capacity. However, in the MIMO-NOMA system, the interference of SVUs with each other and the mobility of SVUs lead to uncertain channel conditions [11].
V2V communication is assumed to be based on device-to-device (D2D) technology, which is a core technology for smart cities, high-quality video streaming, and disaster-relief networks. It offers various advantages in terms of fairness, energy efficiency, and spectral efficiency [12][13][14][15]. In the D2D communication scenario considered in this paper, each SVU communicates only with a specific VEC vehicle, and the communication distance remains constant. Therefore, V2V communication is working on the interference-free channel conditions [16].
In a real scenario, task arrival will be random. In this case, the latency and power consumption of the SVUs for processing the tasks will also be uncertain [11]. For example, when the V2I channel conditions are relatively worse, SVUs should choose to process their tasks either locally or through V2V communication to reduce latency and power consumption. Considering the randomness of task arrival and the uncertainty of the V2I channel conditions, it is necessary to design an optimal power-allocation scheme with the aim of minimizing latency and power consumption.
This paper employs the deep reinforcement learning (DRL) framework to design such a scheme. Since the deep deterministic policy gradient (DDPG) algorithm is suitable for solving problems in the continuous action space, we will employ this characteristic of the algorithm to solve the problem. In most previous work, the BS was responsible for collecting global information, including the SVU state, and then determining the action of each SVU; however, the centralized approach incurs higher power consumption and larger latency [17][18][19][20][21][22][23][24][25][26][27][28][29][30]. Only a few works have adopted decentralized DRL frameworks to design related schemes in which each SVU observes its own surrounding environment to determine its action. In this way, it can effectively reduce the overall latency and power consumption [31,32]. However, to the best of our knowledge, the coexistence of MIMO-NOMA-based V2I communication and D2D-based V2V communication was not considered in the VEC system based on the decentralized DRL.
In this paper, we propose a power-allocation model in VEC based on decentralized DRL to improve power consumption and latency, considering the coexistence of MIMO-NOMA-based V2I communication and D2D-based V2V communication, as well as the randomness of task arrival, the channels interference in MIMO-NOMA and the mobility of SVUs. (The source code has been released on https://github.com/qiongwu86/MIMO-D2D (accessed on 19 February 2023). The main contributions of this article are summarized as follows.
(1) We propose a power allocation model in VEC based on the decentralized DRL, defining the action function, state function, and reward function. The DDPG algorithm is employed to deal with the continuous action space problem and to guide the model to learn the optimal policy. (2) Performance testing of the trained model in a large number of experiments shows that the proposed approach outperforms other existing ones.
The rest of this paper is organized as follows. Section 2 reviews some related work. Section 3 describes the system model. Section 4 presents the design of the DRL framework and the related functions. Section 5 describes the DDPG algorithm used for model training. We present some simulation results to demonstrate that our proposed approach outperforms other existing ones in Section 6. Finally, Section 7 concludes the paper.

Related Work
In this section, we review some work on D2D and MIMO-NOMA technology in MEC and VEC based on DRL. MEC system with the goal of optimizing the system cost, power consumption and latency. In [11], Zhu et al. constructed a VEC system based on MIMO-NOMA technology, in which vehicles can choose to process tasks either locally or offload them to the BS via the MIMO-NOMA channels. In this system, the DDPG algorithm was employed to optimize latency and power consumption. In [34], Liu et al. designed a millimeter-wave D2D MEC model as the basis of an optimal algorithm for task assignment. In [35], Li et al. proposed a MEC system supporting D2D. It formulated a two-stage optimization algorithm with the goal of improving resource utilization and network capacity. However, these works did not consider the coexistence of MIMO-NOMA and D2D technology in MEC and VEC.

V2V and V2I Communication in VEC
Some research works have also employed V2V and V2I communication in VEC systems. In [11], Zhu et al. considered V2I communication based on MIMO-NOMA in a VEC system. In [1], Raza et al. proposed a 5G-based VEC system in which the V2I communication was based on the millimeter-wave mode. In [36], Zhang et al. constructed a software-defined networking (SDN)-assisted VEC system with the goal of optimizing the system overhead while considering 802.11p-based V2I communication and V2V communication. In [37], Bai et al. designed a VEC system containing vehicular cloudlets and considered D2D-based V2V communication. However, these works did not consider the coexistence of MIMO-NOMA-based V2I communication and D2D-based V2V communication in VEC systems.

DRL-Based Resource Allocation in VEC
There have been many studies on DRL-based resource allocation in VEC. In [38], Ning et al. proposed a VEC system consisting of multiple roadside units (RSUs), SVUs and a single BS. With the aims of optimizing the system cost, the DDPG algorithm was employed to obtain the optimal resource allocation scheme. In [39], Ren et al. proposed a VEC system consisting of multiple RSUs, BSs, and SVUs. A centralized resource allocation mechanism based on DRL was designed to obtain an optimal network resource-allocation scheme. In [23], Liu et al. designed a semi-Markov process-based scheme for maximizing VEC network utility and employed the deep Q-learning network (DQN) algorithm to achieve optimal resource allocation. However, none of the above works considered decentralized DRL-based VEC systems.
Some works have also considered decentralized DRL-based VEC systems. In [31], Ye et al. constructed a VEC system with both V2I communication and V2V communication, in which SVUs used the DQN algorithm to select the transmission band for their tasks, thus optimizing the system capacity and latency. In [11], Zhu et al. designed a VEC system consisting of a BS and SVUs, considering the mobility of SVUs and MIMO-NOMA technology. The SVUs employed the DDPG algorithm to allocate processing power, thus optimizing the latency and power consumption. However, [11] did not consider the presence of V2V communication, while [31] did not consider the mobility of SVUs, MIMO-NOMA, and D2D technology.
As seen from the above review, no previous works have considered the coexistence of MIMO-NOMA-based V2I communication and D2D-based V2V communication in the power allocation problem for decentralized DRL-based VEC systems. This motivates us to start this work. We shall show it in detail.

System Model
The system model is shown in Figure 1. A MEC server is placed on each VEC vehicle and at the BS with multiple antennae. Based on the axial distance to the BS, the VEC server coverage is divided into J lanes, where SVUs may move at different speeds in different lanes. The time duration for which SVUs in lane j remain within the communication range of the BS is divided into N j,m slots, and the length of each slot is τ 0 . In each slot, tasks arrive randomly in the buffers of the SVUs. At the same time, each SVU allocates local processing power, V2I processing power, and V2V processing power to process tasks either locally or offload them to the MEC server. In addition, the V2I channel conditions continuously change due to the mobility of SVUs and the channel's interference in MIMO-NOMA. In contrast, V2V communication is based on D2D technology, and it is assumed that each SVU communicates only with a specific VEC vehicle in the same lane, and each VEC vehicle processes tasks for only one specific SVU; thus, the V2V channel conditions can be treated as time invariant. In V2I communication, each SVU first transmits tasks to the BS; the BS processes the tasks and employs the zero-forcing (ZF) technique to detect the signal and noise associated with each SVU from all SVU signals and then obtains the signal-to-interference-plus-noise ratio (SINR) of each SVU for V2I communication. In the next slot, the BS transmits the SINR of each SVU to that SVU. In contrast to the traditional scheme of centralized DRL, in this work, each SVU can determine its own power allocation according to its own observations of the environment. It is a decentralized mode. In the following, we shall introduce the system's mobility model, task-computation model, and communication model and describe the relevant environmental information, such as the buffer capacity of SVU k, the SINRs of V2I communication and V2V communication, and the position of SVU k. The notations used in this article can be seen in Table 1.

Mobility Model
Let P k,j (n) denote the position of SVU k in lane j in slot n. We establish a threedimensional right-angle coordinate system as shown in Figure 1, where the BS is the origin point, the x-axis direction is the travel direction of SVU k, the y-axis represents the vertical distance to the BS, and the z-axis represents the height difference to the BS. Let d k,j (n) and w k,j be the distances between SVU k in slot n and the BS along the x-axis and y-axis, respectively. Therefore, P k,j (n) can be expressed as (d k,j (n), w k,j , 0), where w k,j is equal to the vertical distance of lane j from the BS, which can be calculated as where w d is the lane width and w 1 is the distance between the BS and lane 1.  P max,I Maximum V2I processing power. P max,L Maximum local processing power.
Q µ θ k (s k,n , a k,n ) Action value function of SVU k. Q ζ k (s k,n , a k,n ) Action value function output from the critic network.
Q ζ k (s k,n , a k,n ) Action value function output from the target critic network. r k,n Reward of SVU k in slot n. Learning rate of the critic network.
Similar to [40], for simplification, we shall employ the discrete approximation model and assume that the position of SVU k is constant within each individual slot because the time duration τ 0 of each slot is small.
where d k,j (n) ∈ − D 2 , D 2 and d k,j (1) = − D 2 . SVU k can determine whether it is within the coverage area of the BS based on its own current position d k,j (n), which reflects the mobility of SVU k.

V2I Communication
The channel matrix between the BS and the SVUs in slot n can be expressed as G(n) = [g 1,I (n), · · · , g k,I (n), · · · , g M,I (n)] ∈ C N a ×M , where N a is the number of BS antennae and g k,I (n) ∈ C N a ×1 is the channel vector between SVU k and the BS. In the MIMO-NOMA channels, the signals received by the BS in slot n from all SVUs can be expressed as where N s (n) is Gaussian white noise and s k (n) is complex data symbol with unit variance. Here, p k,I (n) is the V2I processing power of SVU k in slot n, with P max,I being the maximum V2I processing power of SVU k. In addition, g k,I (n) reflects the path loss of SVU k for V2I communication, which can be expressed as [41] where g s k (n) is the small-scale fading channel gain and g p k (n) is the large-scale fading coefficient. g p k (n) reflects the mobility of SVU k and is calculated as where g r is the channel gain at 1 m and η is the path loss exponent. Meanwhile, P k,j (n) = (d k,j (n), w k,j , 0) is the position of SVU k in slot n, and P B = (0, 0, H a ), where H a is the height of the BS antenna. Note that P k,j (n) can be calculated from Equations (1) and (2). The small-scale fading channel gain is initialized as g s The relationship between g s k (n) and g s k (n − 1) can be expressed as [42] where e(n) is an error vector and we can obtain ρ m as described in [43]. ρ k is the normalized channel correlation coefficient and is correlated with θ, which is the angle between the movement direction of SVU k, i.e., the x-axis, and the direction of communication, i.e., P B − P k,j (n). θ is calculated as where x 0 = (1, 0, 0). By using Equations (5)- (7), SVU k can obtain the channel vector g k,I (n). Subsequently, the BS employs the ZF technique [42] to obtain the SINR of SVU k for V2I communication, denoted by γ k,I (n), which is calculated as where p k,I (n) is the V2I processing power of SVU k in slot n, σ 2 R is the noise power, and g G k (n) is the kth row of the pseudoinverse of G(n). Therefore, the relationship between g G k (n) and g k, Based on Equations (3)-(9), the BS can obtain γ k,I (n) and transmit it to SVU k in the next slot. Thus, SVU k is able to observe γ k,I (n − 1) in the local environment in slot n, which reflects the uncertain channel conditions of SVU k for V2I communication caused by mobility of SVU k.

V2V Communication
Similar to [16], V2V communication is based on D2D technology. Since the channel conditions between SVU k and its corresponding VEC vehicle are time invariant, the channel gain between them can be expressed as where f i is an exponentially distributed random variable with unit mean. α h and β 0 are the path loss exponent and the channel power gain at the reference distance, respectively, for V2V communication. R k is the distance between the communicating vehicles. Since each SVU communicates only with a specific VEC vehicle in the same lane, R k is a constant. Therefore, the SINR of SVU k for V2V communication, denoted by γ k,V (n), is calculated as where p k,V (n) ∈ [0, P max,V ] is the V2V processing power of SVU k in slot n.
By using Equations (11) and (12), SVU k can obtain the SINR for V2V communication in slot n, where γ k,V (n) is related only to p k,V (n), which reflects the fact that the channel conditions of SVU k for V2V communication are time invariant.

Task-Computation Model
The relationship between the buffer capacity B k (n) and B k (n − 1) of SVU k in slot n is calculated as where [·] + = max(0, ·) and a k (n − 1) is the number of tasks arriving in slot n. d k,L (n − 1), d k,I (n − 1), and d k,V (n − 1) are the numbers of tasks processed in slot n − 1 for local processing, V2I processing, and V2V processing, respectively. The descriptions of how to calculate d k,L (n − 1), d k,I (n − 1) and d k,V (n − 1) are given below.

Local Processing
Let L be the computational intensity of the tasks, i.e., the number of cycles required for the CPU to process one bit. Let f k (n − 1) be the CPU processing frequency of SVU k in slot n − 1. Therefore, d k,L (n − 1) is calculated as where f k (n − 1) is calculated as where p k,L (n − 1) is the local processing power of SVU k in slot n − 1 and κ is a constant that reflects the effective converted capacitance.

V2I and V2V Processing
Since the computational resource of a MEC server is assumed to be sufficient, the latency of a MEC server in processing tasks is negligible. Moreover, the size of the computation result is very small, so the feedback latency is also negligible. Therefore, according to Shannon's theorem, d k,I (n − 1) and d k,V (n − 1) are calculated as d k,I (n − 1) = τ 0 W d log 2 (1 + γ k,I (n − 1)), where W d is the bandwidth and γ k,I (n − 1) and γ k,V (n − 1) are the SINRs of SVU k at slot n − 1 for V2I communication and V2V communication, respectively. The buffer capacity B k (n) of SVU k can be calculated from B k (n − 1) based on Equations (12)- (16). Since B k (n) depends on a k (n − 1), d k,L (n − 1), d k,I (n − 1) and d k,V (n − 1), it reflects the randomness of task arrival and the uncertainty of the channel conditions for V2I communication.

Problem Formulation
In this section, we describe the DRL-based framework, which consists of state, action, and reward functions. The state is defined based on the environment of each SVU in slot n; the action corresponds to the power allocation of each SVU, which is based on a policy µ and the reward is the benefit earned as a result of the action, which is related to the power consumption and latency.

State
In this paper, each SVU observes its surrounding environment to determine its power allocation. The V2I channel conditions for each SVU are uncertain because of the channel's interference in MIMO-NOMA and the mobility of each SVU. Moreover, the task arrival is random. Based on joint consideration of these two issues, the state is formulated to reflect the uncertainty of V2I channel conditions and the randomness of task arrival.
In the system model, the distance of SVU k from the BS along the x-axis d k,j (n) reflects its mobility. In addition, according to Equations (8) and (9), the SINR γ k,I (n) of SVU k for V2I communication depends on g G k (n), which in turn depends on g k,I (n); thus, we find that γ k,I (n) depends on g k,I (n). Therefore, γ k,I (n) reflects the uncertainty of the V2I channel conditions. Moreover, according to Equations (12)-(16), the buffer capacity B k (n) of SVU k in slot n is a function of a k (n − 1) and γ k,I (n − 1), where a k (n − 1) reflects the randomness of the task-arrival rate and γ k,I (n − 1) reflects the uncertainty of the V2I channel conditions. Therefore, B k (n) reflects both the randomness of task arrival and the uncertainty of the V2I channel conditions. Since SVU k can observe d k,j (n), γ k,I (n − 1) and B k (n) in the local environment, the state of SVU k in slot n can be expressed as where γ k,I (n − 1) depends on g k,I (n) and the buffer capacity B k (n) is related to γ k,I (n − 1) and a k (n − 1). Since g k,I (n) and a k (n − 1) are continuous values, the state space of SVU k is continuous.

Action
SVU k allocates its local processing power p k,L (n), V2I processing power p k,L (n), and V2V processing power p k,L (n) in accordance with the current state observed in slot n. Thus, the action of SVU k in slot n can be expressed as a k,n = [p k,L (n), p k,I (n), p k,V (n)]. (18) note that similar to [44], we consider the action space of SVU k to be continuous.

Reward
The reward is an evaluation based on the previous action. In this paper, we aim to minimize power consumption and latency. As mentioned in Section 3, the latency of task offloading is a constant. Thus, based on Little's theorem [45], the reward of SVU k is defined as r k,n = −[ω 1 (p k,L (n) + p k,I (n) + p k,V (n)) + ω 2 B k (n)], (19) where ω 1 and ω 2 are nonnegative weight factors and ω 1 + ω 2 = 1. Accordingly, the cumulative discount reward of SVU k can be calculated as where γ is a constant that reflects the degree of discount applied to the long-term reward.

Solution
In this section, we first introduce the training process, which is based on the DDPG algorithm. Then, we describe how the performance of the trained model is tested in the testing stage.

Training Stage
Since the DDPG algorithm is capable of solving problems with the continuous action space, we employ the DDPG algorithm to obtain the optimal policy. The DDPG algorithm combines the deterministic policy gradient (DPG) approach with the actor-critic framework; it is a modification of the DQN algorithm and can solve problems with the continuous action space. The DDPG algorithm is composed of four neural networks: an actor network, a target actor network, a critic network, and a target network. Here, the actor network and target actor network are employed to update the policy µ θ k , thus obtaining the optimal policy. The critic network and target critic network are employed to evaluate the policy.
The flow of the training stage is summarized in Algorithm 1. Note that θ k and θ k denote the parameters of the actor network and target actor network, respectively. ζ k and ζ k denote the parameters of the critic network and target critic network, respectively. ∆ n is the noise parameter in slot n.

Algorithm 1:
Model training stage based on the DDPG algorithm 1 Initialize replay experience buffer Γ; 2 Randomly initialize the network parameters θ k , ζ k , ζ k ← ζ k , and θ k ← θ k ; 3 for episode = 1, K max do 4 Input the initial model parameters; 5 Input the initial state s 1 ; 6 for slot n = 1, N j,m do 7 Allocate local processing power, V2I processing power, and V2V processing power based the current policy and exploration noise, a k = µ θ k (s k |θ k ) + ∆ n ; 8 Perform the action a k , and obtain the reward r k and the next state s k from the surrounding environment; 9 Store the transition (s k , a k , r k , s k ) in Γ; 10 if the number of tuples in Γ reaches T then 11 Randomly sample a minibatch of T transition tuples from Γ;

13
Update the parameter θ k based on Equation (24); 14 Update the parameters ζ k and θ k based on Equations (25) and (26).
In the training stage, we randomly initialize θ k and ζ k , while θ k and ζ k have the same initial values as θ k and ζ k , respectively. Then, we define an experience buffer Γ with sufficient space to store the transitions for each slot (lines 1-2).
Without loss of generality, let us consider that model training starts for SVU k. In the first episode, the position of SVU k is first reset to within the range of the BS antennae. d k,j (1) is initialized as − D 2 , and B k (1) is initialized as half of the buffer capacity. Then, g k s (0) is randomly initialized and g G k (0) is calculated by using Equation (9). Meanwhile, the SINR γ k,I (0) of SVU k for V2I communication is calculated by using Equation (8). Thus, the state of SVU k in slot 1 is obtained, i.e., s k,1 = [B k (1), γ k,I (0), d k,j (1)] (lines [3][4][5]. Subsequently, given the initial input s k,1 to the actor network in slot 1, the corresponding policy µ θ k (s k,1 |θ k ) is obtained. The noise ∆ 1 is randomly initialized and then SVU k performs an action a k,1 based on the current policy and the noise, µ θ k (s k,1 |θ k ) + ∆ 1 . With the performed action, the local processing power p k,L (1), the V2I processing power p k,I (1), and the V2V processing power p k,V (1) are determined. Then, SVU k obtains a reward r k (1) in accordance with Equation (19). The SINR γ k,I (0) for V2I communication is obtained in accordance with Equations (3)-(9), and B k (2) is obtained from Equations (12)- (16), where d k,L (1), d k,I (1), and d k,V (1) are obtained from Equation (13), (15) and (16), respectively. Additionally, d k,j (2) is obtained from Equation (2). Thus, the state of SVU k in slot 2 is obtained, i.e., s k,2 = [B k (2), γ k,I (1), d k,j (2)]. Then, the tuple (s k,1 , a k,1 , r k,1 , s k,2 ) is stored in the experience buffer Γ. If the number of tuples in Γ does not exceed T, SVU k proceeds to the next slot and repeats the above process (lines 6-10).
Once the number of tuples reaches T, the parameters θ k , ζ k , θ k , and ζ k are updated toward maximizing J(µ θ k ). The parameter θ k is updated in accordance with the gradient of the policy, i.e., the gradient in the direction of J(µ θ k ), which is denoted by ∇ θ k J(µ θ k ). Let Q ζ k (s k,n , a k,n ) be the action value function of SVU k, which is the output of the critic network. According to [46], the task of solving for ∇ θ k J(µ θ k ) can be replaced by solving for the gradient of Q ζ k (s k,n , a k,n ). Now, we further describe how the parameters are updated, as follows. Figure 2 illustrates the parameter update process of the DDPG algorithm. First, SVU k randomly selects T tuples from the experience buffer to form a minibatch. For convenience, we use r k , s k , a k , s k , and a k to denote r k,n , s k,n , a k,n , s k,n+1 , and a k,n+1 , respectively. Then, let (s i k , a i k , r i k , s then, s i k and a i k are used as inputs to the critic network, which outputs the action value function Q ζ k (s i k , a i k ). Thus, the loss function for tuple i can be expressed as accordingly, the loss function for all tuples can be expressed as Then, SVU k updates ζ k based on Equations (21)-(23) [47] (line 12). Afterward, ∇ θ k J(µ θ k ) can be obtained from Q ζ k (s i k , a µ k ), which is the output of the critic network. We have note that the chain rule is utilized here, since a . Similarly, SVU k updates the parameter of the actor network in accordance with Equation (24) [47] (line 13).
In slot N j,m , SVU k updates ζ k and θ k , i.e., (line 14) where τ 1 is a constant. Finally, SVU k proceeds to the next slot and uses s k as the input to the actor network. This current episode continues until slot N j,m is reached. When the number of episodes reaches K max , training of the system model is complete.

Testing Stage
In the testing stage, we test the performance of the trained system model. Algorithm 2 shows the flow of the testing process. Allocate local processing power, V2I processing power, and V2V processing power in accordance with the optimal policy, a k = µ θ k (s k |θ k * ); 5 Perform the action a k , and obtain the reward r k and the next state s k from the surrounding environment.

Simulation Results and Analysis
In this section, we demonstrate through simulation experiments that our proposed DDPG algorithm can obtain the optimal policy. The simulation experiments are divided into a training stage and a testing stage. The simulation tool is Python 3.7.
The key parameters of the experiments are listed in Table 2. The learning rates for the actor network and critic network are 10 −3 and 10 −4 , respectively [47]. The size of the experience replay buffer is |Γ|. Task arrival follows a Poisson process, where the mean arrival rate is λ k . SVU k is driving in lane 2, and its V2I communication will be interfered with three other vehicles when d k,j (n) = 0.  Figure 3 shows the learning curve in the training stage, where the reward is the average value in each episode. One can see that the average reward rises quickly from episode 0 to episode 12. Then, the curve declines from episode 12 to episode 400. This reflects that SVU k is adjusting its policy toward the optimal reward. From episode 400 to episode 1000, the rewards become stable with little jitter. The reason for the existence of jitter is the presence of exploration noise, which prevents SVU k from falling into a local optimum.

Testing Stage
We present performance tests performed on the trained model in the testing stage. Figures 4 and 5 show the test performance under the DDPG algorithm and three other greedy (GD) policies in terms of the power consumption, buffer capacity, and reward, where the performance is recorded as the average value over 10 episodes. In the following, we introduce the three policies.
• GD-Local policy: SVU k first maximally allocates the local processing in each slot. The remaining tasks are equally allocated to V2I processing and V2V processing. • GD-V2I policy: SVU k first maximally allocates the V2I processing power in each slot. The remaining tasks are equally allocated to local processing and V2V processing. • GD-V2V policy: SVU k first maximally allocates the V2V processing power in each slot. The remaining tasks are equally allocated to local processing and V2I processing.    Figure 4a shows the local-and-V2V processing power and the V2I processing power at different distances under the DDPG algorithm. When d k,j (n) < 0, the V2I processing power increases, and the local-and-V2V processing power decreases significantly. This is because the path loss decreases; thus, the channel conditions for V2I communication improve as SVU k approaches the BS. Therefore, as SVU k approaches the BS, SVU k allocates more power to V2I processing. When SVU k moves away from the BS, SVU k allocates more power to local-and-V2V processing. When d k,j (n) > 0, local-and-V2V processing power increases rapidly, while the V2I processing power decreases. This is because when d k,j (n) = 0, the other three vehicles impose interference on SVU k, causing the V2I channel conditions to deteriorate. As the V2I channel conditions become worse, more power needs to be allocated to local-and-V2V processing. From Figure 4b, it can be seen that the power consumption under the GD-V2I policy gradually decreases when d k,j (n) < 0 and rapidly increases when d k,j (n) > 0, which clearly matches the changing V2I channel conditions. Figure 5a,b compares the buffer capacity and reward under the four policies at different distances. From Figure 5a, it can be seen that the buffer capacity increases when d k,j (n) = 0 under the GD-V2I policy. This is because the V2I communication is interfered with by the other three vehicles when d k,j (n) = 0. In contrast, the buffer capacity under the DDPG algorithm remains stable. This means that SVU k is able to process tasks in a timely manner even though the channel conditions are degraded. As shown in Figure 5a, the reward under the DDPG algorithm is better than the rewards under the other three policies most of the time. Figure 6a compares the average buffer capacity under the four policies. There is no significant difference in buffer capacity among the four policies. Nevertheless, the GD-V2I policy has the largest buffer capacity because of variation of V2I channel conditions. Meanwhile, as seen from Figure 6b, the average power consumption under the DDPG algorithm is significantly superior to the other three policies. Compared to that under the GD-Local policy, the average power consumption under the DDPG algorithm is reduced by 24.4%. Similarly, the average power reduction under the DDPG algorithm is 51.3% compared to the GD-V2I policy and 23.1% compared to the GD-V2V policy.  Figure 6c shows the cumulative discount reward under the four policies. We can see that the cumulative discount reward under the DDPG algorithm is better than those under the other three policies. This is because of the adaptability of the DDPG algorithm, which allows the model to rapidly adjust the power allocation. Figure 7a,c shows the cumulative discount reward, power consumption, and buffer capacity under the four policies at different task-arrival rates. As the task-arrival rate increases, the cumulative discount reward decreases, the power consumption increases, and the buffer capacity becomes larger for all four policies. It can also be seen that although the DDPG algorithm is superior to the other three policies in terms of the power consumption and cumulative discount reward, its buffer capacity is slightly higher than those under the GD-Local and GD-V2V policies. This is because the DDPG algorithm aims to obtain the maximum cumulative discount reward. Due to its focus on power consumption, and the fact that its buffer capacity performance was not given more attention, the result is slightly worse.  Table 3 compares the performance in terms of average power consumption, average buffer capacity and cumulative discount reward under the four policies, where A, B, C, and D stand for different performance levels in a descending order. Table 3. Performance comparison under the four policies.

Conclusions
In this paper, we have proposed a decentralized DRL-based VEC power allocation model that considers not only the coexistence of D2D-based V2V communication and MIMO-NOMA-based V2I communication, but also the mobility of SVUs, the randomness of task arrival, and the channels' interference in MIMO-NOMA. Extensive simulations demonstrate that the average power consumption and reward under the DDPG algorithm is superior to those of other policies. Meanwhile, since the proposed DDPG algorithm focuses on power consumption, it may incur a compromise for buffer capacity. For future work, we will consider the coexistence of many-to-many D2D and MIMO-NOMA technology in VEC systems.