Resource Allocation in V2X Communications Based on Multi-Agent Reinforcement Learning with Attention Mechanism

: In this paper, we study the joint optimization problem of the spectrum and power allocation for multiple vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) users in cellular vehicle-to-everything (C-V2X) communication, aiming to maximize the sum rate of V2I links while satisfying the low latency requirements of V2V links. However, channel state information (CSI) is hard to obtain accurately due to the mobility of vehicles. In addition, the effective sensing of state information among vehicles becomes difﬁcult in an environment with complex and diverse information, which is detrimental to vehicles collaborating for resource allocation. Thus, we propose a framework of multi-agent deep reinforcement learning based on attention mechanism (AMARL) to improve the V2X communication performance. Speciﬁcally, for vehicle mobility, we model the problem as a multi-agent reinforcement learning process, where each V2V link is regarded an agent and all agents jointly intercommunicate with the environment. Each agent allocates spectrum and power through its deep Q network (DQN). To enhance effective intercommunication and the sense of collaboration among vehicles, we introduce an attention mechanism to focus on more relevant information, which in turn reduces the signaling overhead and optimizes their communication performance more explicitly. Experimental results show that the proposed AMARL-based approach can satisfy the requirements of a high rate for V2I links and low latency for V2V links. It also has an excellent adaptability to environmental change.


Introduction
Vehicle-to-everything (V2X) communications is one of the key technologies in future autonomous driving and intelligent transport systems, aiming to enhance user experience, improve road safety, and adapt to complex and diverse transmission environments [1,2].Among them, vehicle-to-infrastructure (V2I) mainly satisfies the requirements of vehicle users for high throughput, such as video traffic offloading [3].Vehicle-to-vehicle (V2V), which focuses on the requirements of low latency and high reliability between vehicles, has become a key technology for cooperative driving and improved road safety [4,5].
V2X communication supports various use cases by exchanging information between infrastructure, vehicles, and pedestrians through various wireless technologies.Some candidate wireless technologies have been proposed, including dedicated short-range communication (DSRC), cellular vehicular communication, and 5G vehicular communication.DSRC technology is based on the IEEE 802.11p standard [6], which supports short exchanges between DSRC devices.To implement DSRC technology, the US Federal Communications Commission (FCC) has allocated 75 MHz of spectrum in the 5.9 GHz band.solve the joint optimization problem of C-V2X transmission mode selection and resource allocation.In [22], the age of information (AoI) was considered to study the delay problem of V2V links.To cope with the variation of vehicle mobility and information arrival time, the original MDP was decomposed into a series of MDPs for V2V pairs.An LSTM-based DRL algorithm was proposed to solve the local observability and high-dimensional disasters of V2V pairs.The authors of [23] introduced a centralized resource allocation architecture, and the base station uses a double deep Q network (DDQN) to allocation resources intelligently based on partial CSI to reduce the signaling overhead.Unlike [20][21][22][23], Refs.[24][25][26][27][28][29] modeled the V2X resource allocation problem as a multi-agent reinforcement learning (MARL) problem, where each V2V link was considered as an agent.In [24], a fingerprint-based deep Q-network was proposed to handle the non-smoothness problem in multi-agent reinforcement learning [25].A centralized training and distributed execution framework were constructed for resource allocation.In the literature [26], both V2I link and V2V link latencies were considered in order to reduce the overall V2X latency.Moreover, proximal policy optimization (PPO)-based multi-agent reinforcement learning was proposed to optimize the objectives.To adapt to the changing environment more effectively, [27] proposed metareinforcement learning for V2X resource allocation.Firstly, spectrum resources and power are allocated using DQN and deep deterministic policy gradient (DDPG), respectively.Then, meta-learning was introduced to enhance the adaptability of the allocation algorithm to the dynamic environment.In [28], the congestion problem of wireless resources was under consideration, multi-agent reinforcement learning (DIRAL) based on unique state representation was proposed, and the nonstationary problem was solved by designing a view-based location.In addition, considering the topological relationship of vehicle users, [29] proposed a graph neural network (GNN)-based reinforcement learning method to learn the low-dimensional features of V2V link states by GNN and use RL for spectrum allocation.Although, the RL method has achieved satisfactory results in the problem of V2X resource allocation.It still faces two problems: firstly, there are difficulties in making effective sensing between each agent; secondly, the process of interfacing the agent with the environment will indiscriminately receive state information from all other agents, which will lead to a high computational overhead and signaling overhead.

Contribution and Organization
In this paper, we consider the resource management in partial CSI cases to match the realistic situation.In addition, a multi-agent reinforcement learning algorithm is utilized for adaption to the dynamic vehicle environment.We regard the V2V link as an agent and make corresponding decisions based on local observations.Furthermore, the agents have competitive and cooperative relationships in a multi-agent environment.In the case of competitive relationships, the V2V links tend to be egoistic, which ultimately affects the communication performance of the whole system.Hence, under the cooperative relationship setting, we build a reinforcement learning architecture and design the reward function to be a common reward for all agents.Considering the information exchange between agents, inspired by [30,31], we introduce an attention mechanism [32] for information exchange between V2V links.Through the attention mechanism, each agent can focus on more relevant information and optimize itself more explicitly.The main contributions of this paper are summarized as follows:

•
Due to the mobility of vehicular users, it is not easy to obtain CSI accurately.We propose the framework of MARL to adapt to the changing environment and use only partial CSI for wireless resource allocation to ensure the high rate of V2I links and low latency of V2V links.

•
To make each agent more effective in acquiring the state information of other agents in the environment and to establish collaborative relationships, we propose an algorithm of multi-agent deep reinforcement learning with attention mechanism (AMARL) to enhance the sense of collaboration among agents.It also enables agents to obtain more useful information, reduce the signaling overhead, and allocate resources more clearly.

•
Experimental results demonstrate that, compared to other baseline schemes, the proposed AMARL-based algorithm can satisfy the requirement of low latency for V2V links and significantly increase the total rate of V2I links.It also has better adaptability to environmental changes.
The remainder of this paper is organized as follows.Section 2 presents the system model and problem formulation.Section 3 presents the details of the proposed attention mechanism-based MARL algorithm for solving V2X resource allocation.The simulation results and analysis are presented in Section 4. Section 5 presents the conclusions.

System Model
As shown in Figure 1, we consider cellular V2X communications in an urban road traffic scenario, including a base station and multiple vehicle users.In particular, we focus on mode 4 in the cellular V2X architecture [33], in which each vehicle can choose its communication resources without relying on the base station for resource allocation.According to the different service requirements of V2X communications, the vehicle users are divided into V2I and V2V links.Specifically, V2I links support higher-throughput tasks while V2V links can provide secure and reliable information to vehicle users through information sharing.In this paper, we consider the uplink for V2I communication and assume that all vehicle users have a single antenna for their transceivers.Meanwhile, to improve spectrum utilization and to guarantee the high-rate requirements of the V2I link, we assume that each V2I is pre-allocated an orthogonal sub-band with a fixed transmit power and shares this sub-band resource with multiple V2V links.In addition, each V2V pair can only select one sub-band for communication.
Mathematics 2022, 10, 3415 4 of 20 obtain more useful information, reduce the signaling overhead, and allocate resources more clearly.

•
Experimental results demonstrate that, compared to other baseline schemes, the proposed AMARL-based algorithm can satisfy the requirement of low latency for V2V links and significantly increase the total rate of V2I links.It also has better adaptability to environmental changes.
The remainder of this paper is organized as follows.Section 2 presents the system model and problem formulation.Section 3 presents the details of the proposed attention mechanism-based MARL algorithm for solving V2X resource allocation.The simulation results and analysis are presented in Section 4. Section 5 presents the conclusions.

System Model
As shown in Figure 1, we consider cellular V2X communications in an urban road traffic scenario, including a base station and multiple vehicle users.In particular, we focus on mode 4 in the cellular V2X architecture [33], in which each vehicle can choose its communication resources without relying on the base station for resource allocation.According to the different service requirements of V2X communications, the vehicle users are divided into V2I and V2V links.Specifically, V2I links support higher-throughput tasks while V2V links can provide secure and reliable information to vehicle users through information sharing.In this paper, we consider the uplink for V2I communication and assume that all vehicle users have a single antenna for their transceivers.Meanwhile, to improve spectrum utilization and to guarantee the high-rate requirements of the V2I link, we assume that each V2I is pre-allocated an orthogonal sub-band with a fixed transmit power and shares this sub-band resource with multiple V2V links.In addition, each V2V pair can only select one sub-band for communication.We denote the V2I links and V2V links as the sets ℳ = {1, ⋯ , } and  = {1, ⋯ , }, respectively, where  and  denote the number of V2I and V2V, respectively.In addition, we assume that the number of sub-bands equals to the number of V2I links.
In this paper, the channel power gain considered includes the large-scale fading component and the small scale-fading component.The channel gain can be expressed as  = , where  and  denote the large-scale fading and the small-scale fading, including the path loss and shadowing for each communication link, respectively.We define the channel power gains of the -th V2I link and the -th V2V link on the -th sub-band as  , and  [], respectively.The interfering channel gains received at the receiver of the -th V2V link from the transmitter of the -th V2I link and the  -th V2V link over the -th sub-band are given by  , and  , .The interfering channel gain for the - We denote the V2I links and V2V links as the sets M = {1, • • • , M} and N = {1, • • • , N}, respectively, where M and N denote the number of V2I and V2V, respectively.In addition, we assume that the number of sub-bands equals to the number of V2I links.
In this paper, the channel power gain considered includes the large-scale fading component and the small scale-fading component.The channel gain can be expressed as g = αβ, where α and β denote the large-scale fading and the small-scale fading, including the path loss and shadowing for each communication link, respectively.We define the channel power gains of the m-th.V2I link and the n-th V2V link on the m-th sub-band as ĝm,B and g n [m], respectively.The interfering channel gains received at the receiver of the n-th V2V link from the transmitter of the m-th V2I link and the n -th V2V link over the m-th sub-band are given by ĝm,n and g n ,n .The interfering channel gain for the m-th V2I link from the n-th V2V link over the m-th sub-band is g n,B [m].For simplicity, the notations adopted in this paper are listed in Table 1.Channel gain between the n-th V2V link ĝm,n The interfering channel gain from m-th V2I link to n-th V2V link g n ,n The interfering channel gain from n -th V2V link to n-th V2V link g n,B [m] The interfering channel gain from n-th V2V link to m-th V2V link Indicator of the n-th V2V link reuse the spectrum of the m-th V2I link The SINR of n-th V2V link σ 2 , W Noise power and bandwidth Transmit power of the m-th V2I link and the n-th V2V link ∆ T The coherence time of the channel ω i,j The attention weight of The received signal to interference plus noise (SINR) of the m-th V2I link and the n-th V2V link over the m-th sub-band can be expressed as: and: where p V2I m and p V2V n [m] denote the transmit power of the m-th V2I link and the n-th V2V link at the m-th sub-band, σ 2 denotes the noise power, and: denotes the total interference power of the n-th V2V link in the m-th sub-band.The binary variable We assume that a V2V link only accesses one sub-band, Then, the capacity of the m-th V2I link and the n-th V2V link is: and: where W is the bandwidth of the sub-band.This paper aims to maximize the V2I link capacity to provide high-quality entertainment services while satisfying the low latency and high reliability requirements of V2V links to provide realistic and reliable information to vehicle users in road traffic.To satisfy the first requirement, the sum rate of V2I links needs to be maximized.To satisfy the second requirement, we require V2V users to successfully transmit packets of size B in finite time T max with the following probabilistic model: where ∆ T is the coherence time of the channel, and the index t is added in R V2V n [m, t] to indicate the capacity of the n-th V2V link at different coherence time slots.Thus, the problem of V2X resource allocation can be formulated as an optimization problem as follows: max where P denotes the discrete power set of V2V link.Constraint (8) denotes that each V2V pair can occupy only one sub-band, and constraint (9) denotes the power condition is satisfied.Problem ( 7) is a combinatorial optimization problem, and a limitation of traditional optimization methods is the high requirement for model accuracy.However, due to vehicle mobility, the environment is constantly changing, leading to uncertainty in the model parameters, and the complete CSI is difficult to obtain and solve by traditional methods.Therefore, we propose to address this problem through a deep reinforcement learning approach.In Section 4, we validate the effectiveness of the proposed method.

Resource Allocation Based on Multi-Agent Reinforcement Learning with Attention Mechanism Algorithm
In this section, we briefly introduce the basic concepts of attentional mechanism and multi-agent reinforcement learning and then describe how the algorithmic framework can be used to solve the problem of V2X resource allocation.Before presenting the algorithm in detail, we first introduce the three elements in reinforcement learning: the observation space, the action space, and the reward function.

Observation Space
Due to the existence of vehicle mobility, it is more difficult to obtain a complete CSI.Therefore, we consider partial CSI as part of the observation space, which, on the one hand, is more in line with the real scenario; on the other hand, it is also beneficial to reduce the signaling overhead of CSI feedback.In mode 4, the vehicle performs wireless resource allocation by sensing channel measurements, in which it will inevitably receive interference information.Considering the need for low latency in V2V links, the state observation space of the V2V agent should also contain the remaining payload and time.Thus, the state of the V2V agent at the time t includes the received interference information, the remaining payload, and the remaining time.
We denote the observation space as: which is the set of all agents' states at moment t. o n t is the observation of the n-th agent at each time slot t.The remaining payload and remaining time are defined as L n t , U n t , respectively.Therefore, o n t can be expressed as:

Action Space
Based on the observed state, each V2V agent will make a decision on sub-band selection and power allocation.We define the action space for all V2V agents as A = {a n } N n=1 , where a n = {x n , p n } is the action space of the n-th V2V agent.x n and p n denote the set of possible sub-band selection and power allocation for the n-th V2V agent.Thus, the set of possible sub-band assignment decisions for the n-th V2V agent at the time slot t can be defined as: In problem (7), we carry out a discrete power allocation scheme [34].The set of possible power selection of the n-th V2V agent at time slot t can be expressed as: where N is the number of power levels.

Rewards Function
The design of the rewards function is closely related to the problem (7).Our objective is to maximize the total throughput of the V2I links while satisfying the latency and reliability requirements of the V2V links.In order to satisfy the requirement of the low latency of the V2V links, we set the following reward function: This means that we want the V2V link to complete the data transfer as quickly as possible.When there is a remaining load, the transmission is carried out at the effective rate of the V2V link until the load is fully transmitted.c is a hyperparameter, which is greater than the maximum possible V2V links rate, and the faster the remaining load is sent, the greater the reward.In addition, we want the transmission time to be as short as possible, which means that the probability of successful packet transmission within a given time constraint will increase.Therefore, the final reward function is set as follows: where {c i } i=1,2,3 is a weighting factor, which reflects the degree of requirement for different QoS.

Overview of Attentional Mechanism
We consider that in the problem of V2X resource allocation, the interaction between V2V pairs affects their respective communication performance.If each V2V pair receives the state information of all other V2V pairs, it will lead to two problems.Firstly, mixing valuable and useless information would lead to problematic performance optimization; secondly, processing global information by V2V pair would require a large number of computational resources and a high signaling overhead, which is unacceptable.Therefore, to solve the above two problems, we introduced the attention mechanism based on reinforcement learning, which evaluates the importance of state information through attention weights and enables V2V pairs to obtain helpful information better.
We define the state information of the i-th V2V pair as s i (i ∈ N ), and the corresponding query Q i , key K i , and value V i , and then define several basic parameter matrices used to describe the attention mechanism, namely the query matrix W Q , the key matrix W K , and the value matrix W V .Thus, the attention weight of V2V i to V2V j is: where d k denotes the key dimension of each component.
The state information after passing the attention mechanism is then obtained by calculating a weighted sum of the values of the other V2V pairs, which is represented as:

Multi-Agent Reinforcement Learning
In multi-agent reinforcement learning, multiple agents are in the same environment.Each agent independently intersects with the environment to motivate it and uses the reward of feedback to improve its policy for higher rewards continuously.Furthermore, an agent's policy not only depends on its state and actions but also considers the states and actions of other agents, as shown in Figure 2.
Mathematics 2022, 10, 3415 8 of 20 secondly, processing global information by V2V pair would require a large number of computational resources and a high signaling overhead, which is unacceptable.Therefore, to solve the above two problems, we introduced the attention mechanism based on reinforcement learning, which evaluates the importance of state information through attention weights and enables V2V pairs to obtain helpful information better.We define the state information of the -th V2V pair as  ( ∈ ), and the corresponding query  , key  , and value  , and then define several basic parameter matrices used to describe the attention mechanism, namely the query matrix  , the key matrix  , and the value matrix  .Thus, the attention weight of 2 to 2 is: where  denotes the key dimension of each component.The state information after passing the attention mechanism is then obtained by calculating a weighted sum of the values of the other V2V pairs, which is represented as:

Multi-Agent Reinforcement Learning
In multi-agent reinforcement learning, multiple agents are in the same environment.Each agent independently intersects with the environment to motivate it and uses the reward of feedback to improve its policy for higher rewards continuously.Furthermore, an agent's policy not only depends on its state and actions but also considers the states and actions of other agents, as shown in Figure 2.

AMARL Algorithm
In this section, we develop an attentional DRL-based algorithmic framework to solve the problem (7).As shown in Figure 3, we consider each V2V link as an agent body and model the resource management problem as an MDP, where all vehicles are in the same wireless environment.Each agent independently intersects with the environment to obtain its local observations and obtains information from other agents through the attention mechanism to allocate spectrum and power based on its observations.

AMARL Algorithm
In this section, we develop an attentional DRL-based algorithmic framework to solve the problem (7).As shown in Figure 3, we consider each V2V link as an agent body and model the resource management problem as an MDP, where all vehicles are in the same wireless environment.Each agent independently intersects with the environment to obtain its local observations and obtains information from other agents through the attention mechanism to allocate spectrum and power based on its observations.
To achieve the goal of maximizing the rate of V2I links and satisfying the low latency of V2V links, we construct an algorithmic framework with a Deep Q Network (DQN) as the backbone network and use a distributed architecture to solve the problem (7), where each agent has its Q network and optimizes its policy in this way.We consider the allocation of wireless resources within time and the introduction of an attention mechanism to sense changes in vehicle state information due to environmental changes.
With the introduction of the attention mechanism, the V2V links pay more attention to helpful information and integrate this information into its action value estimation function, i.e., the Q function, which can be expressed as: The calculation process is shown in Figure 4, where add s A n , s n = s A n + s n , f n is a threelayer multi-layer perceptron (MLP), s A is the output state of the agent after the attention network, and θ is a parameter of the network.To achieve the goal of maximizing the rate of V2I links and satisfying the low latency of V2V links, we construct an algorithmic framework with a Deep Q Network (DQN) as the backbone network and use a distributed architecture to solve the problem (7), where each agent has its Q network and optimizes its policy in this way.We consider the allocation of wireless resources within time and the introduction of an attention mechanism to sense changes in vehicle state information due to environmental changes.
With the introduction of the attention mechanism, the V2V links pay more attention to helpful information and integrate this information into its action value estimation function, i.e., the Q function, which can be expressed as: The calculation process is shown in Figure 4, where ( ,  ) =  +  ,  is a three-layer multi-layer perceptron (MLP),  is the output state of the agent after the attention network, and  is a parameter of the network.

Concatenate heads per agent
1 , , , , To obtain the optimal policy π, the optimal action value function is defined: From the Bellman optimality equation [35], Equation ( 16) can be written as where  is the discount factor.From the Monte Carlo approximation, ( 17) can be transformed into: To achieve the goal of maximizing the rate of V2I links and satisfying the low latency of V2V links, we construct an algorithmic framework with a Deep Q Network (DQN) as the backbone network and use a distributed architecture to solve the problem (7), where each agent has its Q network and optimizes its policy in this way.We consider the allocation of wireless resources within time and the introduction of an attention mechanism to sense changes in vehicle state information due to environmental changes.
With the introduction of the attention mechanism, the V2V links pay more attention to helpful information and integrate this information into its action value estimation function, i.e., the Q function, which can be expressed as: The calculation process is shown in Figure 4, where ( ,  ) =  +  ,  is a three-layer multi-layer perceptron (MLP),  is the output state of the agent after the attention network, and  is a parameter of the network.

Concatenate heads per agent
1 , , , , To obtain the optimal policy π, the optimal action value function is defined: From the Bellman optimality equation [35], Equation ( 16) can be written as where  is the discount factor.From the Monte Carlo approximation, ( 17) can be transformed into: To obtain the optimal policy π, the optimal action value function is defined: From the Bellman optimality equation [35], Equation ( 16) can be written as where γ is the discount factor.From the Monte Carlo approximation, ( 17) can be transformed into: approximating the value of Q in (20) with the Q network yields: where the left-hand side of Equation ( 19) is the prediction of the Q network at moment t and the TD target [36] on the right-hand side is the prediction of the Q network at moment t + 1, denoted as y t = r t + γmax a∈A Q(s t+1 , a; θ).Thus, the loss function can be defined as: The training of DQN can be divided into two parts: collecting the training data and updating the parameters θ.
(1) Collecting training data: The n-th V2V link needs to intersect with the environment using some kind of strategy π, which we for call a behavioral policy.The -greed policy is generally used [37]: A mini-batch of experiences D n are uniformly sampled from the experience replay array D to update parameter θ using stochastic gradient descent.The TD algorithm is used to train the DQN network; however, maximization in the TD algorithm leads to an overestimation problem, where the TD target overestimates the true value.To alleviate this problem, a target network [38] is used to calculate the TD target, i.e., Therefore, the loss function is: Notation: is the TD error.Perform gradient descent to update the network parameters: where θ tar is the target network parameter, which is periodically updated by the Q-network parameter θ to improve the stability of the network.The training process is summarized in Algorithm 1.In building the DQN for each agent, we constructed three fully connected layers containing 250, 180, and 100 neurons, respectively.The activation function of the hidden layer in the DQN used the ReLu f (x) = max(0, x), the RMSProp optimizer was used to update the network parameters, and the learning rate α = 0.001.In the training phase, similar to [24], we fix the payload of V2V pairs to be 2 × 1060 bytes, train a total of 3000 episodes of Q-network for each agent, and the exploration rate is linearly annealed from 1 to 0.2.In the testing phase, we vary the payload and speed of V2V pairs to verify the adaptability of the proposed scheme to the environment.
In order to verify the validity of the proposed method, we compared it with the following four methods: (1) Meta-reinforcement learning [27]: In this scheme, DQN is used to solve the problem of spectrum allocation, deep deterministic policy gradient (DDPG) is used to solve the problem of continuous power allocation, and meta-learning is introduced to make the agent adapt to the changes in the environment.(2) Proposed RL (no attention): This scheme does not incorporate an attention mechanism, and the agent will obtain the state information of other agents without any difference and then allocate wireless resources.(3) Brute-Force: This scheme is implemented in a centralized manner and requires accurate CSI.It focuses only on improving the performance of V2Vs, ignoring the need for V2I links, and performs an exhaustive search of the action space of all V2V pairs to maximize V2Vs and rates.(4) Random: randomizes spectrum and power allocation.

Impact of Payload Size on Network Performance
Figure 5 shows the change in the sum rate of the V2I links, and the probability of successful transmission of the V2V links as the payload changes.In particular, based on the maximum V2V links transmission power of 23 dBm set in [24], we use this power as a lower limit for this paper's transmission power.As can be seen from Figure 5, the sum rate of the V2I link and the probability of successful transmission of the V2V link decrease for all schemes (except Brute-Force) as the V2V links payload increases.This is because, when the payload increases, the V2V links require more transmission time and higher transmission power, which causes more interference in the V2I and V2V links, resulting in a decreased communication performance.Compared to the meta-reinforcement learning scheme, Figure 5a shows that the proposed scheme maintains the higher sum rate of the V2I links as the payload increases and is close to the Brute-Force scheme.Even when the transmission power of the V2V links is set to a minimum of 23 dBm, the proposed scheme still has a much better V2I links sum rate than the meta-reinforcement learning scheme.In Figure 5b, the successful transmission probability of V2V links for different methods are compared.The performance of the proposed method is close to that of the meta-reinforcement learning method using full CSI when partial CSI is utilized.This indicates that the proposed algorithm can achieve the expected requirements of V2V link delay with a low signaling overhead.Figure 5 also shows the robustness of the proposed algorithm to the variation of the payload of V2V links.
delay with a low signaling overhead.Figure 5 also shows the robustness of the proposed algorithm to the variation of the payload of V2V links.Furthermore, we observe the proposed algorithm's performance before and after the introduction of the attention mechanism.The communication performance is substantially improved after the attention mechanism's introduction.Before the attention mechanism, V2V links indiscriminately obtained the state information of other V2V links, enhancing the interference level and increasing the signaling overhead.Moreover, with the introduction of the attention mechanism, a collaborative relationship is built between V2V links, allowing better use of information from other V2V links for more effective interference coordination, thus improving the communication performance.

Impact of V2V Links Transmission Power on Network Performance
In this subsection, we investigate the impact of the V2V links' power variations on the network performance to find a low-power design solution that satisfies the performance requirements.As shown in Figure 6, we set the maximum transmission power of the V2V links to {23, 25, 27, 29, 31, 33, 35} dBm.As the payload increases, the performance at all set powers decreases.Figure 6a shows that with the same load, the sum rate of the V2I links Furthermore, we observe the proposed algorithm's performance before and after the introduction of the attention mechanism.The communication performance is substantially improved after the attention mechanism's introduction.Before the attention mechanism, V2V links indiscriminately obtained the state information of other V2V links, enhancing the interference level and increasing the signaling overhead.Moreover, with the introduction of the attention mechanism, a collaborative relationship is built between V2V links, allowing better use of information from other V2V links for more effective interference coordination, thus improving the communication performance.

Impact of V2V Links Transmission Power on Network Performance
In this subsection, we investigate the impact of the V2V links' power variations on the network performance to find a low-power design solution that satisfies the performance requirements.As shown in Figure 6, we set the maximum transmission power of the V2V links to {23, 25, 27, 29, 31, 33, 35} dBm.As the payload increases, the performance at all set powers decreases.Figure 6a shows that with the same load, the sum rate of the V2I links increases as the transmission power of the V2V links increases, and the performance at all powers is relatively similar.Similarly, Figure 6b shows that the probability of successful transmission of the V2V links also increases with increasing power, which is due to the fact that as the power of the V2V link increases, the rate of the V2V links becomes larger as the transmission time decreases.In addition, we found that when the power of the V2V links is set to 35 dB, the probability of successful transmission of the V2V links decreases by 5.25% with the payload increase, although the network performance will still improve.Moreover, when the maximum power is 33 dBm, the decline in the successful transmission probability is 4.25% and approaches the performance of a maximum power of 35 dBm.Compared with other power settings, the performance of a maximum power of 33 dBm still has an advantage.This provides some reference for practical engineering design.If only the high throughput of the V2I link is required, setting the maximum power of the V2V links to 23 dBm is sufficient and reduces power consumption.
Mathematics 2022, 10, 3415 15 of 20 increases as the transmission power of the V2V links increases, and the performance at all powers is relatively similar.Similarly, Figure 6b shows that the probability of successful transmission of the V2V links also increases with increasing power, which is due to the fact that as the power of the V2V link increases, the rate of the V2V links becomes larger as the transmission time decreases.In addition, we found that when the power of the V2V links is set to 35 dB, the probability of successful transmission of the V2V links decreases by 5.25% with the payload increase, although the network performance will still improve.Moreover, when the maximum power is 33 dBm, the decline in the successful transmission probability is 4.25% and approaches the performance of a maximum power of 35 dBm.Compared with other power settings, the performance of a maximum power of 33 dBm still has an advantage.This provides some reference for practical engineering design.If only the high throughput of the V2I link is required, setting the maximum power of the V2V links to 23 dBm is sufficient and reduces power consumption.

Impact of Vehicle Velocity on Network Performance
To further investigate the adaptability of the proposed algorithm to environmental changes, we investigate the effect of different vehicle speeds on the network performance.In the training phase, the speed was fixed at [10,15] m/s to verify the robustness of the

Impact of Vehicle Velocity on Network Performance
To further investigate the adaptability of the proposed algorithm to environmental changes, we investigate the effect of different vehicle speeds on the network performance.In the training phase, the speed was fixed at [10,15] m/s to verify the robustness of the proposed algorithm.As shown in Figure 7, the performance of the proposed algorithm decreases with increasing speed for the same payload.This is because the environment changes more significantly as the vehicle speed increases, increasing the difficulty of obtaining state information and the uncertainty of the state information.However, the proposed scheme can still maintain a high throughput of the V2I links, and the probability of successful transmission of the V2V links, which indicates that the proposed algorithm can adapt to the changes in the environment.
adapt to the changes in the environment.
Further, we investigated the effectiveness of the proposed AMARL algorithm.As shown in Figure 8, we fixed a payload of 2 × 1060 Bytes and compared the network performance of the AMARL algorithm with the MARL algorithm (no attention).Figure 8a shows that the sum rate of the V2I links using the AMARL algorithm is higher than that of the MARL algorithm in a low-speed environment.As the speed increases, the proposed algorithm is slightly higher than the MARL algorithm.For practical design reasons, the proposed algorithm will be chosen over the MARL algorithm in low-speed environments where higher throughput of the V2I links is required.In high-speed environments, the MARL algorithm may be better; its network performance can satisfy the throughput requirements of some V2I links with a lower computational overhead than the proposed algorithm.Overall, the network performance of the proposed algorithm is better than the MARL algorithm.Further, we investigated the effectiveness of the proposed AMARL algorithm.As shown in Figure 8, we fixed a payload of 2 × 1060 Bytes and compared the network performance of the AMARL algorithm with the MARL algorithm (no attention).Figure 8a shows that the sum rate of the V2I links using the AMARL algorithm is higher than that of the MARL algorithm in a low-speed environment.As the speed increases, the proposed algorithm is slightly higher than the MARL algorithm.For practical design reasons, the proposed algorithm will be chosen over the MARL algorithm in low-speed environments where higher throughput of the V2I links is required.In high-speed environments, the MARL algorithm may be better; its network performance can satisfy the throughput requirements of some V2I links with a lower computational overhead than the proposed algorithm.Overall, the network performance of the proposed algorithm is better than the MARL algorithm.
outperforms the MARL algorithm.Even at the highest vehicle speed, the minimum successful transmission probability of the proposed algorithm is close to the highest successful transmission probability of the MARL algorithm.This is due to the introduction of the attention mechanism.Specifically, introducing the attention mechanism will promote the cooperative relationship of V2V links and reduce unnecessary communication interference by obtaining valid information, thus improving the throughput of V2V links and reducing the data transmission time.The demonstration of the network performance metrics in Figure 8 verifies that the proposed approach can adapt to environmental changes and demonstrates the attention mechanism's effectiveness.

Conclusions
In this paper, we propose an attention-based multi-agent reinforcement learning algorithm for spectrum and power allocation of V2X, aiming to satisfy the requirements of Figure 8b shows the effect of the vehicle speed variation on the successful transmission probability of the V2V link.It can be seen from the figure that the proposed algorithm outperforms the MARL algorithm.Even at the highest vehicle speed, the minimum successful transmission probability of the proposed algorithm is close to the highest successful transmission probability of the MARL algorithm.This is due to the introduction of the attention mechanism.Specifically, introducing the attention mechanism will promote the cooperative relationship of V2V links and reduce unnecessary communication interference by obtaining valid information, thus improving the throughput of V2V links and reducing the data transmission time.
The demonstration of the network performance metrics in Figure 8 verifies that the proposed approach can adapt to environmental changes and demonstrates the attention mechanism's effectiveness.

Conclusions
In this paper, we propose an attention-based multi-agent reinforcement learning algorithm for spectrum and power allocation of V2X, aiming to satisfy the requirements of high throughput for V2I links and low latency for V2V links.Meanwhile, we used partial CSI for

Figure 4 .
Figure 4. Calculating the Q value for agent n.

Figure 4 .
Figure 4. Calculating the Q value for agent n.

Figure 4 .
Figure 4. Calculating the Q value for agent n.

Figure 5 .
Figure 5.The performance for different payload sizes: (a) Sum rate of V2I links.(b) Successful transmission probability of V2V links.

Figure 5 .
Figure 5.The performance for different payload sizes: (a) Sum rate of V2I links.(b) Successful transmission probability of V2V links.

Figure 6 .
Figure 6.The performance comparisons for different V2V links transmission power: (a) Sum rate of V2I links.(b) Successful transmission probability of V2V links.

Figure 6 .
Figure 6.The performance comparisons for different V2V links transmission power: (a) Sum rate of V2I links.(b) Successful transmission probability of V2V links.

Figure 7 .
Figure 7.The performance comparison different velocity: (a) Sum rate of V2I links.(b) Successful transmission probability of V2V links.

Figure 7 .
Figure 7.The performance comparison different velocity: (a) Sum rate of V2I links.(b) Successful transmission probability of V2V links.

Figure 8 .
Figure 8.The performance comparison between AMARL and MARL: (a) Sum rate of V2I links.(b) Successful transmission probability of V2V links.

Figure 8 .
Figure 8.The performance comparison between AMARL and MARL: (a) Sum rate of V2I links.(b) Successful transmission probability of V2V links.

Table 3 .
The channel models for the V2V link and V2I link.