Deep Reinforcement Learning-Based Spectrum Allocation Algorithm in Internet of Vehicles Discriminating Services

: With the rapid development of global automotive industry intelligence and networking, the Internet of Vehicles (IoV) service, as a key communication technology, has been faced with an increasing spectrum of resources shortage. In this paper, we consider a spectrum utilization problem, in which a number of co-existing cellular users (CUs) and prioritized device-to-device (D2D) users are equipped in a single antenna vehicle-mounted communication network. To ensure a business-aware spectrum access mechanism with delay granted in a complex dynamic environment, we consider optimizing a metric that maintains a trade off between maximizing the total capacity of vehicle to vehicle (V2V) and vehicle to infrastructure (V2I) links and minimizing the interference of high priority links. A low complexity priority-based spectrum allocation scheme based on the deep reinforcement learning method is developed to solve the proposed formulation. We trained our algorithm using the deep Q-learning network (DQN) over a set of public bandwidths. Simulation results show that the proposed scheme can allocate spectrum resources quickly and effectively in a high dynamic vehicle network environment. Concerning improved channel transmission rate, the V2V link rate in this scheme is 2.54 times that of the traditional random spectrum allocation scheme, and the V2I link rate is 13.5% higher than that of the traditional random spectrum allocation scheme. The average total interference received by priority links decreased by 14.2 dB compared to common links, realized service priority distinction and has good robustness to communication noise.


Introduction
In recent years, with the development of wireless communication and intelligent vehicle technology, the on-board network has attracted extensive attention from the industry and academia [1][2][3][4]. Vehicle to everything (V2X) includes vehicle to vehicle (V2V), vehicle to infrastructure (V2I), vehicle to pedestrian (V2P) and vehicle to network (V2N). The third generation partnership project (3GPP) already supports V2X services in long term evolution (LTE) and fifth generation mobile communication technology (5G) networks; through the wireless communication network formed by the vehicle and various nodes, real-time information such as driving assistance and accident avoidance can be transmitted, and data services such as on-board entertainment, real-time navigation and Internet access can be provided to provide people with a safer, more efficient, environmentally friendly and comfortable driving environment. Spectrum resources are limited natural resources. With the wide application of various radio technologies and services, the demand for spectrum resources in various industries and fields of the national economy keeps increasing. The development of new generation of information technologies such as mobile Internet and Internet of Things also puts forward new demands for spectrum resources. Therefore, spectrum resources are increasingly scarce and are wasted if not fully utilized or used improperly. The rational allocation of spectrum is the key to achieving high quality vehicle network communication. With the rapid and widespread development of advanced broadband wireless technologies and the increasing demand for high speed and quality services, traditional static spectrum allocation policies are becoming obsolete. Dynamic spectrum sharing, as one of the key technologies to solve the problem of insufficient spectrum utilization, has received a lot of attention and research in recent years [5] to alleviate the current situation of insufficient spectrum resource utilization. The main goal of dynamic spectrum allocation is to design a flexible spectrum allocation strategy between existing users and new users without compromising the utilization of spectrum resources by existing users, so as to effectively allocate the idle spectrum to new users, so as to improve the efficiency of spectrum utilization [6]. However, with the continuous expansion of the application scope of the Internet of vehicles and the improved communication performance requirements in the network, it is necessary to design an effective spectrum resource allocation scheme to ensure the Internet of vehicles communication services with high reliability and low delay.
Vehicle nodes in vehicle-mounted networks have high mobility and complex timevarying characteristics. It is very challenging to provide high-quality services for vehicles, such as super-large capacity, ultra-high reliability and low delay. In order to solve these problems, it is necessary to provide an efficient spectrum resource allocation method for vehicles in V2X scenarios. The authors in [7] propose a spectrum resource allocation scheme that can adapt to the slow-changing large-scale channel fading, and maximizes the capacity of V2I by using the slow-fading statistical characteristics of channel status information (CSI). In [8], quality of service (QoS) requirements for different connections are different. Under the condition of ensuring V2V link reliability and waiting time constraints, the total traversal capacity of the V2I link is maximized to reduce network signaling overhead. In particular, this scheme allows spectrum resources to be shared not only between V2I and V2V links, but also between different V2V links. In [9], the authors proposes a method to convert the actual delay and reliability requirements of V2V communication into optimization constraints, but the optimization constraints can only be calculated by slowly changing CSI. In [10], in areas where spectrum resources are in short supply, vehicle horizontal sharing is combined with available spectrum resources to improve the success rate of alarm information transmission, reduce transmission delay of alarm information and reduce noise to reduce the incidence of traffic accidents, making a great contribution to improving traffic safety. In [11], a spectrum allocation scheme oriented to service priority in the Internet of vehicles based on long-term evolution was proposed. The spectrum allocation problem was modeled as a mixed integer programming problem and solved by the immune cloning-based algorithm to maximize the system utility. The authors in [12] propose a spectrum allocation model based on pricing and auction. While the fairness of secondary users is effectively guaranteed, idle spectrum information needs to be obtained in advance, which requires high spectrum perception ability. The authors in [13] adopt the spectrum sharing model driven by the spectrum database to dynamically adjust the protection boundary of major users according to the geographical location database to improve spectrum utilization efficiency, but this has high requirements for real-time updating of spectrum information. In [14], the dynamic spectrum allocation problem is mapped to a graph coloring model by using topological graph relations, and the interference graph is used to reduce the interference caused by spectrum sharing. However, as long as the topological structure changes, re-mapping calculation is required, which is mostly suitable for a static network environment. However, in the future vehicular network with a high dynamic unknown environment, the traditional spectrum resource allocation scheme is often difficult to achieve.
Machine learning, as one of the powerful artificial intelligence tools, has been widely used in wireless communication networks in recent years, such as multiple input-multiple output (MIMO), D2D, heterogeneous network composed of femtocells and small cells, etc. [15]. In particular, reinforcement learning (RL), as a kind of machine learning, has the ability of adaptive adjustment and does not require real-time spectrum perception. Agents only need to observe the changes of environmental state and improve performance according to the reward feedback received after learning to take action [16], which greatly reduces the complexity of spectrum sharing. Moreover, they do not need the real-time spectrum perception-knowledge concerning only the status of environmental change-in learning to take action after, accordingly as they receive feedback to improve performance, greatly reducing the complexity of spectrum sharing. This has achieved great success in many applications such as AlphaGo [17]. Inspired by its excellent performance, researchers began to use reinforcement learning methods to solve the spectrum resource allocation problem in the unknown high dynamic vehicle-mounted network environment. A spectrum allocation scheme based on distributed learning is proposed in [18], in which D2D users explore the environment and autonomously select spectrum resources to maximize throughput and spectrum efficiency and at the same time to meet the minimum interference caused to cellular users. In [19], a distributed spectrum resource allocation scheme based on multi-agent RL is proposed. Each V2V link is regarded as an agent, and each agent autonomously learns how to rationally select spectrum and power to improve the total capacity of the V2I link and the payload transmission rate of the V2V link. In [20], each vehicle is regarded as an agent, and multiple agents make decisions autonomously based on local observations in V2V broadcast communication to find the available spectrum. In [21], a deep RL method is developed to enable BS to centrally manage network, cache and computing resources. In [22], BS is used to summarize and compress vehicle observation data, and then the compressed information is fed back and the reinforcement learning process is carried out at the base station to improve the spectrum sharing decision performance in the network. In [23], the graph neural network (GNN) is used to build the V2X network; GNN extracts the features of each V2V pair. Based on the extracted features and local observations, the V2V pair can use the Q-network to make distributed decision-making. The authors in [24] propose a V2V communication wireless resource allocation system based on proximal policy optimization. In this radio resource allocation framework, continuous actions and multi-dimensional actions can be output to reduce the implementation complexity of large-scale communication scenarios. In [25], the DQN network is improved. Aiming at the non-stationarity problem caused by multi-agent parallel learning, lag Q-learning and parallel experience replay trajectory are introduced to stabilize the training process, and approximate regret reward (ARR) is added to stabilize the reward estimation. In order to improve the adaptability of the traditional deep reinforcement learning (DRL) algorithm in a dynamic environment, [26] further combines meta-learning with DRL and proposes a meta-based DRL algorithm. Compared with the DQN-based algorithm, our DRL-based algorithm can provide better performance on both V2I and V2V links. In addition, the DRL algorithm training strategy proposed in this paper has good generalization ability and can quickly adapt to the new environment with limited experience. The authors in [27] propose a centralized dynamic channel allocation method based on deep reinforcement learning for satellite Internet of Things. This method makes use of the strong representation ability of the deep neural network to make intelligent allocation decisions through continuous learning of allocation strategies so as to minimize the average transmission delay of all sensors. However, in the above methods, differentiated service design is carried out according to vehicle type or link service characteristics. However, in real life, special vehicles can be seen everywhere on the road, such as police cars, ambulances, fire trucks and so on; they need a better information transmission environment in the Internet of vehicles. Faced with these special vehicles with urgent business needs, compared with all V2V links in existing literature that compete fairly for spectrum resources, this paper proposes a V2V link and V2I link sharing strategy for the scenario that urgent services need to be handled first in the Internet of vehicles. Mode 4 in cellular V2X architecture is used for resource allocation, and vehicles share resource pools for communication between V2V and V2I. In this strategy, the link priority mechanism is introduced, and the higher priority link can get a better information interaction environment by reinforcing the reward design of learning.
The innovation points of this paper are summarized as follows: (1) Formulate the dynamic spectrum allocation problem in a CU and D2D co-existed vehicle network. (2) Develop a centralized low complexity algorithm based on the deep reinforcement learning method to achieve priority-based spectrum allocation. (3) Build a weighted sum reward function to realize the dynamically adaptive ratesinterference between V2I and V2V links.
The simulation results show that the proposed control method can effectively improve the service quality of high-priority links while ensuring the overall performance of the system, and has good robustness to communication noise.
The rest of this paper is organized as follows. We depict the system model and formulate the optimization problem in Section 2. The proposed RL-based algorithm is presented in Section 3. Section 4 demonstrates simulation results and Section 5 concludes this paper.

System Model and Problem Formulations
On a V2X network, each V2V link can independently select a different channel to maximize the transmission rate. However, the global performance is poor due to interference between different V2V links. On the other side, considering the V2X scenarios, BS has enough computing and storage resources and can achieve the efficient allocation of resources. With the help of reinforcement learning, this paper uses BS as an agent to interact with the unknown vehicle-mounted network environment.
Suppose there are CU and D2D users co-existing in a vehicle-mounted communication network, where each device is equipped with a single antenna. In this paper, the sets of CU and D2D users are, respectively, expressed as I = {1, 2, . . . , i, . . . , I} and J = {1, 2, . . . , j, . . . , J}. Each CU establishes V2I links with BS to support high-quality services, and each D2D user pair transmits information by establishing V2V links. In order to ensure high quality V2I link communication, it is assumed that each V2I link has been pre-assigned different orthogonal spectral subcarriers to eliminate the interference between V2I links in the network. Without sacrificing performance, V2V links and V2I links share the same spectrum resources. To improve the communication quality of V2V links, each V2V link needs to select its occupied spectrum subcarriers and transmitted power.
The channel power gain of the V2I link established between the i-th CU user and BS on authorized channel i is defined as Represents the interference channel gain from the vehicle transmitter of the i-th V2I link to the vehicle receiver of the j-th V2V link occupying the i-th subcarrier. P c i and P d j denote the transmitting power of the i-th V2I link vehicle transmitter and the j-th V2V link vehicle transmitter, respectively; σ 2 represents the noise power. ρ i j = {0, 1} represents the spectrum allocation scheme, if the j-th V2V link chooses the i-th channel, ρ i j = 1; otherwise ρ i j = 0. In this case, the reception signal-to-noise ratio (SINR) of V2I link i using the i-th subcarrier can be expressed as follows: Assume that each V2V link occupies only one channel. According to the Shannon formula, the transmission rate of the i-th V2I link using the i-th channel can be expressed as: where W is the bandwidth of each channel.
Similarly, H j [i] represents the interference channel gain of the j-th V2V link occupying the i-th subcarrier. H k,j [i] denotes the interference channel gain from the j-th V2V link vehicle transmitter to the j-th V2V link vehicle receiver on the i-th channel. G i,j [i] indicates the interference channel gain of the i-th V2I link vehicle transmitter to the j-th V2V link receiver on the i-th subcarrier. In summary, the SINR of the V2V link j occupying the i-th subchannel can be represented as: where I j [i] denotes the interference power received by the j-th V2V link from other V2V links and from all V2I links: represents V2I link interference power. Therefore, the transmission rate of the j-th V2V link on the i-th channel can be expressed as: To account for overall link performance and high-priority link interference requirements, we maximize an objective function that is a weighted sum of two terms and subtract one term. The first term is the sum rate of V2I links, the second term is the sum rate of V2V links and the third term is the total interference of priority links. Meanwhile, to reflect the advantages of high-priority links without affecting the performance of other links, we introduce some constraints.
Therefore, the overall optimize problem is: Subject to : where λ 1 ,λ 2 and λ 3 are the weight constant used to define the priority between the three targets, I f refers to the total interference received by a priority link, I max represents the maximum total interference that we set. C c RA and C d RA , respectively represent the rates of V2I link and V2V link when the random spectrum allocation scheme is used.

RL-Based Resource Allocation Algorithm
As shown in Figure 1, based on the network model in [22], the first V2V link is set as a priority link, and the other V2V links are set as common links. Deep neural network (DNN) is designed to compress the local information observed by each V2V; this information includes its own channel power gain, interference from other links and the transmitted power of its own vehicle transmitter. The compressed information is then fed back to BS. Based on the feedback from all V2V links, BS will use RL to make an optimal decision on spectrum allocation for all V2V links. Finally, the BS sends the decision information to each V2V link. ; it can be estimated at BS and broadcast to all V2V links within the coverage area of BS, resulting in low signaling overhead [10]. In [22], it is assumed that the power gain of the current channel can be obtained by delayless feedback at the vehicle transmitter on the j-th V2V link, which is represented as In this case, the observed data of the j-th V2V link can be written as: The observed data O j is compressed using DNN on each V2V link; the compressed data f j output by DNN is fed back to DQN at BS. Here, f j = f j,k is also known as the feedback vector of the j-th V2V link, where f j,k , ∀k ∈ 1, 2, . . . , N j refers to the k-th feedback element of the j-th V2V link, N j represents the feedback number learned by the j-th V2V link. f j will also serve as input to DQN for the reinforcement learning process.

Link Configuration Decision
Reinforcement learning can maximize long-term returns in sequential decision-making problems, and enable agents to interact with complex environments and seek optimal spectrum allocation strategies through continuous trial and error. In [28], a kind of deep Q network is proposed. With the help of the end to end reinforcement learning method, an excellent policy can be directly learned from the high-dimensional output, enabling agents to solve a series of challenging tasks. This paper treats BS as an Agent in RL. The agent seeks a spectrum selection strategy with maximum cumulative reward by trial and error through continuous interaction with a complex unknown environment. RL can be modeled as a Markov decision process (MDP).
As shown in Figure 2, the specific process can be divided into three steps: (1) At the time step t, select the action A t to be performed according to the current state S t . (2) Select the state S t+1 after the transfer based on the current state S t and action A t .
(3) According to the action A t taken in the current state S t gives the corresponding reward R t+1 .

State Space
BS takes all compressed data fed back by DNN as the current state, which can be expressed as:

Action Space
The spectrum allocation scheme of all V2V links will be decided at BS using DQN. So the action is defined as: where ρ j = ρ j [i] , ∀i ∈ I represents the spectrum allocation scheme.

Reward Design
The ultimate goal of this paper is to maximize the long-term sum rate of V2V links, ensure the QoS of V2I links in V2X scenarios and ensure the transmission performance of priority links. V2V links are generally used to transfer key information during vehicle running, such as vehicle speed, location, driving direction and braking. V2I communication is mainly used for real-time information services, vehicle monitoring and management, and charging without parking. Therefore, the primary goal should be to ensure V2V link transmission, while ensuring that V2V transmission should not cause too much influence on V2I links, and priority links in the network of vehicles should be interfered with less. To achieve this goal, this paper assumes that the first V2V link is a priority link, and the reward of RL can be designed as follows: where λ c , λ d and λ f , respectively, correspond to the non-negative weight of the total rate of V2I links, total rate of V2V links and total interference of priority links. The optimization goal of the reinforcement learning algorithm is to find an optimal strategy π(a, s) to maximize the cumulative reward value returned by the training set. Strategy π(a, s) refers to the probability distribution of action a given state s. In reinforcement learning, there is always an optimal strategy π * (a, s) that maximizes the expected reward. The expected reward can be expressed as R t = ∑ ∞ k=0 γ k R t+k+1 , where γ is the attenuation of future rewards.
In this paper, Q learning is used to solve RL problems because it has model-independent properties where for P(s , r|s, a) no prior is required. Concerning Q learning based on a given strategy π So the agent takes action a in state S and gets a reward based on probability π. The optimal action value function Q * (s, a) under the optimal strategy π * satisfies the Bellman equation, which can be approximated by the iterative updating method: where 0 < α < 1 represents the learning rate, which determines how much of the error is to be learned. In [28], the Q-table was turned into a network model to obtain a Q value. In the past, a Q value needed to be queried in a Q-table, but now we only need to input the state and action to obtain the corresponding Q value.
DQN uses the ε − greedy strategy to store the transfer samples (S t , A t , R t , S t+1 ) obtained by the interaction between each time-stepping agent and the environment in the empirical memory unit. At each slot, according to the behavioral strategy and feedback from the environment, a set of sample data D of the agent is stored in the memory bank every training session, and network parameters θ are updated with random gradient descent variables to minimize the squared error: where θ − represents the parameter of the target Q-network, which is updated synchronously with Q-network parameter θ every time step. Meanwhile, in order to further improve the stability of DQN, DQN is updated several times. During training, a certain number of data are randomly extracted from the empirical memory unit for training. So you can constantly optimize the network model.

Algorithm Flow
In this paper, the observed value of the j-th V2V link in the time step t ∈ {1, 2, . . . , T} is defined as O t j . At this time, the observed values of all V2V links at the time step t can be expressed as O t = O t j , ∀j ∈ J. According to literature [16], the approximate target value of return can be expressed as: Then the update process of DQN located at BS can be expressed as: where β refers to the time step length in the strategy gradient iteration. At each episode t, each V2V link observes the local data O t j first, then uses it as input to DNN to get feedback f t j , which is then transmitted to BS. The BS will then serve f t j as the input to the DQN to generate the decision a t , which will be broadcast to all V2Vs. Finally, each V2V link selects its own spectrum according to the decision result.

Simulations
This paper designs a simulator according to the evaluation method defined for urban cases in Annex A of 3GPP TR 36.885 [29], which describes in detail the vehicle fading model, density, speed, direction of movement, vehicle passage, V2V data flow, etc. The simulation considers the topology scene of the Internet of vehicles in the two-way and one-way lane area of 375 m wide and 649 m long at the intersection. There is a BS in the center of the scene, and the starting position and driving direction of vehicles are randomly initialized within the region. Other simulation parameters of the system are shown in Table 1. The hardware environment of the simulator is Intel Core I9-10900F processor + 32G memory + Nvidia GeForce RTX3090 graphics card. Tensorflow 1.12.0 and Keras 2.24 are used to build and train the neural network.  Table 2 in [22] The above algorithm adopts a five-layer fully connected neural network. According to literature [22], the hidden layers of DNN and DQN are set as 3 in this paper. The number of neurons in the three hidden layers of DNN is set to 16, 32 and 16, respectively. The number of neurons in the three hidden layers of DQN was set to 1200, 800 and 600, respectively.
Here, the RELU activation function f (x) = max(0, x) is used for both DNN and DQN, and the linear function is set to the activation function for the output layer in DNN and DQN. In addition, the RMSProp optimizer [30] was used for renewal network parameters, and the study rate was 0.001. The loss function is set as Huber loss [31]. In addition, the exploration rate of the whole neural network was set as linear decay from 1 to 0.01 during training. The number of steps T of each training set is set to 1000, and the update frequency of the target network Q is set to 500 steps. The discount rate γ for training is set to 0.05.
The size of the empirical memory unit is set to be 1 × 10 6 , and the size of the small sample data is set to 512.
In order to facilitate performance comparison, the corresponding weights of total V2I link rates and total V2V link rates are set to λ c = 0.1, λ d = 1 in accordance with reference [22] in the experiment. This algorithm introduces interference weight of priority link λ f to distinguish link priorities. If λ f is too small, it cannot reflect the superiority of the highpriority link. In contrast, high-priority links overoccupy common link resources, affecting fairness and significantly degrading overall performance. Through the experimental test from Figure 3, the selection λ f = 8 × 10 5 can distinguish the priority well and take into account the overall performance of the system. In Figure 4, the reward for each episode increases as the training set increases, and eventually converges. However, the reason why rewards occasionally get smaller on the way is that greedy strategies are adopted in the training, which may lead to poor results in the exploration of unknown environments. However, as the exploration rate decreases in the later training period, the rewards per episode are not particularly bad and tend to converge, indicating the stability of the training process.

Discussion
The following four comparison schemes are considered in the simulation: Firstly, the influence of feedback number on V2V link is considered. As shown in Figure 5, when the number of feedback is 3, the average total interference received by the high-priority link in this scheme is the smallest, obviously reflecting the advantage of priority, and the average total interference received by the ordinary link is also slightly lower than that received by C-Decision. Priority discrimination is implemented on the premise that the average performance (that is, the overall performance) does not degrade, or priority discrimination ensures fairness. Then, with the increase of the feedback number, the average total interference of each V2V link basically remains unchanged and is larger than the total interference of the feedback number 3. In summary, the number of feedback is set as 3 in the simulation.  Table 2 lists the performance comparison of the four schemes when the number of feedback is 3 and there is no input noise and feedback noise. Input noise refers to the Gaussian white noise received by each V2V during local observation, and feedback noise refers to the noise generated when DNN's output is fed back to DQN's input. Because the last four schemes consider the links equally, they do not distinguish the priority links, so we treat the average rate as the priority link rate. For the scheme in this paper, the average total interference received by the high-priority link is significantly smaller than that received by the common link, that is, the priority link has a better information transmission environment, reflecting the advantage of priority. The average total interference received by the ordinary link is slightly lower than that of C-Decision, SOLEN scheme and GNN-RL scheme and better than that of random scheme. In addition, the average rate of the V2V link in this scheme is improved by about 2.12% compared with C-Decision, which is higher than the SOLEN scheme and obviously better than the GNN-RL scheme and random action scheme. Similarly, the average rate of the V2I link in this scheme increases by about 7.51% compared with C-Decision, slightly higher than the SOLEN scheme and higher than the GNN-RL scheme and random action scheme. In summary, the scheme presented in this paper obviously shows the advantage of priority and is superior to C-Decision, GNN-RL and random spectrum allocation schemes in performance. While the proposed scheme is only slightly better than the SOLEN scheme, the proposed scheme does not need to obtain the CSI of all links in the base station. The scheme in this paper has no such limitation. It indicates that the scheme in this paper is more suitable for the Internet of vehicles environment with urgent business needs in real life. Finally, we study the influence of input noise and feedback noise on priority-enabled schemes.
As can be seen from Figure 6, with the increase of input noise, the average interference received by high-priority V2V links in this scheme rises gradually and tends to be stable when the input noise is greater than 10 dB, while the average total interference received by non-priority links basically remains unchanged. This indicates that the scheme with priority enabled when the input noise is low can obviously show the advantage of priority, and the advantage of priority link gradually decreases with the increase of input noise. As shown in Figure 7, in this scheme, the average rate of the V2I link decreases with the increase of input noise, and tends to be stable when the input noise is greater than 3 dB. When the input noise is large, the rate of the V2I link can still maintain 56.78%. Similarly, the average rate of V2V links decreases with the increase of input noise, and remains stable when the input noise is greater than 10 dB. When the input noise is large, the average rate of V2V links remains 82.04%. V2I links maintain better performance than V2V links. It can be seen from Figure 8 that the average interference received by the high-priority link increases with the increase of feedback noise, and there is no significant difference between the high-priority link and the ordinary link when the feedback noise is greater than 3 dB. This indicates that the effective information of link differentiation cannot be resolved when the feedback noise is high, but the priority can still be reflected when the feedback noise is low. When the feedback noise is greater than 10 dB, the average interference received by both of them is basically stable.
As shown in Figure 9, with the increase of feedback noise, the average rate of the V2I link in this scheme decreases gradually, and is basically unchanged when the feedback noise is greater than 10 dB. When the input noise is large, the rate can be maintained at 85.55%. The average rate of the V2V link decreases with the increase of feedback noise, and remains stable when the feedback noise is greater than 10 dB. When the input noise is large, the average rate of the V2V link remains 57.59%. The V2I link is less affected by feedback noise.
Compared with the random spectrum allocation scheme, C-Decision and SOLEN algorithm have improved transmission efficiency and anti-interference. However, C-Decision does not differentiate services between V2V links. The SOLEN scheme relies on CSI of all V2V and V2I links. The proposed scheme improves the transmission performance of high-priority links while ensuring the performance of common links, and is superior to C-Decision, SOLEN and traditional random spectrum allocation schemes in overall performance. At the same time, it has good robustness to input noise and feedback noise, and is more suitable for practical vehicle-mounted network scenarios.  At present, the proposed algorithm makes centralized spectrum allocation decisions at the base station. In order to reduce computational complexity, we can consider distributed spectrum resource allocation in the next step, and each V2V link makes local spectrum allocation decisions. In the future, we will also consider improving our algorithm to be more suitable for joint V2I link and V2V link resource allocation problems in the future Internet of vehicles.

Conclusions
In order to realize dynamic spectrum resource allocation based on service types, this paper proposes a link priority-based spectrum resource allocation scheme based on reinforcement learning to maximize the total capacity of V2V and V2I links and minimize the received interference of priority links to achieve optimal spectrum allocation. Simulation results show that this method effectively solves the complex spectrum of on-board network resource allocation problems, maximizes V2V and V2I link capacity-the V2V link rate is 2.54 times that of the traditional random spectrum allocation scheme, and the V2I link rate is 13.5% higher than that of the traditional random spectrum allocation schemereduces the priority link interference by 14.2 dB relative to the common link and not only provides the high priority link the transmission quality of service guarantee, but there is no harm to ordinary link performance; in addition, the scheme of communication has good noise robustness and better universality in the actual scene. In the future, the proposed scheme will also be improved, and further in-depth research will be carried out in terms of diversified business types and distributed decision control so as to explore resource scheduling and access control technologies that are more suitable for future high-dynamic complex vehicle-mounted networks.