Reinforcement Learning-Based Resource Allocation for Multiple Vehicles with Communication-Assisted Sensing Mechanism

Fan, Yuxin; Fei, Zesong; Huang, Jingxuan; Wang, Xinyi

doi:10.3390/electronics13132442

Open AccessArticle

Reinforcement Learning-Based Resource Allocation for Multiple Vehicles with Communication-Assisted Sensing Mechanism

School of Information and Electronics, Beijing Institute of Technology (BIT), Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2442; https://doi.org/10.3390/electronics13132442

Submission received: 6 May 2024 / Revised: 17 June 2024 / Accepted: 19 June 2024 / Published: 21 June 2024

Download

Browse Figures

Versions Notes

Abstract

Autonomous vehicles (AVs) can be equipped with Integrated sensing and communications (ISAC) devices to realize sensing and communication functions simultaneously. Time-division ISAC (TD-ISAC) is advantageous due to its ease of implementation, efficient deployment and integration into any system. TD-ISAC greatly enhances spectrum efficiency and equipment utilization and reduces system energy consumption. In this paper, we propose a communication-assisted sensing mechanism based on TD-ISAC to support multi-vehicle collaborative sensing. However, there are some challenges in applying TD-ISAC to AVs. First, AVs should allocate resources for sensing and communication in a dynamically changing environment. Second, the limited spectrum resources bring the problem of mutual interference of multi-vehicle signals. To address these issues, we construct a multi-vehicle signal interference model, formulate an optimization problem based on the partially observable Markov decision process (POMDP) framework and design a decentralized dynamic allocation scheme for multi-vehicle time–frequency resources based on a deep reinforcement learning (DRL) algorithm. Simulation results show that the proposed scheme performs better in miss detection probability and average system interference power compared to the DRQN algorithm without the communication-assisted sensing mechanism and the random algorithm without reinforcement learning. We can conclude that the proposed scheme can effectively allocate the resources of the TD-ISAC system and reduce interference between multiple vehicles.

Keywords:

integrated sensing and communications; communication-assisted sensing; deep reinforcement learning; time–frequency resource allocation

1. Introduction

1.1. Background

With the development of communication technology, sensing technology, artificial intelligence and the automotive industry, the level of automotive intelligence is rapidly increasing. Autonomous vehicles (AVs), which will improve road safety and reshape the way we travel, are steering the future of transportation. Sensing and communication are two essential functions of AVs [1]. AVs have many sensors, such as LiDAR, which acquires information about the surroundings and avoids collisions by detecting the presence of obstacles [2,3]. In the meantime, AVs are equipped with wireless communication transceivers to achieve large-scale and high-capacity data exchange in vehicle-to-vehicle (V2V) or vehicle-to-infrastructure (V2I) communication. AVs’ sensing and communication functions can enhance the safety and smoothness of the transportation system.

Radar detection and wireless communication both transmit or obtain information through the transmission of radio waves, and in some scenarios, sensing and communication are highly coupled. First, the operating frequency bands for both sensing and communication are moving closer to the millimeter-wave frequency range [4]. Millimeter-wave radar offers advantages of extended detection range, high precision, weather independence and the ability to operate all day compared with cameras and LiDAR. Millimeter-wave communication technology, with its narrow beamwidth and large bandwidth, can significantly improve the communication capability of vehicles. In addition, the similarities between radar and communication in system design, signal processing and data processing create conditions for sharing hardware equipment, such as transceiver systems [5].

Integrated sensing and communications (ISAC) technology aims to unify these two functions and pursue direct trade-offs and mutual performance gains [4]. Enabled by ISAC technology, AVs can simultaneously realize sensing and communication by carrying ISAC radio-frequency (RF) transceivers, which alleviates concerns for electromagnetic compatibility as well as reduces system size, weight, and energy consumption [6,7]. Currently, ISAC has five main realization methods: time division, frequency division, space division, code division and waveform sharing [1]. The time division ISAC (TD-ISAC) method is based on the fact that an RF device occupies different time resources to transmit sensing and communication signals, respectively [8]. The critical advantage of TD-ISAC is its easy implementation and seamless integration into any system [9]. Furthermore, TD-ISAC can utilize specific waveforms for sensing and communication according to their respective needs and improve system spectrum efficiency.

However, AVs equipped with TD-ISAC systems face some technical challenges. First, only one function of sensing and communication can operate at a given time, leading to a trade-off in the performance of both functions, particularly in dynamic environments [10]. Second, as the number of TD-ISAC vehicles increases, it is necessary for vehicles to learn how to coordinate the use of ISAC to minimize mutual interference and improve overall performance [11].

1.2. Related Work

Some scholars have conducted research on time resource allocation for sensing and communication functions. The traditional time division method uses an arbitrary or fixed schedule to switch between sensing and communication. The authors in [12] use the preamble of IEEE 802.11ad standard frames for radar detection and data blocks for communication data transmission. Based on this point, Ref. [13] performs adaptive preamble design, allowing for a trade-off between radar estimation accuracy and communication rate. In addition, Refs. [14,15] study the fundamental trade-offs of radar–communication coexistence with flexible configuration. Nevertheless, the road traffic environment is dynamic and uncertain, and these solutions are fixed during the runtime and unsuitable for implementation in real-time systems. Deep reinforcement learning (DRL) has been recently introduced to find the optimal decision quickly in real time. Some studies use DRL to optimize the allocation of communication and radar time resources for individual vehicles in dynamic environments. The authors of [16,17,18] use the DRL algorithm to decide when to use the communication mode or the radar mode to maximize the data throughput while minimizing the miss detection probability of unexpected events given the uncertainty of the surrounding environment. The AV can quickly obtain the optimal policy without requiring any prior information about the environment. Ref. [19] jointly coordinates radar and communication and minimizes the age of information (AoI) of its communication function.

As the number of TD-ISAC vehicles increases, vehicles need to learn how to coordinate their use of ISAC to minimize mutual interference. In traditional radar and communication systems, interference avoidance can be used to reduce mutual interference. Interference avoidance is achieved by coordinating signal transmission in the time, frequency and space domains to reduce interference [20]. One representative method for interference avoidance in the frequency domain is spectrum allocation. The authors of [21,22] propose a centralized framework to enable multiple access to a shared spectrum among radars relying on vehicular ad hoc networking (VANET) and cellular-vehicle-to-everything (C-V2X) communication, respectively. However, centralized allocation typically involves frequent transmission of large amounts of data between vehicles and the central node, leading to increased communication costs and decision latency. Decentralized allocation, in which each AV autonomously selects its frequency sub-band, can be adopted to reduce communication costs and ensure the timeliness of spectrum resource scheduling. A direct method is random selection, which is easy to implement but cannot suppress interference [23]. In [24], the author proposes a decentralized spectrum allocation approach based on DRL, effectively avoiding mutual interference among automotive radars. In the communication system, the issue of multi-user interference has also been studied using DRL by learning sub-band access strategies to avoid users colliding with other users in the same sub-band [25,26]. However, these studies do not use the ISAC mechanism or consider interference between communication and radar signals. Ref. [27] considers multiple vehicles based on [19] and proposes a medium access control (MAC) protocol that manages contention-free access to the communication–radar channel. However, in each time step, only one vehicle can communicate or sense. Therefore, we need a more efficient decentralized spectrum allocation method suitable for the ISAC mechanism.

1.3. Contributions

This paper focuses on a time division integrated sensing and communication scenario in intelligent transportation systems. In this scenario, AVs on the road are equipped with both communication and sensing functions, sharing the same set of transceiver equipment and switching in a time division manner, occupying the same frequency resources. The environment in which the AVs operate is highly dynamic, with environmental conditions and communication demands constantly changing, and there is interference between signals from different AVs. In practical applications, the trained model’s ability to generalize across different road environments must be rigorously tested to ensure that AVs can make safe and reliable decisions under various circumstances. We propose an intelligent time–frequency resource allocation algorithm to mitigate multi-vehicle mutual interference in dynamic road environments. The proposed algorithm allows AVs to make decentralized decisions based on local observations to avoid significant time overheads caused by centralized scheduling. The main contributions of this paper can be summarized as follows:

We propose a communication-assisted sensing mechanism based on TD-ISAC. By transmitting sensing information through communication, we can effectively reduce the number of active radars in the system. Simultaneously, by employing time division between sensing and communication, we can enhance spectrum utilization and lower the probability of multiple vehicles’ transmitted signals colliding within the same sub-band, thereby further improving system performance.
We construct a multi-vehicle sensing and communication interference model. Building this model contributes to a comprehensive understanding of the characteristics and sources of interference, enabling us to take appropriate measures to manage system interference.
We formulate a multi-vehicle optimization problem based on the partially observable Markov decision process (POMDP) framework. With the POMDP framework, vehicles can choose sensing or communication operations based on dynamic environments adaptively and select different sub-bands to reduce multi-vehicle interference.
To solve the optimization problem, we design a DRL algorithm using a target network and a prioritized experience replay (PER) scheme to enable multiple vehicles to better obtain the optimal strategy for time–frequency resource allocation under uncertain environmental factors.

The rest of this paper is organized as follows. We describe the problem formulation in Section 2. Section 3 builds the POMDP framework. Furthermore, in Section 4, we present a DRL algorithm. Section 5 evaluates the performance of the proposed approach. Finally, conclusions are drawn in Section 6.

2. Problem Formulation

2.1. Environment Model

This paper considers a two-lane road traffic scenario in both directions with a dynamically changing environment, as shown in Figure 1. Environmental conditions are influenced by road condition, weather condition and moving object state, and the values of these factors can be obtained through a vehicle sensor system, such as a road friction sensor, weather station instrument and camera [16]. Environmental conditions are not the same for different AVs at the same time. In the following text, we provide the modeling of the relationship between environmental conditions and risk levels.

e \in E

represents a particular environmental factor, where

E = \{ρ, w, m\}

,

ρ

, w and m denote the road condition, the weather condition and the moving object state, respectively.

f \in F

represents the degree of influence of the environmental factor on the vehicle’s driving, where

F = \{0, 1, \dots, F\}

. For simplicity, we can assume that

F = \{0, 1\}

. When

f = 0

, the environmental factor is good; when

f = 1

, the environmental factor is poor.

τ_{f}^{e}

represents the probability of the occurrence of the condition

e = f

,

\sum_{f \in F} τ_{f}^{e} = 1 .

(1)

p_{f}^{e}

denotes the probability of the occurrence of an unexpected event under condition

e = f

. Using Bayes’s theorem, the average probability of an unexpected event occurring in the current environment is given by [16]

p_{u} = \sum_{e \in E} \sum_{f \in F} τ_{f}^{e} p_{f}^{e} .

(2)

The relationship between environmental conditions and risk levels can be provided to AVs by accessing digital map databases [27].

2.2. Signal Model

This paper assumes that all AVs are equipped with TD-ISAC systems, i.e., RF devices occupy different time resources for sensing and communication, respectively. We propose a communication-assisted sensing mechanism. Every two adjacent vehicles in the same lane are paired through V2V communication technology, and this vehicle pair is referred to as a communication-assisted vehicle pair (CAVP). The front vehicle (FV) and the back vehicle (BV) can maintain a minimal distance (several meters or even tens of centimeters) between them by sharing speed, location, status, and other information [28].

While driving, a CAVP can choose to sense or communicate based on the dynamic environment. The frame structure of sensing and communication for the FV and BV is illustrated in Figure 2. During the sensing frame, the FV performs forward sensing using long-range radar (LRR) while the BV awaits communication data. During the communication frame, the FV performs backward communication, and the BV receives communication data from the FV. Additionally, a specific subframe in the communication frame is allocated to the BV for backward sensing using short-range radar (SRR). The direction adjustment between sensing and communication functions is achieved through beam angle adjustment [8].

Let the set of all AVs on the road be

N = \{1, 2, \dots, N\}

and the set of CAVPs be

N_{p} = \{1, 2, \dots, N_{p}\}

, where

N = 2 N_{p}

. The total frequency band is divided equally into multiple orthogonal sub-bands, and the set of sub-bands is

M = \{1, 2, \dots, M\}

[21]. A CAVP can freely select a sub-band in the sensing or communication frames. The number of AVs is rapidly increasing, while the number of sub-bands is limited, making it impossible to allocate non-reused orthogonal sub-bands for each CAVP. Multiple CAVPs will inevitably collide in the same sub-band, creating interference [21,24]. Thus, we assume that

N_{p} \geq M

, and multiple CAVPs can use the same sub-band at the same time.

Due to the proximity of paired vehicles and the relatively low importance of backward sensing, only accidents occurring in the overall front of the CAVP are considered, without considering accidents between the FV and BV or behind the BV. When the forward sensing is successful, the FV can know whether an unexpected event occurs in the environment, represented by 1 and 0, respectively, and the sensory result is stored in the data queue. However, if the FV cannot communicate with the BV in time, the sensory result will fail, which will seriously affect the safety of the BV. We use AoI [19] as a metric to characterize the timeliness of information. AoI is defined as the duration of a sensory result in the vehicle’s queue since it was acquired, denoted by

Δ

. If

Δ

is greater than or equal to the threshold

Δ_{0}

, we consider the sensory result to have failed. The data will be removed from the queue. It is assumed that when the backward communication is successful, the FV can transmit all the non-failed sensory results. If the communication fails, all the sensory results in the queue of the FV are lost.

2.3. Interference Model

Different CAVPs may use the same sub-band for the sensing or communication frame, generating multi-vehicle mutual interference. Because this paper primarily focuses on reducing unexpected events occurring in front of the CAVP, we model the interference suffered by forward sensing and backward communication.

As shown in Figure 1 and Figure 2, when a CAVP is in the sensing frame, the FV performs forward sensing. At this time, the radar receiver of the FV will be subjected to the same-frequency interference from the LRR signals in the opposite lane and the communication/SRR signals in the same lane. When the CAVP is in the communication frame, the FV performs backward communication. At this time, the communication receiver of the BV will be subjected to the same-frequency interference from the LRR signals in the opposite lane and the communication signals in the same lane.

The received interference power depends on the relative positions of CAVPs. The interference power of the LRR signal from CAVP n in the opposite lane can be expressed as

P_{R, I}^{n} = \frac{P_{R} G A_{e} g}{4 π (L^{2} + d^{2})} \cdot {[p_{a} (θ (d))]}^{2},

(3)

where

P_{R}

is the LRR transmitting power, G is the antenna gain,

A_{e}

is the antenna effective area, g is the propagation decaying factor,

p_{a} (\cdot)

is the normalized antenna beam patter [24], L is the vertical distance between two lanes, d is the horizontal distance between two CAVPs and

θ (d) = arctan (\frac{L}{d})

is the radiation direction between two CAVPs. For simplicity, we assume that the power of the communication and SRR signals is the same. The interference power of the communication/SRR signal from CAVP n in the same lane can be expressed as

P_{C, I}^{n} = \frac{P_{C} G A_{e} g}{4 π d^{2}},

(4)

where

P_{C}

is the communication/SRR transmitting power.

In conclusion, the total interference power suffered by a CAVP when it performs forward sensing or backward communication is

P_{I} = \sum_{n = 1}^{N_{R}} P_{R, I}^{n} + \sum_{n = 1}^{N_{C}} P_{C, I}^{n},

(5)

where

N_{R}

is the number of CAVPs in the opposite lane choosing the same frequency band for the sensing frame and

N_{C}

is the number of CAVPs in the same lane choosing the same frequency band for the communication frame.

We use the interference-to-noise ratio (INR) to measure the level of interference to the CAVP [24]. The INR is defined as

η = \frac{P_{I}}{σ^{2}},

(6)

where

σ^{2}

is the receiver noise power. Similar to [24], if

η

is lower than the threshold

η_{0}

, we consider the forward sensing or backward communication successful.

3. Pomdp Framework

Due to the high mobility of autonomous vehicles, there is a strict requirement for the latency of information transmission and processing to ensure the safety of autonomous driving. Each CAVP must be able to make decentralized decisions based on its local observations to avoid large time overheads caused by centralized scheduling. However, decentralized decision making leads to CAVPs having access to only the partial environmental information observed by themselves, without the ability to obtain the global information. Based on this, we model multi-agent decision problems as individual POMDP problems for each single agent [29]. A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. In this approach, each CAVP considers the influence of other CAVPs as part of the environment and makes decisions based on its local observations.

During a CAVP’s driving, it continuously interacts with the dynamic environment. At each discrete time step t, each CAVP i observes the local environmental state

o_{t}^{i}

. The global environmental state

s_{t}

of the system is a simple superposition of the local observations of the

N_{p}

CAVPs, i.e.,

s_{t} = [o_{t}^{1}, o_{t}^{2}, \dots, o_{t}^{N_{p}}]

. Based on the observation

o_{t}^{i}

, each CAVP takes an action

a_{t}^{i}

according to a policy

π^{i}

. Furthermore, the CAVP adjusts the policy

π^{i}

based on the environmental feedback

r_{t}^{i}

.

3.1. Observation Space

At each discrete time step t, a CAVP observes its surrounding environmental conditions as well as the state of the data queue:

o_{t}^{i} = [d_{S, t}^{i}, d_{D, t}^{i}, e_{t}^{i}, R_{t}^{i} (Δ_{0})],

(7)

where

d_{S, t}^{i} = |p_{S, t}^{i} - p_{t}^{i}|

,

d_{D, t}^{i} = \sqrt{{(p_{D, t}^{i} - p_{t}^{i})}^{2} + L^{2}}

,

p_{t}^{i}

is the position (one-dimensional coordinate) of CAVP i, and

p_{S, t}^{i}

and

p_{D, t}^{i}

are the positions of the nearest CAVPs to CAVP i in the same and different lanes, respectively. CAVP i obtains these values through the global navigation satellite system and camera.

e_{t}^{i} = [ρ, w, m]

; these features are identified in Section 2 to be associated with the risk level.

R_{t}^{i} (Δ_{0})

denotes the number of sensory results with

Δ_{t}^{i} = Δ_{0}

in the queue, i.e., data that fail at time step t.

3.2. Action Space

According to the system model, a CAVP can perform flexible time–frequency resource configuration according to the dynamic time-varying characteristics of the vehicle location and road environment. Therefore, the action space of CAVP i at time step t can be defined as

a_{t}^{i} \in \{a_{C}^{< 1 >}, a_{C}^{< 2 >}, \dots, a_{C}^{< M >}, a_{R}^{< 1 >}, a_{R}^{< 2 >}, \dots, a_{R}^{< M >}\},

(8)

where

a_{C}^{< m >}

and

a_{R}^{< m >}

denote a CAVP using sub-band m for the sensing frame and communication frame, respectively.

3.3. Reward Function

As described in Section 2, when a CAVP is in the communication frame, the FV performs backward communication. The total number of unexpected events at that moment is the sum of the number of unexpected events that occurred in the environment and the number of sensory results lost due to communication failures. When a CAVP is in the sensing frame, the FV performs forward sensing. The total number of unexpected events at that moment is the sum of the number of invalid sensory results in the queue and the number of unexpected events missed due to sensing failures. The reward the CAVP receives at each time step is used to balance the sensing and communication functions, encouraging the agent to minimize the number of unexpected events that are not successfully sensed or transmitted. Rewards should be negatively correlated with the total number of unexpected events. So we design the reward function of CAVP i at time step t as

r_{t}^{i} = \{\begin{matrix} - ω (u_{C, t}^{i} + I_{t}^{i} (C)) & a_{t}^{i} = a_{C} \\ - ω (n_{o v e r, t}^{i} + I_{t}^{i} (R)) & a_{t}^{i} = a_{R} \end{matrix},

(9)

I_{t}^{i} (C) = \{\begin{matrix} 0 & η_{t}^{i} < η_{0} \\ n_{a l l, t}^{i} & η_{t}^{i} \geq η_{0} \end{matrix},

(10)

I_{t}^{i} (R) = \{\begin{matrix} 0 & η_{t}^{i} < η_{0} \\ u_{R, t}^{i} & η_{t}^{i} \geq η_{0} \end{matrix},

(11)

where

ω

is the weight of the reward,

u_{C, t}^{i}

and

u_{R, t}^{i}

denote the number of unexpected events in the environment when CAVP i is in the sensing frame and communication frame, respectively,

n_{a l l, t}^{i}

denotes all sensory results in the queue of CAVP i and

n_{o v e r, t}^{i}

denotes the number of sensory results with

Δ_{t}^{i} = Δ_{0}

in CAVP i’s queue.

3.4. Optimal Planning

In reinforcement learning, the merit of a policy depends on the cumulative reward obtained from performing the policy over time, and the aim of learning is to find the policy that maximizes the cumulative reward over time [30]. In this paper, we use the

γ

cumulative discount reward, and the optimization issue is represented as follows:

max_{π^{i}} R (π^{i}) = E [\sum_{t = 0}^{N_{t}} γ^{t} r_{t + 1}^{i} (π^{i})],

(12)

where

N_{t}

is the number of training steps per episode,

γ

is the discount factor and

γ \in [0, 1]

.

4. Ddqn-Per Algorithm

Even though we defined the observation space and reward function in the previous discussion, vehicles navigating in the environment cannot directly access this information. Instead, they need to learn strategies through continuous interaction with the environment. Therefore, the model-free learning algorithm of reinforcement learning should be used. In addition, each CAVP performs its decisions in a decentralized manner. Still, its strategies are influenced by the strategies of other CAVPs, leading to a non-stationary environment and difficulties in convergence. We develop an improved algorithm based on a double deep Q-network (DDQN) [31] named DDQN-PER (double deep Q-network with prioritized experience replay) to deal with the dynamics and uncertainty in a multi-agent environment. The algorithm proposed in this paper uses a deep neural network as a function approximator while using the target network and the PER [32] scheme to enable multiple CAVPs to better obtain the optimal strategy for time–frequency resource allocation under uncertain environmental factors. The algorithm flow is shown in Figure 3, and the algorithm pseudocode is shown as Algorithm 1.

Algorithm 1 DDQN-PER

1: Inputs:
2: $D^{i}$ : empty replay memory with capacity $|D^{i}|$
3: $θ^{i}$ : initial Q-network parameters.
4: ${\hat{θ}}^{i}$ : initial target Q-network parameters
5: for $e p i s o d e = 1 : E$ do
6: Initialize observations for all vehicles
7: for $t = 1 : T$ do
8: for $i = 1 : N$ do
9: Feed observation $o_{t}^{i}$ to Q-network to get $Q^{i}$
10: Choose an action $a_{t}^{i}$ as (13)
11: Get a reward $r_{t}^{i}$
12: Obtain new observation $o_{t + 1}^{i}$
13: Store experience $(o_{t}^{i}, a_{t}^{i}, r_{t}^{i}, o_{t + 1}^{i})$ in $D^{i}$
14: Sample a mini-batch of experiences $(o_{j}^{i}, a_{j}^{i}, r_{j}^{i}, o_{j + 1}^{i})$ from $D^{i}$ with prioritization as (15)
15: Compute importance sampling weights $w_{j}$ as (16)
16: Calculate target Q-values $y_{j}^{i}$ as (17)
17: Update priorities of experiences as (14)
18: Update Q-network parameters $θ_{t + 1}^{i}$ as (20), $θ_{t}^{i} \leftarrow θ_{t + 1}^{i}$
19: if $t mod N_{u} = 0$ then
20: ${\hat{θ}}_{t}^{i} \leftarrow θ_{t}^{i}$
21: end if
22: $o_{t}^{i} \leftarrow o_{t + 1}^{i}$
23: end for
24: end for
25: end for

Each CAVP needs to learn its own policy; thus, each CAVP has a separate Q-network. The inputs of the Q-network include relative distances of neighboring CAVPs, the road condition, the weather condition, the moving object state and the data queue state. The outputs are the Q-values of sub-bands for the sensing or communication frame. At each training time step, CAVP i chooses an action

a_{t}^{i}

following the

ε

-greedy policy when the observation is

o_{t}^{i}

. The

ε

-greedy policy is defined as follows:

a_{t}^{i} = \{\begin{matrix} r a n d o m & a < ε \\ \underset{a}{arg max} Q^{i} (o_{t}^{i}, a; θ_{t}^{i}) & a \geq ε \end{matrix},

(13)

where

Q^{i}

are the Q-values of the Q-network of CAVP i, and

θ_{t}^{i}

are the parameters of the Q-network. CAVP i executes the action

a_{t}^{i}

and obtains a reward

r_{t}^{i}

and the next observation

o_{t + 1}^{i}

. The algorithm uses an experience replay scheme to store the experience

ξ_{t}^{i} = (o_{t}^{i}, a_{t}^{i}, r_{t}^{i}, o_{t + 1}^{i})

obtained from the interaction of CAVP i with the environment into the memory

D^{i}

. The storage capacity of

D^{i}

is

|D^{i}|

. At each time step, CAVP i randomly samples a mini-batch of experiences

ξ_{j}^{i} = (o_{j}^{i}, a_{j}^{i}, r_{j}^{i}, o_{j + 1}^{i})

from

D^{i}

for training.

The experience replay scheme improves data utilization by breaking the correlation between data through linear storage and random sampling of experiences. However, its equal probability replay ignores the importance of experiences, resulting in valuable experiences not being used efficiently. Therefore, this paper introduces the PER scheme. PER uses the temporal difference (TD) error, which represents the difference between the current Q-value and the target Q-value, to measure the importance of each experience. Express the sampling priority of the jth experience as follows:

p_{j} = |δ_{j}| + μ,

(14)

where

δ_{j}

is the TD-error of the jth experience, and

μ

is a very small positive constant to avoid the priority being 0. The probability that the jth experience is sampled is defined as

p (j) = \frac{p_{j}^{α}}{\sum_{k}^{k \in |D^{i}|} p_{k}^{α}},

(15)

where

α \in [0, 1]

is used to regulate the priority of experience. When

α = 0

, priority sampling will degrade to uniform sampling. Before priority sampling, it is necessary to sort the data according to their priority. However, sorting large amounts of data can be memory-intensive and have high time complexity. Therefore, PER uses a sum tree data structure for sampling to improve sampling efficiency [32].

However, when sampling according to importance, the sample distribution is altered. Therefore, the bias must be compensated for according to the importance sampling weights, otherwise the model will overfit the important experiences. Express the importance sampling weights as follows:

w_{j} = {(\frac{1}{|D^{i}|} \cdot \frac{1}{p (j)})}^{β},

(16)

where

β \in [0, 1]

is used to regulate the degree of compensation for bias. When

β = 1

, the bias introduced by sampling with non-uniform probability will be entirely eliminated.

β

is annealed linearly from

β_{0}

to 1.

Next, we use the sampled experience

ξ_{j}^{i} = (o_{j}^{i}, a_{j}^{i}, r_{j}^{i}, o_{j + 1}^{i})

to calculate the target Q-values. To improve the stability of the algorithm, another network with the same structure and initial parameters as the original Q-network is introduced, called the target Q-network, which is specially used to compute target Q-values. The parameters of the Q-network are updated at each training and passed to the target Q-network after

N_{u}

steps. The target Q-value at time step t is calculated as follows:

y_{j}^{i} = r_{j}^{i} + γ {\hat{Q}}^{i} (o_{j + 1}^{i}, \underset{a^{'}}{arg max} Q^{i} (o_{j + 1}^{i}, a^{'}; θ_{t}^{i}); {\hat{θ}}_{t}^{i}),

(17)

where

{\hat{Q}}^{i}

are the Q-values of the target Q-network of CAVP i, and

{\hat{θ}}_{t}^{i}

are the parameters of the target Q-network.

The DDQN algorithm updates the Q-network parameters by minimizing the mean square error between the current Q-value and the target Q-value, and the loss function can be expressed as

L_{t}^{i} (θ_{t}^{i}) = E [{(y_{j}^{i} - Q^{i} (o_{j}^{i}, a_{j}^{i}; θ_{t}^{i}))}^{2}] .

(18)

The gradient of the loss function is

\nabla_{θ_{t}^{i}} L_{t}^{i} (θ_{t}^{i}) = E [(y_{j}^{i} - Q^{i} (o_{j}^{i}, a_{j}^{i}; θ_{t}^{i})) \nabla_{θ_{t}^{i}} Q^{i} (o_{j}^{i}, a_{j}^{i}; θ_{t}^{i})] .

(19)

The Q-network parameters are updated according to the stochastic gradient descent (SGD) algorithm:

θ_{t + 1}^{i} = θ_{t}^{i} + w_{j} α_{l} (y_{j}^{i} - Q^{i} (o_{j}^{i}, a_{j}^{i}; θ_{t}^{i})) \nabla_{θ_{t}^{i}} Q^{i} (o_{j}^{i}, a_{j}^{i}; θ_{t}^{i}),

(20)

where

α_{l}

is the learning rate.

5. Simulation Analysis

In this section, we evaluate the performance of the proposed approach through simulation experiments. First, we introduce the simulation setup, performance metrics and contrasting approaches. Then, we analyze the simulation results. Simulation results show that the time–frequency resource allocation problem for radar and communication in the ISAC scenario can be effectively solved within the deep Q-learning framework.

5.1. Simulation Setup

We consider a road traffic scenario as described in Section 2. Vehicles travel at a uniform speed, and the distance

d_{n}

between adjacent CAVPs in the same lane follows a uniform distribution U. The frame length is T. We consider the spatial location and environment around the vehicle to be semi-static, i.e., not changing significantly within frame T but changing dynamically over multiple frames [33]. The detailed scenario parameters are set as shown in Table 1.

The numbers of neurons in the input, hidden (two layers) and output layers of the network are set to be 6, 24, 24 and 2, respectively. The hidden layer utilizes a rectified linear unit (ReLU) as the activation function, and the output layer has no activation function. In the training, a time step equals the frame length T. Each episode contains multiple time steps. The hyperparameters used in training the Q-networks are shown in Table 2.

5.2. Performance Metrics

We propose two key metrics to evaluate the performance of the proposed approach:

Miss detection probability: defined as the ratio of the number of unsuccessfully sensed or transmitted unexpected events to the total number of unexpected events in each episode:

$p_{m i s s} = \sum_{i = 1}^{N_{p}} \sum_{t = 0}^{N_{t}} \frac{u_{C, t}^{i} + I_{t}^{i} (C) + n_{o v e r, t}^{i} + I_{t}^{i} (R)}{u_{C, t}^{i} + u_{R, t}^{i}} .$

(21)
Average system interference power: defined as the ratio of the sum of interference power of all CAVPs per episode to the number of training steps per episode:

$P_{a v e} = \frac{1}{N_{t}} \sum_{i = 1}^{N_{p}} \sum_{t = 0}^{N_{t}} P_{I, t}^{i} .$

(22)

5.3. Contrasting Approaches

There are three contrasting approaches:

DRQN: In [24], each vehicle performs sensing independently and does not communicate with other AVs. AVs only need to allocate sensing frequency resources. The authors in [24] introduce a long short-term memory (LSTM) network into the DQN algorithm, enabling the vehicle to learn how to select sensing sub-bands by incorporating both its current and past observations. In this case, miss detection occurs only when the vehicle does not sense successfully.
Random (CAVPs): We use the random policy to allocate time–frequency resources for the sensing and communication frames of CAVPs. i.e., CAVPs randomly select an action from the action space mentioned in Equation (8) with equal probability.
DDQN (CAVPs): We use the DDQN algorithm proposed in [31] to allocate time–frequency resources for the sensing and communication frames of CAVPs.
DQN (single CAVP): In [16], the DQN algorithm was introduced to allocate time resources for both sensing and communication functions of a single vehicle in a dynamic environment. Nevertheless, Ref. [16] did not account for interference between vehicles, implying that both sensing and communication could be successful regardless of the chosen strategy. We apply this algorithm to CAVPs, considering it as an upper-bound performance to evaluate the effectiveness of other algorithms.

5.4. Results Analysis

We first verify the algorithm’s convergence by using the average system reward as a measure of convergence. The average system reward is calculated as follows:

R_{a v e} = \frac{1}{N_{t}} \sum_{i = 1}^{N_{p}} \sum_{t = 0}^{N_{t}} r_{t}^{i} .

(23)

We selected five random seed values, namely, 10, 20, 30, 40 and 50, and calculated the average

R_{a v e}

of the algorithms under these five random seeds. As shown in Figure 4, the

R_{a v e}

of DDQN-PER gradually increases with the number of episodes and eventually stabilizes around 800 episodes. Compared with DDQN and random, DDQN-PER has better convergence performance.

Then, we evaluate the performance of the proposed approach and the contrasting approaches through the following four aspects:

The number of AVs changes, other conditions are fixed: In Figure 5, as N increases, $p_{m i s s}$ and $P_{a v e}$ increase for all algorithms except DQN. The reason is that as N increases, the mutual interference between the AVs increases, while DQN is used in a single-CAVP scenario, and the effect of interference is not considered. In addition, compared with DRQN, random has a lower $P_{a v e}$ . This is because the proposed communication-assisted sensing mechanism allows some vehicles to acquire sensory data through communication, eliminating the requirement for all vehicles to perform sensing tasks simultaneously, effectively reducing inter-vehicle interference. Nevertheless, random has a higher $p_{m i s s}$ than DRQN. This is because the $p_{m i s s}$ of random is not only affected by system interference but also influenced by the dynamic environment. Random cannot choose sensing or communication modes based on the environment, leading to a higher rate of $p_{m i s s}$ . As DDQN-PER utilizes the communication-assisted sensing mechanism and adaptive time–frequency resource allocation algorithm, it can effectively control the mutual interference between vehicles and reduce the miss detection probability. Thus, compared with other algorithms, $p_{m i s s}$ and $P_{a v e}$ of DDQN-PER are closer to DQN, and the performance does not deteriorate significantly as N increases. In conclusion, DDQN-PER proves to be more advantageous than other algorithms, particularly in scenarios with a higher volume of vehicles.
The number of sub-bands changes, other conditions are fixed: In Figure 6, as M increases, $p_{m i s s}$ and $P_{a v e}$ decrease for all algorithms except DQN. The reason is that as M increases, the probability of two signals switching to the same sub-band becomes small. Compared with other algorithms, $p_{m i s s}$ and $P_{a v e}$ of DDQN-PER are closer to DQN. It can be deduced that DDQN-PER holds a comparative advantage over other algorithms, especially when the number of sub-bands is limited.
The interval changes, other conditions are fixed: In Figure 7, when $d_{n} \sim U$ (50 m, 60 m), $p_{m i s s}$ of DDQN-PER is higher than DRQN. However, when $d_{n}$ is small, DDQN-PER has better performance. In addition, compared with other algorithms, the $P_{a v e}$ of DDQN-PER is closer to 0. This is because when the interval is small, DRQN is greatly affected by interference and cannot effectively allocate frequency resources, while DDQN-PER can effectively control mutual interference and reduce the miss detection probability by using the communication-assisted sensing mechanism and adaptive time–frequency resource allocation algorithm. It can be concluded that DDQN-PER is more advantageous than other algorithms when the interval is smaller.
The $p_{1}^{m}$ (i.e., the probability of the occurrence of an unexpected event under condition $m = 1$ ) changes, other conditions are fixed: In Figure 8, as $p_{1}^{m}$ changes, $p_{m i s s}$ of DDQN-PER and DQN changes, while other algorithms are almost unchanged. This is because DDQN-PER and DQN can adaptively allocate time–frequency resources according to the environment. As DDQN-PER is affected by multi-vehicle interference, its $p_{m i s s}$ does not continue to decrease as $p_{1}^{m}$ increases like DQN. Furthermore, compared with other algorithms, the $P_{a v e}$ of DDQN-PER is closer to 0.

6. Conclusions

In this paper, we propose a communication-assisted sensing mechanism based on TD-ISAC, which uses communication to transmit sensing information effectively. Moreover, we construct an interference model for multi-vehicle sensing and communication signals and formulate a multi-vehicle optimization problem based on the POMDP framework. We develop the DDQN-PER algorithm to enable vehicles to adjust their policies adaptively for decentralized time–frequency resource allocation in the dynamic environment. Simulation results show that the proposed scheme can significantly reduce miss detection probability and average system interference power. Furthermore, the method proposed in this paper also applies to the problem of multi-vehicle mutual interference where a TD-ISAC vehicle assists a radar-free vehicle.

Author Contributions

Methodology, Y.F.; validation and formal analysis, Z.F., J.H. and X.W.; writing—original draft preparation, Y.F.; writing—review and editing, Z.F., J.H. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China under Grant No. 2021YFB2900200 and the National Natural Science Foundation of China under Grant U20B2039.

Data Availability Statement

The data and code are available from the corresponding authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ma, D.; Shlezinger, N.; Huang, T.; Liu, Y.; Eldar, Y.C. Joint Radar-Communication Strategies for Autonomous Vehicles: Combining Two Key Automotive Technologies. IEEE Signal Process. Mag. 2020, 37, 85–97. [Google Scholar] [CrossRef]
Sciuto, G.L.; Kowol, P.; Nowak, P.; Banás, W.; Coco, S.; Capizzi, G. Neural network developed for obstacle avoidance of the four wheeled electric vehicle. In Proceedings of the 30th IEEE International Conference on Electronics, Circuits and Systems (ICECS), Istanbul, Turkey, 4–7 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–4. [Google Scholar]
Kowol, P.; Nowak, P.; Banaś, W.; Bagier, P.; Lo Sciuto, G. Haptic feedback remote control system for electric mechanical assembly vehicle developed to avoid obstacles. J. Intell. Robot. Syst. 2023, 107, 41. [Google Scholar] [CrossRef]
Liu, F.; Cui, Y.; Masouros, C.; Xu, J.; Han, T.X.; Eldar, Y.C.; Buzzi, S. Integrated sensing and communications: Toward dual-functional wireless networks for 6G and beyond. IEEE J. Sel. Areas Commun. 2022, 40, 1728–1767. [Google Scholar] [CrossRef]
Feng, Z.; Fang, Z.; Wei, Z.; Chen, X.; Quan, Z.; Ji, D. Joint radar and communication: A survey. China Commun. 2020, 17, 1–27. [Google Scholar] [CrossRef]
Hassanien, A.; Amin, M.G.; Zhang, Y.D.; Ahmad, F. Signaling strategies for dual-function radar communications: An overview. IEEE Aerosp. Electron. Syst. Mag. 2016, 31, 36–45. [Google Scholar] [CrossRef]
Liu, Y.; Liao, G.; Xu, J.; Yang, Z.; Zhang, Y. Adaptive OFDM integrated radar and communications waveform design based on information theory. IEEE Commun. Lett. 2017, 21, 2174–2177. [Google Scholar] [CrossRef]
Zhang, Q.; Sun, H.; Gao, X.; Wang, X.; Feng, Z. Time-Division ISAC Enabled Connected Automated Vehicles Cooperation Algorithm Design and Performance Evaluation. IEEE J. Sel. Areas Commun. 2022, 40, 2206–2218. [Google Scholar] [CrossRef]
Luong, N.C.; Lu, X.; Hoang, D.T.; Niyato, D.; Kim, D.I. Radio Resource Management in Joint Radar and Communication: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2021, 23, 780–814. [Google Scholar] [CrossRef]
Chiriyath, A.R.; Paul, B.; Bliss, D.W. Radar-communications convergence: Coexistence, cooperation, and co-design. IEEE Trans. Cogn. Commun. Netw. 2017, 3, 1–12. [Google Scholar] [CrossRef]
Lee, J.; Cheng, Y.; Niyato, D.; Guan, Y.L.; González G., D. Intelligent Resource Allocation in Joint Radar-Communication with Graph Neural Networks. IEEE Trans. Veh. Technol. 2022, 71, 11120–11135. [Google Scholar] [CrossRef]
Kumari, P.; Gonzalez-Prelcic, N.; Heath, R.W. Investigating the IEEE 802.11ad Standard for Millimeter Wave Automotive Radar. In Proceedings of the 82nd IEEE Vehicular Technology Conference (VTC2015-Fall), Boston, MA, USA, 6–9 September 2015; pp. 1–5. [Google Scholar]
Kumari, P.; Nguyen, D.H.N.; Heath, R.W. Performance trade-off in an adaptive IEEE 802.11AD waveform design for a joint automotive radar and communication system. In Proceedings of the 42nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4281–4285. [Google Scholar]
Cao, N.; Chen, Y.; Gu, X.; Feng, W. Joint Bi-Static Radar and Communications Designs for Intelligent Transportation. IEEE Trans. Veh. Technol. 2020, 69, 13060–13071. [Google Scholar] [CrossRef]
Ren, P.; Munari, A.; Petrova, M. Performance Analysis of a Time-sharing Joint Radar-Communications Network. In Proceedings of the 2020 International Conference on Computing, Networking and Communications (ICNC), Big Island, HI, USA, 17–20 February 2020; pp. 908–913. [Google Scholar]
Hieu, N.Q.; Hoang, D.T.; Luong, N.C.; Niyato, D. iRDRC: An Intelligent Real-Time Dual-Functional Radar-Communication System for Automotive Vehicles. IEEE Wirel. Commun. Lett. 2020, 9, 2140–2143. [Google Scholar] [CrossRef]
Hieu, N.Q.; Hoang, D.T.; Niyato, D.; Wang, P.; Kim, D.I.; Yuen, C. Transferable Deep Reinforcement Learning Framework for Autonomous Vehicles With Joint Radar-Data Communications. IEEE Trans. Commun. 2022, 70, 5164–5180. [Google Scholar] [CrossRef]
Fan, Y.; Huang, J.; Wang, X.; Fei, Z. Resource allocation for v2x assisted automotive radar system based on reinforcement learning. In Proceedings of the 2022 14th International Conference on Wireless Communications and Signal Processing (WCSP), Nanjing, China, 1–3 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 672–676. [Google Scholar]
Lee, J.; Niyato, D.; Guan, Y.L.; Kim, D.I. Learning to Schedule Joint Radar-Communication Requests for Optimal Information Freshness. In Proceedings of the 2021 IEEE Intelligent Vehicles Symposium (IV), Nagoya, Japan, 11–17 July 2021; pp. 8–15. [Google Scholar]
Alland, S.; Stark, W.; Ali, M.; Hegde, M. Interference in Automotive Radar Systems: Characteristics, Mitigation Techniques, and Current and Future Research. IEEE Signal Process. Mag. 2019, 36, 45–59. [Google Scholar] [CrossRef]
Zhang, M.; He, S.; Yang, C.; Chen, J.; Zhang, J. VANET-Assisted Interference Mitigation for Millimeter-Wave Automotive Radar Sensors. IEEE Netw. 2020, 34, 238–245. [Google Scholar] [CrossRef]
Huang, J.; Fei, Z.; Wang, T.; Wang, X.; Liu, F.; Zhou, H.; Zhang, J.A.; Wei, G. V2X-communication assisted interference minimization for automotive radars. China Commun. 2019, 16, 100–111. [Google Scholar] [CrossRef]
Khoury, J.; Ramanathan, R.; McCloskey, D.; Smith, R.; Campbell, T. RadarMAC: Mitigating Radar Interference in Self-Driving Cars. In Proceedings of the 13th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), London, UK, 27–30 June 2016; pp. 1–9. [Google Scholar]
Liu, P.; Liu, Y.; Huang, T.; Lu, Y.; Wang, X. Decentralized Automotive Radar Spectrum Allocation to Avoid Mutual Interference Using Reinforcement Learning. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 190–205. [Google Scholar] [CrossRef]
Chang, H.H.; Song, H.; Yi, Y.; Zhang, J.; He, H.; Liu, L. Distributive Dynamic Spectrum Access Through Deep Reinforcement Learning: A Reservoir Computing-Based Approach. IEEE Internet Things J. 2019, 6, 1938–1948. [Google Scholar] [CrossRef]
Naparstek, O.; Cohen, K. Deep Multi-User Reinforcement Learning for Distributed Dynamic Spectrum Access. IEEE Trans. Wirel. Commun. 2019, 18, 310–323. [Google Scholar] [CrossRef]
Lee, J.; Niyato, D.; Guan, Y.L.; Kim, D.I. Learning to Schedule Joint Radar-Communication with Deep Multi-Agent Reinforcement Learning. IEEE Trans. Veh. Technol. 2022, 71, 406–422. [Google Scholar] [CrossRef]
Boban, M.; Kousaridas, A.; Manolakis, K.; Eichinger, J.; Xu, W. Use cases, requirements, and design considerations for 5G V2X. arXiv 2017, arXiv:1712.01754. [Google Scholar]
Tampuu, A.; Matiisen, T.; Kodelja, D.; Kuzovkin, I.; Korjus, K.; Aru, J.; Aru, J.; Vicente, R. Multiagent cooperation and competition with deep reinforcement learning. PloS ONE 2017, 12, e0172395. [Google Scholar] [CrossRef] [PubMed]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
Xing, Y.; Sun, Y.; Qiao, L.; Wang, Z.; Si, P.; Zhang, Y. Deep reinforcement learning for cooperative edge caching in vehicular networks. In Proceedings of the 13th International Conference on Communication Software and Networks (ICCSN), Chongqing, China, 4–7 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 144–149. [Google Scholar]

Figure 1. Multiple AVs driving in a dynamic environment.

Figure 2. The frame structure of sensing and communication.

Figure 3. The DDQN-PER algorithm flow of CAVP i.

Figure 4. Average system reward versus the number of episodes.

Figure 5. The number of AVs changes, other conditions are fixed. (a) Miss detection probability. (b) Average system interference power.

Figure 6. The number of sub-bands changes, other conditions are fixed. (a) Miss detection probability. (b) Average system interference power.

Figure 7. The interval changes, other conditions are fixed. (a) Miss detection probability. (b) Average system interference power.

Figure 8. The

p_{1}^{m}

changes, other conditions are fixed. (a) Miss detection probability. (b) Average system interference power.

Figure 8. The

p_{1}^{m}

changes, other conditions are fixed. (a) Miss detection probability. (b) Average system interference power.

Table 1. Scenario parameters setting.

Notation	Value
$(τ_{0}^{e}, τ_{1}^{e}), e \in E$	$(0.7, 0.3)$
$(p_{0}^{e}, p_{1}^{e}), e \in E$	$(0.01, 0.1)$
v	10 m/s
$Δ_{0}$	4
T	$0.02$ s
$P_{R}$	25 dBmW
$P_{C}$	15 dBmW
G	48 dB
$A_{e}$	5 mm²
g	$0.1$
L	3 m
$η_{0}$	10

Table 2. Hyperparameters Setting.

Notation	Value
$ω$	200
$ε$	$0.9 \to 0.01$
$N_{b}$	64
$μ$	$0.01$
$α$	$0.6$
$β$	$0.4 \to 1.0$
$N_{u}$	20
$γ$	$0.9$
$α_{l}$	$0.0001$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, Y.; Fei, Z.; Huang, J.; Wang, X. Reinforcement Learning-Based Resource Allocation for Multiple Vehicles with Communication-Assisted Sensing Mechanism. Electronics 2024, 13, 2442. https://doi.org/10.3390/electronics13132442

AMA Style

Fan Y, Fei Z, Huang J, Wang X. Reinforcement Learning-Based Resource Allocation for Multiple Vehicles with Communication-Assisted Sensing Mechanism. Electronics. 2024; 13(13):2442. https://doi.org/10.3390/electronics13132442

Chicago/Turabian Style

Fan, Yuxin, Zesong Fei, Jingxuan Huang, and Xinyi Wang. 2024. "Reinforcement Learning-Based Resource Allocation for Multiple Vehicles with Communication-Assisted Sensing Mechanism" Electronics 13, no. 13: 2442. https://doi.org/10.3390/electronics13132442

APA Style

Fan, Y., Fei, Z., Huang, J., & Wang, X. (2024). Reinforcement Learning-Based Resource Allocation for Multiple Vehicles with Communication-Assisted Sensing Mechanism. Electronics, 13(13), 2442. https://doi.org/10.3390/electronics13132442

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning-Based Resource Allocation for Multiple Vehicles with Communication-Assisted Sensing Mechanism

Abstract

1. Introduction

1.1. Background

1.2. Related Work

1.3. Contributions

2. Problem Formulation

2.1. Environment Model

2.2. Signal Model

2.3. Interference Model

3. Pomdp Framework

3.1. Observation Space

3.2. Action Space

3.3. Reward Function

3.4. Optimal Planning

4. Ddqn-Per Algorithm

5. Simulation Analysis

5.1. Simulation Setup

5.2. Performance Metrics

5.3. Contrasting Approaches

5.4. Results Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI