Next Article in Journal
Study on Energy Efficiency of Retrofitting Existing Residential Buildings Based on System Dynamics Modeling
Previous Article in Journal
Displacement Patterns and Predictive Modeling of Slopes in the Bayan Obo Open-Pit Iron Mine
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

IMSBA: A Novel Integrated Sensing and Communication Beam Allocation Based on Multi-Agent Reinforcement Learning for mmWave Internet of Vehicles

by
Jinxiang Lai
1,
Deqing Wang
2,* and
Yifeng Zhao
2
1
College of General Education, Fujian Polytechnic of Water Conservancy and Electric Power, Yongan 366000, China
2
Department of Information and Communication Engineering, Xiamen University, Xiamen 361005, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(11), 6069; https://doi.org/10.3390/app15116069
Submission received: 27 April 2025 / Revised: 19 May 2025 / Accepted: 22 May 2025 / Published: 28 May 2025

Abstract

In a multi-beam communication scenario where Infrastructure-to-Vehicle (I2V) and Vehicle-to-Vehicle (V2V) communications coexist, the limited spectrum of resources force V2V users to reuse the orthogonal frequency bands allocated to I2V, inevitably introducing cross-layer interference between I2V and V2V. Furthermore, the adoption of a multi-beam communication architecture exacerbates beam interference, significantly degrading the overall network’s communication and sensing performance. To address these challenges, this paper proposes an integrated sensing and communication (ISAC) beam allocation algorithm, termed IMSBA, which jointly optimizes beam direction, transmission power, and spectrum resource allocation to effectively mitigate the interference between I2V and V2V while maximizing the overall network performance. Specifically, IMSBA employs a joint optimization framework combining Multi-Agent Proximal Policy Optimization (MAPPO) with a Stackelberg game. Within this framework, MAPPO leverages vehicle perception data to dynamically optimize V2V beam steering and frequency selection, while the Stackelberg game reduces computational complexity through hierarchical decision-making and optimizes the joint power allocation among V2V users. Additionally, the proposed scheme incorporates a V2V cooperative sensing domain-sharing mechanism to enhance system robustness under adverse conditions. The experimental results demonstrated that, compared with existing baseline schemes, IMSBA achieved a 92.5% improvement in V2V energy efficiency while significantly enhancing both communication and sensing performance. This study provides an efficient and practical solution for spectrum-constrained scenarios in millimeter-wave Internet-of-Things (IoT), offering substantial theoretical insights and practical value for the efficient operation of intelligent transportation system (ITSs).

1. Introduction

In the 5G and 6G eras, the Internet of Vehicles (IoV) promotes interactive connectivity among vehicles and between vehicles and infrastructure, serving as a key technology for intelligent transportation systems [1]. Vehicle-to-Everything (V2X) technology enables road information sharing and plays a crucial role in IoV [2]. To enhance communication performance in IoV, V2X needs to support high-rate, low-latency data transmission in high-mobility scenarios [3]. With the expansion of future 6G networks, the performance requirements for data transmission in IoV will further increase [4]. Additionally, sensing capability is essential in IoV, enabling functions such as obstacle detection and target positioning, thereby improving driving safety and communication link stability [5]. The development of Millimeter Wave (mmWave) and Massive Multiple-Input Multiple-Output (mMIMO) technologies offers new solutions to simultaneously enhance communication and sensing performance in IoV. The large bandwidth of mmWave bands allows for higher data transmission rates, but compared to sub-6G bands, mmWave faces more severe environmental challenges, including a higher path loss, more complex large/small scale fading, and interference [6,7]. To overcome these challenges, mMIMO technology uses beamforming to concentrate energy into directional narrow beams, effectively compensating for the path loss of mmWave signals [7]. Additionally, the large bandwidth of mmWave bands helps improve sensing resolution, meeting the sensing capability requirements of IoV [8,9].
Although beamforming can compensate for mmWave path loss, the communication and sensing performance sharply declines if the narrow beam is not aligned with the target. To align the beam with the target, traditional beam training uses a set of pilots to scan the angle range where the target is located. The target receiver calculates the Signal to Noise Ratio (SNR) for each pilot and uplinks the index of the pilot with the highest SNR to the transmitter, enabling the transmitter to construct the beamforming vector [10]. To reduce the overhead of beam training, Ref. [11] proposes an algorithm that combines beam training and angular velocity estimation to obtain beam coherence time. The frequency of beam training is adaptively adjusted based on the beam coherence time to reduce training overhead. Beam training requires the use of numerous pilots, leading to significant time overhead and the occupation of communication resources. Additionally, the beam alignment results may become outdated due to latency issues, making it difficult to meet the data transmission requirements of highly mobile IoV. To reduce beam scanning latency, channel correlation can be used to decrease the number of pilots, a technique known as beam tracking. Ref. [12] utilized a small number of pilots to obtain channel state information and then combined the state and measurement equations of Kalman filtering to estimate and predict the beam direction. The estimate is used as the beam direction for the current slot, and the prediction is used as the pilot direction for the next slot. Based on this, other variants of Kalman filter-based beam tracking methods have been derived [13]. However, these methods require specific state evolution models and have a high computational complexity.
Beam allocation schemes based on deep learning (DL) have gained wide attention due to their data-driven and model-free advantages, significantly improving algorithm generality [14]. Long Short-Term Memory (LSTM) networks excel in handling and predicting time series, making them well-suited for channel prediction [15]. The main idea in [15] is to use LSTM to process historical channel state sequences within an unscented Kalman filter framework, predict future channel states, and then estimate beam directions using the Kalman filter update formula. In [16], multiple LSTM layers are employed to extract temporal features from historical angle sequences. After receiving transmitted signals, the user equipment adjusts the angles and feeds them back to the transmitter to predict the next angle. While DL-based beam tracking has shown promising results, constructing datasets is challenging, significantly increasing deployment costs.
In reinforcement learning (RL), agents improve their decisions through trial and error by interacting with the environment, making it suitable for unknown environments and a more attractive technology [17,18]. Refs. [19,20,21] used RL to predict pilot directions to enhance pilot-based channel estimation accuracy. However, using pilots is unsuitable for high-mobility scenarios. In contrast, Refs. [22,23,24,25] suggests using RL to predict the beam direction for the next slot directly, reducing pilot-related overhead. Specifically, Refs. [22,23] employed Q-learning to predict beam directions based on historical channel states, but this method struggles in complex environments due to the difficulty of maintaining high-dimensional Q-tables. Refs. [24,25] utilized deep reinforcement learning (DRL) for prediction, leveraging feature extraction from high-dimensional data to compress state and action spaces, making it suitable for environments with large state and action spaces. Most of the aforementioned methods are based on pilot and uplink feedback mechanisms, which involve substantial time overhead. In this context, researchers have introduced ISAC technology, utilizing mmWave signals with ISAC functions to achieve a lower latency beam allocation, which has been widely studied in recent years [26,27,28]. Moreover, ISAC technology offers significant advantages in improving spectrum efficiency and reducing equipment costs [29].
In [2], the authors propose using the Extended Kalman Filter (EKF) for ISAC beam tracking, accurately tracking and predicting the target’s motion parameters with sensing data without additional pilot overhead. Building on this, Ref. [30] added the feedback of the target vehicle’s angle of arrival (AOA) and speed, achieving better beam tracking accuracy on curved road trajectories. These studies correct the beam direction in each time slot to maintain precise beam pointing, but this may introduce unnecessary time overhead. In [3], the authors propose using ISAC signals to sense and predict the target’s motion parameters, combined with the factor graph message passing algorithm to improve tracking accuracy, and then designing the transmit beamformer based on the predicted angles to establish a reliable communication link. Although this scheme achieved good beam tracking accuracy, the authors did not consider a dynamic transmit power allocation. Given that EKF uses a first-order Taylor series approximation, its accuracy is limited. To address beam tracking with UAV jitter, Ref. [31] used the Unscented Kalman Filter (UKF) with second-order accuracy, combining ISAC sensing information to achieve more accurate beam direction predictions. However, most of these schemes employ idealized road models, which degrade their performance on real-world road trajectories. To enhance algorithm generalization, researchers have combined ISAC with DL. In [32], the authors input ISAC signal echoes into a deep neural network (DNN) for training, with the network outputting the predicted beam direction for the next time slot. In [14], convolutional neural networks (CNN) and LSTM networks extracted spatiotemporal features from ISAC echoes, directly outputting the predicted downlink beamforming matrix to maximize the sum achievable rate under sensing constraints. While these schemes achieve good results, they also face high dataset construction costs and do not consider resource allocation issues.
In this paper, we propose a novel algorithm named IMSBA that combines ISAC technology with MAPPO and uses Stackelberg game theory for joint power allocation between I2V and V2V. Specifically, the algorithm first optimizes the beam directions and frequency band allocation for V2V vehicles using MAPPO. Subsequently, in each time slot, it employs a Stackelberg game strategy to jointly optimize the transmission power for I2V and V2V. Additionally, vehicles achieve sensing domain sharing through V2V communication, utilizing sensing information from other vehicles to switch target vehicles and reduce the path loss caused by communication distance or obstruction. For clarity, the contribution of this work is summarized as follows:
  • To address the challenge of environmental instability, this study employed the multi-agent reinforcement learning algorithm MAPPO to optimize V2V vehicle beam direction and frequency band allocation, reducing interference and enhancing overall sensing–communication performance.
  • We separated the joint power allocation of I2V and V2V from the reinforcement learning policy and used Stackelberg game theory and iterative optimization to obtain an approximate optimal solution. This method reduces the action space size and improves the learning efficiency of the agents.
  • V2V communication enables the sharing of sensing domains between vehicles, utilizing the sensing information from other vehicles to switch to closer target vehicles, effectively reducing the path loss caused by communication distance and obstructions.
This paper is structured as follows: Section 2 introduces the system model, Section 3 presents the beam allocation algorithm, Section 4 proposes the simulation results, and Section 5 concludes the paper.
Notations: A matrix is denoted by a bold uppercase letter (e.g., A ), a vector is denoted by a bold lowercase letter (e.g., a ), and a scalar is denoted by an ordinary letter (e.g., a). The subscript indicates the index of the vehicle and the time slot (e.g., θ k , n indicates the beam direction of vehicle k at the nth time slot).

2. System Model

In our system, we consider K + M vehicles equipped with mmWave mMIMO on a road as shown in Figure 1. Among them, K vehicles use ISAC signals for V2V communication, simultaneously sharing road safety information and sensing the motion parameters of target vehicles; M vehicles conduct I2V communication to receive data from the RSU. Considering that the vehicle trajectory may be aligned with the roadway, for simplicity, it is assumed that the RSU knows the motion parameters of all I2V vehicles. To improve spectrum utilization, it is assumed that the I2V link uses orthogonal frequency bands, while the V2V link reuses these orthogonal frequency bands. k denotes V2V vehicle, k denotes the set of target vehicles for a source vehicle, K represents the size of set k , and m denotes I2V vehicle. For convenience of exposition, the vehicles are modeled as point targets following standard assumptions in the literature [33]. To avoid the interference between the echo signals and the transmitted signals, each vehicle is equipped with two separate mMIMO arrays with N t and N r antennas. For simplicity, it is assumed that the mMIMO array is a uniform linear array (ULA) [2]. The RSU is equipped with a mMIMO ULA with N t R transmit antennas. In the following subsections, we first present the general framework and then provide a detailed description of the signal model.

2.1. The General Framework

The general framework of this paper is shown in Figure 2, and the specific workflow is described as follows:

2.1.1. Beam Direction and Frequency Band Allocation

To address the challenge of environmental instability, the study employed the MAPPO algorithm to optimize the beam direction and frequency band allocation for V2V vehicles. Specifically, in the nth time slot, vehicle k defines the local state of the current time slot based on the sensing information SI k , n 1 of all target vehicles from the previous time slot, the interference power I c received during beam communication, and the interference power I r received during echo sensing. Based on the local state, MAPPO is used to overcome environmental instability and optimize the relative change in beam direction I k , n and frequency band allocation c k , n for vehicle k, thereby further reducing interference and enhancing overall sensing–communication performance.

2.1.2. Joint Power Allocation

This study separated the joint power allocation of I2V and V2V from the reinforcement learning strategy and optimized it using a Stackelberg game. Specifically, after completing step (a), the Stackelberg game is used to split the joint power allocation problem into two subproblems of lower complexity. Iterative optimization is then used to obtain the approximate optimal power allocation p k , n and p m , n for V2V and I2V. This method reduces the action space size and improves the learning efficiency of the agents.

2.1.3. Target Vehicle Switching

Within the nth time slot, vehicle k uses V2V communication to share sensing domains between vehicles. At the end of the time slot, vehicle k switches to a new target vehicle k using the sensing information from other vehicles, effectively reducing path loss caused by communication distance and obstructions. At the start of the next time slot, return to step (a) and continue.

2.2. Radar Signal Model

In the nth time slot, the K downlink ISAC streams transmitted by vehicle k can be expressed as s k , n = s 1 , n , , s k , n , , s K , n C K × 1 , where E s k , n 2 = 1 [34]. After beamforming, the transmitted signal can be expressed as
s ˜ k , n = F k , n s k , n C N t × 1 ,
where F k , n C N t × K represents the transmit beamforming matrix, and N t denotes the number of transmit antennas. The echo signal reflected by the target vehicle is input into a radar matched filter to eliminate the effects of a Doppler shift and delay. The processed echo signal of the target vehicle k can be expressed as [35]
r k , n = ζ p k , n G ϕ k , n g H θ ¯ k , n b θ k , n a H θ k , n · f θ ¯ k , n s k , n + g H θ ¯ k , n m = 1 M ψ m , k ζ p m , n · H R , k f m q m , n + i = 1 , i k K j = 1 K ψ i , k ζ p j , n H i , k f j s j , n + v = 1 , v k K ζ p v , n H v f v s v , n + z r ,
where ζ = N t N r represents the antenna array gain; p k , n denotes the transmit power of the beam directed at target vehicle k ; G represents the matched filter gain; g H θ ¯ k , n represents the receive beamforming vector of the echo signal; a and b are the transmit and receive steering vectors, respectively; θ k , n represents the true angle of target vehicle k relative to vehicle k; f θ ¯ k , n = a θ ¯ k , n denotes the k th column of F k , n ; θ ¯ k , n represents the beam direction selected through multi-agent reinforcement learning; ψ m , k = 1 indicates the same frequency band used by vehicle m and vehicle k, and ψ m , k = 0 indicates different frequency bands, with a similar meaning for ψ i , k ; p m , n denotes the transmit power of the beam directed at vehicle m by the RSU; H R , k represents the channel matrix from the RSU to vehicle k, H i , k represents the channel matrix from vehicle i to vehicle k, H v represents the reflection channel matrix of target vehicle v; z r represents complex additive white Gaussian noise with zero mean and variance σ r 2 ; q m , n represents the downlink symbol transmitted by the RSU, E q m , n 2 = 1 ; the reflection coefficient ϕ k , n = ε k , n 2 d k , n 1 .
The variance of sensing noise is directly proportional to the reciprocal of the sensing signal-to-interference-plus-noise ratio (SINR), which can be expressed as
ω k , n r = ζ 2 p k , n G ϕ k , n g H θ ¯ k , n b θ k , n a H θ k , n f 2 σ r 2 + I I 2 V + I V 2 V + I E c h o ,
where I I 2 V , I V 2 V , and I E c h o represent the interference power of I2V, V2V, and echo signals, respectively. The sensing SINR is affected by transmit power, beam direction, and frequency band allocation.

2.3. Communication Signal Model

In V2V communication, the source vehicle embeds its position, velocity, and beam direction information into the communication signal. The target vehicle demodulates the communication signal to obtain these parameters, compensates for the Doppler shift and delay, and constructs a beamformer to maximize the communication SINR. In time slot n, the received signal of target vehicle k can be expressed as
c k , n = ζ p k , n κ k , n w H ( θ ¯ k , n ) u ( θ k , n ) a H ( θ k , n ) f ( θ ¯ k , n ) s k , n + w H ( θ ¯ k , n ) m = 1 M ψ m , k ζ p m , n H R , k f m q m , n + i = 1 K j = 1 K δ i , j ψ i , k ζ p j , n H i , k f j s j , n + ψ k , k v = 1 K ζ p v , n H v f v s v , n + z c ,
where w H θ ¯ k , n represents the communication receive beamforming vector; u denotes the communication receive steering vector; κ k , n represents the path loss of the signal from source vehicle k to target vehicle k . When i = k and j = k , the beam does not cause self-interference; hence, δ i , j = 0 ; otherwise, δ i , j = 1 ; z c represents complex additive white Gaussian noise with zero mean and variance σ c 2 . The communication SINR at target vehicle k can be expressed as
ω k , n c = ζ 2 p k , n κ k , n w H θ ¯ k , n u θ k , n a H θ k , n f 2 σ c 2 + I I 2 V + I V 2 V + I E c h o .
Similar to the sensing SINR, the communication SINR is also directly related to the transmit power, beam direction, and frequency band allocation. According to the above formula, the achievable communication rate between source vehicle k and target vehicle k can be expressed as
R k , n c = log 2 1 + ω k , n c .
In I2V communication, since the RSU uses orthogonal frequency bands to communicate with each vehicle, the received signal of target vehicle m can be expressed as
c m , n = ζ p m , n κ m , n w H ( θ ¯ m , n ) u ( θ m , n ) a H ( θ m , n ) f ( θ ¯ m , n ) q m , n + w H ( θ ¯ m , n ) i = 1 K j = 1 K ψ i , m ζ p j , n H i , m f j s j , n + z c ,
where κ m , n represents the path loss of the signal from the RSU to vehicle m. Since the RSU knows the motion parameters of all I2V vehicles, precise beam alignment can be achieved, i.e., θ ¯ m , n = θ m , n . The received signal of target vehicle m can be rewritten as
c m , n = ζ p m , n κ m , n q m , n + w H ( θ m , n ) i = 1 K j = 1 K ψ i , m ζ p j , n H i , m f j s j , n + z c .
According to the above formula, the total achievable rate for I2V can be expressed as
R n R = m = 1 M log 2 1 + ζ 2 p m , n κ m , n 2 σ c 2 + I V 2 V .
It can be seen that R n R is related not only to the interference power of V2V but also directly to the transmit power of I2V. Therefore, joint power allocation for I2V and V2V is crucial for enhancing the overall sensing–communication performance of the network. In the following section, we introduce the I2V and V2V beam allocation problem aimed at maximizing overall sensing–communication performance and present the corresponding optimization algorithm.

3. Beam Allocation Algorithm Based on MAPPO and Stackelberg Game

3.1. Problem Formulation

The beam direction θ ¯ k , n = θ ¯ 1 , n , , θ ¯ K , n of vehicle k is calculated based on the relative change I k , n = I 1 , n , , I K , n as follows:
θ ¯ k , n = θ ^ 1 , n 1 + I 1 , n Δ θ , , θ ^ k , n 1 + I k , n Δ θ , , θ ^ K , n 1 + I K , n Δ θ ,
where θ ^ k , n 1 represents the beam direction corrected by sensing in the previous time slot; Δ θ denotes the angle interval between two adjacent indices in the code book. p k , n = p 1 , n , , p K , n indicates the transmit power of each beam for vehicle k; c k , n 0 , 1 , , M 1 = C represents the frequency band index used by vehicle k, with each V2V vehicle only able to select one frequency band, while a frequency band can be used by multiple vehicles.
According to the above signal model and the definitions of related variables, the original beam allocation problem can be expressed as follows:
maximize p k , n , I k , n , c k , n , p m , n , k , m 1 N n = 1 N k = 1 K E k , n + R k , n r + R n R
s . t . p min p k , n p max , k k ,
k = 1 K p k , n P max ,
p m , n p max R , m M ,
E k , n Γ E , k K ,
R n R Γ R c ,
R k , n r Γ V r , k K ,
I k , n 1 , 0 , 1 , k k ,
c k , n 0 , 1 , , M 1 , k K ,
c C ψ k , c = 1 , k K ,
k K ψ k , c K , c C ,
ψ i , j 0 , 1 ,
where E k , n = k = 1 K E k , n = k = 1 K R k , n c p k , n represents the total energy efficiency of vehicle k; for ease of solution, we replace the sensing SINR with R k , n r = k = 1 K R k , n r = k = 1 K log 2 1 + ω k , n r , which represents the total sensing rate of vehicle k. Constraints (11a) to (11c) are related to transmit power, including the transmit power of each beam and the total transmit power. Constraint (11d) is related to the energy efficiency of vehicle k. Constraint (11e) is related to the total achievable rate of I2V. Constraint (11f) is related to the sensing rate of vehicle k. Constraint (11g) specifies the range of the relative change in beam index. Constraint (11h) defines the range of the frequency band index. Constraints (11i) to (11k) specify that a vehicle can select only one frequency band, but a frequency band can be selected by multiple vehicles.
The original beam allocation problem is an MINLP problem, and since it requires the sensing information of target vehicles for assistance, it is also a sequential decision problem. Additionally, beamforming concentrates energy in the specified direction, effectively reducing interference in other directions. Therefore, maximizing overall sensing–communication performance can be approximated by maximizing the individual sensing–communication performance of each vehicle. To simplify the problem-solving process, and considering that the optimization objective is to maximize the energy efficiency, the total I2V reachable efficiency, and the sensing rate of the vehicle, we relax constraints (11d) to (11f) and temporarily disregard their effects.

3.2. Beam Allocation Algorithm Based on MAPPO

V2V reuses the orthogonal frequency bands of I2V, leading to inter-vehicle interference between V2V and I2V. Additionally, V2V supports multi-beam multi-target communication, resulting in inter-beam interference. To mitigate the impact of interference, the beam direction and frequency band allocation for V2V are crucial. These factors not only affect the sensing–communication performance of the current V2V link but also significantly impact the sensing–communication performance of other V2V and I2V links. If V2V vehicles are treated as single agents for decision-making, the reward for the current agent under the same local state–action pair may vary due to the actions of other agents. This environmental instability prevents the agent from proper training.
To address the issue of environmental instability, the multi-agent reinforcement learning framework of centralized training and distributed execution has emerged. In this framework, each agent only needs to input its locally observed state into the decision network to make local decisions, saving communication costs between agents. During training, the global state is input into a centralized network for evaluation, guiding the update of the decision network. Considering that the objective function of problem (11) is to maximize overall sensing–communication performance, we adopt the MAPPO algorithm to promote cooperation among agents, achieving better beam direction and frequency band allocation. In the MAPPO algorithm, the actor network serves as the decision network, while the critic network serves as the centralized network. The critic network evaluates the value of the global state to guide the update of the decision network.
For ease of expression, in this chapter, V2V vehicles are also referred to as agents, and their local state can be expressed as
o k , n = I k m , I c , I r , c k , n 1 , P k , n 1 , v k , n 1 , S I 1 , n 1 , , S I K , n 1 ,
where the local state at time slot n is composed of the values from time slot n 1 . P k , n 1 , v k , n 1 represent the position and velocity of vehicle k, respectively; SI k , n 1 = S I 1 , n 1 , , S I K , n 1 represents the sensing information of each beam’s target vehicle, including position, velocity, and direction angle; c k , n 1 denotes the frequency band index selected by vehicle k; I k m represents the interference power of vehicle k on each I2V vehicle m; I c represents the interference power received by vehicle k for each beam’s downlink communication; I r represents the interference power received by vehicle k for each beam’s echo signal. Since the sensing information includes noise, this decision process is modeled as a POMDP. To improve decision-making, we modify the actor and critic networks to the DRQN network structure, utilizing temporal features from the historical state sequence to compensate for the current state. The global state s n , which includes the local states of all agents, can be expressed as
s n = o 1 , n , , o K , n .
The action of agent k is denoted as a k , n = I k , n , c k , n . I k , n represents the relative change in each beam direction, and c k , n represents the frequency band index selected in the current time slot. Regarding the reward, to promote cooperation, all agents receive the same reward, representing the overall reward of the environment. Therefore, the reward function should be the objective function in problem (11):
r n = k = 1 K w e E k , n + w r R k , n r + w R R n R ,
where w e , w r , w R represent the weights of each part of the reward. By adjusting these weights, resources can be biased towards a specific target.
To facilitate the training of the GRU in the network, at the end of each episode, the trajectory of this episode is saved to the experience pool. Each time step in the trajectory is saved as a tuple ( O , S , A , O , S , R ) , which is the set of local states of all agents at the current time slot n; S is the global state at the current time slot n; A = a 1 , n , , a K , n is the set of actions of all agents at the current time slot n; O = o 1 , n + 1 , , o K , n + 1 is the set of local states of all agents at the next time slot n + 1 ; S is the global state at the next time slot n + 1 ; R is the set of rewards of all agents at the current time slot n, where each element is equal to r n . Since MAPPO is an on-policy algorithm, when the number of episodes in the experience pool equals the batch size s a m p , training is performed, and the experience pool is cleared after training. The loss function for the actor network is calculated as follows:
L w a c t = L c l i p w a c t S π w a c t ,
where w a c t represents the network parameters of the actor network; L c l i p w a c t denotes the proximal policy optimization loss, which stabilizes training results by clipping the update magnitude through the clip strategy [36]. The proximal policy optimization loss is defined as follows:
L c l i p ( w a c t ) = E n min r t o n ( w a c t ) A ^ n , c l i p r t o n ( w a c t ) , 1 ϵ , 1 + ϵ A ^ n ,
where r t o n w a c t = π w a c t a n | o n π w a c t o l d a n | o n represents the ratio of the probability of taking action a n under the current policy π w a c t in local state o n to the probability of taking the same action under the old policy π w a c t o l d , used to measure the update magnitude of the policy; A ^ n = V w v a l s n V a v g s n represents the advantage function, used to measure the value of the current global state V w v a l s n relative to the average level V a v g s n ; w v a l is the network parameter of the critic network. The value of the global state can be understood as the expected future return starting from the current global state. The purpose of clipping is to limit the update magnitude of the policy to avoid the instability caused by large updates. s π w a c t represents the entropy of the policy, where a higher entropy indicates a more uniform probability distribution of actions and stronger exploration ability. The loss function for the critic network is calculated as follows:
L ( w v a l ) = E V w v a l ( s n ) V t a r 2 ,
where V w v a l s n represents the value function of the global state s n ; V t a r = i = n N 1 γ i n r i + γ N n V w v a l s N represents the target value function, which is the expected future return, also known as the expected discounted reward, with γ as the discount factor. The second term in the function is calculated only when N is the episode end step.
As shown in Figure 2, the joint power allocation and target vehicle switching algorithms need to be integrated into the MAPPO framework. The specific implementation details are provided in Algorithms 1 and 2. In summary, the process flow of the beam allocation algorithm based on MAPPO is summarized in Algorithm 3.
Algorithm 1 Joint Power Allocation Algorithm Based on Stackelberg Game
Initialize:
p k , n * 0 , λ k 0 , k k ;
  1:
while  i < n o u t  do
  2:
    Fix p k , n * i 1 , calculate p m , n * i (19);
  3:
    Initialize I k , n 0 ;
  4:
    while  j < n i n  do
  5:
        Fix p m , n * i , solve problem (22), obtain p k , n * j ;
  6:
        Update λ k j ;
  7:
        if  λ k j λ k j 1 Δ 1 , k k  then
  8:
            p k , n * i p k , n * j
  9:
        end if
10:
        According p m , n * ( i ) , p k , n * ( j ) to update I k , n c j , I k , n r j ;
11:
         j = j + 1 ;
12:
    end while
13:
    if  p k , n * i p k , n * i 1 Δ 2 , k k  then
14:
        Output:  p m , n * i , p k , n * i ;
15:
    end if
16:
     i = i + 1 ;
17:
end while
Algorithm 2 Target vehicle switching based on sensing domain sharing
Input: Receiving SNL from other vehicles;
  1:
for  k SNL  do
  2:
    if length( G k ) > 1 then
  3:
        max_sinr = ;
  4:
        idx = −1;
  5:
        for  i = 1 to length( G k ) do
  6:
           if  ω i , n r > max_sinr then
  7:
               max_sinr = ω i , n r ;
  8:
               idx = i;
  9:
           end if
10:
        end for
11:
        Retain the sensing information with idx within G k ;
12:
    end if
13:
    Compute the distance d k , n between the current vehicle k and vehicle k based on P k , n P ^ k , n 2 ;
14:
end for
15:
min_d = ;
16:
k = −1;
17:
for  i SNL  do
18:
    if  d i , n < min_d then
19:
        min_d = d i , n ;
20:
         k = i;
21:
    end if
22:
end for
23:
Output: Vehicle k as the new target vehicle.
Algorithm 3 ISAC Beam Allocation Scheme Based on MAPPO and Stackelberg Game (IMSBA)
Initialize:
Initialize actor network parameters w a c t ; initialize critic network parameters w v a l ; initialize the experience pool F ;
  1:
for episode = 1:MaxEpisode do
  2:
    Reset the vehicles’ position and the experience pool;
  3:
    Obtain the initial local state o k , 0 ;
  4:
    while n ≤ MaxStep-1 do
  5:
        Input o k , n into the actor network and select action a k , n based on the output policy π w a c t ;
  6:
        Execute Algorithm 1 Joint Power Allocation Algorithm.
  7:
        Execute Algorithm 2 Target Vehicle Switching Algorithm.
  8:
        Observe the next time slot state o k , n + 1 and calculate the reward r n (14);
  9:
         o k , n o k , n + 1 ;
10:
    end while
11:
    if evaluate == False then
12:
        Save the current episode trajectory to the experience pool F ;
13:
        if length( m a t h c a l F ) == s a m p  then
14:
           Calculate the advantage function A ^ n and target value function V t a r ;
15:
           Calculate L w a c t (15);
16:
           Calculate L w v a l (17);
17:
           Backpropagation and use the Adam optimizer to update w a c t , w v a l ;
18:
           Clear the experience pool F ;
19:
        end if
20:
    end if
21:
end for

3.3. Joint Power Allocation Based on Stackelberg Game

In the Internet of Vehicles, resource optimization problems are typically complex and non-convex. Therefore, mathematical methods and tools are crucial for facilitating the solution of these problems. Difference of Convex (DC) approximation is a method that can be used to obtain approximate optimal solutions for complex non-convex problems [37]. Additionally, methods based on Stackelberg games are considered useful tools for solving complex problems [38]. In a Stackelberg game, users with different objectives are divided into leaders and followers who cooperate and compete with each other. Based on this strategy, a high-complexity problem can be decomposed into two subproblems of lower complexity. The joint power allocation of I2V and V2V has a high degree of coupling, so we propose a method based on Stackelberg game to transform the joint power allocation problem into two suboptimal power allocation problems, thereby simplifying the solution of the original problem.
In this section, we consider the I2V link as the leader in the game and the V2V link as the follower. For simplicity, this section assumes perfect channel state information is available. The leader aims to maximize the total achievable rate of I2V, while the follower aims to maximize the total energy efficiency and sensing rate of V2V, establishing a competitive relationship between them.
The power allocation problem for the leader can be written as
maximize p m , n , m M R n R m k p m , n h R , k ψ m , k
s . t . p m , n p max R , m M ,
where h R , k represents the channel gain from the RSU to vehicle k. Although a larger p m , n increases R n R , it also causes more severe interference to V2V. Therefore, the first part of the objective function represents the total achievable rate of I2V, i.e., the utility function; the second part of the objective function represents the interference level of I2V to V2V, i.e., the cost function. By observing, it can be found that the objective function is concave with respect to p m , n . Therefore, its partial derivative can be set to zero to obtain p m , n * :
p m , n * = max 0 , min A B A B , p max R ,
where A = h R , m σ c 2 + k k h k , m p k , n ψ k , m , B = k h R , k ψ m , k ; h R , m , h k , m represent the channel gains from the RSU and vehicle k to vehicle m, respectively. Since the joint power optimization is performed in two stages, the follower’s power, i.e., the V2V transmit power, can be considered constant when solving the leader’s power allocation problem. Similarly, when solving the follower’s power allocation problem, the leader’s power, i.e., the I2V transmit power, can be considered constant. The power allocation problem for the follower can be written as
maximize p k , n , k k k E k , n + R k , n r m k k p k , n h k , m ψ k , m
s . t . p min p k , n p max , k k ,
k p k , n P max , k k .
The first part of the objective function represents the total energy efficiency and sensing rate of V2V, i.e., the utility function; the second part of the objective function represents the interference level of V2V to I2V, i.e., the cost function. Due to the total power constraint, we cannot simply use partial derivatives to obtain the optimal power. Therefore, we can use the CVX toolbox to solve this problem. However, the objective function in the problem is not concave with respect to p k , n . Additionally, energy efficiency is a fractional optimization problem, so we need to transform it into a more convenient form for solving.
Firstly, the Dinkelbach algorithm is employed to transform the fractional form of energy efficiency into a linear form of numerator minus λ times the denominator. Then, an iterative method is used to solve the problem. The algorithm begins with an initial value λ k 0 , and in each iteration, the optimal solution p k , n * j is obtained, which is then used to update λ k j = R k , n c / p k , n * j . This iterative process continues until the change in λ k j λ k j 1 Δ 1 , k k between two consecutive iterations meets a certain threshold, marking the end of the algorithm. Therefore, based on the Dinkelbach algorithm, problem (20) is transformed into
maximize p k , n ( j ) , k k k k R k , n c λ k ( j 1 ) p k , n ( j ) + R k , n r
m k k p k , n ( j ) h k , m ψ k , m
s . t . p min p k , n ( j ) p max , k k ,
k p k , n ( j ) P max , k k .
However, the aforementioned problem is not a concave function with respect to p k , n , necessitating further transformation. For the achievable rate R k , n c and sensing rate R k , n r , we can decompose them into a log(numerator) minus log(denominator) form, where the first part is concave in p k , n but the second part is not. To address the non-concavity issue, this chapter employs the DC approximation technique to transform the non-concave part into a linear function. Specifically, the non-concave part is linearized using the first-order Taylor expansion, making the linear function both convex and concave. Therefore, problem (21) can be further transformed into
maximize p k , n ( j ) , k k k k log I k , n c ( j ) + p k , n ( j ) h k , n c Q I k , n c ( j ) λ k ( j 1 ) p k , n ( j ) + k k log I k , n r ( j ) + p k , n ( j ) h k , n r Q I k , n r ( j )
m k k p k , n ( j ) h k , m ψ k , m
s . t . p min p k , n ( j ) p max , k k ,
k p k , n ( j ) P max , k k ,
In this context, Q I k , n j Q I k , n j 1 + Δ Q T I k , n j 1 I k , n j I k , n j 1 represents the first-order Taylor expansion approximation of the non-concave part, where I k , n j 1 is the result from the previous iteration and can be considered a constant. I k , n c j , I k , n r j are linear functions of p k , n j , representing the interference power during downlink communication and echo sensing, respectively. The algorithm terminates when the outer iteration meets the criterion for p k , n * i p k , n * i 1 Δ 2 , k k , outputting the joint power allocation for I2V and V2V. Note that the iteration index i represents the outer iteration, where each outer iteration involves solving the leader and follower problems, while the iteration index j represents the inner iteration, where each inner iteration involves executing the Dinkelbach algorithm. These iterations form a nested loop structure. Additionally, the optimal solution of this joint power allocation algorithm is deterministic, ensuring the stability of the environment when integrated into a multi-agent reinforcement learning framework.
The CVX toolbox typically uses the interior-point method to solve nonlinear programming problems, so the computational complexity can be expressed as O n o u t n i n l 3.5 log 1 1 ε ε , where n o u t represents the number of outer iterations, n i n represents the number of inner iterations, l represents the number of optimization variables, and ε represents the target accuracy. In summary, the joint power allocation algorithm based on the Stackelberg game proceeds as described in Algorithm 1.

3.4. Target Vehicle Switching Based on Sensing Domain Sharing

Sensing information can serve as state to guide agent decisions. Additionally, sensing information can be shared among vehicles through V2V communication. Sensing domain sharing extends the sensing range of individual vehicles, aiding in achieving optimized agent decisions and joint power allocation. We introduce a Sensing Neighbor List (SNL) in V2V communication [39] to facilitate sensing domain sharing among different vehicles during communication. The SNL stores sensing information, with specific details shown in Figure 3, where P ^ k , n represents the sensing position of the target vehicle, v ^ k , n denotes the sensing velocity of the target vehicle, and ω k , n r signifies the sensing SINR of the target vehicle’s echo signal, determining the magnitude of sensing noise. During V2V communication, each vehicle broadcasts its own SNL to the target vehicle, which then combines its own sensing information with the SNL and broadcasts it to other vehicles, thereby enabling sensing domain sharing among vehicles. When the target vehicle’s communication distance is too far or its line-of-sight link is obstructed by other vehicles, the SNL can be used to switch to the nearest alternative target vehicle to minimize the impact of path loss on sensing performance.
During the process of sensing domain sharing, there may be cases where the sensing of target vehicles overlaps. In such instances, the SNL may contain multiple entries with vehicle index k . To resolve this, sorting based on ω k , n r is performed, retaining only the information corresponding to the highest value of sensing quality. Additionally, due to spectrum conflicts, closer communication distances result in lower path losses but may introduce more severe interference. This issue becomes more pronounced with increasing beamwidth, as larger beamwidths reduce antenna count, weakening the asymptotic orthogonality properties.
In summary, the algorithm for target vehicle switching based on sensing domain sharing is outlined as follows. The computational complexity of this algorithm can be represented as O K 2 + K .

4. Simulation Results

In this section, we validate the effectiveness of the IMSBA algorithm in beam allocation through road vehicle simulation experiments. The real-road trajectory is shown in Figure 4, with the road’s latitude and longitude coordinates being 51°25′56″ N and 0°03′20″ E. We set the road’s starting point as the coordinate origin [ 0 , 0 ] , with K = 4 V2V vehicles and M = 2 I2V vehicles moving uniformly along the road. The vehicle speed refers to the speed along the road direction. Table 1 provides other specific simulation parameters.
We compare the proposed IMSBA algorithm with other benchmark algorithms, detailed as follows:
  • IMSBA w/o switch: This algorithm, based on the proposed IMSBA algorithm, removes the target vehicle switching based on sensing domain sharing. As a result, the direct link is affected by communication distance and obstacles, leading to degraded communication and sensing performance.
  • IMSBA w/o allocate: This algorithm, based on the proposed IMSBA algorithm, removes the joint power optimization based on the Stackelberg game. Therefore, in this algorithm, the transmit power uses fixed values of 1 W, 10 W, and 20 W.
  • ISSBA: In this algorithm, the base station uses the DRL algorithm DQN to allocate beams to mobile drones, aiming to maximize data transmission rates [40]. We apply this algorithm to the scenario in this chapter, replacing the base station with the V2V source vehicle and the mobile drone with the V2V target vehicle. The interaction of actions among multiple agents causes environmental instability, severely reducing the learning efficiency and convergence performance of the agents.
Unless otherwise specified, the vertical axis in the result figures represents the sum of the performance across all time slots in an episode, and the horizontal axis represents training duration. Each unit change on the horizontal axis corresponds to 14,400 time slots or 32 episodes.
First, we present the cumulative reward curves for the proposed algorithm and other comparison algorithms, as shown in Figure 5, to validate the effectiveness and convergence of the algorithm. This figure is obtained under the condition of reward weight w e = 1 ,   w r = 1 ,   w R = 1 and 64 transmit and receive antennas, with similar trends observed under other conditions. The IMSBA algorithm achieves the optimal convergence performance, with an average reward per time slot of 121.8, validating the effectiveness of the proposed algorithm. Due to the instability caused by multi-agent actions, the DQN-based ISSBA algorithm fails to learn effectively, resulting in the worst convergence performance, with an average reward per time slot of 76.2. In related ablation experiments, the IMSBA w/o allocate algorithm, which does not perform joint power allocation and uses fixed power instead, shows inferior convergence performance compared to the IMSBA algorithm. Additionally, the convergence performance of the IMSBA w/o allocate algorithm improves with increasing transmit power, primarily because a higher transmit power enhances V2V sensing performance and I2V communication performance, outweighing the decrease in V2V energy efficiency. The IMSBA w/o switch algorithm, which does not switch target vehicles, suffers from link blockage or excessive communication distance, resulting in lower convergence performance compared to the IMSBA algorithm. It is noteworthy that the convergence performance of the IMSBA w/o switch algorithm is also lower than that of the IMSBA w/o allocate algorithm, indicating that the performance gain from switching target vehicles is greater than that from joint power allocation.
We present individual curves for the components of cumulative reward shown in Figure 5 and compare them with other benchmark algorithms. The energy efficiency change curve is depicted in Figure 6. Due to IMSBA w/o allocate fixing the transmission power at 1 W, this algorithm achieves the highest energy efficiency, which is 8.1% higher compared to the IMSBA algorithm. IMSBA w/o switch experiences significant path loss due to not switching target vehicles, resulting in a 30% lower energy efficiency compared to the IMSBA algorithm. The ISSBA algorithm exhibits a lower energy efficiency than IMSBA due to environmental instability, yet it still outperforms the IMSBA w/o allocate algorithms with fixed transmission powers of 10 W and 20 W. The sensing performance change curve is shown in Figure 7. Unlike the energy efficiency curve, the IMSBA w/o allocate algorithms with fixed transmission powers of 10 W and 20 W achieve a higher sensing performance, consistent with the findings in Figure 5. Notably, even with a fixed transmission power of 1 W, the IMSBA w/o allocate algorithm achieves a higher sensing performance than the IMSBA w/o switch algorithm, confirming that switching target vehicles leads to greater performance improvements than joint power allocation. The I2V achievable rate change curve is depicted in Figure 8. Clearly, as the I2V transmission power increases, the total achievable rate also increases, consistent with the results of the IMSBA w/o allocate algorithm. The IMSBA algorithm considers I2V interference in joint power allocation, hence reducing I2V transmission power, resulting in slightly lower I2V achievable rates compared to the IMSBA w/o switch algorithm. Moreover, due to a reduced path loss from switching target vehicles, the IMSBA algorithm under current reward weights tends to allocate more resources to V2V, thus resulting in slightly lower I2V achievable rates compared to the IMSBA w/o switch algorithm.
The weights of various components in the MAPPO reward function influence the final convergence performance. Therefore, we adjusted the weights to observe their impact on the results. The three figures above were obtained with 64 transceiver antennas, and when one weight was changed, the other weights were set to 1. The curve of energy efficiency versus weight w e is shown in Figure 9. Clearly, by increasing w e , resources are tilted towards energy efficiency, resulting in an upward trend in energy efficiency. The curve of sensing performance versus weight w r is shown in Figure 10. Similar to energy efficiency, as w r increases, sensing performance also improves. Notably, sensing performance has a crucial impact on the overall performance of the IMSBA algorithm. Poor sensing performance leads to inaccurate target sensing results, which, in turn, affects the agent’s action selection. Therefore, when adjusting the weight w r , the change in sensing performance is not very significant and needs to be maintained at a high level. The curve of the sum rate of I2V versus weight w R is shown in Figure 11. Firstly, the trend still increases with w R . Secondly, due to the relatively small variation in the sum rate, when w R is small, resources are inclined towards V2V, resulting in a downward trend in the I2V sum rate in the later stages. This downward trend gradually weakens as w R increases.
Figure 12 shows the variation of cumulative reward with the number of antennas for the IMSBA algorithm and the IMSBA w/o switch algorithm, with all experiments using a reward weight of 1. It can be observed that the cumulative reward for both algorithms increases with the number of antennas due to the larger antenna array gain, which enhances overall sensing and communication performance. Under the same number of antennas, the cumulative reward for the IMSBA w/o switch algorithm is lower than that of the IMSBA algorithm due to the higher path loss. Moreover, the gap in cumulative rewards between the two algorithms widens as the number of antennas increases. This is because, with fewer antennas, the beam width is wider, and the smaller path loss exacerbates inter-vehicle interference, leading to a reduction in the cumulative reward of the IMSBA algorithm.
To further assess the sensing performance, Figure 13 plots the CDF of position sensing error versus the number of antennas, with all algorithms using a reward weight of 1. This error refers to the root mean square error (RMSE) between the true position and the sensed estimated position. In Figure 13, the horizontal axis range is smaller, while in Figure 14, the horizontal axis range is larger. It can be seen that for smaller position sensing errors, more antennas result in a higher CDF. For example, when the number of antennas is 128, the probability of the position sensing error being less than 2.5 × 10 3 m is 28%, which is higher than other antenna numbers. However, for larger position sensing errors, fewer antennas result in a higher CDF. This is because the IMSBA algorithm requires target vehicle switching, which can lead to beam misalignment after switching. With more antennas, the beam width is narrower, and beam gain drops more significantly when misaligned, leading to larger position sensing errors, as shown in Figure 14. In the stable condition without target vehicle switching, more antennas result in greater antenna array gain and smaller position sensing errors, as shown in Figure 13. Figure 15 compares the CDF of the position sensing error for the proposed IMSBA algorithm with other benchmark algorithms, with all algorithms using a reward weight of 1 and 64 antennas. It can be seen that, regardless of the error magnitude, the IMSBA algorithm achieves the highest CDF, with a probability of 91.6% for the position sensing error being less than 0.2 m. The IMSBA w/o switch algorithm, constrained by a higher path loss, results in larger position sensing errors. In the ISSBA algorithm, multi-agent actions cause environmental instability, preventing effective training, leading to the largest position sensing error.
Figure 16 illustrates the performance comparison of the IMSBA algorithm under different numbers of I2V vehicles, using experimental conditions with a reward weight of 1 and 64 transmit–receive antennas. The values in the figure represent results after algorithm convergence. As established earlier, I2V employs orthogonal frequency bands for communication, with the number of I2V vehicles equating to the number of orthogonal frequency bands available. Consequently, as the number of I2V vehicles increases, V2V can reuse more orthogonal frequency bands, further reducing mutual interference between V2V pairs and between I2V and V2V. As shown in Figure 16, the overall cumulative reward increases with the number of I2V vehicles, with V2V’s energy efficiency, sensing performance, and I2V’s achievable rates following the same trend as cumulative reward. Figure 17 presents the performance comparison of the IMSBA algorithm under different numbers of V2V vehicles, similarly using experimental conditions with a reward weight of 1 and 64 transmit–receive antennas. The values in the figure represent results after algorithm convergence. As the number of V2V vehicles increases, the cumulative reward shows an upward trend. However, increased V2V vehicles exacerbate inter-vehicle interference, resulting in a certain degree of reduction in I2V total achievable rates. Moreover, due to increased inter-vehicle interference, the growth rate of cumulative reward exhibits a declining trend; the cumulative reward growth at six vehicles is 9.7% lower compared to five vehicles.
IMSBA is expected to improve the communication robustness and sensing accuracy of ISAC systems in 6G scenarios by jointly optimizing beam direction, spectrum, and power allocation through multi-intelligent body reinforcement learning and a Stackelberg game. Meanwhile, the V2V sensing sharing and target switching mechanism expands the sensing range of the vehicle. The IMSBA architecture has good modularization characteristics, which makes it easy to be deployed on edge computing platforms. Although it still needs to be validated under real-road conditions, this scheme has shown the potential to become the foundational framework for intelligent beam management in future ISAC systems.

5. Conclusions

In a multi-beam communication scenario with coexisting I2V and V2V communications and limited spectrum resources, this paper proposes an ISAC beam allocation scheme (IMSBA) based on multi-agent reinforcement learning to address interference suppression and maximize overall ISAC performance. To mitigate environmental instability, the scheme introduces the multi-agent reinforcement learning algorithm MAPPO, employing a centralized training and distributed execution architecture. It optimizes V2V beam direction and frequency allocation using sensing information from target vehicles. Additionally, to reduce the dimensionality of the action space, the paper separately optimizes I2V and V2V joint power allocation using the Stackelberg game method. Finally, vehicles achieve sensing domain sharing through V2V communication, utilizing sensing information from other vehicles to switch target vehicles and reduce the path loss caused by communication distance or obstruction. The simulation results show that under conditions where all weights are 1 and the number of transceiving antennas is 64, the proposed IMSBA algorithm increases V2V total energy efficiency by 92.5% and the probability of target vehicle sensing position RMSE being less than 0.2m by 43.6% compared to the ISSBA algorithm, effectively solving the beam allocation problem in scenarios with coexisting I2V and V2V communications.

6. Limitations and Future Work

Although the IMSBA scheme performs well in simulations, it still has several limitations. First, the Stackelberg game relies on the perfect CSI assumption, which is difficult to guarantee in practice and may affect the performance. Second, MAPPO faces training efficiency and stability challenges when the multi-intelligence scale is expanded. In addition, the target vehicle switching mechanism relies on high-quality sensing information, which may trigger misjudgment under a low SINR. Finally, the simulation adopts idealized modeling and lacks real-road and channel environment verification. Future work will focus on algorithm scalability, computational simplification, and practical deployment evaluation.

Author Contributions

Conceptualization, J.L., Y.Z., and D.W.; methodology, J.L., Y.Z., and D.W.; software, J.L., Y.Z., and D.W.; validation, Y.Z.; formal analysis, J.L., Y.Z., and D.W.; writing—original draft preparation, J.L., Y.Z., and D.W.; writing—review and editing, J.L., Y.Z., and D.W.; visualization, J.L.; supervision, Y.Z.; project administration, D.W.; funding acquisition, D.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Natural Science Foundation of Xiamen, China (Grant number 3502Z20227177) and National Natural Science Foundation of China (Grant number 62171392 and 62271427).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ji, B.; Zhang, X.; Mumtaz, S.; Han, C.; Li, C.; Wen, H.; Wang, D. Survey on the internet of vehicles: Network architectures and applications. IEEE Commun. Stand. Mag. 2020, 4, 34–41. [Google Scholar] [CrossRef]
  2. Liu, F.; Yuan, W.; Masouros, C.; Yuan, J. Radar-assisted predictive beamforming for vehicular links: Communication served by sensing. IEEE Trans. Wirel. Commun. 2020, 19, 7704–7719. [Google Scholar] [CrossRef]
  3. Yuan, W.; Liu, F.; Masouros, C.; Yuan, J.; Ng, D.W.K.; González-Prelcic, N. Bayesian predictive beamforming for vehicular networks: A low-overhead joint radar-communication approach. IEEE Trans. Wirel. Commun. 2020, 20, 1442–1456. [Google Scholar] [CrossRef]
  4. Hong, W.; Jiang, Z.H.; Yu, C.; Hou, D.; Wang, H.; Guo, C.; Hu, Y.; Kuai, L.; Yu, Y.; Jiang, Z.; et al. The role of millimeter-wave technologies in 5G/6G wireless communications. IEEE J. Microwaves 2021, 1, 101–122. [Google Scholar] [CrossRef]
  5. Liu, F.; Masouros, C.; Petropulu, A.P.; Griffiths, H.; Hanzo, L. Joint radar and communication design: Applications, state-of-the-art, and the road ahead. IEEE Trans. Commun. 2020, 68, 3834–3862. [Google Scholar] [CrossRef]
  6. Uwaechia, A.N.; Mahyuddin, N.M. A comprehensive survey on millimeter wave communications for fifth-generation wireless networks: Feasibility and challenges. IEEE Access 2020, 8, 62367–62414. [Google Scholar] [CrossRef]
  7. Niu, Y.; Li, Y.; Jin, D.; Su, L.; Vasilakos, A.V. A survey of millimeter wave communications (mmWave) for 5G: Opportunities and challenges. Wirel. Netw. 2015, 21, 2657–2676. [Google Scholar] [CrossRef]
  8. Rao, S. Introduction to mmWave sensing: FMCW radars. In Texas Instruments (TI) mmWave Training Series; Texas Instruments Inc.: Dallas, TX, USA, 2017; pp. 1–11. [Google Scholar]
  9. Dokhanchi, S.H.; Mysore, B.S.; Mishra, K.V.; Ottersten, B. A mmWave automotive joint radar-communications system. IEEE Trans. Aerosp. Electron. Syst. 2019, 55, 1241–1260. [Google Scholar] [CrossRef]
  10. Alkhateeb, A.; El Ayach, O.; Leus, G.; Heath, R.W. Channel estimation and hybrid precoding for millimeter wave cellular systems. IEEE J. Sel. Top. Signal Process. 2014, 8, 831–846. [Google Scholar] [CrossRef]
  11. Yang, L.; Zhang, W. Beam tracking and optimization for UAV communications. IEEE Trans. Wirel. Commun. 2019, 18, 5367–5379. [Google Scholar] [CrossRef]
  12. Va, V.; Vikalo, H.; Heath, R.W. Beam tracking for mobile millimeter wave communication systems. In Proceedings of the 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Washington, DC, USA, 7–9 December 2016; pp. 743–747. [Google Scholar]
  13. Larew, S.G.; Love, D.J. Adaptive beam tracking with the unscented Kalman filter for millimeter wave communication. IEEE Signal Process. Lett. 2019, 26, 1658–1662. [Google Scholar] [CrossRef]
  14. Liu, C.; Yuan, W.; Li, S.; Liu, X.; Li, H.; Ng, D.W.K.; Li, Y. Learning-based predictive beamforming for integrated sensing and communication in vehicular networks. IEEE J. Sel. Areas Commun. 2022, 40, 2317–2334. [Google Scholar] [CrossRef]
  15. Lim, S.H.; Kim, S.; Shim, B.; Choi, J.W. Deep learning-based beam tracking for millimeter-wave communications under mobility. IEEE Trans. Commun. 2021, 69, 7458–7469. [Google Scholar] [CrossRef]
  16. Yuan, W.; Liu, C.; Liu, F.; Li, S.; Ng, D.W.K. Learning-based predictive beamforming for UAV communications with jittering. IEEE Wirel. Commun. Lett. 2020, 9, 1970–1974. [Google Scholar] [CrossRef]
  17. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  18. Wiering, M.A.; Van Otterlo, M. Reinforcement learning. Adapt. Learn. Optim. 2012, 12, 729. [Google Scholar]
  19. Zhang, J.; Huang, Y.; Wang, J.; You, X. Intelligent beam training for millimeter-wave communications via deep reinforcement learning. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Big Island, HI, USA, 9–13 December 2019; pp. 1–7. [Google Scholar]
  20. Zhang, J.; Huang, Y.; Wang, J.; You, X.; Masouros, C. Intelligent interactive beam training for millimeter wave communications. IEEE Trans. Wirel. Commun. 2020, 20, 2034–2048. [Google Scholar] [CrossRef]
  21. Kim, S.; Kwon, G.; Park, H. Q-learning-based low complexity beam tracking for mmWave beamforming system. In Proceedings of the 2020 International Conference on Information and Communication Technology Convergence (ICTC), Jeju, Republic of Korea, 21–23 October 2020; pp. 1451–1455. [Google Scholar]
  22. Chiang, H.L.; Chen, K.C.; Rave, W.; Marandi, M.K.; Fettweis, G. Machine-learning beam tracking and weight optimization for mmWave multi-UAV links. IEEE Trans. Wirel. Commun. 2021, 20, 5481–5494. [Google Scholar] [CrossRef]
  23. Jeong, J.; Lim, S.H.; Song, Y.; Jeon, S.W. Online learning for joint beam tracking and pattern optimization in massive MIMO systems. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications, Virtual, on 6–9 July 2020; pp. 764–773. [Google Scholar]
  24. Cheng, B.; Zhao, L.; He, Z.; Zhang, P. A beam tracking scheme based on deep reinforcement learning for multiple vehicles. In Proceedings of the International Conference on Communications and Networking in China, Beijing, China, 21–22 November 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 291–305. [Google Scholar]
  25. Liu, Y.; Jiang, Z.; Zhang, S.; Xu, S. Deep reinforcement learning-based beam tracking for low-latency services in vehicular networks. In Proceedings of the ICC 2020-2020 IEEE International Conference on Communications (ICC), Virtual, 7–11 June 2020; pp. 1–7. [Google Scholar]
  26. Ma, D.; Shlezinger, N.; Huang, T.; Liu, Y.; Eldar, Y.C. Joint radar-communication strategies for autonomous vehicles: Combining two key automotive technologies. IEEE Signal Process. Mag. 2020, 37, 85–97. [Google Scholar] [CrossRef]
  27. Zhang, A.; Rahman, M.L.; Huang, X.; Guo, Y.J.; Chen, S.; Heath, R.W. Perceptive mobile networks: Cellular networks with radio vision via joint communication and radar sensing. IEEE Veh. Technol. Mag. 2020, 16, 20–30. [Google Scholar] [CrossRef]
  28. Cheng, X.; Duan, D.; Gao, S.; Yang, L. Integrated sensing and communications (ISAC) for vehicular communication networks (VCN). IEEE Internet Things J. 2022, 9, 23441–23451. [Google Scholar] [CrossRef]
  29. Liu, A.; Huang, Z.; Li, M.; Wan, Y.; Li, W.; Han, T.X.; Liu, C.; Du, R.; Tan, D.K.P.; Lu, J.; et al. A survey on fundamental limits of integrated sensing and communication. IEEE Commun. Surv. Tutor. 2022, 24, 994–1034. [Google Scholar] [CrossRef]
  30. Xu, Y.; Guo, Y.; Li, C.; Xia, B.; Chen, Z. Predictive beam tracking with cooperative sensing for vehicle-to-infrastructure communications. In Proceedings of the 2021 IEEE/CIC International Conference on Communications in China (ICCC), Xiamen, China, 28–30 July 2021; pp. 835–840. [Google Scholar]
  31. Zhao, J.; Gao, F.; Jia, W.; Yuan, W.; Jin, W. Integrated sensing and communications for UAV communications with jittering effect. IEEE Wirel. Commun. Lett. 2023, 12, 758–762. [Google Scholar] [CrossRef]
  32. Mu, J.; Gong, Y.; Zhang, F.; Cui, Y.; Zheng, F.; Jing, X. Integrated sensing and communication-enabled predictive beamforming with deep learning in vehicular networks. IEEE Commun. Lett. 2021, 25, 3301–3304. [Google Scholar] [CrossRef]
  33. Wymeersch, H.; Seco-Granados, G.; Destino, G.; Dardari, D.; Tufvesson, F. 5G mmWave positioning for vehicular networks. IEEE Wirel. Commun. 2017, 24, 80–86. [Google Scholar] [CrossRef]
  34. Salem, A.A.; Ismail, M.H.; Ibrahim, A.S. Active reconfigurable intelligent surface-assisted MISO integrated sensing and communication systems for secure operation. IEEE Trans. Veh. Technol. 2022, 72, 4919–4931. [Google Scholar] [CrossRef]
  35. Lin, Z.; Lin, M.; Wang, J.-B.; de Cola, T.; Wang, J. Joint Beamforming and Power Allocation for Satellite-Terrestrial Integrated Networks with Non-Orthogonal Multiple Access. IEEE J. Sel. Top. Signal Process. 2019, 13, 657–670. [Google Scholar] [CrossRef]
  36. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  37. Vucic, N.; Shi, S.; Schubert, M. DC programming approach for resource allocation in wireless networks. In Proceedings of the 8th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks, Avignon, France, 31 May–4 June 2010; pp. 380–386. [Google Scholar]
  38. Jiang, Y.; Ge, H.; Bennis, M.; Zheng, F.C.; You, X. Power Control via Stackelberg Game for Small-Cell Networks. Wirel. Commun. Mob. Comput. 2019, 2019, 1401469. [Google Scholar] [CrossRef]
  39. Liu, Y.; Sun, S.; Zhang, R. Sensing-Assisted Neighbor Discovery for Vehicular Ad Hoc Networks. In Proceedings of the 2023 IEEE Wireless Communications and Networking Conference (WCNC), Glasgow, UK, 26–29 March 2023; pp. 1–6. [Google Scholar]
  40. Susarla, P.; Gouda, B.; Deng, Y.; Juntti, M.; Sílven, O.; Tölli, A. DQN-based beamforming for uplink mmWave cellular-connected UAVs. In Proceedings of the 2021 IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, 7–11 December 2021; pp. 1–6. [Google Scholar]
Figure 1. Diagram of mmWave ISAC coexistence between I2V and V2V communications in IoV.
Figure 1. Diagram of mmWave ISAC coexistence between I2V and V2V communications in IoV.
Applsci 15 06069 g001
Figure 2. The network model of the ISAC beam allocation scheme (IMSBA) based on MAPPO and a Stackelberg game.
Figure 2. The network model of the ISAC beam allocation scheme (IMSBA) based on MAPPO and a Stackelberg game.
Applsci 15 06069 g002
Figure 3. Sensing neighbor list.
Figure 3. Sensing neighbor list.
Applsci 15 06069 g003
Figure 4. Real-road trajectory.
Figure 4. Real-road trajectory.
Applsci 15 06069 g004
Figure 5. Cumulative reward variation curves.
Figure 5. Cumulative reward variation curves.
Applsci 15 06069 g005
Figure 6. Energy efficiency change curve.
Figure 6. Energy efficiency change curve.
Applsci 15 06069 g006
Figure 7. Sensing performance change curve.
Figure 7. Sensing performance change curve.
Applsci 15 06069 g007
Figure 8. I2V achievable rate change curve.
Figure 8. I2V achievable rate change curve.
Applsci 15 06069 g008
Figure 9. Energy efficiency versus weight.
Figure 9. Energy efficiency versus weight.
Applsci 15 06069 g009
Figure 10. Sensing performance versus weight.
Figure 10. Sensing performance versus weight.
Applsci 15 06069 g010
Figure 11. I2V achievable rate versus weight.
Figure 11. I2V achievable rate versus weight.
Applsci 15 06069 g011
Figure 12. Curve of cumulative reward versus number of antennas.
Figure 12. Curve of cumulative reward versus number of antennas.
Applsci 15 06069 g012
Figure 13. CDF of position sensing error versus number of antennas when the position sensing error RMSE ranges from 0 to 0.01.
Figure 13. CDF of position sensing error versus number of antennas when the position sensing error RMSE ranges from 0 to 0.01.
Applsci 15 06069 g013
Figure 14. CDF of position sensing error versus number of antennas when the position sensing error RMSE ranges from 0 to 0.2.
Figure 14. CDF of position sensing error versus number of antennas when the position sensing error RMSE ranges from 0 to 0.2.
Applsci 15 06069 g014
Figure 15. CDF of position sensing error for different algorithms.
Figure 15. CDF of position sensing error for different algorithms.
Applsci 15 06069 g015
Figure 16. Performance comparison of different numbers of I2V vehicles.
Figure 16. Performance comparison of different numbers of I2V vehicles.
Applsci 15 06069 g016
Figure 17. Performance comparison of different numbers of V2V vehicles.
Figure 17. Performance comparison of different numbers of V2V vehicles.
Applsci 15 06069 g017
Table 1. Parameter settings.
Table 1. Parameter settings.
ParameterValue
Initial x-axis positions of the V2V vehicles 30 , 10 , 30 , 20 m
Initial x-axis positions of the I2V vehicles 20 , 5 m
Velocity of V2V vehicles15, 20, 10, 15 m/s
Velocity of I2V vehicles10, 15 m/s
The location of RSU [ 60 , 120 ] m
Time slot Δ T 20 ms
Transmit and receive antennas N t , N r 64
RSU transmit antenna N t R 64
Angular interval of adjacent indices
in the codebook Δ θ
0.025 rad
Center frequency f c 30 GHz
Complex radar cross section ε 10 + 10 j
Channel power gain α ˜ 1
Noise variance σ 2 = σ ˜ c 2 = 1
Matched filter gain G10
Proportional coefficient0.01
Relative change set I I = { 1 , 0 , 1 }
Transmit power range p 1 , 20 W
Learning rate φ 0.0005
Discount factor γ 0.99
Experience pool size s a m p 32
MaxStep450
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lai, J.; Wang, D.; Zhao, Y. IMSBA: A Novel Integrated Sensing and Communication Beam Allocation Based on Multi-Agent Reinforcement Learning for mmWave Internet of Vehicles. Appl. Sci. 2025, 15, 6069. https://doi.org/10.3390/app15116069

AMA Style

Lai J, Wang D, Zhao Y. IMSBA: A Novel Integrated Sensing and Communication Beam Allocation Based on Multi-Agent Reinforcement Learning for mmWave Internet of Vehicles. Applied Sciences. 2025; 15(11):6069. https://doi.org/10.3390/app15116069

Chicago/Turabian Style

Lai, Jinxiang, Deqing Wang, and Yifeng Zhao. 2025. "IMSBA: A Novel Integrated Sensing and Communication Beam Allocation Based on Multi-Agent Reinforcement Learning for mmWave Internet of Vehicles" Applied Sciences 15, no. 11: 6069. https://doi.org/10.3390/app15116069

APA Style

Lai, J., Wang, D., & Zhao, Y. (2025). IMSBA: A Novel Integrated Sensing and Communication Beam Allocation Based on Multi-Agent Reinforcement Learning for mmWave Internet of Vehicles. Applied Sciences, 15(11), 6069. https://doi.org/10.3390/app15116069

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop