Deep Reinforcement Learning-Based Coordinated Beamforming for mmWave Massive MIMO Vehicular Networks

As a critical enabler for beyond fifth-generation (B5G) technology, millimeter wave (mmWave) beamforming for mmWave has been studied for many years. Multi-input multi-output (MIMO) system, which is the baseline for beamforming operation, rely heavily on multiple antennas to stream data in mmWave wireless communication systems. High-speed mmWave applications face challenges such as blockage and latency overhead. In addition, the efficiency of the mobile systems is severely impacted by the high training overhead required to discover the best beamforming vectors in large antenna array mmWave systems. In order to mitigate the stated challenges, in this paper, we propose a novel deep reinforcement learning (DRL) based coordinated beamforming scheme where multiple base stations serve one mobile station (MS) jointly. The constructed solution then uses a proposed DRL model and predicts the suboptimal beamforming vectors at the base stations (BSs) out of possible beamforming codebook candidates. This solution enables a complete system that facilitates highly mobile mmWave applications with dependable coverage, minimal training overhead, and low latency. Numerical results demonstrate that our proposed algorithm remarkably increases the achievable sum rate capacity for the highly mobile mmWave massive MIMO scenario while ensuring low training and latency overhead.


Introduction
With the recent advancements in 5G, it is not ambitious to expect that 5G will enable 1000× more data traffic than the widely established current 4G standards [1,2]. Foreseeing the rise is users in increased traffic demands, facilitating these massive users and serving great quality cellular networks require high-frequency waves. Recently, millimeter wave (mmWave) communication has attracted significant interest in designing 5G wireless communication systems owing to its advantages in reducing spectrum scarcity and enabling high data speeds [3]. The range of the mmWave frequency band lies between 30 GHz to 300 GHz. However, the higher frequency travels very short distances due to their physical limitations in the spectrum and demonstrates high path loss [4]. Consequently, higher frequencies require smaller cellular cells to overcome the challenges such as path loss and blockage [5]. The massive multiple-input multiple-output (mMIMO) can use hundreds of antennas simultaneously to propagate signal in the same time-frequency resource and serve tens of users at the same time [6]. The mMIMO techniques can be utilized to perform highly directional transmissions thanks to the short wavelength of mmWave, which makes it physically feasible to equip a lot of antennas at the transceiver in a cellular network and can significantly improve network capacity [7,8]. Under a fairly generic channel model that considers poor channel estimation, pilot contamination, path loss, and terminal-specific antenna correlation, large-scale antenna systems significantly increase the achievable upload and download rate [9]. In situations with rapid changes in propagation, large-scale antenna systems can reliably offer high throughput on both the forward and the reverse link connections [10].
Vehicles are getting more sensors as driving becomes increasingly automated, resulting in increasingly higher data rates. Beamforming in mMIMO makes it possible to serve distance users with mmWave, even users that are not stationary. Therefore, the practical method for large bandwidth connected automobiles is mmWave mMIMO communication [11]. As a result, mmWave mMIMO systems can serve mobile vehicles effectively, considering the proper beam is selected. Due to the fundamental differences between mmWave communications and current microwave-based communication technologies (e.g., 2.4 GHz and 5 GHz), the mmWave systems present difficulties, such as a high sensitivity to shadowing and a significant signal attenuation [12]. In this paper, in order to overcome these issues and allow mMIMO environments where a highly non-stationary active user is present, we introduce a coordinated beamforming scheme utilizing deep reinforcement learning (DRL) to select the optimal beam for a vehicular communication system. First, a deep Q-network (DQN) algorithm is created to handle the beam selection problem as a Markov decision process (MDP). Then, by ensuring that the limitations of the beam selection matrix are met, our goal is to choose the best beams to maximize the sum rate for the user served.

Related Works
There have been few standard traditional approaches for beamforming or beam selection. In [13], Gao et al. followed an exhausting search approach for beamforming which demonstrates very high complexities in the system. On the other hand, Pal et al. [14] followed a different approach that iterated through the users and beams to determine the best possible beamforming matrices. This approach is also executed with high complexity algorithm.
On the other hand, deep learning (DL) based approaches show promising results in terms of application complexity and viability. Alkhateeb et al. [15] derived a high mobility supported mmWave mMIMO-based DL-enabled coordinated beamforming scheme for an outdoor scenario. To formulate their design, they utilized distributed base stations (BSs) simultaneously to serve a mobile user. They predicted the optimal beams using the traditional DL approach and compared the achievable rate performance of their DL method with the optimal achievable rate of beamforming. Zhang et al. [16] proposed a multi-user mMIMO coordinated beamforming scheme for heterogeneous networks (HetNets) focusing on energy efficiency (EE) based on the convolutional neural network (CNN) approach. They designed and used a multi-user huge MIMO HetNets optimization challenge to maximize EE with less complexity and compute delay. In order to accomplish end-toend autonomous beamforming [17], introduced a constrained deep neural network-based beamforming technique. This method uses a neural network in place of the beamforming matrices used in conventional beamforming.
In [18], in-depth experiments for coordinated multipoint transmission at 73 GHz were carried out in a downtown Brooklyn urban open square setting. The analysis showed that serving a user jointly at the same time by many BSs can achieve a considerable coverage improvement. Moreover, another work on BS coordination, where a user is concurrently given access by many BSs, may be used to generate a significant coverage increase and is demonstrated by Maamari et al. in an analysis of the performance of heterogeneous mmWave cellular networks in [19]. Gupta et al. in [20] investigated the scope of a minimum of one line of sight (LOS) case when the users are served with LOS connections. The results showed that the density of coordinating BSs should scale with the square of the blockage density in order to maintain the same LOS connection. Although [18][19][20] established how BS coordination significantly increased coverage, they lack the analysis of producing coordinated beamforming vectors.
In order to enable high-speed, long-range, and reliable transmission in mmWave 60 GHz wireless personal area networks, Wang et al. [21] introduced a beamforming approach applied in the media access control (MAC) layer on top of various physical layer (PHY) designs. Ref. [11] suggested a new strategy to lower the overhead for beam alignment by utilizing dedicated short-range communication (DSRC) and/or sensor information as side information. Afterward, they provided detailed examples of how to leverage location data from DSRC to lessen the overhead of beam alignment and tracking in mmWave vehicleto-everything (V2X) applications. Conversely, Ref. [22] proposed an algorithm to jointly optimize the beamforming vectors and power allocation for reconfigurable intelligent surface (RIS)-based applications. Lin et al. in [23] formulated solutions on the joint design and optimization of beamforming for hybrid satellite-terrestrial relay networks with RIS support, and in [24] proposed another methodology for joint beamforming for mmWave non-orthogonal multiple access (NOMA). Furthermore, the author also investigated secure energy-efficient beamforming in multibeam satellite systems in [25].
On the other hand, Va et al. [26] proposed a multipath fingerprint database using the vehicle's position (for example, as determined by GPS) to gain information on probable pointing directions for accurate beam alignment. The power loss probability is a parameter used in the method to measure misalignment precision and is used to enhance candidate beam selection. Moreover, two candidate beam selection techniques are created, one of which uses a heuristic, and the other aims to reduce the likelihood of misalignment. Cao et al. [27] proposed a latency reduction scheme for vehicular network relay selection. In addition, Zhou et al. [28] proposed a DQN-based algorithm to train and determine the optimal receiver beam direction with the purpose of maximizing average received signal power.
However, there are various drawbacks to designing beamforming vectors by utilizing the stated approaches, such as solely based on location data and received signal power. First, narrow-beam systems may not function effectively with position-acquisition sensors like GPS because of their poor precision, which is typically in the range of meters. Second, these technologies are unable to handle indoor applications since GPS sensors perform poorly inside structures. In addition, the beamforming vectors depend on the environment's shape, obstructions, and so on. Furthermore, received signal power can experience severe penetration power loss because of the vehicle's metal body. In this paper, we aim to utilize a DRL-based coordinated approach where we do not encounter the declared challenges and exhibit better results.

Contribution
In this paper, for highly mobile mmWave applications, we provide a novel DRL approach for highly mobile mmWave communication architecture. As part of our suggested method, a coordinated beamforming system is used, in which a number of BSs concurrently provide access to a single non-stationary user. In this approach, a DRL network exclusively utilizes beam patterns and learns how to anticipate the BSs beamforming vectors from the signals obtained at the scattered BSs. The idea behind this is that the propagated waves collectively acquired at the scattered BSs indicate a distinctive multi-path signature of both the user position and its surroundings. There are several benefits to the suggested approach. First, the suggested technique can accommodate not only LOS but non-LOS (NLOS) framework without the need for specialized position-acquiring devices because beamforming prediction is based on the uplink received signals rather than position data. Second, only received pilots, which may be retrieved with minimal overhead training, are needed for the determination of the best beams. Furthermore, because the DRL model trains and responds to any environment, it does not need any training before deployment in the suggested system. The proposed model also inherits coordination coverage and reliability improvements since it is coupled with the coordinated beamforming mechanism. Even though some DRL-based beamforming solutions exist, to the best of our knowledge, no prior work addressed a coordinated beamforming solution by leveraging DRL where multiple BSs serve one single mobile user jointly to achieve the highest possible data rate. The contributions of the proposed beamforming scheme are summarized as follows: • We develop a simple coordinated beamforming scheme where several BSs employ RF beamforming and are connected to a central cloud processing unit that uses baseband processing, which serves a mobile user at once. To increase the platform's effective achievable rate, we define a training and design issue for the central baseband processing and for BSs RF beamforming vectors. The trade-off between the beamforming training overhead and the achievable sum rate using the proposed beamforming vectors is taken into account when determining the effective achievable rate for highly mobile mmWave systems. • For the selected system, we construct a fundamental coordinated beamforming technique that relies on uplink training for creating the RF and baseband beamforming vectors. The BSs choose their RF beamforming vectors from a predetermined codebook as part of this baseline approach. The baseband beamforming is then designed by a central processor to guarantee consistent incorporation by the user. We demonstrate that the standard beamforming technique achieves the best attainable rates in a few unique but crucial situations. • We introduce a system operation of machine learning modeling of a unique combined DRL and coordinated beamforming solution. In this approach, we incorporate a reverse autoencoder owing to its capability to handle raw data seamlessly so that it can reproduce the input data as closely as possible as a neural network for our DRL model and solve a coordinating beamforming problem. The main concept of the suggested technique is to anticipate the RF beamforming vectors of the coordinating BSs using just beam patterns, i.e., with very little training overhead. The proposed approach also enables minimal coordination overhead harvesting of coordinated beamforming improvements with wide coverage and low latency, making the method a viable solution for highly mobile mmWave applications.

System Model
In this section, we discuss the chosen frequency-selective coordinated mmWave system and channel models for our DRL-based coordinated beamforming scheme, where for this designated system and channel model, the DRL model optimizes the beam selection from a set of candidate beams by utilizing the exploration and exploitation strategy of the DRL. We analyze a mmWave-enabled vehicular communication architecture shown in Figure 1, where N BSs are concurrently providing service to one mobile station (MS). Each BS is equipped with M a number of antennas, and each BS is linked to a central processing unit in the cloud. In the interests of simplicity, we assume that each BS utilizes analog-only beamforming with networks of phase shifters and has a single RF chain [29]. In this paper, we use the assumption that the MS is equipped with only one antenna.
The signals are precoded using a N× 1 digital precoder f k ∈ C N×1 for subcarrier k, k = 1, · · · , K. The frequency domain signals are then converted into the time domain using N K-point inverse fast Fourier Transforms (IFFTs). Afterward, each BS n performs a timedomain analog beamforming and then transmits the resulting signal. At the receiver end, the received signal is converted to the frequency domain using a K-point FFT, presuming perfect synchronization of frequency and carrier offset. The received signal at kth subcarrier at nth BS is denoted by where x k,n is the transmitted complex baseband signal, h k,n is the M× 1 channel vector between the MS and BS, n k ∈ C M×1 is the received noise at the BS with independent and identically complex (i.i.c.) additive white Gaussian noise (AWGN) distribution with zero mean and variance σ 2 . We consider a L clustered geometric wideband model for our mmWave cellular channel [30][31][32]. For each cluster l, it is assumed that l = 1, · · · , L contributes one ray with a temporal delay ø l ∈ R, and azimuth/elevation angles of arrival (AoA) is θ l , φ l . Let p(τ) be a pulse shaping function for T S -spaced signaling assessed at τ seconds, and let ρ n signify the path-loss between the user and the nth BS [33]. The delay-d channel vector in this model h d,n between the user and the nth BS can be expressed as where α l is the gain, a n (θ l , φ l ) is the array response vector (θ l = azimuth angle, φ l = elevation angle) of the nth BS. Considering the delay-d channel in (2), for subcarrier k, our frequency domain channel vector h k,n can be formulated as Our adopted block-fading channel model {h k,n } K k=1 is considered to remain constant throughout the channel coherence time, and it is dependent on the user the mobility and the channel multi-path components [34].

Coordinated Beamforming
In this chapter, we introduce a baseline DRL coordinated beamforming approach for a highly mobile vehicular mmWave communication system as shown in Figure 2. To present the proposed solution, we first describe the problem formulation, then derive the novel DRL-based approach for beamforming. In this chapter, we also present the environment setup, dataset generation, simulation parameters, and performance analysis for our proposed scheme.

Problem Statement
For a vehicular mmWave based 5G network, serving any user or MS is challenging because of the dynamic and varying environment characteristics. When signal interference, fading effect, and network congestion are considered, which we subsequently describe as the environment dynamics [35], it becomes much more complicated to serve the receiver end by maintaining eMBB, mMTC, and URLLC standards. Considering the time-varying environment of wireless communication, a DRL-based beamforming scheme is most appropriate. In the case of DL-based approaches, they struggle to show promising results while dealing with the stated time-varying environments because they lack the functionality of learning by good or bad actions. To achieve the highest level of sum rate, reduce the overhead, and tackle the large RF beamforming vector arrays, an adaptive beam selection approach such as DRL is best suited for this specific task. With this motivation, in this paper, we exploit the DRL's capability of tackling varying environments to maximize the achievable data rate by selecting the optimal beam for mmWave vehicular networks in a coordinated approach.
In this paper, considering a set of beamforming vectors {f BF n } N n=1 , our focus is to formulate a beam selection matrix to optimize the achievable downlink rate of the mmWave vehicular beamforming system. The user's maximum achievable rate can be derived as

Drl-Based Coordinated Beamforming Framework
We propose a DRL framework that utilizes DQN to train and optimize the beam selection assignment. Typically, the DQN technique consists of an environment and an agent using a deep neural network (DNN). The agent, the same as BS in this study, engages with the environment before performing any action. In the beginning, the agent starts exploring the environment, moving from one state to another. At that point, it needs more information about the environment. As the agent explores the environment, it gathers information and starts to take action by exploiting the environment with the help of the reward function. In any timestep t, if the current state is S t , the agent will receive an immediate reward R t assessing the performed action A t using the DNN. The agent also gets to take the next state S t+1 as input from the environment in the same timestep.
Depending upon the performed A t , the agent receives a reward R t . If the action taken can achieve a reasonable sum rate, then the agent will also receive a good R t . The agent gains knowledge of its surroundings and develops an ideal beam selection assignment strategy by foreseeing future events. The DNN algorithm learns this policy π at each timestep as it continues to move forward with the next timesteps. We formulate our state, action, and reward functions as follows: • State: We utilize the channel matrices for all the BSs as the state of our environment. The complex channel matrices are constructed incorporating the bandwidth, user position, noise figure, and noise power. If the environment has Z states each having V number of beams, then, the state space with Z × V can be represented as S =S 1 ,S 2 ,S 3 , · · · ,S Z . • Action: The goal of the agent is to assign a beam for serving from the action space A. At each episode for a set of S, the agent has to take Z ∈ A actions while maintaining one action per V elements from the S. Out of the Z × V, the target of the agent is choosing a beam that will maximize the data rate. • Reward: In our reward function, we first derive the data rate for each channel as follows.
For every action the agent takes, we calculate the data rate of the chosen action and feed it as the reward value. Our aim is to acquire the highest possible cumulative reward R max as it obtains reward for each action, according to With this state, action, and reward function, we propose the DNN architecture as the policy controller for the beam selection, as shown in Figure 3. The DNN takes the place of the Q-table and calculates the Q-values for each environment state-action pair. Deriving probabilities for each beam selection for each state space is the primary objective of the DNN, and this probability can be defined by Q(S, A) of the DQN algorithm. We select the best beam out of V = 64 candidate beams, coordinately with 4 BSs.

Reverse Autoencoder
Autoencoder is a neural network that can be taught to reconstruct their input [36]. It is a particular kind of neural network that is primarily developed to compress and meaningfully represent the input before decoding it back so that the reconstructed input is as similar to the original as possible [37]. Moreover, the autoencoder can handle the raw input data without any difficulties and is viewed as a component of the unsupervised learning model [38]. The autoencoder consists of three main components, encoder, code, and decoder. In addition, for autoencoders, the number of neurons decreases as we go deep down the hidden layers. However, it increases [39] in reverse autoencoder. We resort to the newly introduced reverse autoencoder for the DNN segment of our DQN model.
In the encoder, the input layer starts with 2 c neurons, and the next hidden layers are followed by 2 c+p neurons. In this paper, we start the hidden layers with c = 5, and p refers to the position of the layer. For the code layer, we use the value of c + p = 9 for the code layer. The decoder portion ends with an output layer and is the exact opposite of the encoder portion. Because the layers are placed one on top of the other like a sandwich, this form of structure is referred recognized as a stacked autoencoder. Additionally, each layer in the autoencoder has its own ReLu activation function.

Performance Evaluation
In this section, we evaluate the proposed DRL-based coordinated beamforming approach in different case studies by comparing it with traditional DL architecture [15]. In a multicell mmWave mMIMO downlink scenario, a large uniform planar array (UPA) is installed on a BS. In this paper, we select 4 BS with 32 × 8 UPA, resulting in M = 256 antenna arrays for each BS.
For our methodology, we used the popular publicly available DeepMIMO [40] dataset generated by the Wireless InSite [41]. The dataset contains the generated beamforming vectors or predetermined codebooks, denoted as f BF n . These generated codebooks are the beamforming defining vectors. Along with f BF n , we also use their corresponding channel matrices as depicted as h k,n in our proposed scheme for defining states and rewards as discussed in Section 3.2 for optimizing the optimal beamforming vectors.
We used the outdoor scenario of two streets and one intersection with a mmWave communication operating at 60 GHz. We aim to serve the MS with the best beam, coordinating with 4 BSs. For this adopted scenario, 4 BSs are equipped on the top of 4 lamp posts to concurrently provide beam coverage for one MS coordinately. The lamps are located 60 m away, side by side. Every BS is installed on the 6 m elevation having 32 × 8 antenna elements. The MS is incorporated with a single antenna on top of the vehicle. During the uplink training, we assumed a transmit power of 30 dBm for the MS. The adopted DeepMIMO parameters for dataset generation and the simulation parameters used in this work are summarized in Table 1 and Table 2, respectively.

Training
The proposed DNN is gradually trained using a set of training data for each episode. For every state space S, the state action pair is formulated using the -greedy policy in accordance with the output probabilities of DNN. An episode is considered complete when all state space has been processed by the DNN. For every state space, the exploitation policy [42] or the policy for taking action can be represented as After executing a l t , the agent will receive the rewards according to (5) and the next state space S t+1 . Afterward, we first determine the loss and then tweak the DNN's parameters using back-propagation to train our model. We take an approximation of the optimal Q *values for each state-action pair for S t+1 from a separate DNN termed the target DNN [43] in order to compute the loss. The policy DNN's settings are used to initialize the target DNN, which is identical to it. Consequently, for the target DNN input, we use the next state space S t+1 as the input, and finally, the agent chooses optimal Q * -values greedily from the output. We add experience replay memory (ERM) to the DQN to help the optimal policy converge more steadily [44]. The agent first explores its environment while saving its current states, actions, rewards and next states (S t , A t , R t , S t+1 ) as a tuple in the ERM. The agent then trains the policy DNN using a small batch of tuples from the ERM. Each training set of data continues to be updated in the ERM. We summarize the system architecture and working principles of our model in Figure 4 and Algorithm 1.
In the training phase of our model, we used Adam optimizer [45] with a learning rate of 0.0005. The DRL model minimizes the error of our training in the DNN using the Smooth L1 loss function [46,47]. If we have a batch of size B, the unreduced loss for two data points (u, w) can be described as where in any loss instance b ∈ B, Algorithm 1 Proposed deep Q-learning algorithm 1: Initialize policy, target DQN with random w, w 2: Initialize 3: for episode do 4: for instance do 5: Select a channel matrix and add it to action space A t for present state space S t

6:
Observe immediate reward R t , next state space S t+1 7: Form random sample mini batch of (S t , A t , R t , S t+1 ) from ERM 9: for each tuple in mini batch do 10: Calculate Q-values 11: Approximate Q * -values using target DNN 12: Compute loss from Q and Q *

13:
Optimize w of policy DNN with Adam optimizer 14: w ← w after all time steps Ensure: R r ≈ R max

Performance Analysis
In this subsection, we will evaluate our achieved performance in terms of sum rate and will compare our rate with the traditional ML approach. Figure 5 represents the performance analysis of our proposed model having 3 performance matrices. We plot an effective achievable rate based on our DRL, conventional DL, and optimal data rate. Even though the utilization of the reverse autoencoder in the DRL model incorporated higher computation complexity and time complexity in the learning phase of the model. However, the delays we verified in our simulations are slight in the considered system environment. In addition, we confirmed that the performance degradation due to the delays is insignificant in the simulation results. It is clear that our proposed DRL outperforms the traditional DL model by a large margin and demonstrates suboptimal performance. In this figure, we did not consider any beam training or latency overhead. For vehicular mmWave communication, when the user is mobile, one of the most viable communication overheads is velocity because the connectivity between the BS and the user gets affected by the velocity. For fast-moving users, it needs fast beam switching from the BS, otherwise, because of the delay, the user might not get service on time from the BS as it moves away from its current position.

3UHGLFWHGEHDPV
$FKLHYDEOHUDWHESV+] 2SWLPDODFKLHYDEOHUDWH 3URSRVHG'5/EDVHGVXERSWLPDODFKLHYDEOHUDWH '/EDVHGDFKLHYDEOHUDWH In Figure 6, we compare DRL and DL-based beamforming performance with the optimal beamforming performance by incorporating overhead. More specifically, we compared our DRL-based achievable sum rate for all the 3 overhead speed side by side. The performance was similarly very consistent throughout the plot, and the achievable rate of declination due to the increased overhead was negligible. In this stage, we consider the 64-beam training overhead, with coherence time at 40 kmph at first. It is visible that, even though our suboptimal performance experienced a slight decrease, the DRL beamforming achievable rate is still significantly higher than the DL approach. We also compared the achievable rate versus different user positions at 80 kmph and 120 kmph in the same Figure 6. The results followed similar trends. Our DRL-based approach outperformed the DL approach by a large margin and demonstrated suboptimal performance. As the user position moved, the achievable rate saw a slight but steady decrease over the period. However, for the traditional DL-based approach, the performance was inconsistent.
We also compare the performance of our proposed DRL scheme by varying SNR as shown in Figure 7. It is demonstrated how the performance of our model varies at two different SNR levels, which are low SNR at 10 dB and high SNR at 30 dB. The previous results containing 38.65 dB SNR portrayed higher results. We confirmed that after the SNR was reduced to 30 dB, the initial performance dropped by 14.27% in terms of the average sum rate for our DRL method. In addition, we have illustrated the performance of our DRL model at an SNR of 10 dB in this figure. It is noticeable that, for another 20 dB of SNR drop, the performance declined by another 38  Furthermore, in Figure 8, we have illustrated the convergence of our proposed algorithm. The achievable sum rate converges with the time step t in terms of loss. After approximately 3.2 × 10 6 iterations, we confirmed that our model converged successfully. Overall, the performance of our model significantly rises as the SNR increases. Our proposed DRL architecture is robust and flexible in various conditions, such as different SNRs and different velocities.

Conclusions
In this paper, we propose a sub-optimal beam selection scheme with DRL that enables high mobile applications in mmWave mMIMO systems. The key idea is to utilize the powerful exploration-exploitation strategy of DRL to derive the optimal beam selection policy by learning the mapping of the omni-received uplink pilot and sub-optimal beam mapping. Our proposed scheme guarantees achievable sum rate performance close to optimal, even if it requires a small training overhead and beam overhead. In addition, the proposed scheme ensures reliable coverage and shorter latency while serving the beam towards the highly mmWave mobile user end.
Funding: This work was supported by the research fund from Chosun University, 2022.