Wi-Fi Assisted Contextual Multi-Armed Bandit for Neighbor Discovery and Selection in Millimeter Wave Device to Device Communications

The unique features of millimeter waves (mmWaves) motivate its leveraging to future, beyond-fifth-generation/sixth-generation (B5G/6G)-based device-to-device (D2D) communications. However, the neighborhood discovery and selection (NDS) problem still needs intelligent solutions due to the trade-off of investigating adjacent devices for the optimum device choice against the crucial beamform training (BT) overhead. In this paper, by making use of multiband (μW/mmWave) standard devices, the mmWave NDS problem is addressed using machine-learning-based contextual multi-armed bandit (CMAB) algorithms. This is done by leveraging the context information of Wi-Fi signal characteristics, i.e., received signal strength (RSS), mean, and variance, to further improve the NDS method. In this setup, the transmitting device acts as the player, the arms are the candidate mmWave D2D links between that device and its neighbors, while the reward is the average throughput. We examine the NDS’s primary trade-off and the impacts of the contextual information on the total performance. Furthermore, modified energy-aware linear upper confidence bound (EA-LinUCB) and contextual Thomson sampling (EA-CTS) algorithms are proposed to handle the problem through reflecting the nearby devices’ withstanding battery levels, which simulate real scenarios. Simulation results ensure the superior efficiency of the proposed algorithms over the single band (mmWave) energy-aware noncontextual MAB algorithms (EA-UCB and EA-TS) and traditional schemes regarding energy efficiency and average throughput with a reasonable convergence rate.


Introduction
The drastically exponential growth of wireless traffic sparks future communication standards (beyond fifth generation, B5G, and sixth generation, 6G) to shift their operating bands from the crowdy sub 6 GHz band into the abandoned millimeter wave (mmWave), i.e., 30-300 GHz, band. Although mmWave has excellent positives such as huge available spectrum, large capacity, and the ability to support high data rates and bandwidth-intensive applications, it suffers from several negatives that represent the main obstacle to deal with. Millimeter-wave signals experience harsh path loss, blockage sensitivity, and absorption from the wireless environment due to their short wavelengths [1]. Consequently, directional communication usage by employing high gain antennas and beamforming training (BT) is advocated to overcome the significant attenuation at a considerable overhead expense.
LinUCB [20] and CTS [21] CMAB algorithms. In the proposed EA-CMAB algorithms, the devices having residual energies above a specified limit will play the game, and the full game is finished when the whole devices reach the energy limit. Numerical investigations verify the outstanding performance of the EA-CMAB-based mmWave D2D NDS over both noncontextual EA-MAB proposed in [4] and traditional techniques. To the best of our knowledge, the current work is the first that proposes a ML-based context-aware bandit algorithm for mmWave D2D NDS.
The key contributions of this paper are highlighted as follows. Motivated by the standardized multiband devices, the mmWave D2D NDS optimization problem is modeled as budget constrained CMAB. The central device is the player, the arms are the nearby devices, the budget is the adjacent devices' residual energies for constructing the D2D linkages, and the reward is the obtained throughput from the selected nearby device. Finally, the context is the nearby devices' (arms) Wi-Fi information. We named the algorithms as EA-CMAB.
We examine the effect of Wi-Fi contextual information on the overall system performance by leveraging LinUCB [20] and CTS [21] algorithms and compare them with their noncontextual versions, i.e., UCB [22] and TS [23].
We propose EA-CMAB algorithms, e.g., EA-LinUCB and EA-CTS, for addressing the problem. In the proposed algorithms, the adjacent devices' lasting energies are considered while performing online learning for selecting the best device for creating the mmWave D2D link.
Widespread numerical investigations are done to evaluate the proposed EA-CMABbased algorithms at diverse situations and to examine their performances against two standard schemes named conventional direct NDS and random selection. Moreover, the proposed algorithms are compared with the noncontextual EA-MAB (EA-UCB and EA-TS) ones presented in [4].
The remainder of this paper is organized as follows. Section 2 reviews the related works. Section 3 introduces the mmWave D2D system model plus the utilized Wi-Fi and mmWave linkage models besides mmWave D2D NDS problem formulation, and the general concept of the CMAB algorithms. Section 4 discusses the proposed EA-CMAB algorithms. Section 5 gives the numerical investigations followed by the concluded remarks in Section 6. The following table shows the nomenclature used throughout this paper.

Literature Review
MmWave D2D communications carry great hopes to afford the capacity and the spectrum efficiency requirements of B5G/6G systems. A comprehensive study on mmWave and D2D related aspects like NDS, interference management, and network security are provided in [2,5,24], respectively. Furthermore, a comprehensive survey on D2D device discovery is provided in [25]. Designing an efficient NDS algorithm for mmWave D2D networks is more challenging due to high gain directional antenna usage and BT overhead. In [26], a novel D2D neighbor discovery algorithm that practices necklaces' idea to mitigate the worst-case discovery latency compared with former methods is presented. Specifically, they leveraged Po'lya's enumeration theorem and Fredricksen, Kessler and Maiorana (FKM) algorithm to discover briefer and effective scanning sequences for the nodes. However, the paper focused on the delay time only and neglected to maximize the accumulated reward.
A novel distributed algorithm using stochastic geometry tools that enable the devices to choose between the mmWave and µW bands for transmitting data by discovering unblocked mmWave LOS links was proposed in [27]. However, our proposed ML-based algorithms depend on mmWave band communications with side information from Wi-Fi and utilize mmWave for the whole data communications, not like [27] that switches between Wi-Fi and mmWaves. A novel cross-technology communication-based technique for neighbor discovery called NewBee that made use of coordination of Wi-Fi nodes to help neighbor discovery (ND) of Zigbee nodes is suggested in [28]. However, the authors did not consider mmWaves nor ML solutions in their proposal. In [29], a compressed- sensing related FastND algorithm that speeds up the ND process by dynamically learning the spatial channel characteristics is discussed. Although the authors made a successful practical experimental setup, their algorithm still does direct NDS with nearby devices and does not choose the best nearby device as in our case. In [30], the authors proposed a clustering scheme that splits the network nodes into clusters. Each cluster assigns one separate control channel and a particular mmWave channel for beamforming only. In [31], the authors suggested exploiting the context info associated with user position, handled by a separate control channel to advance the cell discovery process with minimizing its time delay. The schemes in [30] and [31] need an extra control channel, which increases the ND overhead, unlike our proposal that does not require any extra control channel. Employing linear programming, the authors of [32] proposed a distributed random mmWave-based discovery algorithm, where each device finds the relevant algorithm parameters, i.e., transmission and beam steering probabilities, using the information provided from the microwave band. However, they did not consider best neighbor selection besides the high complexity of linear programming especially for numerous adjacent devices. Another context-aware approach is provided in [33], where new cell discovery supported by the context information obtained from geo-located databases in heterogeneous mmWave networks was proposed. However, they did not consider mmWave D2D scenario, plus their method requires access to a previously established database, which might not be updatable, plus the labor work needed for constructing this database. A hunting-based directional neighbor discovery (HDND) technique for mmWave-based ad hoc networks is presented in [34]. However, it does not consider the D2D scenario nor applies advanced ML techniques.
Recently, MABs attracted significant attention in numerous sequential decision-makingbased applications, especially in wireless networks [5,[35][36][37][38]. In [5], we surveyed the applications of ML algorithms in different D2D communication challenges including NDS, resource allocation, power control, etc. To confirm the efficiency of ML in addressing these problems, we presented a case study of applying UCB and minimax optimal stochastic strategy (MOSS) algorithms in mmWave NDS problem. However, both applied algorithms were neither contextual nor energy aware ones. The authors of [4] leveraged stochastic bandit algorithms to solve similar problem by accounting the nearby devices' battery levels. E-UCB1, energy aware Kullback libeler UCB (E-KLUCB), and E-TS were proposed with improved system performance. Moreover, in [38] we extended the problem solution using E-MOSS algorithm. Different from our previous works given in [4,5,38] handling mmWave NDS using noncontextual MABs, we reformulate the problem using contextual MABs while leveraging the Wi-Fi information as context in the current work. We will prove the potency of the proposed contextual-based algorithms over the noncontextual ones due to the valuable Wi-Fi contextual information. The authors of [39] proposed an adaptive TS (ATS) algorithm for beam alignment of mmWaves. ATS can precisely evaluate the best beam/rate pair without assuming any channel settings and user mobility. However, their main contribution was in beam alignment, not D2D NDS.
Contextual bandits have been applied for fundamental areas in wireless communications [40] like machine type communications (MTC) [41], cooperative communications [42], link adaptation [43], and wireless handover optimization [44]. This motivates us to leverage CMAB to solve D2D NDS critical problem, especially with the challenging difficulties of mmWaves. Although some related work contained context-aware algorithms, this paper is inspired by Wi-Fi signal's merits, such as ease of obtainability with low latency and relative relation to mmWave signal strength.

System Model
This section presents the considered system model plus the utilized Wi-Fi and mmWave link models, including the mmWave blockage model. Moreover, the optimization problem of mmWave D2D NDS will be formulated followed by a brief discussion about the CMAB concept. 3.1. Multiband D2D Network Architecture Figure 1 shows the network planning of the multiband (mmWave/Wi-Fi) D2D communication network, where multiband devices, like QUALCOMM and Intel triband devices [13,14], are uniformly located within the 4G/5G LTE-based base station (BS) (e.g., femtocell) allocated zone. Multiband D2D connections can enhance the BS coverage and its traffic offloading. The 4G/5G LTE BS will deliver the necessary signaling to supervise the mmWave D2D communication operation, including the devices remaining energies and transmission characteristics. Moreover, it handles D2D broadcasting demands, changing between cellular and D2D modes, movement supervision, and network caching. Therefore, the processing of separate D2D links, including NDS, are completed using the spread devices. In a conventional direct NDS scheme, the central device attempts careful adjacent devices exploration by recurrently doing exhaustive search BT with all the surrounding devices to attain the finest transmit/receive (TX/RX) beam pairs for reliable linking. This is performed by accounting both LOS and non-LOS (NLOS) routes originated from obstructions, see Figure 1. Subsequently, the nearby device that owns the highest data rate in Gigabit per second (Gbps) is chosen for the mmWave D2D linkage setup. Conventional NDS scheme requires a considerable BT overhead, profoundly influencing the mmWave D2D network performance.

System Model
This section presents the considered system model plus the utilized Wi-Fi and mmWave link models, including the mmWave blockage model. Moreover, the optimization problem of mmWave D2D NDS will be formulated followed by a brief discussion about the CMAB concept. Figure 1 shows the network planning of the multiband (mmWave/Wi-Fi) D2D communication network, where multiband devices, like QUALCOMM and Intel triband devices [13,14], are uniformly located within the 4G/5G LTE-based base station (BS) (e.g., femtocell) allocated zone. Multiband D2D connections can enhance the BS coverage and its traffic offloading. The 4G/5G LTE BS will deliver the necessary signaling to supervise the mmWave D2D communication operation, including the devices remaining energies and transmission characteristics. Moreover, it handles D2D broadcasting demands, changing between cellular and D2D modes, movement supervision, and network caching. Therefore, the processing of separate D2D links, including NDS, are completed using the spread devices. In a conventional direct NDS scheme, the central device attempts careful adjacent devices exploration by recurrently doing exhaustive search BT with all the surrounding devices to attain the finest transmit/receive (TX/RX) beam pairs for reliable linking. This is performed by accounting both LOS and non-LOS (NLOS) routes originated from obstructions, see Figure 1. Subsequently, the nearby device that owns the highest data rate in Gigabit per second (Gbps) is chosen for the mmWave D2D linkage setup. Conventional NDS scheme requires a considerable BT overhead, profoundly influencing the mmWave D2D network performance. Furthermore, most exiting NDS schemes neglected the lasting energies of the adjacent devices while carrying out NDS. That is, the selected device may not have enough energy for conducting the D2D functionality. Instead, in this paper, we will make use of the Wi-Fi information in initializing mmWave D2D NDS procedure. The solid relative relationship between Wi-Fi and mmWaves link statistics, as given in [3,[14][15][16][17][18], along with the previous works of [45][46][47] that efficiently made use of Wi-Fi information to efficiently handle mmWave challenges inspired us to use Wi-Fi information as context. Thus, CMAB is best fitted to this problem besides reflecting the residual energies of the nearby devices. Furthermore, most exiting NDS schemes neglected the lasting energies of the adjacent devices while carrying out NDS. That is, the selected device may not have enough energy for conducting the D2D functionality. Instead, in this paper, we will make use of the Wi-Fi information in initializing mmWave D2D NDS procedure. The solid relative relationship between Wi-Fi and mmWaves link statistics, as given in [3,[14][15][16][17][18], along with the previous works of [45][46][47] that efficiently made use of Wi-Fi information to efficiently handle mmWave challenges inspired us to use Wi-Fi information as context. Thus, CMAB is best fitted to this problem besides reflecting the residual energies of the nearby devices. In our scenario, the mmWave devices are usually stationary or slow motion close to the individual's speed. Hence, device mobility is left for future studies.

Wi-Fi Linkage Model
Regarding the Wi-Fi model, we will utilize the linkage model provided in [3,16,17], where the Wi-Fi received power P w r at a reference distance r between two devices functioning at 5.25 GHz (Wi-Fi band) is formulated as [3]: where P w t and P w r are the transmitting and receiving Wi-Fi powers in dBm, respectively. Path loss exponent is η w = 2.32, and χ w N (0, σ w ) is the Wi-Fi log-normal shadowing with zero mean and 6 dB standard deviation, i.e., σ w = 6 dB [3].

mmWave Linkage and Blockage Models
For the mmWave model, the mmWave received power, P m r , bearing in mind beamforming gain and blockage effects, from an adjacent device located at a distance r can be expressed as [3,4]: where η(P LOS (r)), β(P NLOS (r)) are Bernoulli random variables (RVs) that reflect the blockage effect with parameters P LOS (r), P NLOS (r) that indicate the distance-dependent LOS and NLOS probabilities; where P NLOS (r) = 1 − P LOS (r). P m t is the mmWave TX power and Λ TX (ϑ) and Λ RX (ϕ) are the transmitting and receiving beamforming gains as functions of the angle of departures (AoD), i.e., ϑ, and the angle of arrival (AoA), i.e., ϕ. L v m (r); where v ∈ {LOS, NLOS} is the distance-dependent path loss formulated in dB as [3,4]: where β v m = 82.02 − 10η v m log 10 (r 0 ) is the reference path loss at the reference distance r 0 = 5 m. η v m identifies path loss exponent, and χ v m N (0, σ v m ) indicates the log-normal shadowing with zero mean and standard deviation of σ v m . Regarding Λ TX (ϑ), the 2D steerable antenna formula with Gaussian main loop shape provided in [3,4,6] is utilized, which is modeled as: where ϑ, ϑ −3dB and Λ 0 represent the azimuth angle, −3 dB beamwidth, and maximum antenna gain, respectively. The same equation is applied for evaluating Λ RX (ϕ) except that RX and ϕ are used instead of TX and ϑ, respectively. For mmWave blockage, we utilize the blockage scenario presented in [48], which is appropriate for both indoors and outdoors. In this scenario, mmWave obstructions are represented as cylinders that follow 2D homogenous Poisson point process (PPP) in its spatial distribution. Hence, P LOS (r) is expressed as [6]: where g = e −π∆λE[Ω 2 ] and ω = 2∆λE[Ω], λ represents the obstacles density, ∆, Ω are the cylinder's thinning factor and radius, respectively. E[.] is the mean operator.

mmWave D2D NDS Problem Modeling
The main aim of the mmWave D2D NDS process is to maximize the D2D link's long-term average throughput/reward by considering the remaining battery levels of the distributed nearby devices. Such maximization problem is outlined as: where N specifies the number of the adjacent devices. Ψ i,t reflects the D2D linkage throughput in Gbps with adjacent device i at round t. Here, t points to the time instance of the mmWave D2D linkage request. In NDS process, the next round comes when new frames need to be sent. More precisely, the central device data is fragmented into frames and at every frame duration an NDS decision is taken to select the most appropriate nearby device to transmit its data. Ξ i,t reflects the remaining energy of the adjacent device i at instant t in joule, and Ξ limit defines the limited energy threshold within the device for keeping its primary activities. X i,t is the Wi-Fi context information vector of length d for device i at a time t. Ψ i (t) formula is given as: where W m designates the utilized mmWave bandwidth, T D is the required time for data transmission, T BT represents the BT time consumed by the central device to explore only one of its adjacent devices. V t reflects the number of adjacent devices performing BT with the center device at instant t. Hence, V t always equals N in the conventional direct NDS scheme. Y i,t represents the D2D linkage's SE in bps/Hz related to adjacent device i at instant t, formulated as: where P m r i,t is the mmWave power received by nearby device i at instant t, and N 0 reflects the receiver's noise power. Assume that each arm has a feature vector X i,t ∈ R d , which is the Wi-Fi information in our case, expressed as: where [] T means transpose, and P w r i,t is the instantaneous received Wi-Fi power at nearby device i from the central device at time t. E P w r i,t and var P w r i,t are its average value and variance up to instant t. CMAB adopts the concept that the predictable reward of an arm i is linear with respect to its feature vector. Thus, to implement the proposed algorithm, the expected reward of arm/device i is proposed to be linear in its d dimensional context feature vector X i,t with unknown coefficient vector θ * i for all t, which is given as [20,21]: The CMAB game aims to estimate θ * i given X i,t T through successive online training.

CMAB Concept
To solve the optimization problem in (6), we leverage a proper type of bandits called CMAB. where, the player accumulates her rewards from taking actions (selecting arms) over a sequence of trials. During each round, the player takes action upon both contexts (feature vector) for the current round and the previously collected rewards obtained in the previous trials. The player notices the reward only for the chosen arm. CMAB exists in several vital applications like online recommendations, mobile health applications, and clinical trials [43]. The feature utilization to encode context is acquired from supervised ML, while exploration is vital for improving the learning performance like RL technique. Hence, CMABs is the usual halfway argument between supervised learning and RL [49]. Usually, the CMAB problem is solved via proposing a linear relationship between the produced reward and its related contexts as given in (10) and addressed by LinUCB [20] and CTS [21] algorithms.
The standard CMAB problem can be formulated as follows. Let A = {1, . . . , N} be the set of N existing independent devices/arms. Let X ⊆ R d be a set of d-dimensional context vectors that depict players/devices and their surroundings, i.e., each member is a binary vector encoding features such as arm locations, decisions, pursuits, etc. For each round t ∈ [1, T] and each arm i ∈ A, the context vector, X i,t ∈ X , is given to the algorithm from the environment to select an arm. Assume that rw t = (rw i,t , . . . rw N,t ) is the reward vector at trial t, where rw i,t is the collected reward via selecting arm/device i at round t that follows some unknown Gaussian distribution in our case. θ i is an unknown coefficient vector (to be learned) related to arm i at round t. An assumption is made that the expected rewards of an arm/device i at trial t is linearly related to the d-dimensional context vector X i,t as given in (10). The general CMAB protocol is summarized in Figure 2. produced reward and its related contexts as given in (10) and addressed by LinUCB [20] and CTS [21] algorithms. The standard CMAB problem can be formulated as follows. Let = {1, … , } be the set of N existing independent devices/arms. Let ⊆ ℝ be a set of d-dimensional context vectors that depict players/devices and their surroundings, i.e., each member is a binary vector encoding features such as arm locations, decisions, pursuits, etc. For each round t ∈ [1, T] and each arm ∈ , the context vector, , ∈ , is given to the algorithm from the environment to select an arm. Assume that = ( , , . . . , ) is the reward vector at trial t, where , is the collected reward via selecting arm/device i at round t that follows some unknown Gaussian distribution in our case. θi is an unknown coefficient vector (to be learned) related to arm i at round t. An assumption is made that the expected rewards of an arm/device i at trial t is linearly related to the d-dimensional context vector , as given in (10). The general CMAB protocol is summarized in Figure 2.

Proposed EA-CMAB Algorithms
Herein, we will discuss two proposed EA-CMAB algorithms that handle mmWave D2D NDS proficiently. In our setting, single-player CMAB is concerned, and multiplayer CMAB scenario will be left for future investigations. First, we will explain the device's battery update equation followed by the proposed EA-LinUCB and EA-CTS algorithms. At every round , the proposed CMAB algorithm will select a nearby device, * , where its updated residual energy, * , , is given by: where * , is its remaining energy at instant − 1 . The expression / Υ * , reflects the consumed energy to fetch the necessary data bits with Υ * , bps data rate by the selected nearby device * . An assumption is made that all devices have equal transmit powers to fetch data.
The modified algorithms take into account the remaining energy levels of the nearby devices during their arm selection. This is done by appending the energy term, , , , in the main exploration part of each algorithm to reflect the real scenario where some devices may run out of their energy and be excluded from the game. The added term compromises

Proposed EA-CMAB Algorithms
Herein, we will discuss two proposed EA-CMAB algorithms that handle mmWave D2D NDS proficiently. In our setting, single-player CMAB is concerned, and multiplayer CMAB scenario will be left for future investigations. First, we will explain the device's battery update equation followed by the proposed EA-LinUCB and EA-CTS algorithms. At every round t, the proposed CMAB algorithm will select a nearby device, i * CMAB , where its updated residual energy, Ξ i * CMAB ,t , is given by: - where Ξ i * CMAB , t−1 is its remaining energy at instant t − 1. The expression reflects the consumed energy to fetch the necessary L D data bits with W m Y i * CMAB ,t bps data rate by the selected nearby device i * CMAB . An assumption is made that all devices have equal transmit powers to fetch data. The modified algorithms take into account the remaining energy levels of the nearby devices during their arm selection. This is done by appending the energy term, ρ , in the main exploration part of each algorithm to reflect the real scenario where some devices may run out of their energy and be excluded from the game. The added term compromises between the obtained throughput and consumed energy of each selected device. Hence, the EA-CMAB algorithms will choose the highest energy and largest throughput device among others. Therefore, the algorithms will not be stuck to the lowest energy or the highest throughput device.

Proposed EA-LinUCB Algorithm
LinUCB [20] extends the Auer's UCB algorithm in [10,22] to the contextual concept. Its main clue is to figure out each arm's probable reward by finding a linear relationship between the previous rewards of the arm and its current context vector as given in (10). LinUCB interprets the features vector of the existing round into a linear combination of features vectors seen on former rounds and utilizes the calculated coefficients and rewards on earlier rounds to calculate the anticipated reward on the present round. Let G i be an m × d matrix at trial t, whose rows represent m contexts noticed previously for arm/device i. Applying ridge regression to the training data (G i , b i ) gives an estimate of the coefficients:θ where b i = G T i c i , where c i is the m-dimensional vector whose components are past observed rewards of arm i. When the c i components are independently conditioned on corresponding rows in G i , it can be shown that [20] where B i = G T i G i + I d and α LinUCB = 1 + ln(2/δ LinUCB ) for δ LinUCB > 0. Y i,t is the SE/reward of drawing arm/device i at round t calculated from (8). The above inequality provides a reasonable strong UCB for the expected reward of device i. Similar to UCB arm selection strategy, at each trial t, the best arm i * t is selected as follows: where r i,t is the distance of device i from central device at instant t. The new term ρ is added to the standard LinUCB equation to mirror the remaining energies of the spread devices upon their locations from the central device. That is, for a constant data length L D in (11), higher remaining energy is required by a faraway device to establish the D2D linkage owing to the reduction of its attainable data rate and vice versa. Algorithm 1 provides the detailed explanations of the proposed EA-LinUCB algorithm. The inputs are the threshold energy limit, Ξ limit , and the energy of the adjacent devices at t = 1 plus the parameter α LinUCB . The arms having higher remaining energies than Ξ limit will be involved in the game. After applying the EA-LinUCB, the parameters are updated for the next round when new data frames need to be sent by the central device, as given in Algorithm 1.

Proposed EA-CTS Algorithm
TS [23] fundamental policy applies Bayesian strategy because the rewards are supposed to be pulled upon a known probabilistic model. A simple former distribution is suggested for the rewards of each arm based on parameter initialization. Then, within the learning process, the TS strategy updates the rewards' posterior distribution using the collected data to draw the optimal probable arm. Precisely, at every round t, random samples are drawn from the rewards' posterior distributions, then the arm having the highest sample value is chosen. Afterward, the chosen arm's posterior distribution is updated for the upcoming round of arm choice. For CTS, we assume a slightly simpler model on the CMAB protocol given in Figure 2, where the main difference is that we assume that there exists θ s.t. θ i = θ for all arms i ∈ A. The global construction of CTS for the CMAB problem includes the subsequent fundamentals [21]:

2.
A former distribution P( θ) which is Gaussian in our case.

3.
Former observations, D, containing (context X, reward Y) for the previous time steps. 4. P(Y X, θ), the probability of reward Y given a context X and a parameter θ.
At each round t, CTS pulls an arm upon its posterior probability. This simply can be done by taking a sample from each arm via the posterior distributions and selecting the arm with the best sample. Because the reward distribution is Gaussian due to the Gaussian noise, we utilize the Gaussian likelihood function and Gaussian prior for our EA-CTS. Expressly, assume that the likelihood of reward Y i,t at time t, given context X i,t are provided from the normal distribution (X Then, if the prior distribution of θ at time t is known as N θ t , ∂ 2 CTS B −1 t , then the posterior distribution at time t + 1 is given as N θ t+1 , ∂ 2 CTS B −1 t+1 [21]. Our modified algorithm produces a sample θ t from N θ t , ∂ 2 CTS B −1 t distribution and selects the arm . Herein, we utilize Gaussian-based EA-CTS because of Gaussian distribution of the reward as shown in [4]. Algorithm 2 summarizes the EA-CTS main steps, where the first step is to select the devices with high remaining energies inside the selection range. Then the algorithm produces a d-dimensional sample θ t , from a multivariate Gaussian distribution, and attempts to solve the maximization problem argmax i∈A As given in EA-LinUCB, the newly added term, ρ , reflects the remaining energies of the surrounding devices. The parameters B, f , θ, Ξ i * are updated for next round selection to send new data frames as given in Algorithm 2.

Numerical Results
This section presents the conducted numerical simulations that confirm the superior performance of the proposed EA-CMAB-based algorithms using 10,000 rounds of Monte Carlo (MC) simulations throughout MATLAB environment. Every MC round includes randomized device locations, randomized channel properties (mmWave and Wi-Fi related shadowing terms), randomized mmWave blocking patterns coming from the tested blocking probability, and randomized battery initialization of each distributed nearby device. To approve that, the proposed algorithms are compared with mostly related noncontextual solutions [4,5], besides the famous traditional selection techniques, named conventional and random selection schemes. The conventional NDS scheme searches all devices before deciding the best one, which consumes a considerable time and achieves a significant BT overhead. However, in random NDS, the adjacent device is picked randomly from the surrounding devices at every round t to establish the mmWave D2D link. The total average throughput is evaluated by averaging (7) over the game's time horizon T. Hence, V t = N for conventional selection scheme, while for CMAB proposed algorithms and random scheme V t = 1. The EE is formulated as: where Ξ i,1 is the device i's starting energy and Ξ i,T reflects its final energy when the game is terminated. Table 1 summarizes the related simulation parameters, where around 20 to 100 devices are uniformly diffused in a region of 125 × 125 m 2 . Moreover, ideal beam alignment is considered within the D2D devices, i.e., Λ TX (ϑ) = Λ RX (ϕ) = Λ 0 .

Without Battery Consideration
Herein, we will figure out the merits of CMAB algorithms over noncontextual ones in mmWave NDS. Figure 3 shows the average throughput versus the number of distributed devices at no blocking (λ = 0) for UCB, TS, LinUCB, and CTS algorithms. The two CMAB algorithms' performance is close to each other due to the utilized stationary scenario shown in [50,51] and the Wi-Fi context vector, not the mmWave-based one. The noncontextual MAB algorithms (UCB and TS) show improved performances over conventional and random selection methods. The CMAB algorithms (LinUCB and CTS) have a superior performance that is close to the optimum, where the optimal NDS performance comes via selecting the best device having the maximum SE from the first time, i.e., V t = 1. In conventional direct NDS scheme, the throughput is reversely related to the number of surrounding devices because the exhaustive BT produces considerable overhead. The other compared schemes (optimal, LinUCB, CTS, TS, UCB, and Random) have small BT overhead due to performing BT with a single device every round. However, the random scheme experiences the worst performance due to the adjacent device randomization selection policy. It is interesting to notice that the throughputs of the LinUCB and CTS schemes are improved relatively with increasing the number of devices because of the valuable context vector that maximizes long-term throughput with small BT overhead. CMAB performance is higher than TS and UCB, which indicates the effectiveness of the contextual information. At 40 (80) devices, about 96.3% (97.4%), 94.7% (91%), 82.6% (67.24%), 80.5% (43.1%), and 59.3% (48.3%) of the optimal performance are obtained using LinUCB/CTS, TS, UCB, conventional and the random schemes, respectively.

Without Battery Consideration
Herein, we will figure out the merits of CMAB algorithms over noncontextual ones in mmWave NDS. Figure 3 shows the average throughput versus the number of distributed devices at no blocking (λ = 0) for UCB, TS, LinUCB, and CTS algorithms. The two CMAB algorithms' performance is close to each other due to the utilized stationary scenario shown in [50,51] and the Wi-Fi context vector, not the mmWave-based one. The noncontextual MAB algorithms (UCB and TS) show improved performances over conventional and random selection methods. The CMAB algorithms (LinUCB and CTS) have a superior performance that is close to the optimum, where the optimal NDS performance comes via selecting the best device having the maximum SE from the first time, i.e., V = 1. In conventional direct NDS scheme, the throughput is reversely related to the number of surrounding devices because the exhaustive BT produces considerable overhead. The other compared schemes (optimal, LinUCB, CTS, TS, UCB, and Random) have small BT overhead due to performing BT with a single device every round. However, the random scheme experiences the worst performance due to the adjacent device randomization selection policy. It is interesting to notice that the throughputs of the LinUCB and CTS schemes are improved relatively with increasing the number of devices because of the valuable context vector that maximizes long-term throughput with small BT overhead. CMAB performance is higher than TS and UCB, which indicates the effectiveness of the contextual information. At 40 (80) devices, about 96.3% (97.4%), 94.7% (91%), 82.6% (67.24%), 80.5% (43.1%), and 59.3% (48.3%) of the optimal performance are obtained using LinUCB/CTS, TS, UCB, conventional and the random schemes, respectively.   Figure 4 presents the average throughput performances of the examined methods using 60 devices versus various blocking densities, i.e., changing values of λ. As blocking is enlarged, the average throughput of all methods decreases because of the increased NLOS probability (blockage) that decreases the received power and hence, the attainable data rate. The random scheme also yields the most defective throughput performance due to the randomized device selection policy that may experience abrupt blocking. However, the CMAB-based NDS displays near optimal performance. At λ of 0 (0.15) about 96.9% (95.6%), 92.8% (90.8%), 81.6% (80.8%), 56.2% (54.2%), and 51.3% (20%) of the optimal performance are obtained using LinUCB/CTS, TS, UCB, conventional and the random schemes, respectively.  Figure 4 presents the average throughput performances of the examined methods using 60 devices versus various blocking densities, i.e., changing values of λ. As blocking is enlarged, the average throughput of all methods decreases because of the increased NLOS probability (blockage) that decreases the received power and hence, the attainable data rate. The random scheme also yields the most defective throughput performance due to the randomized device selection policy that may experience abrupt blocking. However, the CMAB-based NDS displays near optimal performance. At λ of 0 (0.15) about 96.9% (95.6%), 92.8% (90.8%), 81.6% (80.8%), 56.2% (54.2%), and 51.3% (20%) of the optimal performance are obtained using LinUCB/CTS, TS, UCB, conventional and the random schemes, respectively.  Figure 5 shows the convergence rate of the LinUCB, CTS, UCB, and CTS algorithms against optimal and random performances. It is worth noting that the convergence of TS is faster than UCB due to the Bayesian policy of TS. At = 100, both LinUCB and CTS converge to around 98% of the optimal throughput, while noncontextual bandits own slower convergence, where TS converges to 91% while UCB converges to 73%.    Figure 5 shows the convergence rate of the LinUCB, CTS, UCB, and CTS algorithms against optimal and random performances. It is worth noting that the convergence of TS is faster than UCB due to the Bayesian policy of TS. At t = 100, both LinUCB and CTS converge to around 98% of the optimal throughput, while noncontextual bandits own slower convergence, where TS converges to 91% while UCB converges to 73%.  Figure 4 presents the average throughput performances of the examined methods using 60 devices versus various blocking densities, i.e., changing values of λ. As blocking is enlarged, the average throughput of all methods decreases because of the increased NLOS probability (blockage) that decreases the received power and hence, the attainable data rate. The random scheme also yields the most defective throughput performance due to the randomized device selection policy that may experience abrupt blocking. However, the CMAB-based NDS displays near optimal performance. At λ of 0 (0.15) about 96.9% (95.6%), 92.8% (90.8%), 81.6% (80.8%), 56.2% (54.2%), and 51.3% (20%) of the optimal performance are obtained using LinUCB/CTS, TS, UCB, conventional and the random schemes, respectively.  Figure 5 shows the convergence rate of the LinUCB, CTS, UCB, and CTS algorithms against optimal and random performances. It is worth noting that the convergence of TS is faster than UCB due to the Bayesian policy of TS. At = 100, both LinUCB and CTS converge to around 98% of the optimal throughput, while noncontextual bandits own slower convergence, where TS converges to 91% while UCB converges to 73%.   Figure 6 shows the average throughput performances against the number of distributed devices at no blocking (λ = 0). The proposed EA-CMAB algorithms (i.e., EA-LinUCB and EA-CTS) show better performance than not only similar noncontextual ones (i.e., EA-UCB and EA-TS [4]) but also conventional and random selection schemes too. Both EA-CMAB schemes have close performance due to the close performance of both LinUCB and CTS as previously explained, plus the newly added energy term [50,51]. The average throughput performance of EA-LinUCB and EA-CTS schemes are increased proportionally with the number of devices due to the effective Wi-Fi context vector that increases the longterm reward and reduces the BT cost. At 20 (100) devices, both EA-CMAB algorithms have 1.3 (5.5) and 2.8 (5) throughput improvement against conventional and random selections, correspondingly. The two modified EA-CMAB algorithms display similar performance, showing small throughput fluctuations affected by the lately appended remaining energy expression, i.e., r i

With Battery Consideration
. This expression affects the typical CMAB algorithms' estimation by prioritizing closer devices, reaching more excellent realizable data rates with lower consumed energies. Moreover, EA-CMAB shows higher performance than noncontextual EA-MAB algorithms.  Figure 6 shows the average throughput performances against the number of distributed devices at no blocking (λ = 0). The proposed EA-CMAB algorithms (i.e., EA-LinUCB and EA-CTS) show better performance than not only similar noncontextual ones (i.e., EA-UCB and EA-TS [4]) but also conventional and random selection schemes too. Both EA-CMAB schemes have close performance due to the close performance of both LinUCB and CTS as previously explained, plus the newly added energy term [50,51]. The average throughput performance of EA-LinUCB and EA-CTS schemes are increased proportionally with the number of devices due to the effective Wi-Fi context vector that increases the long-term reward and reduces the BT cost. At 20 (100) devices, both EA-CMAB algorithms have 1.3 (5.5) and 2.8 (5) throughput improvement against conventional and random selections, correspondingly. The two modified EA-CMAB algorithms display similar performance, showing small throughput fluctuations affected by the lately appended remaining energy expression, i.e., , . This expression affects the typical CMAB algorithms' estimation by prioritizing closer devices, reaching more excellent realizable data rates with lower consumed energies. Moreover, EA-CMAB shows higher performance than noncontextual EA-MAB algorithms.     s of all compared schemes are increased relatively with increasing the number of nearby devices because of the large number of devices having higher SE values available for setting up the mmWave D2D linkage. This intensely reduces the spent energy of the chosen device in accordance. Furthermore, random selection reveals the worst performance, while the two EA-CMAB-based NDS algorithms display better performance than EA-MAB algorithms. Due to the additional energy-constraint to the formulated CMAB problem, the EA-CMAB-based NDS maximizes the longterm throughput while conserving the adjacent devices' remaining energies when constructing the D2D links through making use of Wi-Fi contexts. This improves performances over both noncontextual EA-MAB and the conventional and random NDS. At 20 (100) devices, the EA-CMAB-based NDS has 0.1 (0.5), 1.3 (2.5) and 2 (3.5) increase in EE over EA-MAB, conventional and random schemes, respectively.  Figure 9 demonstrates the evaluations versus different blocking λ values using 60 devices. Generally, as λ is increased, the of all algorithms is decreased. This is due to the significant blockage effect, which reduces the available data rate extending data   s of all compared schemes are increased relatively with increasing the number of nearby devices because of the large number of devices having higher SE values available for setting up the mmWave D2D linkage. This intensely reduces the spent energy of the chosen device in accordance. Furthermore, random selection reveals the worst performance, while the two EA-CMAB-based NDS algorithms display better performance than EA-MAB algorithms. Due to the additional energy-constraint to the formulated CMAB problem, the EA-CMAB-based NDS maximizes the longterm throughput while conserving the adjacent devices' remaining energies when constructing the D2D links through making use of Wi-Fi contexts. This improves performances over both noncontextual EA-MAB and the conventional and random NDS. At 20 (100) devices, the EA-CMAB-based NDS has 0.1 (0.5), 1.3 (2.5) and 2 (3.5) increase in EE over EA-MAB, conventional and random schemes, respectively.  Figure 9 demonstrates the evaluations versus different blocking λ values using 60 devices. Generally, as λ is increased, the of all algorithms is decreased. This is due to the significant blockage effect, which reduces the available data rate extending data   Figure 9 demonstrates the EE evaluations versus different blocking λ values using 60 devices. Generally, as λ is increased, the EE of all algorithms is decreased. This is due to the significant blockage effect, which reduces the available data rate extending data transmission time resulting in more considerable energy dissipation as given in (9). Still, the proposed EA-CMAB algorithms show the best EE performances within whole λ values because of the context vector's influence and the energy constraint. However, the random scheme demonstrates the most defective EE values at different values of λ. At blocking densities of 0 (0.15), the EA-CMAB-based NDS has 1.5 (1.7) and 2.9 (29) increments in EE over conventional and random schemes, accordingly.

With Battery Consideration
transmission time resulting in more considerable energy dissipation as given in (9). Still, the proposed EA-CMAB algorithms show the best performances within whole λ values because of the context vector's influence and the energy constraint. However, the random scheme demonstrates the most defective values at different values of λ. At blocking densities of 0 (0.15), the EA-CMAB-based NDS has 1.5 (1.7) and 2.9 (29) increments in EE over conventional and random schemes, accordingly.  Figure 10 illustrates the convergence comparisons of EA-LinUCB and EA-CTS algorithms versus EA-MAB (EA-UCB and EA-TS), random, and optimal schemes. For the sake of comparison, the optimal scheme is by considering device's infinite energy. EA-CMAB converges faster than EA-MAB algorithms, resulting in faster learning process. Nearly at 100 rounds, the two proposed EA-CMAB algorithms converge to 99% of the optimum average throughput, while EA-MAB convergence equals 96%. EA-CMAB algorithms have slight faster convergence than EA-MAB schemes, which ensures its appropriate selection for the problem solution.  Figure 10 illustrates the convergence comparisons of EA-LinUCB and EA-CTS algorithms versus EA-MAB (EA-UCB and EA-TS), random, and optimal schemes. For the sake of comparison, the optimal scheme is by considering device's infinite energy. EA-CMAB converges faster than EA-MAB algorithms, resulting in faster learning process. Nearly at 100 rounds, the two proposed EA-CMAB algorithms converge to 99% of the optimum average throughput, while EA-MAB convergence equals 96%. EA-CMAB algorithms have slight faster convergence than EA-MAB schemes, which ensures its appropriate selection for the problem solution. transmission time resulting in more considerable energy dissipation as given in (9). Still, the proposed EA-CMAB algorithms show the best performances within whole λ values because of the context vector's influence and the energy constraint. However, the random scheme demonstrates the most defective values at different values of λ. At blocking densities of 0 (0.15), the EA-CMAB-based NDS has 1.5 (1.7) and 2.9 (29) increments in EE over conventional and random schemes, accordingly.  Figure 10 illustrates the convergence comparisons of EA-LinUCB and EA-CTS algorithms versus EA-MAB (EA-UCB and EA-TS), random, and optimal schemes. For the sake of comparison, the optimal scheme is by considering device's infinite energy. EA-CMAB converges faster than EA-MAB algorithms, resulting in faster learning process. Nearly at 100 rounds, the two proposed EA-CMAB algorithms converge to 99% of the optimum average throughput, while EA-MAB convergence equals 96%. EA-CMAB algorithms have slight faster convergence than EA-MAB schemes, which ensures its appropriate selection for the problem solution.  For complexity analysis, the time consumed by the compared schemes comes from algorithm execution time and nearby device probing time. The execution time of the proposed CMAB algorithms is of order O d 2 N [20,21], which greatly depends on the number of the probed devices and size of the context vector d. Regards of d, it is fixed to three as previously explained, and N is a small value because we only consider the scenario of a small cell with a few numbers of surrounding users. Moreover, according to our proposed algorithms policy, N decreases with the trials increment because of the battery condition. Hence, our algorithm's processing time can be considered as constant, especially at small nearby devices case. In Table 2, we measured the MATLAB R2020 b execution time of the proposed algorithms against the number of devices compared to the conventional scheme. The specifications of the used machine are i7-8565U CPU @ 1.80 GHz 1.99 GHz and 8 GB RAM. From Table 2, the execution time of the proposed algorithms are in the range of milliseconds which fit the 5G/6G requirements of millisecond latency. Moreover, typically, MATLAB software consumes large execution time because of its complier. Hence, we expect much lower execution time compared to these values when implemented in real hardware platforms. The second source is the BT time of one device probing which is about 0.28 msec [1]. This ensures the near optimal performance of the proposed CMAB/EA-CMAB schemes as given in Figures 3-10.

Conclusions
This paper discussed resolving the NDS problem in mmWave D2D communications using ML-based CMABs. It advanced a CMAB-based online learning technique that effectively solves the NDS problem for future talented applications. This is done by making use of Wi-Fi information of the nearby multiband standardized devices as context information. Hence, LinUCB and CTS schemes were leveraged for NDS solution and their performance was investigated against UCB and TS algorithms. Afterward, we proposed EA-LinUCB and EA-CTS to accelerate the discovery process and take full advantage of the long-term average throughput while bearing in mind the remaining energies of the adjacent devices. The suggested algorithms confirmed their superior performances, which are higher than noncontextual MAB algorithms plus traditional mmWave D2D NDS approaches. EA-CMABs achieved larger EE than other schemes with faster convergence rates. Future research directions will be directed towards practical experimental implementations and multiplayer scenarios using CMABs in centralized and decentralized settings. Moreover, implementing deep CMABs looks a promising approach.  Wi-Fi and mmWave log-normal shadowing Λ TX (ϑ), Λ RX (ϕ) Transmitting and receiving beamforming gains angle of departures ϑ, ϕ (AoD) and the angle of arrival (AoA) W m , T D , T BT mmWave bandwidth, Data transmission and BT times rw t,a i Collected reward via selecting arm/device a i at round t N 0 , ϑ −3dB , Λ 0 Noise power of receiver, −3dB beamwidth, maximum antenna gain λ, ∆, Ω Obstacles density, cylinder's thinning factor and radius Ξ i (t), Ξ limit Remaining energy of the adjacent device i, threshold energy Ψ i (t) D2D linkage throughput in Gbps with adjacent device i at round t