Leveraging Machine-Learning for D2D Communications in 5G/Beyond 5G Networks

: Device-to-device (D2D) communication is a promising paradigm for the ﬁfth generation (5G) and beyond 5G (B5G) networks. Although D2D communication provides several beneﬁts, including limited interference, energy efﬁciency, reduced delay, and network overhead, it faces a lot of technical challenges such as network architecture, and neighbor discovery, etc. The complexity of conﬁguring D2D links and managing their interference, especially when using millimeter-wave (mmWave), inspire researchers to leverage different machine-learning (ML) techniques to address these problems towards boosting the performance of D2D networks. In this paper, a comprehensive survey about recent research activities on D2D networks will be explored with putting more emphasis on utilizing mmWave and ML methods. After exploring existing D2D research directions accompanied with their existing conventional solutions, we will show how different ML techniques can be applied to enhance the D2D networks performance over using conventional ways. Then, still open research directions in ML applications on D2D networks will be investigated including their essential needs. A case study of applying multi-armed bandit (MAB) as an efﬁcient online ML tool to enhance the performance of neighbor discovery and selection (NDS) in mmWave D2D networks will be presented. This case study will put emphasis on the high potency of using ML solutions over using the conventional non-ML based methods for highly improving the average throughput performance of mmWave NDS.


Introduction
Future wireless data traffic keeps growing, especially with recent data-hungry applications such as high definition video, virtual and augmented reality applications. This traffic explosion motivated network operators and designers to race to satisfy the market and customers' needs and expectations. On the other hand, the expected massive connectivity of such applications in B5G and 6G wireless networks presents various challenges in resource allocation (RA), link, and interference management (IM).
Device-to-device (D2D) communication represents one of the main pillars of future networks that facilitates traffic offloading and relaxes the traffic load of the whole system [1]. D2D networks harness networks, especially with novel ML techniques like federated learning (FL). • A case study on applying MAB as an efficient online ML tool in enhancing the performance of mmWave NDS will be presented. In this case study, different multi armed bandit (MAB) techniques such as upper confidence bound (UCB) and minimax optimal stochastic strategy (MOSS) will be investigated to show the effectiveness of using ML tools in enhancing the average throughput performance of mmWave NDS over the existing traditional solutions, namely direct NDS and random selection. Besides, we show that such performance enhancements come with a sufficient learning convergence rate.
The rest of the paper is organized as follows: Section II overviews the recent research directions for D2D communications. In section III, we summarize different ML techniques that can be generally employed for D2D network solutions. Section IV discusses different applications of ML algorithms in various D2D scenarios. D2D challenges, future research directions, and applications are introduced in section V. A case study of applying MAB on mmWave D2D is presented by section VI. Finally, section VII concludes the work.

Recent Research Direction for D2D Communications
Before we dig deep on the applications of ML in D2D communication, we have to highlight the future research directions for D2D communication in both sub 6GHz and mmWaves bands, as summarized in Fig. 2, which are given as follows.

Network Architectures and Standardization
D2D communication modes have been included in the 3rd generation partnership project (3GPP) by defining the proximity services (ProSe) into the standard [2]. Later, ProSe network architecture has been modified to incorporate mmWaves D2D applications. Towards that, a message exchange entity was added to the ProSe architecture to manage the operations of mmWave links discovery, establishment, and maintenance. Also, a new paradigm of mmWave D2D networks was introduced in [6] based on the interworking between LTE, unlicensed µW and mmWave bands, where wide coverage µW band, e.g., Wi-Fi, was used to overcome the shortcoming of the mmWave transmissions and assists the construction/management of the mmWave D2D links.

Neighbor Discovery and Selection (NDS)
NDS is a crucial design aspect in D2D networks, in which a device should discover its neighbor devices and select the best one for constructing the D2D link. There are two approaches of D2D ND, namely network-centric (NC) and device-centric (DC). In NC ND, the cellular network itself will discover the nearby devices, while in DC, the devices themselves will do the job. Both NC and DC have their advantages and disadvantages. However, in dense user scenario NC ND performs better than DC ND, and vice versa. Quick ND is preferred due to the limited battery capacity of the devices in addition to reducing the consumed overhead. In mmWave D2D, the problem of ND turns to be more significant due to the use of BT, which consumes high energy and overhead. To overcome this problem, the authors in [6] used out-of-band ND assistance, in which the wide coverage unlicensed µW band is used to assist the discovery of the mmWave devices and hence reducing the overhead and energy consumption.

Resource Allocation and Power Control
RA and power control (PC) gained considerable attention in the design of the in-band D2D networks, where the D2D users share the same resources with the CUs [7]. A variety of interference mitigation techniques exist in literature to address this problem, which can be divided into interference avoidance, interference coordination and interference cancellation. A limited interference area (LIA) surrounding the D2D users was introduced to prevent any CU from transmitting within the LIA as a way of interference avoidance. For interference coordination, optimal RA and PC were investigated between D2D users and CUs. The genetic algorithm, game theory and the optimization theory were utilized as efficient mathematical tools for coordinating the interference and controlling the power among CUs and D2D users [8], [9]. A new research direction is to use the social activities of the users in interference mitigation among D2D users and CUs. For interference cancellation, techniques based on successive interference cancellation (SIC), coordinated multi point (CoMP) and full-duplex (FD)based self-interference cancellation were investigated in the literature to cancel the interference occurs between D2D users and CUs. Fortunately, for the out-of-band D2D networks, including mmWave D2D, the problem of interference between CUs and D2D users does not exist. A low complex and bandwidth efficient transceiver design for superimposed waveform is provided in [10] . A new RA technique based on NOMA called superimposed multi-user shared access (MUSA) that supports much users than conventional techniques was proposed in [11].

Spectral Efficiency and Coverage Analysis
The main advantages of enabling D2D communications in cellular networks beside relaxing the traffic load on the Macro BS/core network are enhancing the spectral efficiency (SE), reducing outage and increasing the coverage of the D2D users. Several studies were performed to evaluate these metrics in conjunction with D2D networks [1]. Towards that, tools from probability theory and stochastic geometry were extensively used to analyze the performance of D2D networks. Cognitive and energy harvesting D2D networks were modeled and analyzed using a tool from stochastic geometry. Also, the Poisson cluster process was used to model the locations of the devices for coverage analysis of the clustered D2D networks. The coverage probability, the mean number of covered receivers and throughput of the multi-cast D2D transmission were also analyzed using tools from probability theory. Recently, analysis of the underlaid FD D2D network is provided interms of coverage probabilities and achievable sum-rates for both D2D users and CUs. To study the improvements in mmWave networks coming from enabling D2D links, a tool from stochastic geometry was used to analyze mmWave D2D networks concerning interference, coverage and data rate, especially for mmWave wearable. Also, a fine-grained analysis of mmWave D2D networks using the Poisson bipolar model was given. Moreover, the locations of mmWave devices were modeled by Poisson cluster process to investigate the performance of mmWave clustered D2D network.

Relaying
The construction of D2D relays was investigated to extend the coverage of the D2D communications, deliver the cellular connection to the out-of-coverage CUs, and route around blockages in case of mmWave D2D. Several D2D relay selection schemes can be found in literature considering different critical parameters when selecting the best relay like the end-to-end data rate, end-to-end delay and the remaining energy of the relayed device. Also, various relaying schemes were considered such as decode and forward (DF), amplify and forward (AF), and demodulate and forward in addition to both half-duplex (HF) and FD transmissions. Optimal resource allocation and power control, in conjunction with D2D relaying were also investigated. Different mathematical tools were used for selecting the relays, such as optimization theory, game theory, fuzzy logic, genetic algorithm, etc. For mmWave D2D, relaying is more critical to not only extend the D2D communication range but also to rout around blockages. Another uniqueness of mmWave transmission is the use of BT, which makes the process of relay probing, i.e., exploring the candidate relays, time, and energy consuming. Thus, the research in mmWave D2D relaying is focusing not only on optimizing the relay selection using conventional optimization techniques but also on finding out the optimal number of probed relays considering the trade-off between investigating more relays and maximizing the end-to-end throughput.

Overview of Machine-Learning Methods
ML is a branch of artificial intelligence (AI) that allows learning knowledge from examples/data without being explicitly programmed [12]. ML algorithms can find hidden patterns in massive complex data by using different training methodologies, which usually can be categorized as follows: • Supervised-Learning: In this category, the ML model tries to learn a function, y = f (x), that maps an input (x) to an output (y) based on a set of sample pairs (i.e., historical data set), which is used for training the model. There are two sub-categories for supervised-learning, namely the regression and the classification. Regression models like linear and logistic ones that predict real-valued outcomes using linear or sigmoid function approximations [12]. On the other hand, other regression ML models such as neural networks (NNs), random forests, bagging and boosting meta-algorithms are another fundamental regression exploits different techniques [12]. Classification models categorize/classify data samples into one out of several classes. Several classical classification models can be used for D2D applications, including K-nearest neighbor (KNN), support vector machines (SVM), and decision tree (DT) [12]. Additionally, the recent advances in graphical processing units (GPUs) allows artificial deep NNs (DNNs) to be used for large-size datasets. Such DNN have different architectures including the multi-layer feed-forward NN (FNN), convolutional NN (CNN), recurrent NN (RNN), Hopfield Networks, and Boltzmann machine, which are implemented in many new areas in communication networks [12]. • Unsupervised-Learning: Unlike supervised-learning, unsupervised-learning models discover and explore hidden patterns and structures of the input data without having data labels [12]. Unsupervised-learning can be sub-categorized into three categories, namely the clustering, density estimation, and dimension reduction. In clustering, the ML algorithm divides and labels data samples into groups/clusters, where the samples in one cluster are similar to each other more than to those samples in different clusters. Representative types of such sub-categories are the K-means and the RElative COre MErge (RECOME) clustering algorithms [12]. On the other hand, the objective of density estimation algorithms is to estimate the distribution density of data samples in the feature space to reveal the high-density regions, which usually show some essential characteristics. The Gaussian mixture model (GMM) is one of the popular algorithms in this sub-category. Finally, dimension reduction techniques, such as principal component analysis (PCA), K-means, and GGMM, transform the data from a high-dimensional space into a low-dimensional space, which reserve the principal structures of the data. Such techniques are widely-utilized in many applications [12]. • Reinforcement-Learning: RL is a powerful tool for dealing with real-time control problems at which there are difficulties in using supervised and unsupervised learning techniques. The learning methodology of RL is based on trial-and-error, similar to humans. An RL's agent is rewarded or penalized for the action it took to maximize the long-term rewards. To select a proper action, a recursive environmental feedback is provided to the agent in each step, where the strategy of the agent is to take action is defined as a policy. The most widely-used RL techniques are the Q-learning [12]. On the other hand, MAB is another promising RL based general approach, which is getting more interest specially in communication applications. In its conventional settings, the MAB problem is expressed by a collection of arms or actions, and it takes the exploration-exploitation dilemma for a player. Each time step, the player/learner selects an arm and receives its corresponding reward, which can be modeled as stochastic or non-stochastic. The title bandit means that the player only knows the prize of the chosen arm, while the other arms rewards remain unknown at that specific time. The player wishes to maximize the cumulative reward gained from a sequential selection of the arms. In other words, the player intends to minimize regret compared with the best single arm. MAB is very helpful in sequential decision-making such as network routing [12], [13].

Applications of ML in D2D communications
In this section, we highlight some vital ML algorithms utilized for D2D in real-life scenarios emphasizing on their pros over traditional solutions. ML techniques can address variety of D2D communications' challenges as given in Fig. 3 including the traditional ones given in Fig. 2. Furthermore, Table 2 summarizes different ML applications in different D2D scenarios.

RA and Mode Selection (MS)
The RA and MS processes in D2D communications show high complexity in highly dense networks. Leveraging ML methods for both RA and MS can promote more powerful flexibility to cope with network dynamics.The authors of [14] proposed a DL based algorithm for transmission in D2D networks. They formulated a CNN that selects D2D linkages to transmit data, where 90% of the selected sub-optimal data are utilized to train the CNN based algorithm and the remaining 10% for validation. Their numerical results showed that they obtained 85-95% accuracies from the neural network. In [9], an autoencoder is trained using a supervised learning approach to pair underlay D2D transmitters to reuse the spectrum of CUs, where the optimal pairing in a dataset is obtained using the conventional Hungarian algorithm that minimizes the total cost of selection through combinatorial optimization formed as bipartite graph. The main idea is to use a Deep Learning (DL) approach that is capable of mapping the cost matrices of matching different D2Ds-CUs to the corresponding optimal solutions defined by the assignment matrix obtained from the Hungarian algorithm with less complexity and time. In [8], the authors modeled the joint radio resource management (RRM)-MS problem as two stage online learning combinatorial MAB (CMAB). Their Combinatorial bandit Learning for MS and RA (CBMOS) algorithm achieves fast learning speed (132%) and higher performance (142%) for high channel dynamics. In [7], a RL-based latency controlled D2D connectivity (RL-LCDC) for indoor D2D was proposed. RL-LCDC intelligently finds the neighbors, determines the D2D connection, and adaptively manages the communication area for greatest network connectivity. Distributed Q-learning algorithm can automatically allocate the spectrum to control interference in D2D enabled multi-tier HetNets.The performance of the proposed RL based schemes was optimal compared to other techniques in terms of throughput, SE, signal to interference ratio (SINR), and network coverage. Additionally, clustering algorithms can be exploited for improving RA in single and multi-cell D2D network to obtain better SE and system performance. The authors in [15] utilized k-means clustering algorithm for improving RA in single cell D2D network yielding SE and system performance improvements.

ML for NDS
Single and multi-hop neighbor probing is a severe problem in D2D communications that can be efficiently solved using ML. In [16], the authors proposed a DL based peer discovery technique that applies information about social network relationships to reject ill-disposed devices. Also the authors of [17] and [18], formulated the problem of mmWave D2D NDS as a stochastic MAB to gain maximum long term throughput. In [4], [19], a multiplayer MAB algorithms were leveraged for surrounding gateway UAV selection by access UAVs in a disaster area scenario. This results in not only reducing the probability of encountering malicious devices but also enhancing the efficiency of peer discovery.

ML for Power Control
PC is an important interference management related topic that ML can handle in D2D. In [20], two PC algorithms based on supervised and unsupervised learning were proposed in D2D scenarios. The authors proved the importance of ML in D2D communication by comparing two ML algorithms with conventional PC methods in terms of computational complexity, throughput, and energy efficiency. In [21],the problem of D2D PC is addressed in the case of known channel gains between two D2D users and BSs, while the channel gains among the two users are unknown. A complete automatic power allocation method for internet of things (IoT)-D2D communication based on DL are proposed in [22]. They designed a distributed DL structure that trains the devices as a group but each device works independently and attained near optimal performance. The authors of [23] suggested a mean-field multiagent deep RL model that permits the devices to learn online PC strategies in a fully distributed manner, i.e., selfish strategy each node operates independently.

ML for Interference Mitigation
D2D transmission may be the source of severe interference to the other D2D links and the CUs. A survey of different popular and AI-based interference mitigation and RA approaches developed in D2D communications is provided in [24]. Additionally, the multi-agent actor critic (MAAC) is a newly proposed algorithm in [25] to mitigate interference by efficiently distributing the spectrum allocation. Moreover, the same paper proposes the neighbor-agent actor critic (NAAC) that uses neighbor users' historical information for centralized training leading to outage probability reduction and sum rate improvement for D2D links. Another RL based method for the RA problem was introduced in [26] based on the K-nearest neighbor algorithm that is utilized to choose the task offloading platform. Moreover, in [15], the concept of a limited D2D communication area is proposed based on ML.

ML for Network Caching
Former research on D2D caching strategies implies perfect knowledge of the content distribution. However, ML-based caching policy that makes use of the demand history is not only highly promising but also recommended to save time and complexity. A comprehensive survey articles for different ML and DL techniques for caching are provided in [27], [28], respectively. In [29], the D2D caching problem was formulated as multi-agent MAB to maximize the total predicted caching reward. Q-learning was leveraged to learn how to manage caching choices.

ML for D2D Security and commercial availability
There are still ongoing discussions on D2D commercial pricing and network security, where ML can handle such problems [1]. ML can help in addressing new D2D security challenges and threats related to device and user authentication to prohibit unauthorized access and attacks on the complete network. As an example of such usage, a radio frequency (RF) fingerprint based identification method of D2D device is proposed in [30]. Firstly, Hilbert transform (HT) and PCA are utilized to create the RF fingerprint of D2D device. Then, cross-validation (CV)-SVM is used as the classifier. Moreover the development of ML solution will make the issues of related commercial products close to appear in the market by large mobile companies like Huawei and Samsung with reasonable prices.

Challenges and Future Research Directions
Although most researchers leveraged ML techniques in wireless communication and D2D networks, it is crucial to identify and address different problems and challenges in practical D2D Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 5 January 2021 doi:10.20944/preprints202101.0074.v1 networks. In this section we present the following exciting and challenging future research directions that worth further studies.

Fast Learning Process in Highly Dynamic D2D Networks
The implementation of ML-based algorithms requires fast speed learning , especially for fast-moving devices. Additionally, suitable models for high channel dynamicity should be learned and updated. However, it is not always achievable to model in practice, which forms a bottleneck in D2D communication networks, especially for high-speed trains (HST) and vehicles. In mmWave D2D, due to the inherent complexity of the BT algorithms, proposing an optimal/sub-optimal ML-based algorithm with fast inference time is a challenging task, especially if we take into consideration different D2D use cases in Fig.1.

Adaptive ML for Easy/Adversarial Environment
So far, stochastic MAB algorithms are applied in several wireless communications applications. These algorithms are designed for stochastic stationary environments, which are not suitable for adversarial /dynamic environments like D2D communications. On the other hand, some of ML techniques (e.g., online learning algorithms ) have theoretical performance guarantees in the worst-case scenarios. These techniques, however, do not perform better in practice since the environments are not always so adversarial and they do not fully take advantage of such easiness in the settings. Easy data approach in ML like in [31], attempt to develop algorithms that perform adaptively in both the best and worst cases simultaneously. This approach would be helpful in solving different D2D problems, where the environments sometimes stationary and some other times are dynamic.
in mmWave based environment. The existing ML algorithms are applied without considering the adversarial mmWave environment, such as path blocking and spatial transmissions coming from BF. Moreover, DL-based solutions utilize continuous optimization that increases the system overhead. Also, these solutions require an incredible offline-learning phase, making it unsuitable for future mmWave B5G/6G applications. However, online/ discrete adaptive optimization techniques will be more suitable to cope with the mmWave nature, especially for multi-hop transmission scenarios.

Distributed learning information
Distributed AI can control the future D2D generations. The mutual information between ML and communications communities are essential and strongly required for promising unique solutions. The D2D community designers should provide ML designers with sufficient information about the devices/ locations/ speeds/ environments so as to invent suitable algorithms that succeed to perform distributive learning.

Energy Harvesting and Cognitive Radio
Saving energy in D2D networks is an essential requirement for prolonging network lifetime. ML-based energy harvesting and stochastic optimization schemes are urgently required to mitigate the harvested energy outage. The concept of CR can be intelligently implemented with the aid of ML.

Peer to Peer Internet
Employing ML with future D2D networks might help in realizing newly decentralized peer to peer internet. Decentralized multiplayer MAB techniques can help on solving such problem. Also federated and distributive learning techniques will be effective solutions. Future 6G systems will be definitely depend on real time/responsible AI.

D2D Networks for Decentralized Federated Learning
FL is a type of decentralized ML-based technique used to train networks by exploiting local models training and client-server communication [32]. This type of decentralized model is suitable for networks where the training data is distributed over a large number of devices with a fraction of the data. At the same time, those devices exchange their locally-trained models instead of exchanging their private data. Specifically, FL enables a joint ML training over distributed data sets with limited disclosure of local data. In [32], the authors provided an implementation of decentralized stochastic gradient descent (DSGD) technique for large-scale wireless D2D networks. However, exploiting such FL techniques for joint scheduling and resource allocation in D2D networks is a challenging task, especially under channel uncertainty and connection availability in each iteration.

Case Study: MAB based mmWave D2D Scenario
This section demonstrates the effectiveness of ML-based methods over conventional solutions of mmWave D2D NDS problem. Fig. 4 illustrates the simulated mmWave D2D network, where a mmWave device is located at the center of a micro-BS area of 125 × 125m 2 , and it desires to establish a D2D link with one of its neighbor devices. Conventionally, the center device should exhaustively search over all its surrounding devices using BT and select the best one maximizing the achievable data rate of the D2D linkage. This will highly decrease the link throughput because of the incredible training overhead. Instead, the mmWave NDS problem is modeled as a stochastic MAB, where the center device will act as the player aiming to maximize its long term reward, which is the achievable data rate. This is done via playing over the surrounding devices, serving as the arms of the bandit. Through proactive online learning, the center device will reach up at the device, maximizing its achievable data rate while examining one nearby device at a time, which highly reduces the NDS process's training overhead and increases the throughput consequently. UCB and MOSS [33] algorithms are utilized to  prove the effectiveness of the MAB based mmWave NDS over the conventional NDS, and random selection [13]. UCB attempts to improve the action selection's confidence every round by reducing the uncertainty, while MOSS is appropriate for both stochastic and adversarial MAB settings. Hence, both are suitable for the mmWave D2D NDS problem. In random selection, a random nearby device is selected every round for establishing the D2D link. Although it highly relaxes the NDS overhead, it results in a poor achievable data rate and low throughput. The UCB and MOSS algorithms are modified to select the best nearby device with maximum long-term data rate for constructing the D2D link. The mmWave channel model plus blocking formulations are given in details in [17]. Table 3 summarizes the simulation parameters values utilized in this case study. show superior performance over either conventional (Conv) or random NDS methods. The average throughput is increased as we increase the number of devices due to the pros of MAB based algorithms that reduce the overhead, unlike traditional solutions. The conventional NDS method reduces the average throughput as the number of devices increases. Figure 6 presents the average throughput against the percentage of NLOS availability for the compared algorithms. It is worth noting that MAB based solutions have superior performance even at high LOS blockage. The figure confirms the ML-based algorithms' advantage for solving the NDS problem in mmWave D2D considering harsh LOS blockage environment. One of the main challenges for MAB solutions is the convergence of the algorithm. In Fig. 7, we study the convergence rate of the utilized MAB algorithms against the horizon where the optimal solution is added as an upper limit. It is clearly shown that the proposed MAB algorithms achieve a high convergence rate towards the optimal data rate obtained through exhaustively searching all available nearby devices. At t = 400, MOSS and UCB converge to 86%, 70% of the ideal average throughput, respectively.

Conclusion
This paper has presented a general overview of the applicability of ML algorithms in the area of D2D networks. The above investigation has identified difficulties and challenges to be addressed by the community to establish practical ML-based solutions that support D2D in B5G and 6G systems. The scope of future research when ML meets D2D is broad. Hence, we introduced a few exciting and challenging research issues that worth additional investigations. Furthermore, we give a case study to emphasize the effectiveness of the MAB based techniques to solve the NDS problem in mmWave D2D communications over conventional solutions.
Author Contributions: All authors contributed equally in this paper.

Conflicts of Interest:
The authors declare no conflict of interest.