Machine Learning in Beyond 5G/6G Networks—State-of-the-Art and Future Trends

: Artiﬁcial Intelligence (AI) and especially Machine Learning (ML) can play a very important role in realizing and optimizing 6G network applications. In this paper, we present a brief summary of ML methods, as well as an up-to-date review of ML approaches in 6G wireless communication systems. These methods include supervised, unsupervised and reinforcement techniques. Additionally, we discuss open issues in the ﬁeld of ML for 6G networks and wireless communications in general, as well as some potential future trends to motivate further research into this area


Introduction
Wireless communication systems have experienced substantial revolutionary progress over the past years.With the rapid progress of 3GPP 5G phase 2 standardization, the commercial deployment of 5G applications being deployed all over the world cannot fully meet the challenges brought by the rapid increase of traffic and the real-time requirement of services [1].In this behalf, industry and academia are already working towards realizing the sixth generation (6G) communication systems.ML, as part of AI, involves teaching the machines to perform tasks independently based on making data-driven decisions.ML can accurately estimate various parameters and support interactive decision-making.In [2], the deployment of ML techniques as potential solutions upcoming 6G wireless communications challenges is being discussed.The application of ML techniques in 6G wireless communication systems has been the subject that attracts interest in recent years.In this paper, we extend our earlier work [3].
The remainder of the paper is as follows.Section 2 briefly discusses the 6G network requirements and challenges.In Section 3, we present some basic ML algorithms.In Section 4, we present some of the emerging new 6G applications and services and the role of ML.Finally, Sections 5 and 6 discuss some open issues and future trends in the application of ML algorithms in 6G and wireless communications, whereas Section 7 concludes this review paper with some remarks.

6G Network Requirements and Challenges
The global mobile traffic volume is anticipated to reach 5016 exabytes per month (Eb/mo) in 2030, while in 2010 it was 7.462 EB/mo in 2010 [4] and so 5G will not be able to address the traffic load.6G will try to address the shortcomings of 5G by trying creating smart radio environments through Intelligent Reflecting Surfaces (IRS) and adjusting the communication in higher frequency bands (THz and mm-wave) [5].IRS emerges as a key technology in future 6G networks.IRS receives a signal from the base station (BS), and reflects the signal with induced phase changes, which are adjusted by a controller.The reflected signal can be added coherently with the signal from the BS to either boost or attenuate the overall signal at the receiver.IRS may not amplify the signal power without power but has minimal power requirement for the operation of the controller and reconfiguration of the elements to have full control over the reflection signal.
IRS is energy and cost efficient, by inducing smart radio environments, and is free from self-interference, so can be used as other related wireless technologies such as, conventional relaying, backscatter communication (BackCom), and mMIMO relaying.IRS can be a solution for energy and spectral-efficient issue in 6G systems [6].IRS will play a crucial role in 6G communication networks, similar to that of massive MIMO in 5G networks.Thus, IRS can be used to help achieve massive MIMO 2.0 in 6G networks [7].
6G networks will enhance and expand 5G applications and will meet the following requirements [8,9]: perform any decision based on the training data.ML is usually classified into three major categories [13]: supervised, unsupervised, and reinforcement learning.

Supervised Learning
Supervised learning algorithms are trained using a labeled data-set.In supervised approach, both the input data and the desired output data to be predicted, are known to the system.In supervised learning it is essential to have enough data, in order to be effectively applied in any application [14].Supervised learning is mostly used for classification and regression problems and some typical supervised algorithms are logistic regression, Artificial Neural Networks (ANN), k-Nearest Neighbor (kNN) [15], naive Bayes, random forest and decision tree [16].

•
ANNs: ANNs are inspired by nature and try to imitate biological neural networks, and so are able to learn from complicated data.In wireless communication systems, ANNs can be used to learn the structure of the network and predict user's behavior to solve different problems such as spectrum and resource allocation, cell association etc. [17].Recently deep learning has extended the ANN applicability and capabilities with Deep Neural Networks (DNN) [18].Moreover, there are ANN types like the Autoencoders that are applied for unsupervised learning or other ANN structures that are used for reinforcement learning.• K-Nearest Neighbor: KNN is a classification and regression algorithm based on the distance between different feature values.The classification of an unknown data sample is determined based on the class of K nearest neighbors.If the majority of the nearest neighbors belongs to a certain class, then the sample is assigned into this class.The advantages of the algorithm are many: it is insensitive to outliers, easy to realize and suitable for multiclass classifications.The big disadvantage of the approach is that, for large input dataset, is very time-consuming [16].

•
Naive Bayes: it is a simple probabilistic classification model based on the Bayes theorem.The Bayes theorem provides a model of the conditional probability of a result Y with the given inpu/ condition X.The Naïve Bayes classifiers can effectively handle a large number of independent continuous or categorical features.This is due to the ability to transform a high-dimensional density estimation task into a one dimensional kernel density estimation task, assuming the features are independent with one another [19].

•
Decision tree: This model imitates trees in natures.Each node of the decision tree represents a feature of a data, each branch the conjunction of features that are needed for the classification, and each leaf node represents a specific class.The model tries to maximize the information gain of each variable split.After the model is trained by the known labeled dataset, the classification of the unlabeled sample can be achieved by comparing the feature value with the trained nodes of the decision tree.The basic advantages of the approach, include simple implementation, and high classification accuracy.However, it suffers from including many-level data variables because information gains are biased towards multi-level features [16].• Random Forest: A random forest usually consists of multiple decision trees.The method randomly selects a subset of features to be the base of constructing each decision tree.Each decision tree classifies any new dataset and the unknown data samples are categorized into a specific class, based on the majority of the decision trees [16].The algorithm only examines part of the attributes for an attribute of the best split and so low correlation between trees is essential to avoid the domination of a few strong attributes [19].Figure 1 depicts an example of a Random Forest model.

•
Convolutional Neural Networks (CNN): These models are made up of neurons that can self-optimize through unsupervised learning.They are mostly used for pattern recognition, especially in classification applications for image recognition.CNN consists of three layers: the convolutional layer, the pooling layer, and the fully connected layer.When these layers are stacked together, the complete CNN architecture is formed [17].CNNs can be used for both supervised or unsupervised learning depending on the task in which it is used.• Recurrent Neural Network (RNN): A RNN is an ANN type that uses sequential data or time series data.Some common applications of RNNs include ordinal or temporal problems, like as language translation, natural language processing, speech recognition, and image captioning.An artificial Recurrent Neural Network type is Long Short Term Memory (LSTM), which have been introduced in order to overcome the vanishing gradient problems, which are observed when training traditional RNNs.LSTM networks can be applied for classification, processing and making predictions based on time series data.As with CNNs, RNNs can be applied for both supervised or unsupervised learning.In our study, many different algorithms were applied, but all of them were based and inspired from the previously mentioned supervised algorithms.The advantages and limitations of the most common supervised ML methods that were introduced [20][21][22][23][24], are analyzed in Table 2:

Unsupervised Learning
Unsupervised learning algorithms are given a set of unlabeled data to correctly predict the output, which is the basic difference with the supervised learning approach.These algorithms are mostly used for clustering and aggregation problems, but can also achieve great results for regression problems.Some typical unsupervised algorithms include Kmeans, Self-Organizing Maps (SOMs), Hidden Markov Model (HMM), Auto Encoders (AEs), Principal Component Analysis (PCA), Restricted Boltzmann Machine (RBM), fuzzy C-means etc. Furthermore, unsupervised ML have been applied to enhance the performance of Deep Learning (DL) algorithms such as Convolutional Neural Networks (CNNs) and Long short-term memory (LSTM) algorithms [16].
• K-means: It is a widely used method to classify unlabeled raw input data into different clusters.K-means algorithm assigns each new data point to a cluster, based on its distance from the nearest associated centroid.The centroids are updated based on the previously assigned data point and the procedure is repeated until there is no alteration in the input data points and the centroids.K represents the number of desired clusters and can greatly impact the performance of the algorithm [16].

•
Self-Organizing Map (SOM): This approach is mostly used for data clustering and dimensionality reduction.The model has one input layer and a map layer, with each layer containing many neurons and a different weight vector is assigned to each neuron.During the training process, SOM builds the map by using an unsupervised competitive learning approach.The winning neuron from this competition determines the cluster in which any new input vector is classified [16].Figure 2 displays the architecture of a traditional Self Organizing Map model.

•
Autoencoders: learning circuits that copy inputs into outputs, aiming to have the least possible deviation.The have great results on both classification and regression problems.Autoencoders are stacked approaches and are trained unsupervised bottom-up, followed by a supervised learning method.In this way the top layer is trained based on known input, and so fine-tuning the whole architecture.In our study, many different algorithms were applied for unsupervised ML, but all of them were based and inspired from the previously mentioned algorithms.The advantages and limitations of the most common unsupervised ML methods that were introduced [16,[25][26][27][28], are analyzed in Table 3:

Reinforcement Learning
Reinforcement Learning (RL) is based on the principles of behaviourist psychology and the model learns the same way as a child learns to perform a new task.RL is realized on the basis of a feedback performance indicator (reward) conceived from the model's environment.The model pursues the ideal performance of the output by maximizing the indicator of the reward.RL is a hybrid of supervised and unsupervised learning, because (indirect) supervision is required for the model to understand and learn the ideal system's performance, while there is no available training dataset paired with the desired output [15].Basically, RL is a trial and error procedure where an agent interacts with the environment and based on whether the action tried was good or bad, gets feedback in terms of reward or penalty.RL tries to learn the best policy that would enable the agent to make an optimal decision at any given state of the environment.Figure 3 displays an example of RF.RL algorithms can be categorized to value-based (e.g., Q-learning, SARSA) and policy-based algorithms (e.g., Policy Gradient (PG), Proximal Policy Optimization (PPO) and Actor-Critic (A2C) [29].• Q-learning: Q-learning is the most common used RL algorithm.It is an off Policy technique and uses a greedy approach to learn the needed Q-value.The algorithm learns the Q-value given to the agent in a certain state, based on a specific action.The approach creates the Q-table, where the number of rows represent the number of states, and the number of columns represent the number of actions.The Q-value is the reward of the action at a certain state.Once the Q-values are learned the agent can make quick decisions under a current state by taking the action that has the largest Q-value from the table [30].• SARSA: It is an on-policy algorithm which uses each time the action performed by the current policy of the model, in order to learn the Q-values [19].• Policy Gradient (PG): The approach uses a random network, and a frame of the agent is applied to produce a random output action.This output is sent back to the agent and then the agent produces the next frame and the procedure is repeated until a good solution is reached.During the training of the model, the network's output is being sampled in order to avoid repeating loops pf the action.The sampling allows the agent to randomly explore the environment and find the better solution [17].

•
Actor Critic: The actor-critic model learns a policy (actor) and value function (critic).
Actor-critic learning is always on-policy because the critic needs to learn correct the Temporal Difference (TD) errors from the 'actor' or the policy [19].• Deep reinforcement learning.In recent years, deep learning has significantly advanced the field of RL, with the use of deep learning algorithms within RL giving rise to the field of "deep reinforcement learning".Deep learning enables RL to operate in high-dimensional state and action spaces and can now be used for complex decisionmaking problems [31,32].
Some advantages and limitations of the most common RL algoriths [33][34][35][36], are listed below in Table 4: Coverage, power and capacity optimization are critical challenges in future 6G networks services [16].In [37], Random Forest and knn algorithms are proposed to predict and optimize the Path Loss (PL).The results show a higher accuracy and reduced Mean Squared Error (MSE) compared with conventional approaches.The authors in [38] propose a novel approach, namely GRL, to address the problems of joint user association and power allocation.In the proposed model, for optimization purposes, the learning process is split into two parts, the generalization-representation learning (GRL) part, and the specialization-representation learning (SRL) part The authors assume a function that can represent the connection between the network's parameters and the optimal resource allocation, and problems are addressed by optimizing this selected function.In this approach, the data-driven (supervised learning) and model-driven (unsupervised learning) training methods are combined to accurately predict the optimal function and the results are satisfactory.
In [39], a supervised ANN-based algorithm, named MLP-DBA, is proposed to predict the dynamic bandwidth allocation (DBA).The authors, aim to achieve bandwidth allocation close to optimal conventional approaches.The simulation results indicate that the proposed model can adaptively allocate the bandwidth, while improving the latency performance over the conventional DBA schemes.In [40], a DNN algorithm is proposed to predict the user's requirements in high dynamic (UAV) network.The results show better performance than the conventional Q-learning based algorithms that were mostly used.In [41], an RNN algorithm is proposed for intelligent load balancing.The proposed intelligent load balancer, named APRIL, can effectively load forecast information to maximize server utilization.Results show that the proposed forecasting model performs by between 5.88 and 92.6 better than the alternatives.The deviation in the performance is because the user's role greatly impacts the performance of the model.
In [42], machine learning-based Cooperative Spectrum Sensing schemes (CSSs) have been proposed.In the proposed approaches, some nodes send the received signal power from the users to the Fusion Center (FC), where some artificial neural networks (ANNs) and SVM approach are used to determine whether the channels are idle or not.ANN is used to recognize the transmit power while SVM is used to find the best decision boundary, acting as a classifier.The results show that proposed approaches can offer great results in terms of accuracy and performance.In [43], the authors compare different supervised ML algorithms to predict data rate (ANN, SVM, random forest).Results show that random forest approach can achieve the lowest prediction error.The error is minimized in the uplink transmission direction (in downlink it is more significant).In [44], a supervised cooperative data rate prediction approach is introduced.This cooperative model reduces average prediction error by 30%.
In [45], combination of 2 well-known beamforming schemes (maximum ratio transmission and zero-forcing) is used in a K-user Multiple Input Single Output (MISO) channel.The proposed approach is based on a DNN in which the input nodes take channel vector with transmit power and the output returns the combining factors from transmitter's beamforming.The model achieves a sum rate of 99% when compared with conventional approaches.
A K-means clustering model for users in Thz MIMO-NOMA systems is proposed in [46].Based on whether the user belong to Small Cell Base Stations (SBSs) coverage or Macro Base Station (MBSs) coverage, they are separated into different cluster.The great path spreading path loss and molecular absorption loss are two important challenges in THz systems.So an efficient clustering scheme can both reduce interference and improve the channel quality, resulting in higher throughput and Signal-to-interference-plus-noise ratio (SINR).For the user's clustering an enhanced K-means approach is proposed in the same paper.The channel's correlation parameters of different cluster are examined and the one that maximizes the metric is used to address the issue of fluctuation of clustering centers.The simulation results show the efficiency of the proposed schemes.
In [47], a machine learning based predictive DBA algorithm is proposed for the contention of upstream bandwidth and bottleneck latency in Passive Optical Networks (PONs).The proposed algorithm using an ANN at the Central Office (CO) to learn the uplink latency and estimate the bandwidth demand of every units.Using this approach, the CO can allocate the required bandwidth to forthcoming packet bursts without the need to have them wait until the following transmission cycle.The simulation results show that the model is able to achieve a >90% accuracy in predicting the Optical Network's status leading to the improvement of the accuracy of estimating the bandwidth demands of the optical units.Table 5 holds a brief summary of the supervised ML models in Beyond 5G(B5G)/6G optimization problems.

Fault/Anomaly Management
In [48], the authors propose an extended SVM, which is called support Tucker machine, to detect any fault/outlier detection in IoT systems.The model improved the accuracy and efficiency of anomaly detection and was able to retain the structure of the big sensor data.Estimation of future radio communication channels is rather challenging, due to their growing complexity [16].In [49], data-driven supervised DNN estimators are used to predict channels, with results showing that using this approach the authors can predict more accurate channels compared to conventional channel estimation algorithms.The authors in [50] propose a supervised deep neural network (DNN) approach for adaptive bit allocation with imperfect Channel State Information (CSI) in heterogeneous networks.The accurate CSI estimation in heterogeneous networks can greatly impact the system's performance.Furthermore, the reduction of feedback overhead is an important challenge in heterogeneous networks.Even though many different quantization techniques have been used to address this issue, the system's performance cannot increase linearly with the number of bits increasing exponentially.The bits need to be distributed to the cells and then they are further allocated to each channel optimally.This conventional approach is timeconsuming and so in order to enable direct allocation for the entire network, the proposed method is used.Using the supervised DNN the optimized number of bits can be directly obtained for a different number of bits and scenarios, leading to complexity reduction.Simulations show that the proposed method achieves a closer to optimal performance than the conventional approaches.

Beam Selection
The authors in [51] propose a combined supervised ML approach for beams selection in mm-wave communications.The beam selection problem was addressed as a multi-class problem, using two supervised learning algorithms (kNN and Support Vector Classifier-SVC) to address the issue, with simulation results showing that the proposed ML schemes can retain 90% of the sum rate with optimal beam selection.In [52], a supervised SVM for beam selection is proposed, aiming to achieve high sum-rate at lower computational complexity.The results verified that the proposed ML approach can achieve higher Average Sum Rate (ASR) with substantially lower computational complexity than conventional approaches.In [53], the authors propose a DNN model for beam selection in mm-wave systems, to reduce space required for the initial beam.The results show that the proposed beam selection reduces the beam overhead by up 79.3%.In [54], a DNN for optimal downlink beam in mm-wave networks is proposed, to enhance prediction accuracy and data rate.The simulation results show superior performance and robustness of the proposed model.The conventional approaches mostly rely on the sub 6GHz information, especially in the low signal-to-noise ratio (SNR) regions.In [55], a novel deep learning solution based on a RNN, namely the Gated Recurrent Unit (GRU) is proposed for beam selection.The model can predict the serving base station and beam for each drone based on their prior trajectories and locations, extending their coverage.Simulation results show that the proposed scheme can achieve more than 90% accuracy for beam prediction.

Caching/Computing
In [56], the authors use an ANN-based approach to address the issue of code caching, with results showing the effectiveness of the model, In [57], a supervised DNN is proposed to address the issue of caching in IoT systems, with results being close to the optimal of conventional ones.

Security
In [58], the authors use decision tree algorithms to boost trust management using eXplainable Artificial Intelligence (XAI) for intrusion detection.Simple decision tree algorithms are applied to split the sub-choices for the intrusion detection system (IDS), which resemble a human approach to decision-making.Results show that the accuracy of the proposed approach is comparable with state-of-the-art algorithms.The authors in [59] used a supervised-based LSTM algorithm for intrusion detection model.They applied 6 different optimizer to investigate the performance of the model and the results show that LSTM model with Nadam optimizer can achieve an accuracy of 97.5%, which outperforms conventional approaches.In [60], the authors propose a supervised CNN-based method to classify and detect malware traffic, with classification accuracy of up to 99.4%.

MIMO
In [61], the authors propose a combination of ML-estimators, using CNN with Autoregressive Network (ARN)) for predicting Channel State Information (CSI) and RNN for channel prediction in massive MIMO systems with channel aging property.Results show that proposed model can improve the prediction accuracy and user's throughput gains for both low and high mobility scenarios.In [62], the issue of channel mapping in space and frequency domain in massive MIMO is addressed, by using a novel supervised deep learning approach, reducing overhead in both the training and feedback aspects.

UAV
In [63], a supervised deep learning approach is proposed for UAV systems.The proposed model uses a Clustering-based Two-layered (CBTL) algorithm for addressing this joint caching and trajectory prediction issue.Then, a DL approach of a CNN is used to enhanced make fast decisions online.This approach aims to maximize the network's throughput by jointly optimizing cache and trajectory.Simulation results show the effectiveness of the proposed approach in terms of accuracy.In [64] an ANN-based algorithm is proposed, to detect GPS spoofing signals in UAV systems.The results show high detection accuracy of spoofing signals and can reduce possible false alarms in the UAV system.In [65], the authors propose a SVM-based supervised approach for detecting jamming, spoofing and intrusion attacks in UAV systems.The proposed model shows high accuracy in detecting any attacks, reassuring safer UAV systems against cyber security attacks.The authors in [66] proposed a supervised ANN approach combined with an evolutionary algorithm, to predict the Received Signal Strength (RSS) in a UAV system.Moreover, in [67] an ensemble approach is selected, which exhibits satisfactory results in terms of performance and accuracy.Table 6 reports some supervised ML models used for B5G/6G problems.Coverage, power and capacity optimization are critical challenges in future 6G networks services [16].In [68,69], an unsupervised K-means algorithm is used to address the user selection and optimization of power allocation challenges in NOMA systems.Results show that the proposed model achieves great results in terms of accuracy and optimization.In [70], two Power Control (PC) algorithms, which are trained both using supervised and unsupervised learning, were proposed for Device-to-Device (D2D) scenarios.The comparison of the hybrid algorithms with conventional PC methods, show satisfactory results in terms of computational complexity, throughput, energy efficiency, resource allocation and power control optimization.This work is categorized in unsupervised ML, because for the approach the supervised decision tree occurs from the unsupervised Q-learning method, so for the final hybrid approach the most significant impact factor is the performance of the unsupervised model that defines the supervised phase of the model and so the final performance of the approach.
Conventional approaches in modulation recognition of the received signals include several procedures such as preprocessing, classification and feature extraction.The authors in [71,72] addressed the challenge of modulation recognition, by investigating the performance of different deep learning algorithms such as CNN, LSTM etc, by using unsupervised learning paradigms for optimization purposes.The comparison results suggest that LSTM can achieve better performance than other DL based approaches.
CNN and LSTM are categorized as supervised learning methods, but they can be used in an unsupervised learning approach with satisfactory results.CNN is mostly supervised ML approach, but can be also used in an unsupervised way depending on the problem at hand [73].The authors in [74] propose an automatic unsupervised cell event detection and classification method, which expands convolutional Long Short-Term Memory (LSTM) neural networks.The LSTM network could be trained in an unsupervised manner, by using a branched structure where one branch learns the regular appearance and movements of objects and the second learns the stochastic events, which occur rarely and without warning in a cell video sequence.Furthermore, the authors in [75] investigated anomaly detection in an unsupervised framework and introduce long short-term memory (LSTM) neural networkbased algorithms with significant performance gains.The authors in [76] propose a new architecture for extracting features from images in an unsupervised manner, which is based on CNN.The model, namely Unsupervised Convolutional Siamese Network (UCSN), is trained to embed a set of images in a vector space, in a way that the local distance structure in the image space is preserved.The results indicate that the UCSN produces representations that are suitable for classification purposes.So LSTM and CNN are mainly used as supervised ML approaches, they can also be used in an unsupervised manner and as an unsupervised learning paradigm.

Fault Management
Fault management includes detection, identification and mitigation of any abnormal status of networks.Fault management in future 6G network needs to be effective, due to their heterogeneous, complex and dynamic nature.The authors in [77] compared five different unsupervised learning approaches (including K-means clustering, Fuzzy C-means clustering, Local Outlier Factor-LOF, Local Outlier Probabilities-LoOP and Kohonen's Self Organizing Maps-SOM) for fault detection in 6G networks.The results show that SOMbased approach outperforms Fuzzy C-means and K-means in detecting and predicting faults/abnormalities in 6G networks.
In [78], an extension of the conventional K-Means clustering algorithm, named K-Aware K-means, is used for fault detection in 6G network systems.In this extended version of K-means, the model uses an unsupervised learning phase to acquire a temporary expert knowledge of what the smallest cluster of the current data is like and then labels them as outliers, while updating the temporary knowledge.In this way, the model self-optimizes the K value (K ≤ 1).and achieves a prediction accuracy of 99.7%.The authors in [79] propose an unsupervised learning approach with a SOM algorithm as the centerpiece for both fault recognition and recovery, achieving great accuracy results.

Channel Estimation
Estimation of future 6G radio communication channels is rather challenging, due to their growing complexity [16].State-of-the-art unsupervised learning approaches (DL unsupervised model, CNN and RNN) have been used for channel detection in molecular communication [80,81].A DL-based detector called DetNet was proposed in [82] and is able to achieve similar accuracy as conventional algorithms with much lower computation time.
The unsupervised DL-based detectors suggested in [81] can also outperform conventional detectors.Especially, the LSTM-based detector shows an outstanding performance for molecular communication use-cases, when dealing with inter-symbol interference [80].

User Mobility Estimation
Predicting user's position, movement and trajectory can improve resource allocation and reduce signal overhead in 6G networks [16].The authors in [83] used a discrete-time Markov chain based approach to predict the next cell a user is most likely to move into.Results show that the solution can accurately predict both the movement and trajectory of the users.Furthermore, in [84] the authors used HMM algorithm to predict user's location.The model addresses the mobile network as a state-transition graph.The efficiency and accuracy results of the approach were satisfactory.Two unsupervised algorithms for user equipment (UE) association are proposed in [85] in heterogeneous networks at RF and THz frequencies.The simulation results show that proposed algorithms can outperform conventional approaches in both data rate and balancing traffic load.4.2.5.Security AI/ML technologies can also be considered in applications of authentication and access control to detect different kinds of attacks, such as jamming and malware attacks, Denial of Service (DoS) or Distributed DoS (DDoS) attacks.In IoT devices, it is important to address authentication and access control without leaking privacy-sensitive information such as localization.In [86], the authors use non-parametric Bayesian methods for IoT authentication, access control, malware detection, with satisfactory results.The authors in [87] propose a DRL based approach that detects various attacking possibilities through unsupervised learning to address the security issue, with result showing a 6 percent extra gain in accuracy.The authors in [88] propose an unsupervised Gausian Mixture Model (GMM) approach for Physical Layer security, enhancing the performance of the model, whereas the authors in [89] used an unsupervised approach combining CNN and Stacked Encoders (SAE) for intrusion detection, achieving a precision of 98.44% black.

UAV Networks
Future 6G networks will support high transmission data rates and wireless broadcast.Unmanned Aerial Vehicle (UAV)-assisted communication networks will be widely used towards achieving these challenges [90].In UAV-NOMA systems, an UAV often acts as a flying BS to boost the capacity of an existing terrestrial network.In [91], a K-means clustering algorithm is used to spatially cluster correlated users and then a reinforcement Q-learning algorithm is used to place the UAV as BS in a 3-D manner.The authors in [92] proposed MLP and LSTM algorithms techniques to predict the optimal UAV location and optimize user throughout and system performance.The proposed model accurately predicts UAV position and enhances user throughput and system performance.

MIMO
With multiple antennas at the transmitter and receiver, Multiple Input Multiple Output (MIMO) has been widely adopted in wireless systems.The authors in [93] propose an unsupervised fast beamforming DNN design method for maximization of sum-rate in a MIMO single base station system.The proposed approach can preserve the performance, while improving considerably the computational speed, thus achieving results close to optimal.

Visible Light Communications
Effective Radio Frequency (RF) communications systems in indoor use-cases emerge as an important challenge in 6G networks.Visible Light Communications (VLC) as a potential technology, can offer various solutions to this issue.VLC is based on the principle of modulating Light Emitted by Diodes (LEDs), without affecting the human eye, giving an opportunity to exploit the existing illumination infrastructure for wireless communication.VLC technology is expected to offer very high data-rate short-range communications, needed for 6G Networks [90].6G is expected to support transmission rates 100-1000 times higher than those for 5G, so there will be growing frequency and bandwidth demands.VLC can employ high transmission rates and use unlicensed bands.So, it is a promising technique to replace conventional wireless local area networks for indoor communications in 6G networks [94].
Optical Wireless Communications (OWCs) will be widely used in 6G networks and among them, VLC is the most promising frequency spectrum because of the technology advancement and extensive using of light-emitting diodes (LEDs).VLC-based communications do not emit electromagnetic (EM) radiation and have minor interference with other potential EM interference source.Furthermore, VLC has significant advantages in terms of communication security and privacy [95].
VLC can also be widely exploited in Vehicle to everything (V2X) applications and especially in n Vehicle to Vehicle (V2V) applications [90].In [94], some clustering unsupervised ML techniques (K-means and clustering algorithm perception decision-CAPD)) have been proposed to reduce non linearity in VLC systems.In 2017, CAPD was applied in a multi band VLC system, with the results showing an improvement in the Q-factor by 1.6-2.5 dB.Furthermore, in 2018 a K-means-based pre-distorter was proposed, leading to a 50% improvement of performance [94].The data for the unsupervised ML models used in 6G problems are listed in Table 7.In [96], the authors propose a multi-agent deep reinforcement learning-based model, named Neighbor-Agent Actor Critic (NAAC), for spectrum allocation in 6G network D2D scenarios.This model uses information from user's neighbors for centralized training and utilizes any cooperation between the users to optimize system's performance.The simulation results show that the proposed approach can improve the sum rate of D2D links and have good convergence.
In [97], a deep Q-learning based approach is proposed, namely a Generative Adversarial Network-powered Deep Distributional Q Network (GAN-DDQN) for spectrum allocation per network slice.Simulation results show enhanced performance accuracy compared with conventional deep Q-learning algorithms.In [98], the authors propose a reinforcement Q-learning-based algorithm, for resource allocation.The model minimizes the outage probability of information by assigning the channel resources.The results demonstrate the superior performance and effectiveness of the proposed scheme while satisfying the average power constraint at the energy harvesting node.
In [99], the authors propose a Q-learning based algorithm for channel selection, scanning the order of the channel and so reducing the overhead and possible delays.The proposed approach achieves higher detection probability and accuracy, and reduction of scanning overhead and access delay when compared with state-of-the-art algorithm, resulting to enhanced spectrum sharing.In [100], a deep Q network based algorithm is proposed for cooperative communications in 6G networks.The model aims to select optimal relay from different nodes without needing a network model.Results show that the proposed algorithms can achieve better performance probability, and reduced energy consumption with lower convergence time than existing approaches.
In [101], a deep RL-based algorithm is developed for dynamic power allocation.Each transmitter exploits its neighbors to collect CSI and QoS information and then adapt its needed transmit power.Random variations and delays in the CSI are addressed using deep Q-learning based approach.The proposed algorithm is shown to achieve near-optimal power allocation results based on delayed CSI measurements and is excellent for scenarios where the CSI is significant.
In [102], novel reinforcement learning-based transmission approaches, named Reinforcement Learning Channel-aware Transmission (RL-CAT) and Reinforcement Learning pCAT (RL-pCAT), for data rate optimization are proposed.The proposed models significantly outperform conventional probabilistic approaches and achieve data rate improvements of up to 181 in uplink and up to 270 in downlink transmission direction.
In [103], a DRL-based approach for joint mode selection and resource management is proposed.Each user equipment (UE) can operate either in cloud RAN (C-RAN) mode or D2D mode.The network controller makes intelligent decisions on UE communications and aims to minimize system's power consumption.The proposed approach is compared with other different models to show its effectiveness.In [104], the authors propose a DRL based model to maximize downlink SNR in Intelligent Reflecting Surface (IRS) communications.Simulations results show that the system can, not only achieve almost the upper bound of received SNR, but also reduce the time consumption.
In [105], a DRL actor-critic based model is used for resource allocation optimization and to solve the joint network control challenge in IoT systems.The actor-critic based algorithms reduce the data rate assigned to each IoT network and IoT devices.The algorithm also chooses whether transmission will be in space or terrestrial network.The proposed model outperforms conventional approaches with different network parameters and metrics.
In [106], a Single-Agent Q-learning (SAQ-learning) algorithm is proposed for resource allocation using historical experience with satisfactory result.In the same paper, a Bayesian Learning Automated (BLA) Multi-Agent Q-learning (MAQ-learning) algorithm is proposed for task offloading decision.The effectiveness of the proposed algorithm is confirmed from the comparison with the results of conventional algorithms in various network scenarios.

Caching/Computing
In [107], a DRL MDP-based algorithm is proposed to enhance caching and computing capabilites in cache-aided MEC networks.This approach lead to resource allocation optimization with low complexity and thus is able to achieve quasi-optimal performance under various system setups, and significantly outperform the conventional methods.In [108], the authors propose a deep actor-critic reinforcement learning based model for caching (centralized and decentralized).For centralized edge caching, the model aims at the maximization of cache hit rate, where both the cache hit rate and transmission delay are addressed as performance metrics that need optimization.Results show that the proposed approach outperforms previously applied conventional approaches, such as least frequently used (LFU), least recently used (LRU, etc.In [109], a Multi-Agent Multi-Armed bandit (MAMAB) approach is proposed for caching in 6G networks.The proposed model learns online the caching strategy in various environments (stationary and non-stationary), whereas conventional approaches first estimate the users preference and need and then tries to optimize the caching.Results show great accuracy and performance results of the proposed algorithm.Table 8 reports the RL models used in 6G for optimization and caching problems.

Channel Estimation/Allocation
In [110], the authors propose a RL-based algorithm (based on auction theory model) for channel allocation.Each user try to converge to the optimal allocation while achieving an optimal regret order O (log T ), where T is the length of time horizon.The algorithm is based on a Carrier Sensing Multiple access (CSMA) implementation.Simulations show that the algorithm performs very well on realistic LTE and 5G channels and has great potential for B5G systems.In [111], a Markov decision process (MDP)-based algorithm for channel allocation is proposed.The model allocates channels in densely deployed WLANs, leading to enhancement of throughput.The proposed method can achieve more efficient channel allocation or realizes the optimal channel allocation and reducing the number of changes in the systems performance, when compared with state-of-the-art approaches.

Energy Consumption/Harvesting
In [112], author propose a Q-learning and a deep Q-learning algorithms for cooperative networks to user devices and the Small Base Station (SBS) due to different complexities.Results show greater energy saving performance of these approaches over existing methods.In [113], a DRL approach is proposed for optimizing energy consumption in 6G networks.This model takes mobility into account and accelerates block verification.The reward function considers the total consumed energy for transmission and caching.In this paper, also, a security study is conducted, with the model providing security and privacy protection, while maintaining low-energy consumption.The proposed algorithms achieves 86% of successful content caching requests against 76% of a conventional greedy algorithm and 5% of a random content caching approach.
In [114], the authors propose two DRL-based algorithms for energy harvesting: one hybrid-decision-based actor-critic learning (Hybrid-AC) algorithm and one multi-device hybrid-AC (MD-Hybrid-AC) algorithm for dynamic computation offloading scenarios.Hybrid-AC applies an improvement in the actor-critic architecture.In this approach, the actor outputs offloading ratio and local computation capacity and the critic evaluates these continuous outputs with discrete server selection.MD-Hybrid-AC applies centralized training with decentralized execution in the scenarios.The model constructs a centralized critic for output server selections, and considers the continuous action policies of all devices for actor.Simulation results show that the proposed algorithms have a significant performance improvement compared with conventional and can maintain good balance between time and energy consumption.
In [65], a Deep Q-Network (DQN) based algorithm for energy consumption is proposed.Furthermore, the authors develop a RL algorithm for minimization of prediction error, in order to address a battery's energy prediction challenge.Finally, a two-layer RL network approach is developed to solve the joint access control and battery prediction issue.In this approach the first RL layer deals with the battery's energy prediction and the second, depending on the output of the first layer, produces the access policy of the system.Simulation results show that the three proposed RL algorithms can achieve better performances compared with existing approaches in terms of optimizing energy consumption, sum rate and minimizing the prediction loss.
In [115], a multi-agent DRL-based framework was proposed for power control and maximization of throughput in energy-harvesting super IoT systems.Furthermore, a DNN based for distributed online power control is developed to study the policies in the system.Simulation results show the efficiency of the proposed power control policies, outperforming conventional optimal approaches like Markov decision process, and also achieving throughput close to optimal.

Handover
In [116], the authors propose an offline RL algorithm to optimize Handover decisions.The model is able to decrease excess Handover up to 70% by studying the prolonged user's connectivity.This model can also achieve higher than conventional Handover reduction approaches.In [117], a DRL framework is proposed for handover optimizing and timing in mm-wave systems.The model uses camera images for predicting future data rate of mm-wave links and ensuring that proactive Handover is performed before the presence of obstacles leads to decreasing system's data rate.The proposed approach achieves better performance results than conventional model and is also able to predict the degradations of date rate 500 ms before the occur.In [118], a distributed RL model for Handover optimization in mm-wave systems is proposed, with results showing reduction in signal overhead.

V2V
In [119], a DRL algorithm is adopted to map the correlation between observation and optimal resource allocation in V2V systems.The proposed model satisfies the latency constraints on V2V links and is able to minimize any interference in the V2V system.In [120], a RL-based approach for sum rate optimization in V2V systems is being introduced.The model is a reinforcement distributed Resource Allocation (RA) algorithm, modeled as a multi-agent system.Furthermore, a double deep Q-learning algorithm is applied to jointly train the agents and maximize the sum-rate.Simulation results show that the proposed RL-based algorithms achieve close to optimal performances, while ensuring limited latency and accurate packet delivery in the V2V link.

UAV
In [121], the authors propose a two-stage DRL algorithm for joint content placement and trajectory design.The two stages of the proposed scheme include offline content placement and online user tracking.In the first stage, the authors maximize users hit rate while constraining cache capacity.In the second stage, a Double Deep Q-Network (DDQN) is developed for online tracking mobile users, while maintaining energy constrains.Simulation results show that the proposed algorithm can easily adapt to dynamic conditions, predict trajectory and provide enhanced achievable throughput.In [122], a DRL is proposed to maximize throughput, and security metrics against jamming attacks, in 6G network.Simulation results show that the proposed approach is robust against jamming and can achieve throughput enhancement, compared with conventional policies.In [123], the authors use a Markov model to deal with several ad-vanced jamming attacks.When dealing with attacks such as swept jamming and dynamic jamming, the authors model a multi-agent reinforcement learning (MARL) algorithm for effective defense.The simulation results show that the algorithm can effectively avoid these advanced jamming attacks, thanks to collaboratively sharing the spectrum to its agents.In [104], a novel DRL-based algorithm is proposed to ensure secure beamforming approach against eavesdroppers in dynamic IRS-aided environments.The model uses post-decision state (PDS) and prioritized experience replay (PER) approaches to boost the learning efficiency and secrecy performance of the system.The proposed novel approach can significantly improve the system secrecy rate and QoS (thus optimal beamforming is required) in IRS-aided secure communication systems.

Visible Light Comunication
In [124], the authors propose a DQN based multi-agent multi-user algorithm for hybrid networks for power allocation.These networks are composed of radio frequency (RF) and visible light communication (VLC) access points (APs).The users are capable of multi-hopping, which can link RF and VLC systems in terms of bandwidth requirements.In the proposed DQN algorithm, each AP is considered an agent and so the transmit power needed for users is optimized by an online power allocation strategy.Simulation results demonstrate faster median convergence time training (90 shorter than typical Q-Learning based algorithm) and convergence rate is 96.1% (whereas conventional QL-based algorithm's convergence rate in 72.3%).In [125], a multi-agent Q-learning algorithms is proposed for power allocation strategy in RF/VLC systems.In these systems, in order to ensure QoS satisfaction, the transmit power at the Aps needs to be optimized.Simulation results demonstrate the effectiveness of the proposed Q-learning based strategy in terms of accuracy and performance.

Fault/Anomaly Management
In [126], a deep Q-learning approach is proposed for fault detection and diagnosis in 6G networks.Simulation results show that the algorithm can use less features and achieve higher accuracy, up to 96.7% Table 9 holds a brief summary of the RL models used in various 6G problems.

Open Issues
ML application can offer new research directions and solutions in wireless communication systems and also support the realization of 6G wireless communication networks and services.Although significant research has emerged on the field of ML in wireless communication systems, there are still many challenges and open issues to be resolved: • Time Convergence: A careful investigation of the relatively long convergence time of ML methods, as well as the factors that influence the convergence, is needed.Optimizing the time convergence is critical, as long ML time convergence can undermine the performance in highly dynamic wireless networks [127].

•
Resource allocation: AI-enabled networks also impact e-health applications.For instance, advancing outside-of-clinic operations by using wearable sensor requires harmonizing network resource allocation across several technologies, and ML can be helpful for such harmonization [127].• QoS and QoE: A network encompassing a large and diverse set of users will have very dynamic operation, as users may have very different QoS and QoE requirements.For example, users require high throughput and low delay in video stream applications, in the expense of security, but when it comes to payment software, the users demand high security, even in the expense of throughput.In this direction, a design of a cross-layer, action based ML protocol for different applications is a critical issue, as to meet various requirement while balancing network resources [128].Meta-learning is an exciting research direction in the field of ML.Model Agnostic Meta Learning (MAML) is a gradient-based meta-learning algorithm that is able to learn a sensitive initialization to perform fast adaptation.Compared to other meta learning methods, MAML has much less complexity.MAML does not depend on any specific model, and only requires the use of gradient descent algorithm to update the parameters.So MAML can be applied to multiple learning problems, such as regression, classification and reinforcement learning, etc. [131,132].MAML is a field of ML that needs to be further investigated and developed.To this end, few studies are exploring potential solutions.For example, in [133] a MAML-based method is proposed o solve the challenge of associated large number of samples in a wireless channel environment, in order to train a deep neural network (DNN) with good results in terms of Normalized Mean Squarred Error (NMSE).Furthermore, the authors in [134] propose a new decoder, namely Model Independent Neural Decoder (MIND) based on a MAML methodology achieving satisfactory parameter initialization in the meta-training stage and accuracy results.The authors in [135] use state-of-the-art meta-learning schemes,namely MAML, FOMAML, REPTILE, and CAVIA, for IoT scenarios using offline and online meta learning approach.The results show the advantage of meta-learning in both offline and online cases as compared to conventional ML approaches.It is an interesting and ongoing direction to developing ML methods that can be utilized in 6G networks in future work.

Generative Adversarial Networks (GANs)
Generative Adversarial Networks (GANs) is a novel class of deep generative models in which training is a minimax zero-sum game between two networks: a Generator (G) and a Discriminator(D) [136].These networks compete in a unified training process where the generator uses its neural network to produce samples and the discriminator tries to classify these samples as real or fake [137].The game is played until Nash equilibrium using a gradient-based optimization technique (Simultaneous Gradient Descent), i.e., G can generate images like sampled from the true distribution, and D cannot differentiate between the two sets of images [136].GANs has gained a lot of attention recently for different applications and seem to be a potential solution to various challenges.For example, the authors in [138] employ a GAN approach to pre-train a deep-RL framework to provide resource allocation for ultra reliable low latency communication (URLLC) in the downlink of a 6G wireless network, with results showing near-optimal performance within the rate-reliability-latency region, depending on the network and service requirements.Furthermore, the authors in [139] proposea GAN based joint trajectory and power optimization (GAN-JTP) algorithm for a UAV trajectory prediction and power optimization, with results being close to optimal with high convergence speed.In the context of a complex 6G network system, the development of GANs seems crucial for the upcoming challenges.

Conclusions
In this review, we focused on the various enhanced capabilities that 6G has to offer, but also to the solutions that ML has to offer to the emerging 6G wireless communication challenges.We have summarized the state of-the-art 6G applications and the deployment of ML algorithms in various fields and applications.The most important ML were explained in detail, focusing on their advantages in dealing with upcoming 6G wireless communications challenges and enhancement of different systems.The interest in exploiting ML in 6G wireless communications challenges will sky rocket in the upcoming years, as 6G networks will soon be realized and the various challenges in the networks can be effectively addressed using ML approaches and models.Finally, we outlined out a handful of open problems and directions worth future research efforts.

Table 2 .
Advantages and limitations of supervised ML methods.

Table 3 .
Advantages and limitations of unsupervised ML methods.

Table 4 .
Advantages and limitations of RL methods.

. Beyond 5G/6G Applications and Machine Learning 6G
will be able to support enhanced Mobile Broadband Communications (eMBB), Ultrareliable Low Latency Communications (URLLC) and massive Machine Type Communications (mMTC), but with enhanced capabilities compared to 5G networks.Furthermore, will be able to support application such as Virtual Reality (VR) Augmented Reality (AR) and ultimately Extended Reality (XR).Based on the problem different ML algorithms are applied as analyzed below.

Table 7 .
Unsupervised ML models in 6G problems.

Table 8 .
RL models in 6G optimization and caching problems.

Table 9 .
RL models in 6G various problems.
• UAVs as an Intelligent Service(UaaIS): UaaIS employs UAVs to intelligently provide fundamental services in terms of wireless communication, edge computing, and edge caching, using advanced ML techniques.Due to the scarce resources, it is urgent to perform energy-efficient ML model training and inference for UaaIS, a rather challenging open issue in the field.For example, when a UAV acts as an edge intelligence trainer, energy-efficient training strategies for all participants should be designed, and especially for the UAVs with relatively limited energy [129].• CSI Acquisition in IRS: The acquisition of timely and accurate CSI plays a crucial role in IRS-enhanced wireless systems and especially in MIMO-IRS and MISO-IRS networks.Obtaining CSI in IRS-enhanced wireless networks is a non-trivial task, that requires a non-negligible training overhead.Additionally, in IRS-assisted NOMA networks, users in each cluster have to share the CSI with each other.Due to the passive characteristic of IRS, CSI acquisition and exchanging are non-trivial tasks.A challenging issue is the employment of ML and DL approaches for exploiting CSI in cases beyond linear correlations [130].