A survey on Machine Learning-based Performance Improvement of Wireless Networks: PHY, MAC and Network layer

This paper provides a systematic and comprehensive survey that reviews the latest research efforts focused on machine learning (ML) based performance improvement of wireless networks, while considering all layers of the protocol stack (PHY, MAC and network). First, the related work and paper contributions are discussed, followed by providing the necessary background on data-driven approaches and machine learning for non-machine learning experts to understand all discussed techniques. Then, a comprehensive review is presented on works employing ML-based approaches to optimize the wireless communication parameters settings to achieve improved network quality-of-service (QoS) and quality-of-experience (QoE). We first categorize these works into: radio analysis, MAC analysis and network prediction approaches, followed by subcategories within each. Finally, open challenges and broader perspectives are discussed.


Introduction
Science and the way we undertake research is rapidly changing. The increase of data generation is present in all scientific disciplines [1], such as computer vision, speech recognition, finance (risk analytics), marketing and sales (e.g. customer churn analysis), pharmacy (e.g. drug discovery), personalized healthcare (e.g. biomarker identification in cancer research), precision agriculture (e.g. crop lines detection, weeds detection...), politics (e.g. election campaigning), etc. Until the recent years, this trend has been less pronounced in the wireless networking domain, mainly due to the lack of big data and 'big' communication capacity [2]. However, with the era of the Fifth Generation (5G) cellular systems and the Internet-of-Things (IoT), the big data deluge in the wireless networking domain is under way. For instance, massive amounts of data are generated by the omnipresent sensors used in smart cities [3,4] (e.g. to monitor parking spaces availability in the cities, or monitor the conditions of road traffic to manage and control traffic flows), smart infrastructures (e.g. to monitor the condition of railways or bridges), precision farming [5,6] (e.g. monitor yield status, soil temperature and humidity), environmental monitoring (e.g. pollution, temperature, precipitation sensing), IoT smart grid networks [7] (e.g. to monitor distribution grids or track energy consumption for demand forecasting), etc. It is expected that 28.5 billion devices will be connected by 2022 to the Internet [8], which will create a huge global network of things and the demand for wireless resources will accordingly increase observe control Transmission Layer Figure 1: Architecture for wireless big data analysis nologies. To address these challenges, machine learning (ML) is increasingly used to develop advanced approaches that can autonomously extract patterns and predict trends (e.g. at the PHY layer: interference recognition, at the MAC layer: link quality prediction, at the network layer: traffic demand estimation) based on environmental measurements and performance indicators as input. Such patterns can be used to optimize the parameter settings at different protocol layers, e.g PHY, MAC or network layer.
For instance, consider Figure 1, which illustrates an architecture with heterogeneous wireless access technologies, capable of collecting large amounts of observations from the wireless devices, processing them and feeding into ML algorithms which generate patterns that can help making better decisions to optimize the operating parameters and improve the network quality-of-service (QoS) and quality-of-experience QoE.
Obviously, there is an urgent need for the development of novel intelligent solutions to improve the wireless networking performance. This has motivated this paper and its main goal to raise awareness of the emerging interdisciplinary research area (spanning wireless networks and communications, machine learning, statistics, experimental-driven research and other research disciplines) and showcase the state-of-the-art on how to apply ML to improve the performance of wireless networks to solve the challenges that the wireless community is currently facing.
Although several survey papers exist, most of them focus on ML in a specific domain or network layer. To the best of our knowledge, this is the first survey that comprehensively reviews the latest research efforts focused on ML-based performance improvements of wireless networks while considering all layers of the protocol stack (PHY, MAC and network), whilst also providing the necessary tutorial for non-machine learning experts to understand all discussed techniques.
Paper organization: We structure this paper as shown on Figure 2. We start with discussing the related work and distinguishing our work with the state-of-the-art, in Section 2. We conclude that section with a list of our contributions. In Section 3, we present a high-level introduction to data science, data mining, artificial intelligence, machine learning and deep learning. The main goal here is to define these interchangeably used terms and how they related to each other. In 4 we provide a tutorial focused on machine learning, we overview various types of learning paradigms and introduce a couple of popular machine learning algorithms. Section 5 introduces four common types of data-driven problems in the context of wireless networks and provides examples of several case studies. The objective of this section is to help the reader formulate a wireless networking problem into a data-driven problem suitable for machine learning. Section 6 discusses the latest state-of-the-art  about machine learning for performance improvements of wireless networks. First, we categorize these works into: radio analysis, MAC analysis and network prediction approaches; then we discuss example works within each category and give an overview in tabular form, looking at various aspects including: input data, learning approach and algorithm, type of wireless network, achieved performance improvement, etc. In Section 7, we discuss open challenges and present future directions for each. Section 8 concludes the paper.

Related Work
With the advances in hardware and computing power and the ability to collect, store and process massive amounts of data, machine learning (ML) has found its way into many different scientific fields. The challenges faced by current 5G and future wireless networks pushed also the wireless networking domain to seek innovative solutions to ensure expected network performance. To address these challenges, ML is increasingly used in wireless networks. In parallel, a growing number of surveys and tutorials are emerging on ML for future wireless networks. Table 1 provides an overview and comparison with the existing survey papers. For instance: • In [10], the authors surveyed existing ML-based methods to address problems in Cognitive Radio Networks (CRNs).
• The authors of [11] survey ML approaches in WSNs (Wireless Sensor Networks) for various applications in-cluding location, security, routing, data aggregation and MAC.  [17] IoT Big data analytics, event detection, data aggregation, etc.
Supervised, unsupervised and reinforcement learning 2016 [13] Cellular networks Self-configuration, self-healing, and selfoptimization Supervised, unsupervised and reinforcement learning 2017 [14] +-CRN Spectrum sensing and access • Paper [19] surveys deep learning applications in IoT networks for big data and stream analytics.
• Paper [20] studies and surveys deep learning applications in cognitive radios for signal recognition tasks.
• The authors of [21] survey ML approaches in the context of IoT smart cities.
• Paper [22] surveys reinforcement learning applications for various applications including network access and rate control, wireless caching, data offloading, network security, traffic routing, resource sharing, etc.
Nevertheless, some of the aforementioned works focus on reviewing specific wireless networking tasks (for example, wireless signal recognition [20]), some focus on the application of specific ML techniques (for instance, deep learning [16], [15], [20]) while some focus on the aspects of a specific wireless environment looking at broader applications (e.g. CRN [10], [14], [20], and IoT [17], [21]). Furthermore, we noticed that some works miss out the necessary fundamentals for the readers who seek to learn the basics of an area outside their specialty. Finally, no existing work focuses on the literature on how to apply ML techniques to improve wireless network performance look-ing at possibilities at different layers of the network protocol stack.
To fill this gap, this paper provides a comprehensive introduction to ML for wireless networks and a survey of the latest advances in ML applications for performance improvement to address various challenges future wireless networks are facing. We hope that this paper can help readers develop perspectives on and identify trends of this field and foster more subsequent studies on this topic.

Contributions
The main contributions of this paper are as follows: • Introduction for non-machine learning experts to the necessary fundamentals on ML, AI, big data and data science in the context of wireless networks, with numerous examples. It examines when, why and how to use ML.
• A systematic and comprehensive survey on the state-ofthe-art that i) demonstrates the diversity of challenges impacting the wireless networks performance that can be addressed with ML approaches and which ii) illustrates how ML is applied to improve the performance of wireless networks from various perspectives: PHY, MAC and the network layer.
• References to the latest research works (up to and including 2019) in the field of predictive ML approaches for improving the performance of wireless networks.
• Discussion on open challenges and future directions in the field.

Data Science Fundamentals
The objective of this section is to introduce disciplines closely related to data-driven research and machine learning, and how they related to each other. Figure 3 shows a Venn diagram, which illustrates the relation between data science, data mining, artificial intelligence (AI), machine learning and deep learning (DL), explained in more detail in the following subsections. This survey, particularly, focuses on ML/DL approaches in the context of wireless networks.

Data Science
Data science is the scientific discipline that studies everything related to data, from data acquisition, data storage, data analysis, data cleaning, data visualization, data interpretation, making decisions based on data, determining how to create value from data and how to communicate insights relevant to the business. One definition of the term data science, provided by Dhar [23], is: Data science is the study of the generalizable extraction of knowledge from data.
Data science makes use of data mining, machine learning, AI techniques and also other approaches such as: heuristics algorithms, operational research, statistics, causal inference, etc. Practitioners of data science are typically skilled in mathematics, statistics, programming, machine learning, big data tools and communicating the results.

Data Mining
Data mining aims to understand and discover new, previously unseen knowledge in the data. The term mining refers to extracting content by digging. Applying this analogy to data, it may mean to extract insights by digging into data. A simple definition of data mining is: Data mining refers to the application of algorithms for extracting patterns from data.
Compared to ML, data mining tends to focus on solving actual problems encountered in practice by exploiting algorithms developed by the ML community. For this purpose, a datadriven problem is first translated into a suitable data mining method [24], which will be in detail discussed in Section 5.

Artificial Intelligence
Artificial intelligence (AI) is concerned with making machines smart aiming to create a system which behaves like a human. This involves fields such as robotics, natural language processing, information retrieval, computer vision and machine learning. As coined by [25], AI is: The science and engineering of making intelligent machines, especially computer systems by reproducing human intelligence through learning, reasoning and selfcorrection/adaption. AI uses intelligent agents that perceive their environment and take actions that maximize their chance of successfully achieving their goals.

Machine Learning
Machine learning (ML) is a subset of AI. ML aims to develop algorithms that can learn from historical data and improve the system with experience. In fact, by feeding the algorithms with data it is capable of changing its own internal programming to become better at a certain task. As coined by [26]: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
ML experts focus on proving mathematical properties of new algorithms, compared to data mining experts who focus on understanding empirical properties of existing algorithms that they apply. Within the broader picture of data science, ML is the step about taking the cleaned/transformed data and predicting future outcomes. Although ML is not a new field, with the significant increase of available data and the developments in computing and hardware technology ML has become one of the research hotspots in the recent years, in both academia and industry [27].
Compared to traditional signal processing approaches (e.g. estimation and detection), machine learning models are datadriven models; they do not necessarily assume a data model on the underlying physical processes that generated the data. Instead, we may say they "let the data speak", as they are able to infer or learn the model. For instance, when it is complex to model the underlying physics that generated the wireless data, and given that there is sufficient amount of data available that may allow to infer the model that generalizes well beyond what is has seen, ML may outperform traditional signal processing and expert-based systems. However, a representative amount and quality data is required. The advantage of ML is that the resulting models are less prone to the modeling errors of the data generation process.

Deep Learning
Deep learning is a subset of ML, in which data is passed via multiple number of non-linear transformations to calculate an output. The term deep refers to many steps in this case. A definition provided by [28], is: Definition 3.5. Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction.
A key advantage of deep learning over traditional ML approaches is that it can automatically extract high-level features from complex data. The learning process does not need to be designed by a human, which tremendously simplifies prior feature handcrafting [28].
However, the performance of DNNs comes at the cost of the model's interpretability. Namely, DNNs are typically seen as black boxes and there is lack of knowledge why they make certain decisions. Further, DNNs usually suffer from complex hyper-parameters tuning, and finding their optimal configuration can be a challenge and time consuming. Furthermore, training deep learning networks can be computationally demanding and requires advanced parallel computing such as graphics processing units (GPUs). Hence, when deploying deep learning models on embedded or mobile devices, considered should be the energy and computing constraints of the devices.
There is a growing interest in deep learning in the recent years. Figure 4 demonstrates the growing interest in the field, showing the Google search trend from the past few years.

Machine Learning Fundamentals
Due to their unpredictable nature, wireless networks are an interesting application area for data science because they are influenced by both, natural phenomena and man made artifacts. This section sets up the necessary fundamentals for the reader to understand the concepts of machine learning.

The Machine Learning Pipeline
Prior to applying machine learning algorithms to a wireless networking problem, the wireless networking problem needs to be first translated into a data science problem. In fact, the whole process from problem to solution may be seen as a machine learning pipeline consisting of several steps. Figure 5 illustrates those steps, which are briefly explained below: • Problem definition. In this step the problem is identified and translated into a data science problem. This is achieved by formulating the problem as a data mining task. Chapter 5 further elaborates popular data mining methods such as classification and regression, and presents case studies of wireless networking problems of each type. In this way, we hope to help the reader understand how to formulate a wireless networking problem as a data science problem.
• Data collection. In this step, the needed amount of data to solve the formulated problem is identified and collected. The result of this step is raw data.
• Data preparation. After the problem is formulated and data is collected, the raw data is being preprocessed to be cleaned and transformed into a new space where each data pattern is represented by a vector, x ∈ R n . This is known as the feature vector, and its n elements are known as features. Through, the process of feature extraction each pattern becomes a single point in a n-dimensional space, known as the feature space or the input space. Typically, one starts with some large value P of features and eventually selects the n most informative ones during the feature selection process.
• Model training. After defining the feature space in which the data lays, one has to train a machine learning algorithm to obtain a model. This process starts by forming the training data or training set. Assuming that M feature vectors and corresponding known output values (sometimes where x i ∈ R n , is the feature vector of the ith observation, The corresponding output values (labels) to which x i , i = 1, ..., M, belong, are In fact, various ML algorithms are trained, tuned (by tuning their hyper-parameters) and the resulting models are evaluated based on standard performance metrics (e.g. mean squared error, precision, recall, accuracy, etc.) and the best performing model is chosen (i.e. model selection).
• Model deployment. The selected ML model is deployed into a practical wireless system where it is used to make predictions. For instance, given unknown raw data, first the feature vector x is formed, and then it is fed into the ML model for making predictions. Furthermore, the deployed model is continuously monitored to observe how it behaves in real world. To make sure it is accurate, it may be retrained.
Further below, the ML stage is elaborated in more detail.

Learning the model
Given a set S, the goal of a machine learning algorithm is to learn the mathematical model for f . Thus, f is some fixed but unknown function, that defines the relation between x and y, that is The function f is obtained by applying the selected learning method to the training set, S, so that f is a good estimator for new unseen data, i.e., In machine learning, f is called the predictor, because its task is to predict the outcome y i based on the input value of x i . Two popular predictors are the regressor and classifier, described by: In other words, when the output variable y is continuous or quantitative, the learning problem is a regression problem. But, if y predicts a discrete or categorical value, it is a classification problem.
In case, when the predictor f is parameterized by a vector θ ∈ R n , it describes a parametric model. In this setup, the problem of estimating f reduces down to one of estimating the parameters θ = [θ 1 , θ 2 , ..., θ n ] T . In most practical applications, the observed data are noisy versions of the expected values that would be obtained under ideal circumstances. These unavoidable errors, prevent the extraction of true parameters from the observations. With this in regard, the generic data model may be expressed as where f (x) is the model and are additive measurement errors and other discrepancies. The goal of ML is to find the inputoutput relation that will "best" match the noisy observations. Hence, the vector θ may be estimated by solving a (convex) optimization problem. First, a loss or cost function l(x, y, θ) is set, which is a (point-wise) measure of the error between the observed data point y i and the model predictionf (x i ) for each value of θ. However, θ is estimated on the whole training set, S, not just one example. For this task, the average loss over all training examples called training loss, J, is calculated: where S indicates that the error is calculated on the instances from the training set and i = 1, ..., m. The vector θ that minimizes the training loss J(θ), that is will give the desired model. Once the model is estimated, for any given input x, the prediction for y can be made withŷ = θ T x.

Learning the features
The prediction accuracy of ML models heavily depends on the choice of the data representation or features used for training. For that reason, much effort in designing ML models goes into the composition of pre-processing and data transformation chains that result in a representation of the data that can support effective ML predictions. Informally, this is referred to as feature engineering. Feature engineering is the process of extracting, combining and manipulating features by taking advantage of human ingenuity and prior expert knowledge to arrive at more representative ones. The feature extractor φ transforms the data vector d ∈ R d into a new form, x ∈ R n , n <= d, more suitable for making predictions, that is For instance, the authors of [29] engineered features from the RSSI (Received Signal Strength Indication) distribution to identify wireless signals. The importance of feature engineering highlights the bottleneck of ML algorithms: their inability to automatically extract the discriminative information from data. Feature learning is a branch of machine learning that moves the concept of learning from "learning the model" to "learning the features". One popular feature learning method is deep learning, in detail discussed in 4.3.9.

Types of learning paradigms
This section discussed various types of learning paradigms in ML, summarized in Figure 6 4.2.1. Supervised vs. Unsupervised vs. Semi-Supervised Learning Learning can be categorized by the amount of knowledge or feedback that is given to the learner as either supervised or unsupervised.
Supervised Learning. Supervised learning utilizes predefined inputs and known outputs to build a system model. The set of inputs and outputs forms the labeled training dataset that is used to teach a learning algorithm how to predict future outputs for new inputs that were not part of the training set. Supervised learning algorithms are suitable for wireless network problems where prior knowledge about the environment exists and data can be labeled. For example, predict the location of a mobile node using an algorithm that is trained on signal propagation characteristics (inputs) at known locations (outputs). Various challenges in wireless networks have been addressed using supervised learning such as: medium access control [30][31][32][33], routing [34], link quality estimation [35,36], node clustering in WSN [37], localization [38][39][40], adding reasoning capabilities for cognitive radios [41][42][43][44][45][46][47], etc. Supervised learning has also been extensively applied to different types of wireless networks application such as: human activity recognition [48][49][50][51][52][53], event detection [54][55][56][57][58], electricity load monitoring [59,60], security [61][62][63], etc. Some of these works will be analyzed in more detail later.
Unsupervised Learning. Unsupervised learning algorithms try to find hidden structures in unlabeled data. The learner is provided only with inputs without known outputs, while learning is performed by finding similarities in the input data. As such, these algorithms are suitable for wireless network problems where no prior knowledge about the outcomes exists, or annotating data (labelling) is difficult to realize in practice. For instance, automatic grouping of wireless sensor nodes into clusters based on their current sensed data values and geographical proximity (without knowing a priori the group membership of each node) can be solved using unsupervised learning. In the context of wireless networks, unsupervised learning algorithms are widely used for: data aggregation [64], node clustering for WSNs [64][65][66][67], data clustering [68][69][70], event detection [71] and several cognitive radio applications [72,73], dimensionality reduction [74], etc.
Semi-Supervised Learning. Several mixes between the two learning methods exist and materialize into semi-supervised learning [75]. Semi-supervised learning is used in situations when a small amount of labeled data with a large amount of unlabeled data exists. It has great practical value because it may alleviate the cost of rendering a fully labeled training set, especially in situations where it is infeasible to label all instances. For instance, in human activity recognition systems where the activities change very fast so that some activities stay unlabeled or the user is not willing to cooperate in the data collection process, supervised learning might be the best candidate to train a recognition model [76][77][78]. Other potential use cases in wireless networks might be localization systems where it can alleviate the tedious and time-consuming process of collecting training data (calibration) in fingerprinting-based solutions [79] or semi-supervised traffic classification [80], etc.

Offline vs. Online vs. Active Learning
Learning can be categorized depending on the way the information is given to the learner as either offline or online learning. In offline learning the learner is trained on the entire training data at once, while in online learning the training data becomes available in a sequential order and is used to update the representation of the learner in each iteration.
Offline Learning. Offline learning is used when the system that is being modeled does not change its properties dynamically. Offline learned models are easy to implement because the models do not have to keep on learning constantly, and they can be easily retrained and redeployed in production. For example, in [81] a learning-based link quality estimator is implemented by deploying an offline trained model into the network stack of Tmote Sky wireless nodes. The model is trained based on measurements about the current status of the wireless channel that are obtained from extensive experiment setups from a wireless testbed.
Another use cases are human activity recognition systems, where an offline trained classifier is deployed to recognize actions from users. The classifier model can be trained based on information extracted from raw measurements collected by The learner is trained sequentially as data becomes available The learner selects the most useful training data Figure 6: Summary of types of learning paradigms sensors integrated in a smartphone, which is at the same time the central processing unit that implements the offline learned model for online activity recognition [82].
Online Learning. Online learning is useful for problems where training examples arrive one at a time or when due to limited resources it is computationally infeasible to train over the entire dataset. For instance, in [83] a decentralized learning approach for anomaly detection in wireless sensor networks is proposed. The authors concentrate on detection methods that can be applied online (i.e., without the need of an offline learning phase) and that are characterized by a limited computational footprint, so as to accommodate the stringent hardware limitations of WSN nodes. Another example can be found in [84], where the authors propose an online outlier detection technique that can sequentially update the model and detect measurements that do not conform to the normal behavioral pattern of the sensed data, while maintaining the resource consumption of the network to a minimum. Active learning has been a major topic in recent years in ML and an exhaustive literature survey is beyond the scope of this paper. We refer the reader for more details on active learning algorithms to [86][87][88].

Machine Learning Algorithms
This section reviews popular ML algorithms used in wireless networks research.

Linear Regression
Linear regression is a supervised learning technique used for modeling the relationship between a set of input (independent) variables (x) and an output (dependent) variable (y), so that the output is a linear combination of the input variables: ..x n ] T , and θ = [θ 0 , θ 1 , ...θ n ] T is the estimated parameter vector from a given training set (y i , x i ), i = 1, 2, ..., m.

Nonlinear Regression
Nonlinear regression is a supervised learning techniques which models the observed data by a function that is a nonlinear combination of the model parameters and one or more independent input variables. An example of nonlinear regression is the polynomial regression model defined by:

Logistic Regression
Logistic regression [89] is a simple supervised learning algorithm widely used for implementing linear classification models, meaning that the models define smooth linear decision boundaries between different classes. At the core of the learning algorithm is the logistic function which is used to learn the model parameters and predict future instances. The logistic function, f (z), is given by 1 over 1 plus e to the minus z, that is: where, z : ..x n are the independent (input) variables, that we wish to use to describe or predict the dependent (output) variable y = f (z).
The range of f (z) is between 0 and 1, regardless of the value of z, which makes it popular for classification tasks. Namely, the model is designed to describe a probability, which is always some number between 0 and 1.

Decision Trees
Decision trees (DT) [90] is a supervised learning algorithm that creates a tree-like graph or model that represents the possible outcomes or consequences of using certain input values. The tree consists of one root node, internal nodes called decision nodes which test its input against a learned expression, and leaf nodes which correspond to a final class or decision. The learning tree can be used to derive simple decision rules that can be used for decision problems or for classifying future instances by starting at the root node and moving through the tree until a leaf node is reached where a class label is assigned. However, decision trees can achieve high accuracy only if the data is linearly separable, i.e., if there exists a linear hyperplane between the classes. Hence, constructing an optimal decision tree is NP-complete [91].
There are many algorithms that can form a learning tree such as the simple Iterative Dichotomiser 3 (ID3), its improved version C4.5, etc.

Random Forest
Random forests (RF) are bagged decision trees. Bagging is a technique which involves training many classifiers and considering the average output of the ensemble. In this way, the variance of the overall ensemble classifier can be greatly reduced. Bagging is often used with DTs as they are not very robust to errors due to variance in the input data. Random forest are created by the following procedure: Figure 7 illustrates this process.

SVM
Support Vector Machine (SVM) [92] is a learning algorithm that solves classification problems by first mapping the input data into a higher-dimensional feature space in which it becomes linearly separable by a hyperplane, which is used for classification. In Support vector regression, this hyperplane is used to predict the continuous value output. The mapping from the input space to the high-dimensional feature space is non-linear, which is achieved using kernel functions. Different kernel functions comply best for different application domains. The most common kernel functions used in SVM are: linear kernel, polynomial kernel and basis kernel function (RBF),

Algorithm 1: Random Forest
Input: Training set D Output: Predicted value h(x) Procedure: • Sample k datasets D 1 , ..., D k from D with replacement.
• For each D i train a decision tree classifier h i () to the maximum depth and when splitting the tree only consider a subset of features l. If d is the number of features in each training example, the parameter l <= d is typically set to l = √ d.
• The ensemble classifier is then the mean or majority vote output decision out of all decision trees.

Input dataset
where σ is a user defined parameter.

k-NN
k nearest neighbors (k-NN) [93] is a learning algorithm that can solve classification and regression problems by looking into the distance (closeness) between input instances. It is called a non-parametric learning algorithm because, unlike other supervised learning algorithms, it does not learn an explicit model function from the training data. Instead, the algorithm simply memorizes all previous instances and then predicts the output by first searching the training set for the k closest instances and then: (i) for classification-predicts the majority class amongst those k nearest neighbors, while (ii) for regression-predicts the output value as the average of the values of its k nearest neighbors. Because of this approach, k-NN is considered a form of instance-based or memory-based learning.
k-NN is widely used since it is one of the simplest forms of learning. It is also considered as lazy learning as the learner is passive until a prediction has to be performed, hence no computation is required until performing the prediction task. The pseudocode for k-NN [94] is summarized in Algorithm 2.

k-Means
k-Means is an unsupervised learning algorithm used for clustering problems. The goal is to assign a number of points, x 1 , .., x m into K groups or clusters, so that the resulting intracluster similarity is high, while the inter-cluster similarity low. The similarity is measured with respect to the mean value of the data points in a cluster. Figure 8 illustrates an example of kmeans clustering, where K = 3 and the input dataset consisting of two features with data points plotted along the x and y axis.
On the left side of Figure 8 are data points before k-means is applied, while on the right side are the identified 3 clusters and their centroids represented with squares.
The pseudocode for k-means [94] is summarized in Algorithm 3.

Algorithm 3: k-means
Input: K: The number of desired clusters; X = {x 1 , x 2 , ..., x m }: Input dataset with m data points Output: A set of K clusters Procedure: 1. Set the cluster centroids µ k , k = 1, ..., K to arbitrary values; 2. while no change in µ k do (a) (Re)assign each item x i to the cluster with the closest centroid. (b) Update µ k , k = 1, ..., K, as the mean value of the data points in each cluster. end return K clusters

Neural Networks
Neural Networks (NN) [95] or artificial neural networks (ANN) is a supervised learning algorithm inspired on the working of the brain, that is typically used to derive complex, nonlinear decision boundaries for building a classification model, but are also suitable for training regression models when the goal is to predict real-valued outputs (regression problems are explained in Section 5.1). Neural networks are known for their ability to identify complex trends and detect complex nonlinear relationships among the input variables at the cost of higher computational burden. A neural network model consists of one input, a number of hidden layers and one output layer, as shown on Figure 9.
The formulation for a single layer is as follow: where x is a training example input, and y is the layer output, w are the layer weights, while b is the bias term. The input layer corresponds to the input data variables. Each hidden layer consists of a number of processing elements called neurons that process its inputs (the data from the previous layer) using an activation or transfer function that translates the input signals to an output signal, g(). Commonly used activation functions are: unit step function, linear function, sigmoid function and the hyperbolic tangent function. The elements between each layer are highly connected by connections that have numeric weights that are learned by the algorithm. The output layer outputs the prediction (i.e., the class) for the given inputs and according to the interconnection weights defined through the hidden layer. The algorithm is again gaining popularity in recent years because of new techniques and more powerful hardware that enable training complex models for solving complex tasks. In general, neural networks are said to be able to approximate any function of interest when tuned well, which is why they are considered as universal approximators [96]. Deep neural networks. Deep neural networks are a special type of NNs consisting of multiple layers able to perform feature transformation and extraction. Opposed to a traditional NN, they have the potential to alleviate manually extracting features, which is a process that depends much on prior knowledge and domain expertise [97].
Various deep learning techniques exist, including: deep neural networks (DNN), convolutional neural networks (CNN), recurrent neural networks (RNN) and deep belief networks (DBN), which have shown success in various fields of science including computer vision, automatic speech recognition, natural language processing, bioinformatics, etc, and increasingly also in wireless networks.
Convolutional neural networks. Convolutional neural networks (CNN) perform feature learning via non-linear transformations implemented as a series of nested layers. The input data is a multidimensional data array, called tensor, that is presented at the visible layer. This is typically a grid-like topological structure, e.g. time-series data, which can be seen as a 1D grid taking samples at regular time intervals, pixels in images with a 2D layout, a 3D structure of videos, etc. Then a series of hidden layers extract several abstract features. Hidden layers consist of a series of convolution, pooling and fully-connected layers, as shown on Figure 10.
Those layers are "hidden" because their values are not given. Instead, the deep learning model must determine which data representations are useful for explaining the relationships in the observed data. Each convolution layer consists of several kernels (i.e. filters) that perform a convolution over the input; therefore, they are also referred to as convolutional layers. Kernels are feature detectors, that convolve over the input and produce a transformed version of the data at the output. Those are banks of finite impulse response filters as seen in signal processing, just learned on a hierarchy of layers. The filters are usually multidimensional arrays of parameters that are learnt by the learning algorithm [98] through a training process called backpropagation.
For instance, given a two-dimensional input x, a twodimensional kernel h computes the 2D convolution by i.e. the dot product between their weights and a small region they are connected to in the input.
After the convolution, a bias term is added and a point-wise nonlinearity g is applied, forming a feature map at the filter output. If we denote the l-th feature map at a given convolutional layer as h l , whose filters are determined by the coefficients or weights W l , the input x and the bias b l , then the feature map h l is obtained as follows where * is the 2D convolution defined by Equation 16, while g(·) is the activation function.
Common activation functions encountered in deep neural networks are the rectifier that is defined as the hyperbolic tangent function, tanh, g(x) = tanh(x), that is defined as and the sigmoid activation, g(x) = σ(x), defined as The sigmoid activation is rarely used because its activations saturate at either tail of 0 or 1 and they are not centered at 0 as is the tanh. The tanh normalizes the input to the range [−1, 1], but compared to the rectifier its activations saturate which causes unstable gradients. Therefore, the rectifier activation function is typically used for CNNs. Kernels using the rectifier are called ReLU (Rectified Linear Unit) and have shown to greatly accelerate the convergence during the training process compared to other activation functions. They also do not cause vanishing or exploding of gradients in the optimization phase when minimizing the cost function. In addition, the ReLU simply thresholds the input, x, at zero, while other activation functions involve expensive operations.
In order to form a richer representation of the input signal, commonly, multiple filters are stacked so that each hidden layer consists of multiple feature maps, {h (l) , l = 0, ..., L} (e.g., L = 64, 128, ..., etc). The number of filters per layer is a tunable parameter or hyper-parameter. Other tunable parameters are the filter size, the number of layers, etc. The selection of values for hyper-parameters may be quite difficult, and finding it commonly is much an art as it is science. An optimal choice may only be feasible by trial and error. The filter sizes are selected according to the input data size so as to have the right level of granularity that can create abstractions at the proper scale. For instance, for a 2D square matrix input, such as spectrograms, common choices are 3 × 3, 5 × 5, 9 × 9, etc. For a wide matrix, such as a real-valued representation of the complex I and Q samples of the wireless signal in R 2×N , suitable filter sizes may be 1 × 3, 2 × 3, 2 × 5, etc.
After a convolutional layer, a pooling layer may be used to merge semantically similar features into one. In this way, the spatial size of the representation is reduced which reduces the amount of parameters and computation in the network. Examples of pooling units are max pooling (computes the maximum value of a local patch of units in one feature map), neighbouring pooling (takes the input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and distortions, etc.
The penultimate layer in a CNN consists of neurons that are fully-connected with all feature maps in the preceding layer. Therefore, these layers are called fully-connected or dense layers. The very last layer is a softmax classifier, which computes

Input data
That is, the scores z i computed at the output layer, also called logits, are translated into probabilities. A loss function, l, is calculated on the last fully-connected layer that measures the difference between the estimated probabilities,ŷ i , and the onehot encoding of the true class labels, y i . The CNN parameters, θ, are obtained by minimizing the loss function on the training where l(.) is typically the mean squared error l(y,ŷ) = y−ŷ 2 2 or the categorical cross-entropy l(y,ŷ) = m i=1 y i log(ŷ i ) for which a minus sign is often added in front to get the negative loglikelihood. Then the softmax classifier is trained by solving an optimization problem that minimizes the loss function. The optimal solution are the network parameters that fully describe the CNN model. That isθ = argmin θ J(S , θ).
Currently, there is no consensus about the choice of the optimization algorithm. The most successful optimization algorithms seem to be: stochastic gradient descent (SGD), RM-SProp, Adam, AdaDelta, etc. For a comparison on these, we refer the reader to [99].
To control over-fitting, typically regularization is used in combination with dropout, which is a new extremely effective technique that "drops out" a random set of activations in a layer. Each unit is retained with a fixed probability p, typically chosen using a validation set, or set to 0.5 which has shown to be close to optimal for a wide range of applications [100].
Recurrent neural networks. Recurrent neural networks (RNN) [101] are a type of neural networks where connections between nodes form a directed graph along a temporal sequence. They are called recurrent because of the recurrent connections between the hidden units. This is mathematically denoted as:  Figure 11 shows a graphical representation of RNNs. The left part of Figure 11 presents the "folded" network, while the right part the "unfolded" network with its recurrent connections propagating information forward in time. An activation functional is applied in the hidden units and the so f tmax may be used to calculate the prediction.
There are various extensions of RNNs. A popular extension are LSTMs, which augment the traditional RNN model by adding a self loop on the state of the network to better remember relevant information over longer periods in time.

Data Science Problems in Wireless Networks
The ultimate goal of data science is to extract knowledge from data, i.e., turn data into real value [102]. At the heart of this process are severe algorithms that can learn from and make predictions on data, i.e. machine learning algorithms. In the context of wireless networks, learning is a mechanism that enables context awareness and intelligence capabilities in different aspects of wireless communication. Over the last years, it has gained popularity due to its success in enhancing network-wide performance (i.e. QoS) [103], facilitating intelligent behavior by adapting to complex and dynamically changing (wireless) environments [104] and its ability to add automation for realizing concepts of self-healing and self-optimization [105]. During the past years, different data-driven approaches have been studied in the context of: mobile ad hoc networks [106], wireless sensor networks [107], wireless body area networks [50], cognitive radio networks [108,109] and cellular networks [110]. These approaches are focused on addressing various topics including: medium access control [30,111], routing [81,112], data aggregation and clustering [64,113], localization [114,115], energy harvesting communication [116], spectrum sensing [44,47], etc.
As explained in section 4.1, prior to applying ML to a wireless networking problem, the problem needs to be first formulated as an adequate data mining method.
This section explains the following methods: • Regression • Classification • Clustering

• Anomaly Detection
For each problem type, several wireless networking case studies are discussed together with the ML algorithms that are applied to solve the problem.

Regression
Regression is a data mining method that is suitable for problems that aim to predict a real-valued output variable, y, as illustrated on Figure 12. Given a training set, S, the goal is to estimate a function, f , whose graph fits the data. Once the function f is found, when an unknown point arrives, it is able to predict the output value. This function f is known as the regressor.
Depending on the function representation, regression techniques are typically categorized into linear and non-linear regression algorithms, as explained in section 4.3. For example, linear channel equalization in wireless communication can be seen as a regression problem.

Regression Example 1: Indoor localization
In the context of wireless networks, linear regression is frequently used to derive an empirical log-distance model for the radio propagation characteristics as a linear mathematical relationship between the RSSI, usually in dBm, and the distance. This model can be used in RSSI-based indoor localization algorithms to estimate the distance towards each fixed node (i.e., anchor node) in the ranging phase of the algorithm [114].

Regression Example 2: Link Quality estimation
Non-linear regression techniques are extensively used for modeling the relation between the PRR (Packet Reception Rate) and the RSSI, as well as between PRR and the Link Quality Indicator (LQI), to build a mechanism to estimate the link quality based on observations (RSSI, LQI) [117].

Regression Example 3: Mobile traffic demand prediction
The authors in [118] use ML to optimize network resource allocation in mobile networks. Namely, each base station observes the traffic of a particular network slice in a mobile network. Then, a CNN model uses this information to predict the capacity required to accommodate the future traffic demands for services associated to each network slice. In this way, each slice gets optimal resources allocated.

Classification
A classification problem tries to understand and predict discrete values or categories. The term classification comes from the fact that it predicts the class membership of a particular input instance, as shown on Figure 13. Hence, the goal in classification is to assign an unknown pattern to one out of a number of classes that are considered to be known. For example, in digital communications, the process of demodulation can be viewed as a classification problem. Upon receiving the modulated transmitted signal, which has been impaired by propagation effects (i.e.the channel) and noise, the receiver has to decide which data symbol (out of a finite set) was originally transmitted.
Classification problems can be solved by supervised learning approaches, that aim to model boundaries between sets (i.e., classes) of similar behaving instances, based on known and labeled (i.e., with defined class membership) input values. There are many learning algorithms that can be used to classify data including decision trees, k-nearest neighbours, logistic regression, support vector machines, neural networks, convolutional neural networks, etc.

Classification Example 1: Cognitive MAC layer
We consider the problem of designing an adaptive MAC layer as an application example of decision trees in wireless networks. In [30] a self-adapting MAC layer is proposed. It is composed of two parts: (i) a reconfigurable MAC architecture that can switch between different MAC protocols at run time, and (ii) a trained MAC engine that selects the most suitable MAC protocol for the current network condition and application requirements. The MAC engine is solved as a classification problem using a decision tree classifier which is learned based on: (i) two types of input variables which are (1) network conditions reflected through the RSSI statistics (i.e., mean and variance), and (2) the current traffic pattern monitored through the Inter-Packet Interval (IPI) statistics (i.e., mean and variance) and application requirements (i.e., reliability, energy consumption and latency), and (ii) the output which is the MAC protocol that is to be predicted and selected.

Classification Example 2: Intelligent routing in WSN
Liu et al. [81] improved multi-hop wireless routing by creating a data-driven learning-based radio link quality estimator. They investigated whether machine learning algorithms (e.g., logistic regression, neural networks) can perform better than traditional, manually-constructed, pre-defined estimators such as STLE (Short-Term Link Estimator) [121] and 4Bit (Four-Bit) [122]. Finally, they selected logistic regression as the most promising model for solving the following classification problem: predict whether the next packet will be successfully received, i.e., output class is 1, or lost, i.e., output class is 0, based on the current wireless channel conditions reflected by statistics of the PRR, RSSI, SNR and LQI.
While in [81] the authors used offline learning to do prediction, in their follow-up work [112], they went a step further and both training and prediction were performed online by the nodes themselves using logistic regression with online learning (more specifically the stochastic gradient descent online learning algorithm). The advantage of this approach is that the learning and thus the model, adapt to changes in the wireless channel, that could otherwise be captured only by re-training the model offline and updating the implementation on the node.

Classification Example 3: Wireless Signal Classification
ML has been extensively used in cognitive radio applications to perform signal classification. For this purpose, typically flexible and reconfigurable SDR (software defined radio) platforms are used to sense the environment to obtain information about the wireless channel conditions and users' requirements, while intelligent algorithms build the cognitive learning engine that can make decisions on those reconfigurable parameters on SDR (e.g., carrier frequency, transmission power, modulation scheme).
In [44,47,123] SVMs are used as the machine learning algorithm to classify signals among a given set of possible modulation schemes. For instance, Huang et al. [47] identified four spectral correlation features that can be extracted from signals for distinction of different modulation types. Their trained SVM classifier was able to distinguish six modulation types with high accuracy: AM, ASK, FSK, PSK, MSK and QPSK.

Clustering
Clustering is a data mining method that can be used for problems where the goal is to group sets of similar instances into clusters, as shown on Figure 14.
Opposed to classification, it uses unsupervised learning, which means that the input dataset instances used for training are not labeled, i.e., it is unknown to which group they belong. The clusters are determined by inspecting the data structure and grouping objects that are similar according to some metric. Clustering algorithms are widely adopted in wireless sensor networks, where they have found use for grouping sensor nodes into clusters to satisfy scalability and energy efficiency objectives, and finally elect the head of each cluster. A significant number of node clustering algorithms tends to be proposed for WSNs [125]. However, these node clustering algorithms typically do not use the data science clustering techniques directly. Instead, they exploit data clustering techniques to find data correlations or similarities between data of neighboring nodes, that can be used to partition sensor nodes into clusters.
Clustering can be used to solve other types of problems in wireless networks like anomaly detection, i.e., outliers detection, such as intrusion detection or event detection, for different data pre-processing tasks, cognitive radio application (e.g., identifying wireless systems [73]), etc. There are many learning algorithms that can be used for clustering, but the most commonly used is k-Means. Other popular clustering algorithms include hierarchical clustering methods such as single-linkage, complete-linkage, centroid-linkage; graph theory-based clustering such as highly connected subgraphs (HCS), cluster affinity search technique (CAST); kernel-based clustering as is support vector clustering (SVC), etc. A novel two-level clustering algorithm, namely TW-k-means, has been introduced by Chen et al. [113]. For a more exhaustive list of clustering algorithms and their explanation we refer the reader to [126]. Several clustering approaches have shown promise for designing efficient data aggregation for more efficient communication strategies in low power wireless sensor networks constrained. Given the fact that the most of the energy on the sensor nodes is consumed while the radio is turned on, i.e., while sending and receiving data [127], clustering may help to aggregate data in order to reduce transmissions and hence energy consumption.

Clustering Example 1: Summarizing sensor data
In [68] a distributed version of the k-Means clustering algorithm was proposed for clustering data sensed by sensor nodes. The clustered data is summarized and sent towards a sink node. Summarizing the data ensures to reduce the communication transmission, processing time and power consumption of the sensor nodes.

Clustering Example 2: Data aggregation in WSN
In [64] a data aggregation scheme is proposed for in-network data summarization to save energy and reduce computation in wireless sensor nodes. The proposed algorithm uses clustering to form clusters of nodes sensing similar values within a given threshold. Then, only one sensor reading per cluster is transmitted which lowered extremely the number of transmissions in the wireless sensor network.

Clustering Example 3: Radio signal identification
The authors of [74] use clustering to separate and identify radio signal classes without to alleviate the need of using explicit class labels on examples of radio signals. First, dimensionality reduction is performed on signal examples to transform the signals into a space suitable for signal clustering. Namely, given an appropriate dimensionality reduction, signals are turned into a space where signals of the same or similar type have a low distance separating them while signals of differing types are separated by larger distances. Classification of radio signal types in such a space then becomes a problem of identifying clusters and associating a label with each cluster. The authors used the DBSCAN clustering algorithm [128].

Anomaly Detection
Anomaly detection (changes and deviation detection) is used when the goal is to identify unusual, unexpected or abnormal system behavior. This type of problem can be solved by supervised or unsupervised learning depending on the amount of knowledge present in the data (i.e., whether it is labeled or unlabeled, respectively).
Accordingly, classification and clustering algorithms can be used to solve anomaly detection problems. Figure 15 illustrates anomaly detection. A wireless example is the detection of suddenly occurring phenomena, such as the identification of suddenly disconnected networks due to interference or incorrect transmission power settings. It is also widely used for outliers detection in the pre-processing phase [129]. Other use-case examples include intrusion detection, fraud detection, event detection in sensor networks, etc.
anomalies Figure 15: Illustration of anomaly detection.

Anomaly Detection Example 1: WSN attack detection
WSNs have been target of many types of DoS attacks. The goal of DoS attacks in WSNs is to transmit as many packets as possible whenever the medium is detected to be idle. This prevents a legitimate sensor node from transmitting their own packets. To combat a DoS attack, a secure MAC protocol based on neural networks has been proposed in [31]. The NN model is trained to detect an attack by monitoring variations of following parameters: collision rate R c , average waiting time of a packet in MAC buffer T w , arrival rate of RTS packets R RT S . An anomaly, i.e., attack, is identified when the monitored traffic variations exceeds a preset threshold, after which the WSN node is switched off temporarily. The results is that flooding the network with untrustworthy data is prevented by blocking only affected sensor nodes.

Anomaly Detection Example 2: System failure and intrusion detection
In [83] online learning techniques have been used to incrementally train a neural network for in-node anomaly detection in wireless sensor network. More specifically, the Extreme Learning Machine algorithm [130] has been used to implement classifiers that are trained online on resource-constrained sensor nodes for detecting anomalies such as: system failures, intrusion, or unanticipated behavior of the environment.

Anomaly Detection Example 3: Detecting wireless spectrum anomalies
In [131] wireless spectrum anomaly detection has been studied. The authors use Power Spectral Density (PSD) data to detect and localize anomalies (e.g. unwanted signals in the licensed band or the absence of an expected signal) in the wireless spectrum using a combination of Adversarial autoencoders (AAEs), CNN and LSTM.

Machine Learning for Performance Improvements in Wireless Networks
Obviously, machine learning is increasingly used in wireless networks [27]. After carefully looking at the literature, we identified two distinct categories or objectives where machine learning empowers wireless networks with the ability to learn and infer from data and extract patterns: • Performance improvements of the wireless networks based on performance indicators and environmental insights (e.g. about the radio medium) as input, acquired from the devices. These approaches exploit ML to generate patterns or make predictions, which are used to modify operating parameters at the PHY, MAC and network layer.
• Information processing of data generated by wireless devices at the application layer. This category covers various applications such as: IoT environmental monitoring applications, activity recognition, localization, precision agriculture, etc.
This section presents tasks related to each of the aforementioned objectives achieved via ML and discusses existing work in the domain. First, the works are broadly summarized in tabular form in Table 2, followed by a detailed discussion of the most important works in each domain.
The focus of this paper is on the first category related to ML for performance improvement of wireless networks, therefore, a comprehensive overview of the existing work addressing problems pertaining to communication performance by making use of ML techniques is presented in the forthcoming subsection. These works provide a promising direction towards solving problems caused by the proliferation of wireless devices, networks and technologies in the near future, including: problems with interference (co-channel interference, inter-cell interference, cross technology interference, multi user interference, etc.), non-adaptive modulation scheme, static non-application cognizant MAC, etc.

Machine Learning Research for Performance improvement
Data generated during monitoring of wireless networking infrastructure (e.g. throughput, end-to-end delay, jitter, packet loss, etc.) and by the wireless sensor devices (e.g. spectrum monitoring) and analyzed by ML techniques has the potential to optimize wireless networks configurations, thereby improving end-users QoE. Various works have applied ML techniques for gaining insights that can help improve the network performance. Depending on the type of data used as input for ML algorithms, we first categorize the researched literature into three types, summarized in Table 2: • Radio spectrum analysis • Medium access control (MAC) analysis

• Network prediction
Furthermore, within each of the above categories, we identified several classes of research approaches illustrated in Figure  16. In what follows, the work in these directions is reviewed.

Radio spectrum analysis
Radio spectrum analysis refers to investigating wireless data sensed by the wireless devices to infer the radio spectrum usage. Typically, the goal is to detect unused spectrum portions in order to share it with other coexisting users within the network without exorbitant interference with each other. Namely, as wireless devices become more pervasive throughout society the available radio spectrum, which is a scarce resource, will contain more non-cooperative signals than seen before. Therefore, collecting information about the signals within the spectrum of interest is becoming ever more important and complex. This has motivated the use of ML for analyzing the signals occupying the radio spectrum.
Perhaps the most prevalent task related to radio spectrum analysis solved using ML is automatic modulation recognition (AMR). Other related radio spectrum analysis tasks which employ ML techniques include technology recognition (TR) and signal identification (SI) methods. Typically, the goal is to detect the presence of signals that may cause interference so as to decide on a interference mitigation strategy. Therefore, we introduce those approached as wireless interference identification (WII) tasks.
Automatic modulation recognition. AMR plays a key role in various civilian and military applications, where friendly signals shall be securely transmitted and received, whereas hostile signals must be located, identified and jammed. In short, the goal of this task is to recognize the type of modulation scheme an emitter is using to modulate its transmitting signal based on raw samples of the detected signal at the receiver side. This information can provide insight about the type of communication systems and emitters present in the radio environment.
Traditional AMR algorithms were classified into likelihoodbased (LB) approaches [230], [231], [232] and feature-based (FB) approaches [233], [234]. LB approaches are based on detection theory (i.e. hypotesis testing) [235]. They can offer good performance and are considered optimal classifiers, however they suffer high computation complexity. Therefore, FB approaches were developed as suboptimal classifiers suitable for practical use. Conventional FB approaches heavily rely on expert knowledge, which may perform well for specialized solutions, however they are poor in generality and are timeconsuming. Namely, in the preprocessing phase of designing the AMR algorithm, traditional FB approaches extracted complex hand engineered features (e.g. some signal parameters) computed from the raw signal and then employed an algorithm to determine the modulation schemes [236].
To remedy these problems, ML-based classifiers that aim to learn on preprocessed received data have been adopted and shown great advantages. ML algorithms usually provide better generalization to new unseen datasets, making their application preferable over solely FB approaches. For instance, the authors of [133], [134] and [143] used the support vector machine (SVM) machine learning algorithm to classify modulation schemes. While, strictly FB approaches may become obsolete with the advent of the employment of ML classifiers for AMR, hand engineered features can provide useful input to ML techniques. For instance in the following works [140] and [156], the authors engineered features using expert experience applied on the raw received signal and feeding the designed features as input for a neural network ML classifier.
Although ML methods have the advantage of better general-
Recently, the wireless communication community experienced a breakthrough by adopting deep learning techniques to the wireless domain. In [142], deep convolution neural networks (CNNs) are applied directly on complex time domain signal data to classify modulation formats. The authors demonstrated that CNNs outperform expert-engineered features in combination with traditional ML classifiers, such as SVMs, k-Nearest Neighbors (k-NN), Decision Trees (DT), Neural Networks (NN) and Naive Bayes (NB). An alternative method, is to learn the modulation format of the received signal from different representations of the raw signal. In our work in [237], CNNs are employed to learn the modulation of various signals using the in-phase and quadrature (IQ) data representation of the raw received signal and two additional data representations without affecting the simplicity of the input. We showed that the amplitude/phase representation outperformed the other two, demonstrating the importance of the choice of the wireless data representation used as input to the deep learning technique so as to determine the most optimal mapping from the raw signal to the modulation scheme. Other, follow-up works include [157], [158], [159], [160], [161], [165], [166], [168], [169], [170], [171], [172], etc.
For a more comprehensive overview of the state-of-the art work on AMR we refer the reader to tables 4 and 5. Table 3 describes the structure used for tables 4, 5, 6 and 7.
Wireless interference identification. WII essentially refers to identifying the type of wireless emitters (signal or technology) existing in the local radio environment, which can be immensely helpful information to investigate an effective interference avoidance and coexistence mechanisms. For instance, for technologies operating in the ISM bands in order to efficiently coexist it is crucial to know what type of other emitters are present in the environment (e.g. Wi-Fi, Zigbee, Bluetooth, etc.). Similar to AMR, FB and ML approaches (e.g. using time  For instance, the authors of [238] exploit the amplitude/phase difference representation to train a CNN model network to discriminate several radar signals from Wi-Fi and LTE transmissions. Their method was able to successfully recognize radar signals even under the presence of several interfering signals (i.e. LTE and Wi-Fi) at the same time, which is a key step for reliable spectrum monitoring.
In [160], the authors make use of the average magnitude spectrum representation of the raw observed signal on a distributed architecture with low-cost spectrum sensors together with an LSTM deep learning classifier to discriminate between different wireless emitters, such as TETRA, DVB, RADAR, LTE, GSM and WFM. Results showed that their method is able to outperform conventional ML approaches and a CNN based architecture for the given task.
In [176] the authors use the time domain quadrature (i.e. IQ) representation of the received signal and amplitude/phase vectors as input for CNN classifiers to learn the type of interfering technology present in the ISM spectrum. The results demonstrate that the proposed scheme is well suited for discriminating between Wi-Fi, ZigBee and Bluetooth signals. In [237], we introduce a methodology for end-to-end learning from various signal representations and investigate also the frequency domain (FFT) representation of the ISM signals and demonstrate that the CNN classifier that used FFT data as input outperforms the CNN models used by the authors in [176]. Similarly, the authors of [175] developed a CNN model to facilitate the detection and identification of frequency domain signatures for 802.x standard compliant technologies. Compared to [176] the authors in [175] make use of spectrum scans across the entire ISM region (80-MHz) and feed as input to a CNN model.
In [181] the authors used a CNN model to perform recognition of LTE and Wi-Fi transmissions based on two wireless signal representations, namely, the IQ and the frequency domain representation. The motivation behind this approach was to obtain accurate information about the technologies present in the local wireless environment so as to select an appropriate mLTE-U configuration that will allow fair coexistence with Wi-Fi in the unlicensed spectrum band.
Other examples include [174], [179], [131], [180], etc. In some applications like cognitive radio (CR) and spectrum sensing, the goal is however to identify the presence or absence of a signal. Namely, spectrum sensing is a process by which unlicensed users, also known as secondary users (SUs), acquire information about the status of the radio spectrum allocated to a licensed user, also known as primary user (PU), for the purpose of accessing unused licensed bands in an opportunistic manner without causing intolerable interference to the transmissions of the licensed user [239].
For instance, in [182] four ML techniques are examined k-NN, SVM, DT and logistic regression (LR) in order to predict the presence or absence of a PU in CR applications. The authors in [185] go a step further and design a spectrum sensing framework based on CNNs to facilitate a SU to achieve higher sensing accuracy compared with conventional approaches. For more examples, we refer the reader to [183], [184] and [177].
The literature related to machine learning and deep learning for WII approaches is contained in Tables 4 and 5 ordered by the publishing year of the work.

Research Problem
The problem addressed in the work

Performance improvement
Performance improvement achieved in the work Type of wireless network The type of wireless networks considered in the work and/or for which the problem is solved Data Type Type of data used in the work, e.g. synthetic or real Input Data The data used as input for the developed machine learning algorithms Learning Approach Type of learning approach, e.g. traditional machine learning (ML) or deep learning (DL) Learning Algorithm List of learning algorithms used Year The year when the work was published Reference The reference to the analyzed work Table 3: Description of the structure for tables 4, 5, 6 and 7

Medium access control (MAC) analysis
Sharing the limited spectrum resources is the main concern in wireless networks [245]. One of the key functionalities of the MAC layer in wireless networks is to negotiate the access to the wireless medium to share the limited resources in an ad hoc manner. Opposed to centralized designs where entities like base stations control and distribute resources, nodes in Wireless Ad hoc Networks (WANETs) have to coordinate resources.
For this purpose, several MAC protocols have been proposed in the literature. Traditional MAC protocols designed for WANETs include Time Division Multiple Access (TDMA) [246], [247], Carrier Sense Multiple Access/Collision Avoidance (CSMA/CA) [248], [249], Code Division Multiple Access (CDMA) [250], [251] and hybrid approaches [252], [253]. However, given the changing network and environment conditions, designing a MAC protocol that fits all possible conditions and various application requirements is a challenge especially when these conditions are not available or known a priori. This subsection investigates the advances made related to the MAC layer to tackle the problem of efficient spectrum sharing with the help of machine learning. We identify two categories of MAC analysis i) MAC identification and ii) Interference recognition. The reviewed MAC analysis tasks are listed in Table  6.
MAC identification. These approaches are typically employed in cognitive radio (CR) applications to foster communication and coexistence between protocol-distinct technologies. CRs rely on information gathered during spectral sensing to infer the environment conditions, presence of other technologies and spectrum holes. Spectrum holes are frequency bands that have been allocated to licensed network users but are not used at a particular time, which can be utilized by a CR user. Usually, spectrum sensing can determine the frequency range of a spectrum hole, while the timing information, which is also a channel access parameter, is unknown.
MAC protocol identification approaches may help CR users determine the timing information of a spectrum hole and accordingly tailor its packet transmission duration, which provides the potential benefits for network performance improve-ment. For this purpose several MAC layer characteristics can be exploited.
For example, in [187] the TDMA and slotted ALOHA MAC protocols are identified based on two features, the power mean and the power variance of the received signal combined with a SVM classifier. The authors in [189] utilized power and time features to distinguish between four MAC protocols, namely TDMA, CSMA/CA, pure ALOHA, and slotted ALOHA using a SVM classifier. Similary, in [190] the authors captured MAC layer temporal features of 802.11 b/g/n homogeneous and heterogeneous networks and employed a k-NN and a NB ML classifier to distinguish between all three.
Interference recognition. Similar to the approaches of recognizing interference based on radio spectrum analysis, the goal here is to identify the type of radio interference which degrades the network performance. However, compared to the previously introduced work, the works in the MAC analysis level category focus on identifying distinct features of interfered channel and packets to detect and quantify interference in order to pinpoint the viability of opportunistic transmissions in interfered channels and select an appropriate strategy to coexist with the present interference. This is realized based on information available on low-cost off-the-shelf devices, such as 802.15.4 and Wi-Fi radios, which is used as input for ML classifiers.
For instance, in [194] the authors investigated two possibilities for detecting interference: i) the energy variations during packet reception captured by sampling the radio's RSSI register and ii) monitoring the Link Quality Indicator (LQI) of received corrupted packets. This information is combined with a DT classifier, considered as a computationally and memory efficient candidate for the implementation on 802.15.4 devices. Another work on interference identification in WSNs is [192]. The authors were able to accurately distinguish Wi-Fi, Bluetooth and microwave oven interference based on features of corrupted packets (i.e. mean normalized RSSI, LQI, RSSI range, error burst spanning and mean error burst spacing) used as input to a SVM and DT classifier.
In [191] the authors were able to detect non-Wi-Fi interfer-      [172] ence on Wi-Fi commodity hardware. They collected energy samples across the spectrum from the Wi-Fi card to extract a diverse set of features that capture the spectral and temporal properties of wireless signals (e.g. central frequency, bandwidth, spectral signature, duty cycle, pulse signature, inter-pulse timing signature, etc.). They used these features and investigated performance of two classifiers, SVM and DT. The idea is to embed these functionalities in Wi-Fi APs and clients, which can then implement an appropriate mitigation mechanisms that can quickly react to the presence of significant non Wi-Fi interference.
The authors of [193] propose an energy efficient rendezvous mechanism resilient to interference for WSNs based on ML. Namely, due to the energy constraints on sensor nodes, it is of great importance to save energy and extend the network lifetime in WSNs. Traditional rendezvous mechanism such as Low Power Listening (LPL) and Low Power Probe (LPP) rely on low duty cycling (scheduling the radio of a sensor node between ON and OFF compared to always-ON methods) depending on the presence of a signal (e.g. signal strength). However, both suffer performance degradation in noisy environments with signal interference incorrectly regarding a non-ZigBee interfering signal as an interested signal and improperly keeping the radio ON, which increases the probability of false wake-ups. To remedy this, the proposed approach in [193] is capable of detecting potential ZigBee transmissions and accordingly decide whether to turn the radio ON. For this purpose, they extracted signal features from time domain RSSI samples (i.e. On-air time, Minimum Packet Interval, Peak to Average Power Ratio and Under Noise Floor) and used it as input to a DT classifier to effectively distinguish ZigBee signals from other interfering ones.
Spectrum prediction. In order to share the available spectrum in a more efficient way, there are various attempts in predicting the wireless medium availability to minimize transmission collisions and, therefore, increase the overall performance of the network.
For instance, an intelligent wireless device may monitor the medium and based on MAC-level measurements predict if the medium is likely to be busy or idle. In another variation of this approach, a device may predict the quality of the channels in terms of properties such as idle probabilities or idle durations and then select the channel with the highest quality for transmission.
For instance, the authors in [195] use NNs to predict if a slot will be free based on some history to minimize collisions and optimize the usage of the scarce spectrum. In their follow up work [200], they exploit CNNs to predict the spectrum usage of the other neighboring networks. Their approach is aimed for devices with limited capabilities for retraining.
In [196], a Deep Q-Network (DQN) is proposed to predict and select a free channel for WSNs. In [199], the authors design a NN predictor to predict PUs future activity based on past channel occupancy sensing results, with the goal of improving secondary users (SUs) throughput while alleviating collision to primary user (PU) in full-duplex (FD) cognitive networks.
The authors of [198] consider the problem of sharing time slots among a multiple of time-slotted networks so as to maximize the sum throughput of all the networks. The authors utilize the ResNet and compare performance to a plain DNN. MAC analysis approaches are listed in Table 6.

Network prediction
Network prediction refers to tasks related to inferring the wireless network performance or network traffic, given historical measurements or related data. Table 7 gives an overview of the works on machine learning for network level prediction tasks, i.e. i) Network performance prediction and ii) Network traffic prediction.
Network performance prediction. ML approaches are used, extensively, to create prediction models for many wireless networking applications. Typically, the goal is to forecast the performance or optimal device parameters/settings and use this knowledge to adapt the communication parameters to the changing environment conditions and application QoS requirements so as to optimize the overall network performance.
For instance, in [207] the authors aim to select the optimal MAC parameter settings in 6LoWPAN networks to reduce excessive collisions, packet losses and latency. First, the MAC layer parameters are used as input to a NN to predict the throughput and latency, followed by an optimization algorithm to achieve high throughput with minimum delay. The authors of [203] employ NNs to predict the users QoE in cellular networks, based on average user throughput, number of active users in a cells, average data volume per user and channel quality indicators, demonstrating high prediction accuracy.
Given the dynamic nature of wireless communications, a traditional one-MAC-fit-all approach cannot meet the challenges under significant dynamics in operating conditions, network traffic and application requirements. The MAC protocol may deteriorate significantly in performance as the network load becomes heavier, while the protocol may waste network resources when the network load turns lighter. To remedy this, [30] and [204] study an adaptive MAC layer with multiple MACs available that is able to select the MAC protocol most suitable for the current conditions and application requirements. In [30] a MAC selection engine for WSNs based on a DT model decides which is the best MAC protocol for the given application QoS requirements, current traffic pattern and ambient interference levels as input. The candidate protocols are TDMA, BoX-MAC and RI-MAC. The authors of [204] compare the accuracy of NB, Random Forest (RF), decision trees and SMO [254] to decide between the DCF and TDMA protocol to best respond to the dynamic network circumstances.
In [120] an NN model is employed that learns how environmental measurements and the status of the network affect the performance experienced on different channels, and uses this knowledge to dynamically select the channel which is expected to yield the best performance for the user.
As an integral part of reliable communication in WSNs, accurate link estimation is essential for routing protocols, which is a challenging task due to the dynamic nature of wireless channels. To address this problem, the authors in [81] use ML (i.e. LR, NB and NN) to predict the link quality based on physical  [196] Spectrum prediction More efficient radio resource utilization and enhanced link scheduling and power control  [200] layer parameters of last received packets and the PRR, demonstrating high accuracy and improved routing. The same authors go a step further in [112] and employ online machine learning to adapt their link quality prediction mechanism real-time to the notoriously dynamic wireless environment.
The authors in [205] develop a ML engine that predicts the packet loss rate in a WSN using machine learning techniques network performance as an integral part for an adaptive MAC layer.
Network traffic prediction. Accurate prediction of user traffic in cellular networks is crucial to evaluate and improve the system performance. For instance, the functional base station sleeping mechanism may be adapted by utilizing knowledge about future traffic demands, which are in [256] predicted based on a NN model. This knowledge helped reduce the overall power consumption, which is becoming an important topic with the growth of the cellular industry.
In another example, consider the need for efficient management of expensive mobile network resources, such as spectrum, where finding a way to predict future network use can help for network resource management and planning. A new paradigm for future 5G networks is network slicing enabling the network infrastructure to be divided into slices devoted to different services and tailored to their needs [270]. With this paradigm, it is essential to allocate the needed resources to each slice, which requires the ability to forecast their respective demands. The authors in [118] employed a CNN model that, based on traffic observed at base stations of a particular network slice, predicts the capacity required to accommodate the future traffic demands for services associated to it.
In [260] LSTMs are used to model the temporal correlations of the mobile traffic distribution and perform forecasting together with stacked Auto Encoders for spatial feature extraction. Experiments with a real-world dataset demonstrate superior performance over SVM and the Autoregressive Integrated Moving Average (ARIMA) model.
Deep learning was also employed in [261], [266] and [264] where the authors utilize CNNs and LSTMs to perform mobile traffic forecasting. By effectively extracting spatio-temporal features, their proposals gain significantly higher accuracy than traditional approaches, such as ARIMA.

Machine Learning applications for Information processing
Wireless sensor nodes and mobile applications installed on various mobile devices record application level data frequently, making them act as sensor hubs responsible for data acquisition and preprocessing and subsequently storing the data in the "cloud" for further "offline" data storage and real-time computing using big data technologies (e.g. Storm [271], Spark [272], Kafka [273], Hadoop [274], etc). Example applications are i) IoT infrastructure monitoring such as smart farming [5,6], smart mobility [4,208], smart city [7,209,210] and smart grid [211], ii) device fingerprinting, iii) localization and iv) activity recognition.
In the works of [219], [220], [221], [222], [223] and [224] ML or deep learning is employed to localize users in indoor or outdoor environments, based on different signals received from wireless devices or about the wireless channels such as amplitude and phase channel state information (CSI), RSSI, etc.
The goal in the works [225], [226], [227], [228], [229] is to identify the activity of a person based on various wireless signal properties in combination with a machine learning technique. For instance, in [226] the authors demonstrate accurate human pose estimation through walls and occlusions based on properties of Wi-Fi wireless signals and how they reflect off the human body, used as input to a CNN classifier. In [227] the authors detect intruders based on how their movement patterns affect Wi-Fi signals in combination with a Gaussian Mixture Model (GMM).
For a more throughout overview of the applications and works on wireless information processing the reader is referred to [275].

Open Challenges and Future Directions
Previous sections presented the significant amount of research work focused on exploiting ML to address the spectrum scarcity problem in future wireless networks. However, despite the growing state-of-the-art with more and more different ML algorithms being explored and applied at various layers of the network protocol stack, there are still open challenges that need to be addressed in order to employ these paradigms in real radio environments to enable a fully intelligent wireless network in the near future.
This section discusses a set of open challenges and explores future research directions which are expected to accelerate the adoption of ML in future wireless network deployments.

. Standard Datasets
To allow the comparison between different ML approaches, it is essential to have common benchmarks and standard datasets available, similar to the open dataset MNIST that is often used in computer vision. In order to effectively learn, ML algorithms will require a considerable amount of data. Furthermore, preferably standardized data generation/collection procedures should be created to allow reproducing the data. Research attempts in this direction include [276,277], showing that synthetic generation of RF signals is possible, however some wireless problems may require to inhibit specifics of a real system in the data (e.g. RF device fingerprinting).  [269] Therefore, standardizing these datasets and benchmarks remains an open challenge. Significant research efforts need to be put in building large-scale datasets and sharing them with the wireless research community.

Standard Problems
Future research initiatives should identify a set of common problems in wireless networks to facilitate researchers in benchmarking and comparing their supervised/unsupervised learning algorithms. These problems should be supported with standard datasets. For instance, in computer vision for benchmarking computer vision algorithms for image recognition tasks, the MNIST and ImageNet datasets are typically used. Examples of standard problems in wireless networks may be: wireless signal identification, beamforming, spectrum management, wireless network traffic demand prediction, etc. Special research attention must be focused on designing these problems.

Standard Data representation
DL is increasingly used in wireless networks, however it is still unclear what the optimal data representation is. For instance, an I/Q sample may be represented as a single complex number, a tuple of real numbers or via the amplitude and phase values of their polar coordinates. It is a debate that there is no one-size-fits-all data representation solution for every learning problem [237]. The optimal data representation might depend among other factors on the DL architecture, the learning objective and choice of the loss function [149].

Standard evaluation metrics
After identifying standard datasets and problems, future research initiatives should identify a set of standard metrics for evaluating and comparing different ML models. For instance, a set of standard metrics may be determined per standardized problem. Examples of standardized metrics might be: confusion matrix, F-score, precision, recall, accuracy, mean squared error, etc. In addition, the evaluation part may take into account other evaluation metrics such as: model complexity, memory overhead, training time, prediction time, required data size, etc.

Implementation of Machine Learning models in practical wireless platforms/systems
There is no doubt that ML will play a prominent role in the evolution of future wireless networks. However, although ML is powerful, it may be a burden when running on a single device. Furthermore, DL which has shown great success, requires significant amount of data to perform well, which poses extra challenges on the wireless network. It is therefore of paramount importance to advance our understanding of how to simply and efficiently integrate ML/DL breakthroughs within constrained computing platforms. A second question that requires particular attention is which requirements does the network need to meet to support collection and transfer of large volumes of data?

Constraint wireless devices
Wireless nodes, such as for instance seen in the IoT (e.g. phones, watches and embedded sensors), are typically inexpensive devices with scarce resources: limited storage resource, energy, computational capability and communication bandwidth. These device constraints bring several challenges when it comes to implement and run complex ML models. Certainly, ML models with a large number of neurons, layers and parameters will necessarily require additional hardware and energy consumption not just for performing training but also for inference.
Reducing complexity of Machine Learning models. ML/DL is well on its way to becoming mainstream on constraint devices [278]. Promising early results are appearing across many domains, including hardware [279], systems and learning algorithms. For example, in [280] binary deep architectures are proposed, that are composed solely of 1-bit weights instead of 32-bit or 16-bit parameters, allowing for smaller models and less expensive computations. However, their ability to generalize and perform well in real-world problems is still an open question.
Distributed Machine Learning implementation. Another approach to address this challenge, may be to distribute the ML computation load across multiple nodes. Some questions that need to be addressed here are: "Which part of the learning algorithms can be decomposed and distributed?", "How are the input data and output calculation results communicated among the devices?", "Which device is responsible for the assembly for the final prediction results?", etc.

Infrastructure for data collection and transfer
The tremendously increasing number of wireless devices and their traffic demands, require a scalable networking architecture to support large scale wireless transmissions. The transmission of large volumes of data is a challenging task due to the following reasons: i) there are no standards/protocols that can efficiently deliver > 100T bits of data per second, ii) it is extremely difficult to monitor the network in real-time, due to the huge traffic density in short time.
A promising direction in addressing this challenge is the concept of fog computing/analytics [19]. The idea of fog computing is to bring computing and analytics closer to the end-devices, which may improve the overall network performance by reducing or completely avoiding the transmission of large amounts of raw data to the cloud. Still, special efforts need to be devoted to employ these concepts in practical systems. Finally, cloud computing technologies (using virtualized resources, parallel processing and scalable data storage) may help reduce the computational cost when it comes to processing and analysis of data.

Machine Learning model accuracy in practical wireless
systems Machine learning has been commonly used in static contexts, when the model speed is usually not a concern. For example, consider recognizing images in computer vision. Whilst, images are considered as stationary data, wireless data (e.g. signals) are inherently time-varying and stochastic. Training a robust ML model on wireless data that generalizes well is a challenging task due to the fact that wireless networks are inherently dynamic environments with changing channel conditions, user traffic demands and changing operating parameters (e.g. due to changes in standardization bodies). Considering that stability is one of the main requirements of wireless communication systems, rigorous theoretical studies are essential to ensure ML based approaches always work well in practical systems. The open question here is "How to efficiently train a ML model that generalizes well to unseen data in such dynamically changing system?". The following paragraphs discuss promising directions in addressing this challenge.

Transfer learning
With typical supervised learning a learned model is applicable for a specific scenario and likely biased to the training dataset. For instance, a learned model for recognizing a set of wireless technologies is trained to recognize only those technologies and also tight to the specific wireless environment characteristic where the data is collected. What if new technologies need to be identified? What if the conditions in the wireless environment change? Obviously, the ability of generalization of the trained learning models are still open questions. How can we efficiently adapt our model to these new circumstances?
Traditional approaches may require to retrain the model based on new data (i.e. incorporating new technologies or specifics of a new environment together with new labels). Fortunately, with the advances in ML it turns out that it is not necessary to fully retrain a ML model. A new popular method called transfer learning may solve this. Transfer learning is a method that allows to transfer the knowledge gained from one task to another similar task, and hereby alleviate the need to train ML models from scratch [281]. The advantage of this approach is that the learning process in new environments can be speeded up, with a smaller amount of data needed to train a good performing model. In this way, wireless networking researchers may solve new but similar problems in a more efficient manner. For instance, if the new task requires to recognize new modulation formats, the model parameters for an already trained CNN model may be reused as the initialization for training the new CNN.

Active learning
Active learning is a subfield of ML that allows to update a learning model on-the-fly in a short period of time. For instance, in wireless networks, the benefit is that updating the model depending on the wireless networking conditions allows the model to be more accurate with respect to the current state [282].
The learning model adjusts its parameters whenever it receives new labeled data. The learning process stops when the system achieves the desired prediction accuracy.

Unsupervised/semi-supervised deep learning
Typical supervised learning approaches, especially the recently popular deep learning techniques, require a large amount of training data with a set of corresponding labels. The disadvantage here is that so much data might either not always be available or comes at a great expense to prepare. This is especially a time consuming task in wireless networks, where one has to wait for the occurrence of certain types of events (e.g. appearance of emission from a specific wireless technology or on a specific frequency band) for creating training instances to build robust models. At the same time, this process requires significant expert knowledge to construct labels, which is not a sufficiently automated process and generic for practical implementations.
To reduce the need for much domain knowledge and labeling data, deep unsupervised learning [131] and semi-supervised learning [74] is recently used. For instance, the AE (autoencoders) have become a powerful deep unsupervised learning tool [283], which have also shown the ability to compress the input information by possibly learning a lower dimensional encoding of the input. However, these new tools, require further research to fulfill their full potentials in (practical) wireless networks.

Conclusion
With the advances in hardware and computing power and the ability to collect, store and process massive amounts of data, machine learning (ML) has found its way into many different scientific fields, including wireless networks. The challenges wireless networks are faced with, pushed the wireless networking domain to seek more innovative solutions to ensure expected network performance. To address these challenges, ML is increasingly used in wireless networks.
In parallel, a growing number of surveys and tutorials emerged on ML applied in wireless networks. We noticed that some of the existing works focus on addressing specific wireless networking tasks (e.g. wireless signal recognition), some on the usage of specific ML techniques (e.g. deep learning techniques), while others on the aspects of a specific wireless environment (e.g. IoT, WSN, CRN, etc.) looking at broad application scenarios (e.g. localization, security, environmental monitoring, etc.). Therefore, we realized that none of the works elaborate ML for optimizing the performance of wireless networks, which is critically affected by the proliferation of wireless devices, networks, technologies and increased user traffic demands. We further noticed that some works are missing out the fundamentals, necessary for the reader to understand ML and data-driven research in general. To fill this gap, this paper presented i) a well-structured starting point for non-machine learning experts, providing fundamentals on ML in an accessible manner, and ii) a systematic and comprehensive survey on ML for performance improvements of wireless networks looking at various perspectives of the network protocol stack. To the best of our knowledge, this is the first survey that comprehensively reviews the latest research efforts (up until and including 2019) in applying prediction-based ML techniques focused on improving the performance of wireless networks, while looking at all protocol layers: PHY, MAC and network layer. The surveyed research works are categorized into: radio analysis, MAC analysis and network prediction approaches. We reviewed works in various wireless networks including IoT, WSN, cellular networks and CRNs. Within radio analysis approaches we identified the following: automatic modulation recognition, and wireless interference identification (i.e. technology recognition, signal identification and emitter identification). MAC analysis approaches are divided into: MAC identification, wireless interference identification and spectrum prediction tasks. Network prediction approaches are classified into: performance prediction, and traffic prediction approaches.
Finally, open challenges and exciting research directions in this field are elaborated. We discussed where standardization efforts are required, including standard: datasets, problems, data representations and evaluations metrics. Further, we discussed the open challenges when implementing machine learning models in practical wireless systems. Herewith, we discussed future directions at two levels: i) implementing ML on constraint wireless devices (via reducing complexity of ML models or distributed implementation of ML models) and ii) adapting the infrastructure for massive data collection and transfer (via edge analytics and cloud computing). Finally, we discussed open challenges and future directions on the generalization of ML models in practical wireless environments.
We hope that this article will become a source of inspiration and guide for researchers and practitioners interested in applying machine learning for complex problems related to improving the performance of wireless networks.