Machine Learning is a popular subfield of AI responsible for allowing machines to learn from data. It gives devices the capability to learn and perform tasks without being explicitly programmed. ML techniques can be divided into three. Supervised, unsupervised and reinforcement learning (RL). In the following subsection, we present the state-of-the-art of some machine learning techniques in cognitive sensing solutions.

#### 7.1.1. Supervised Learning Techniques

In supervised learning (SL), algorithms are trained using labeled datasets. During the training process, algorithms evaluate an estimate based on the input dataset and continually updates the estimate until it achieves a predefined degree of accuracy. An SL algorithm adjusts and satisfies a cost function that measures the error between a labeled and predicted output [

118]. They are majorly used for classification and regression-based tasks.

In classification tasks, data are grouped into a predetermined and distinct number of labeled classes. Popular algorithms include Naïve Bayes (NB), K-Nearest Neighbor (KNN), Support Vector Machines (SVM), Decision Trees (DT), Random Forests (RF), Artificial Neural Networks (ANN), and Ensembles. Some algorithms are fast, accurate and easy to implement, e.g., NB, RF which makes them usable on resource-constrained nodes while others like KNN and SVM tend toward computational complexity, especially with large datasets. For example, authors in ref. [

119] successfully RF on a constrained node to classify collected data before transmission and further demonstrated that the energy consumption of the AI-based process was three times lower than the normal sense and transmit approach. Other algorithms like DT, ANN and SVM suffer from overfitting when datasets are not well pruned or regularized. Common classification tasks in CS include threat classification [

83], device/user classification [

120], data and request classification [

121].

The operational technique of algorithms, the type of data (size, kind, features, state, etc.) and the CS task influences the choice of a selected algorithm. For example, a Naive Bayes classifier assumes that data attributes are statistically uncorrelated, thus, given a vector of attributes

**t**, NB evaluates the probability that the vector belongs to a class using the Bayes Theorem in Equation (7)

It then uses the naïve assumption given as

Subsequently, for all

**i**, this relationship is then simplified as shown in Equation (9) as

This simplified process makes an NB model easy, fast and capable of real-time processing in CS tasks with massive datasets having large or irrelevant features.

Given an independent variable, algorithms such as linear regression, SVM and ANN are used to predict a continuous value. Hence, they are useful for forecasting or establishing a relationship between variables of interest. Common prediction-based tasks in CS include data/event prediction [

122], energy consumption/availability forecast [

17], application load prediction [

123], and predicting threats [

124]. For an independent variable y in Equation (10)

the goal of a linear regression algorithm is to find the best value for

${b}_{\mathbf{0}}\mathrm{and}{b}_{\mathbf{1}}$. This can be achieved using a minimization problem that minimizes the error between the predicted value and the actual value. The mean squared error (MSE) function is obtained when the error difference is squared and summed over all the data points, then divided by the total number of data points. It is given as

A gradient descent approach is then systematically used to update the values of ${b}_{\mathbf{0}}\mathrm{and}{\mathbf{b}}_{1}$ to reduce the MSE.

Convolutional Neural Networks (CNN) are a class of deep neural networks commonly applied to vision-based analysis, image and video recognition systems, recommender systems, and image classification. CNN consists of an input, output and a hidden layer that comprises a series of convolutional layers. Some of the obstacles of using CNN in IoT nodes include high computational complexity, lack of sufficient data sets, high energy and memory usage [

125]. However, these obstacles can be mitigated by scaling a large model down or by using a simplified model designed for resource-constrained environments. For example, authors in ref. [

126] proposed a simple but efficient CNN model suitable for IoT devices. The simplified model achieves its state-of-the-art performance by factorizing standard 3 × 3 convolutions into pairs of 1 × 3 and 3 × 1 standard convolutions, instead of performing depth-wise convolutions.

Authors in ref. [

127] proposed a streaming hardware accelerator for achieving image detection using CNN in IoT nodes. To promote energy efficiency, the accelerator avoids unnecessary data movement and uses a unique filter decomposition technique to support arbitrary convolution window size. Also, to improve throughput, the accelerator uses an external pooling module to provide a pooling function. The validity testing of the accelerator showed that it can support popular CNNs and it is suitable to be integrated with the IoT devices. Authors in ref. [

128] present a CNN and RNN based network traffic classifier for classifying IoT traffic. The proposed method provided a better detection result than alternative algorithms without the added feature engineering technique common in other models.

In a related study, authors in ref. [

129] applied the compressed sensing scheme at the input layer of a CNN model for image classification to reduce the resources consumption and the required number of training samples. The proposed technique was then evaluated using the public data sets, MINST and CIFAR-10, with results showing reduced training and inference time. Further, the model achieved a higher classification accuracy when compared with the traditional large CNN models. In ref. [

130], a CNN indoor localization framework based on RSSI measurements was developed using a 3D radio image-based region recognition process. It aims to localize a sensor node accurately by determining its location region. To achieve this, 3D radio images are constructed based on the Received Signal Strength Indicator (RSSI) fingerprints. The RSSI measurements and the kurtosis values are then used to provide new information to the network. The proposed method solved the problem of the high computational complexity of the traditional methods and ensured a good localization accuracy. Authors in ref. [

131] developed a general-purpose CNN for image and video classification in IoT systems. To overcome the high computational cost of CNNs, the developed system distributes their computation onto the units of the IoT system which is then formalized as an optimization problem of minimizing the latency between the data-gathering and the decision-making phase. The strength of the proposed CNN lies in its ability to supports multiple IoT sources of data as well as parallel execution on the same IoT system.

#### 7.1.2. Unsupervised Learning

The goal of unsupervised learning algorithms is to find unknown patterns or reduce features in unlabeled datasets. These two tasks are carried out using clustering and dimensionality reduction (DR) techniques. Popular clustering techniques include K-means, hierarchical, Density-based spatial clustering of applications with noise (DBSCAN), and cluster analysis while DR tasks majorly use principal component analysis (PCA), linear discriminant analysis (LDA), non-negative matrix factorization (NNMF) and autoencoder methods. Clustering algorithms are extremely useful in CS-related tasks due to their ability to work with unlabeled data and their ability to automate the difficult sensor data annotation process [

132,

133]. On the other hand, DR techniques are useful for selecting and extracting features from collected data before transmission due to bandwidth limitations, or as a precursor to a supervised learning task [

134,

135]. Other application areas include density estimation, outlier, and anomaly detection [

136]. Data clustering is a process of grouping unlabeled datasets into clusters of the same features. For example, given a set of measurements

$\left({m}_{\mathbf{1}},{m}_{\mathbf{2}},\dots {m}_{n}\right)$, where each measurement is a g-dimensional real vector, k-mean clustering will partition the n measurements into

$k\left(\le n\right)$ sets S =

$\left({S}_{\mathbf{1}},{\mathbf{S}}_{2},\dots {S}_{k}\right)$ to minimize the within-cluster sum of square. The objective is shown in Equation (12) as

where

${\mu}_{i}$ is the mean of points in

${S}_{i}$. In hierarchical clustering, the objective is to build a hierarchy of clusters using either a bottom-up or a top-down approach. In the bottom-up approach, each observation starts in its cluster and pairs of clusters are merged as one moves up the hierarchy whereas in the top-down approach, all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. Common metrics used to determine whether clusters are to be combined or split include Euclidean distance

${\Vert a-b\Vert}_{\mathbf{2}}=\sqrt{{{\displaystyle \sum}}_{i}{({a}_{i}-{b}_{i})}^{\mathbf{2}}}$, Squared Euclidean distance

${\Vert a-b\Vert}^{\mathbf{2}}{}_{\mathbf{2}}={{\displaystyle \sum}}_{i}{({a}_{i}-{b}_{i})}^{\mathbf{2}}$, Manhattan distance

${\Vert a-b\Vert}_{\mathbf{1}}={{\displaystyle \sum}}_{i}\left|{a}_{i}-{b}_{i}\right|$.

Dimensionality reduction techniques derive their importance in CS solutions due to the difficulty encountered by some supervised learning algorithms when working with large datasets and the bandwidth/energy limitation problem in sensor data transmission. PCA is an algorithm used majorly for DR and it operates by performing a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the lower dimensional space is maximized. A review of the DR techniques applicable to the IoT domain can be found in ref. [

137].

To recover missing IoT sensor data, authors in ref. [

138] propose the use of a probabilistic method and data from related sensors. The proposed method uses a K-mean algorithm to measure and split data into different clusters based on the idea that sensors within one group will have similar patterns of measurement. After this, a probabilistic matrix factorization (PMF) is carried within the cluster to recover missing sensor data by analyzing measurement patterns of neighboring sensors. The performance of the PMF algorithm is further enhanced by normalizing the data and limiting the probabilistic distribution of random feature matrices. Unlike other approaches that use SVM and DNN, the proposed method achieved a better recovery accuracy and lower root mean square error. The method, however, suffers from scalability issues due to the increased difficulty of determining the correlation between sensors data on large datasets.

In ref. [

139], the authors proposed a node-density-based clustering and mobile collection (NDCMC) approach that combines the hierarchical routing and mobile element (ME) data collection techniques. In the approach, cluster heads (CH) collect data from members after which mobile elements aggregate these data by visiting the CHs. To achieve this, the work proposes a CH selection scheme based on the node density clustering algorithm to make nodes surrounded by more deployed nodes become CHs. This aims to improve the efficiency of the intracluster routing and ME data collection process. The ME then uses a low-complexity traveling track planning algorithm to collect data from all CHs. The strength of the proposed approach lies in its ability to provide a more uniformed power consumption among nodes. However, the difficulty in scheduling the traveling paths of the ME is an observed disadvantage. Further in ref. [

140], the authors propose a recursive principal component analysis (R-PCA)-based data analysis framework that aggregates redundant data and detects outliers. To achieve this, the principal components of aggregated sensor data are extracted at the CH which makes it suffer from increased energy consumption at the CH nodes.

#### 7.1.3. Reinforcement Learning

RL deals with how an agent learns while interacting with an environment via the use of a reward system. The agent receives a delayed reward in the next time step based on which it evaluates its previous action. There are two variants of RL. The model-based and model-free RL. In the model-based approach, a transition probability maps a current state with an action and a resulting next state [

141]. Thus, an agent’s task is to learn an optimal policy that maximizes its reward or reduces its cost as it navigates through the environment [

56]. Examples of such algorithms are Dyna Q and Monte Carlo methods. On the other hand, an agent in model-free RL relies on trial-and-error actions to update its knowledge about the environment, e.g., temporal difference learning and Q-learning. An RL problem is modeled using the Markov decision process (MDP) framework. An MDP is a 5-tuple [S, A, P, R,

${\mathit{S}}_{\mathbf{0}}$] where S is the set of possible states, A is the set of corresponding actions, P(St+1|St, At) represents the dynamics and R(St, At, St+1) is the reward, R(s, a,

${\mathit{S}}_{\mathbf{0}}$) represents the reward given to the agent at state s, after performing an action a and terminating in state

${\mathit{S}}_{\mathbf{0}}$ [

142]. The objective of MDPs is to find an optimal control policy that can maximize a given average reward per unit time or, a policy that minimizes the average cost per unit time. The value of a policy π (

${\mathbf{V}}^{\mathsf{\pi}}$), (i.e., the expected discounted reward if starting in some state and following a policy π) can be expressed using the Bellman equation given as

whereas the optimal value function (value function of an optimal policy

${\pi}^{\ast}$ a policy with the highest value) can be obtained using

In CS tasks, RL is used to solve planning, control, optimization, and learning-related problems e.g., retransmission scheduling in 802.15.4e LLDN [

143], intrusion detection system [

144], self-learning power control [

145], power consumption scheduling in an EH IoT node [

146,

147], sampling rate configuration of EH sensors [

148].

A major challenge when using RL techniques for cognitive sensing tasks is the difficulty in training active agents whose drop in performance could adversely affect the overall system. Another difficulty encountered is the memory-intensive nature of some RL algorithms as well as the need to limit exploratory moves during learning where the agent’s safety is paramount. The large and continuous state and action spaces of some sensing tasks is also a challenge that needs to be addressed efficiently [

149].

In ref. [

150], authors develop three RL-based methods that address the user access control and battery prediction problems in a multiuser EH-based IoT system. The LSTM-DQN-based scheduling algorithm uses causal information about the channel and node battery states to find an optimal policy that maximizes the long-term discounted uplink sum rate. The battery state prediction algorithm uses deep LSTM to minimize prediction loss. The efficiency of the algorithms was tested under different conditions with results showing they were efficient in mitigating the addressed problems. In ref. [

151], the authors formulate the resource allocation problem of IoT fog nodes using a Markov decision process (MDP) framework. For each request from an IoT user, the node decides whether to serve it locally at the edge using its resources or to refer it to the cloud to conserve its valuable resources. The formulated MDP problem is then solved using several RL methods, namely Q-learning, SARSA, Expected SARSA, and Monte Carlo by learning the optimal decision-making policies. The performance and adaptivity of the RL methods are then compared with the performance of the network slicing approach with various slicing thresholds. The evaluation results showed that the RL algorithms can be adapted to various IoT environments.

Table 6 presents some ML algorithms and selected works detailing their strength and weaknesses for cognitive sensing tasks.