A Review of Machine Learning and Deep Learning Techniques for Anomaly Detection in IoT Data

Anomaly detection has gained considerable attention in the past couple of years. Emerging technologies, such as the Internet of Things (IoT), are known to be among the most critical sources of data streams that produce massive amounts of data continuously from numerous applications. Examining these collected data to detect suspicious events can reduce functional threats and avoid unseen issues that cause downtime in the applications. Due to the dynamic nature of the data stream characteristics, many unresolved problems persist. In the existing literature, methods have been designed and developed to evaluate certain anomalous behaviors in IoT data stream sources. However, there is a lack of comprehensive studies that discuss all the aspects of IoT data processing. Thus, this paper attempts to fill this gap by providing a complete image of various state-of-the-art techniques on the major problems and core challenges in IoT data. The nature of data, anomaly types, learning mode, window model, datasets, and evaluation criteria are also presented. Research challenges related to data evolving, feature-evolving, windowing, ensemble approaches, nature of input data, data complexity and noise, parameters selection, data visualizations, heterogeneity of data, accuracy, and large-scale and high-dimensional data are investigated. Finally, the challenges that require substantial research efforts and future directions are summarized.


Introduction
The advent of the Internet has revolutionized communication between humans. Similarly, Internet of Things (IoT) devices are reshaping how humans perceive and interact with the physical world. By 2025, IoT systems are expected to cross nearly 75 billion connected devices, tripling the global population [1].
The IoT is a network of heterogeneous objects, such as smartphones, laptops, intelligent devices, and sensors, connected to the Internet through various technologies [2]. The IoT enables various sensors and devices to communicate with each other directly without user interaction [3]. IoT has become one of the biggest data sources in the past few years [4]. Methods such as machine learning algorithms can be used to extract meaningful information from these data.
IoT remains a significant challenge. One technique that effectively analyses the collected data stream is anomaly detection [5,6]. These unexplained phenomena could be In many cases, IoT real-time applications generate infinitely massive data streams that pose impose unique limitations and obstacles to machine learning algorithms [13]. These challenges require a careful design of the algorithm to process these data [14]. Most existing data stream algorithms are less efficient and have limited capability requirements [15]. Many studies have investigated the techniques used for anomaly detection, such as [16][17][18][19] that address static data and data stream using both statistical and machine learning methods. However, these studies have not focused on evolving data streams.
Some studies have focused on the detection of anomalies in the data stream, such as [16]. However, previous studies have not addressed all the requirements that have to be available in the algorithm to process IoT data streams and the main challenges for choosing an excellent algorithm that suits the IoT data characteristics. For anomaly detection, many algorithms can be used to detect anomalies in the data stream. A good anomaly detection algorithm should consider the following restrictions related to data streams:

•
Data points are pushed out continuously, and the speed of arrival of data depends on the data source. Thus, it could be fast or slow.

•
The data stream could be potentially infinite, which means there could be no end to the incoming data.

•
Features and/or characteristics of the arriving data points may evolve.

•
Data points are potentially one-pass, i.e., the data points can be used only once, and discarded after. Thus, fetching important data characteristics is important.
To have good quality anomaly detection, algorithms must have the ability to handle the following challenges before they can be used effectively: • Ability to handle fast data-the anomaly detection algorithm must be able to handle and process data in real-time when data points from the data source come constantly, as data streams could be huge and should be handled in one pass. In many cases, IoT real-time applications generate infinitely massive data streams that pose impose unique limitations and obstacles to machine learning algorithms [13]. These challenges require a careful design of the algorithm to process these data [14]. Most existing data stream algorithms are less efficient and have limited capability requirements [15]. Many studies have investigated the techniques used for anomaly detection, such as [16][17][18][19] that address static data and data stream using both statistical and machine learning methods. However, these studies have not focused on evolving data streams.
Some studies have focused on the detection of anomalies in the data stream, such as [16]. However, previous studies have not addressed all the requirements that have to be available in the algorithm to process IoT data streams and the main challenges for choosing an excellent algorithm that suits the IoT data characteristics. For anomaly detection, many algorithms can be used to detect anomalies in the data stream. A good anomaly detection algorithm should consider the following restrictions related to data streams:

•
Data points are pushed out continuously, and the speed of arrival of data depends on the data source. Thus, it could be fast or slow.

•
The data stream could be potentially infinite, which means there could be no end to the incoming data. • Features and/or characteristics of the arriving data points may evolve. • Data points are potentially one-pass, i.e., the data points can be used only once, and discarded after. Thus, fetching important data characteristics is important.
To have good quality anomaly detection, algorithms must have the ability to handle the following challenges before they can be used effectively: • Ability to handle fast data-the anomaly detection algorithm must be able to handle and process data in real-time when data points from the data source come constantly, as data streams could be huge and should be handled in one pass. • Ability to process data in given memory-the massive amount of data should not influence the data stream's processing capabilities. Thus, the data stream algorithm should not require unlimited memory for the unlimited data points arriving in the system, and it should be able to process the data within the available memory.
• Ability to handle dimensionality-high-dimensional data have the additional problem of selecting the proper feature vectors or dimensions that could create better clusters. High-dimensional data also require additional calculation and additional processing. Thus, the algorithm should consider such factors in which the clusters' quality will not be affected. • Ability to handle evolving data streams-data sources are numerous and moving over time; they grow continuously. Therefore, data produced would change, and the algorithm outcomes can change considerably. The activities of streams are known to evolve over time and need a special method to handle them.
Existing research has mostly analyzed anomalies based on machine learning techniques primarily focused only on batch processing. In contrast, this paper focuses on machine learning techniques for anomaly detection in data streams, more specifically on evolving data streams. The major contributions of this research are as follows: 1.
Examination of state-of-the-art studies centered on machine learning and deep learning techniques for anomaly detection in data streams; 2.
Taxonomy that defines current literature based on the nature of the data, anomaly types, detection learning modes, window model, dataset, and evaluation criteria; 3.
Analysis of the existing techniques based on the proposed taxonomy; 4.
Highlighting challenges that form the future research direction.
The rest of the paper is organized as follows: Section 2 provides a brief background of the study, which includes the definition of the main terms in the review paper. Section 3 presents a taxonomy of anomaly detection techniques for IoT data stream that includes the machine learning and deep learning techniques used, nature of data, anomaly types, detection learning mode, window models, datasets, and the evaluation criteria. Section 4 discusses the research challenges and the potential future directions. Section 5 presents the research results. Finally, the conclusion is given in Section 6.

Background
With sensors invading human's daily lives, data streams are growing exponentially. Driven by the expansion of the Internet of Things (IoT) and the connected real-time data sources, many applications generate vast amounts of critical data streams that evolve. Data streams are a continuous, infinite series of data records followed and arranged by embedded or precise timestamps [14]. Analyzing such data streams efficiently offers useful insights for many application domains. Yet, it is a significant challenge. One of the techniques that deal with analyzing the collected data stream effectively is called anomaly detection [5]. Anomaly detection refers to the identification of events or patterns in a dataset that differ significantly from the majority of items or the expected pattern, and those unexpected patterns are referred to as anomalies, novelties, exceptions, noise, surprises, or deviations [7,8,20]. Anomaly detection plays a considerable role in the analysis of anomalies in many applications. In almost every use case, early detection is useful.
Looking at a device that monitors a cardiac patient's heart rate continuously, an anomaly may result in a heart attack. Detecting such anomalies minutes in advance is far better than detecting it a few seconds ahead, or detecting it after the event occurs [21]. Detecting anomalies in the data stream has practical and important applications across a wide range of fields [15].
One of the main approaches used for anomaly detection is machine learning techniques. As stated in [22], machine learning enables computers to learn without explicit programming. It is intended to allow a system to learn from the past or the present and to use the knowledge to make future predictions or decisions [18]. Even though "learning" is extremely vital in machine learning, it is not the objective. The primary objective of machine learning is to create a system that can identify relevant patterns in data automatically and correctly [23].
Recent advancement in deep learning techniques has made it also possible to largely improve anomaly detection performance compared to the classical approaches. Deep learning is defined as a subset of machine learning in artificial intelligence that has networks capable of unsupervised learning from data that are unstructured [5].

Taxonomy of Anomaly Detection Techniques
The taxonomy defined here reflects the state-of-the-art techniques on anomaly detection using machine learning in IoT data streams. Further, this section also elaborates on the nature of data, types of anomalies and the anomaly detection learning modes, window models used for analyzing the data, the datasets used for evaluating the anomaly detection techniques, and the evaluation criteria used to measure the performance of the anomaly detection technique, as illustrated in Figure 2.
is extremely vital in machine learning, it is not the objective. The primary objective of machine learning is to create a system that can identify relevant patterns in data automatically and correctly [23].
Recent advancement in deep learning techniques has made it also possible to largely improve anomaly detection performance compared to the classical approaches. Deep learning is defined as a subset of machine learning in artificial intelligence that has networks capable of unsupervised learning from data that are unstructured [5].

Taxonomy of Anomaly Detection Techniques
The taxonomy defined here reflects the state-of-the-art techniques on anomaly detection using machine learning in IoT data streams. Further, this section also elaborates on the nature of data, types of anomalies and the anomaly detection learning modes, window models used for analyzing the data, the datasets used for evaluating the anomaly detection techniques, and the evaluation criteria used to measure the performance of the anomaly detection technique, as illustrated in Figure 2.   Machine learning anomaly detection techniques for data streams have been proposed in recent years. These techniques can detect data stream anomalies in various core implementations across a broad variety of areas, including manufacturing networks, finance and banking, military, healthcare, insurance, network protection, and the Internet of Things.
Based on a popular and successful anomaly detection algorithm called Local Outlier Factor (LOF), a new data stream algorithm called Cumulative Local Outlier Factor (C-LOF) was developed; however, it only works on static data [14]. Experimental results suggest that even when detecting anomalies in the data stream, C-LOF may overcome masquerading. A serious drawback of this strategy is that the time complexity of C-LOF is high.
An evolving algorithm based on data stream clustering was introduced by Bezerra et al. [24]. The algorithm called AutoCloud is based on the newly introduced Typicality and Eccentricity Data Analytics principle that is used for anomaly detection research. Au-toCloud is an evolving electronic and recursive approach that does not require prior knowledge of data before processing. The algorithm can handle concepts, which are inherited problems in data streams. However, the proposed technique overlooks many of the distance problems, dynamically affecting the separation of the clouds.
Another evolving clustering algorithm was proposed based on a mixture of topicalities for data stream mining [25]. It is based on the paradigm of TEDA and separates clustering into two sub-problems: micro and macro-clusters. Experimental results show that the proposed technique yields good results for data clustering and predicts its density even in the face of events that affect data distribution parameters, such as concept drifts. However, the autonomous adaptation process is not entirely considered in this strategy.
Complex linear Bayesian equations have been developed, wherein existing BDLMs and RBPF techniques are combined for the real-time detection of anomalies [26]. This methodology aims to quantify if the variables and parameters are effective in detecting anomalies. However, no effort was made in the proposed technique to resolve the challenge of sensor drift or inadequate conditions for implementation.
The Hierarchical Temporary Memory (HTM) algorithm is used to suggest a data stream method for real-time outlier detection for space imagers [27]. The HTM-based algorithm is consistent with the detection of spatial and temporal outliers and fulfils the specifications without oversight for real-time, continuous data stream detection. The findings suggest that the algorithm would efficiently achieve real-time detection of anomalies. However, the suggested solution was not able to reduce the false-positive rate.
In 2019, an artificial algorithm for the neural network was used by Cauteruccio et al. [28]. A novel method for automated anomalies in heterogeneous sensor networks based on data edge exploration combined with cloud data was constructed. The experimental assessment of the planned solution was carried out on real data gathered in an indoor building environment and then distorted by separate virtual impairments. The research outcomes suggest that the proposed solution would adjust itself to the shifts in the environment and identify the anomalies correctly. However, the study does not consider the evolving features of the data and the definition of drift.
A new multi-source approach for the detection of multi-dimensional data anomalies (MDADM) was developed [29]. The real-time prediction of the probability of abnormal behavior occurring in underground mining is based on the hierarchical edge computing model. Results show the current method has a higher detection performance and less transmission delay than conventional methods. However, the authors overlooked the evolution of data and its value and that its characteristics may change over time, which will affect the overall accuracy of the proposed method for anomaly detection.
A new approach to classifying non-stationary data streams that address problems in detecting anomalies such as infinite time, idea drift, recurring concepts, and concept evolution was suggested based on a multi-kernel approach [30]. These kernels were modified in the stream periodically after acquiring the precise labels of the instances. In the function spaces, newly arrived cases will be graded as to their distance from the boundaries of the groups previously identified. The experimental results indicate the proposed approach is superior in this area to the state-of-the-art approaches. However, the study does not consider the evolving clusters.
For the first time, an outlier detection algorithm called xStream, which addresses feature-evolution in data streams, was suggested by Manzoor et al. [31]. The proposed Appl. Sci. 2021, 11, 5320 6 of 23 algorithm addresses a problem that has not been investigated previously. XStream is an outlier detector based on density and has three main characteristics: (1) it is a constant-space and constant-time algorithm (per incoming update), (2) it measures anomalies on several scales or granularities, and (3) it can handle high-dimensionality by distance-preserving projections, and non-stationary as the stream progresses through (1)-time model updates. For evolving streams with moderate space overhead for which there is no competition, experiments show that xStream is effective and precise. However, the challenge of evolving the cluster, which affects the overall outlier detection, has not been attempted.
A regression-based strategy was developed by Farshichi et al. [32] for detecting contextual anomalies in the control system of air traffic. They also have actionable improvement specifics by using the algorithm for modification detection and time windows on contextual anomalies. The evaluation of the proposed model shows a high accuracy anomaly detection with low delay. However, the researchers did not discuss the challenge of fault detection initiatives.
An SVM algorithm was effectively designed by Bose et al. [33], to provide drivers with real-time warnings and directions on the road anomalies and ensure a safe driving experience. The proposed system uses the common machine learning technique, SVM, to categorize driving events, such as acceleration and braking, and road anomalies, such as bumps and potholes, and it gives drivers real-time alerts and guidance using a local Quick Dynamic Time Warping (FastDTW) algorithm. Nevertheless, the study fails to consider the different categories and features of objects moving on the roads.
In 2018, HTM was used by Rodriguez et al. [34] to detect outliers on real-time network metrics obtained by continuous analysis of the resource utilization of workflow tasks performed. The framework can process data streams online in an unsupervised fashion and successfully adapt to changes in the data based on the underlying statistics. The experiment results illustrate the proposed model's ability to correctly catch output deviations on different resource usage parameters caused by various conflicting workloads implemented into the program. However, a severe weakness with this argument is that detection rate accuracy and latency are still subject to improvement.
Similarly, HTM has been used to create a new method of detecting anomalies based on online sequence memory [35]. Spatial and temporal anomalies within predictable and noisy domains can be detected by the proposed method. The technique achieves, without supervision, real-time, continuous, online detection criteria. The results indicate an improvement in the detection rate of the method in comparison with the state of the art. However, the research did not consider evaluating the technique based on a real dataset with anomalies containing high dimensionality.
A full online method, called CEDAS, has been introduced for clusters evolving data streams into arbitrary-shaped clusters [36]. It is a two-stage solution that is effective, noiseresistant, and productive in memory and computing use, with low latency as the number of data dimensions. The results revealed the proposed algorithm's ability to deal with changing data sources entirely online. The proposed algorithm compares velocity, purity of the cluster, precision, and memory capacity favorably with similar techniques. However, this technique does not consider the evolution of the feature and the high dimensionality of the data.
On the other hand, a new method called MuDi-Stream [37] consisted of four main components of the online-offline phases. The online phase maintains summary information on the evolving multi-density data stream into core mini clusters. In contrast, to generate final clusters, the offline phase uses an adapted density-based clustering algorithm. An anomaly buffer for handling both noise and multi-density information is the grid-based approach. The algorithm is evaluated on different synthetic and real-world datasets using different quality metrics, and results on scalability are presented. The experimental results show that the approach proposed in this study increases the effectiveness of clustering in multi-density environments. However, the main weakness of this method is that as the number of empty grids increases, it cannot handle high-dimensional data and thus makes the processing time slower.
Fast anomaly detection algorithms were suggested using Extreme Learning Machine (ELM) [38], to discover operationally significant anomalies. The method is used over large aviation datasets to address elevated computational training time. The authors carried out the experiments on a real benchmark aviation safety dataset in an unsupervised fashion, including 43,000 flight data on radar information, flight trajectories, and distance from nearby aircraft. Nevertheless, the authors did not use the benchmarked data to demonstrate the performance of the proposed method and compare its performance with the state-of-the-art techniques.
Xue et al. [39] proposed a novel dynamic anomaly detection framework on timeevolving attributed networks called (AMAD). AMAD utilized the advantage of the evolving characteristics of the underlying attributed networks and models. Precisely, AMAD progressively updates the detection results by measuring the difference in residuals among the two consecutive timestamps and applying it to the previous residuals. To prevent the negative effects of unwanted and noisy features, the proposed method performs attribute selection while processing the residuals for anomaly detection. Tests performed on both synthetic and real-world datasets show that the proposed method is effective and efficient. Nevertheless, the study fails to handle the high dimensionality of the data and ignore its effect towards the power and memory consumption of the framework. Table 1 presents a summary of the machine learning techniques used for anomaly detection in data streams. The table highlights the nature of data used for executing the experiments, the types of anomalies found within the data, the windowing model applied, the dataset used, and finally the criteria used for evaluating the proposed techniques.

Deep Learning Techniques
The use of deep learning in anomaly detection is one of the latest advancements in this field. To solve the problem of imbalanced classification for non-stationary time series, timeseries anomaly detection for KPIs based on supervised deep learning models with neural convolution and long-short-term memory networks (LSTMs) and a vector auto-encoder (VAE) oversampling algorithm have been developed by researchers Qiu et al. [40]. To verify the performance of KPI-TSAD, Yahoo's benchmark anomaly detection datasets were used, and the detector performed well on the benchmark datasets. The proposed VAEGEN algorithm also produced better results than other common oversampling approaches. The primary drawback of this approach is that it required labelled data to train the model, which in most cases are not usable.
Another new anomaly detection algorithm based on neural network ensembles, called Streaming Autoencoder (SA), was developed by Dong and Japkowicz [41] for evolving data streams. It is a one-class learner that only requires suitable instructional class data, which is reliable even without training. It features an autoencoder threaded ensemble with continuous learning capabilities. The results show anomalies with fewer false alarms are effectively identified by the new approach. One of the limitations of this technique, however, is scalability. The technique loses its capability when it comes to a large volume of data.
In 2020, a neural network showed a promising improvement in the arena of anomaly detection. Wambura et al. [42], used a deep neural network to propose an algorithm named One sketch Fits All Time (OFAT) for addressing the issue of accurate long-range forecast within high-dimensional feature-evolving time series. The proposed algorithm is designed to address the issue generated by the non-stationary nature of feature-evolving time series causing the length of the input's sequence (rows) to change as new data points arrive with their feature values (columns) evolving over time. The experiments performed on real-world datasets and rigorous evaluation evidenced that OFAT has fast processing time, robust performance, and accurate detection. Yet, one of the limitations with this approach is that it does not consider real-time interactive forecasting in data streams.
Similarly, for reliable anomaly detection in data streams, a real-time evolving spiking restricted Boltzmann machine technique named (e-SREBOM) was proposed by Xing et al. [43]. It is a hybrid anti-malware detection technique that is sophisticated, revolutionary, and extremely efficient. It is a combination of the e-SNN and REBOM algorithms that allows the system to adjust to changes automatically. e-SREBOM can respond to high-complexity problems and provides a high degree of generalization. The proposed technique was evaluated on three-dimensional datasets with high complexity. However, the study made no attempt in considering the incremental online learning, with long-/short-term memory abilities, to ensure greater accuracy and efficiency.
In 2020, Nawaratne et al. [44] proposed the Incremental Spatio-Temporal Learner (ISTL) to handle the issues and limitations of anomaly detection and localization for real-time video surveillance. ISTL is an unsupervised deep learning technique that uses active learning and fuzzy aggregation to constantly update and discriminate among new anomalies and normal data with respect to time. Three benchmark datasets are used to illustrate and test ISTL on precision, robustness, and computation complexity. The experiment findings confirm the efficiency of the proposed technique. However, one major drawback of this approach is that it fails to reduce false-negative detection, which effects the overall detection accuracy.
In 2019, a real-time spiking restricted Boltzmann machines (e-SREBOM) approach was proposed [43]. It is a strategy of detecting malware that is hybrid, dynamic, innovative, and incredibly useful. The findings have shown that the proposed algorithm maximizes the efficiency of the classification, thereby reducing the requirements for computational power. One big downside to this strategy is that it lacks gradual online learning for long-/short-term memory capability. Table 2 presents the summary on the deep learning techniques used for anomaly detection in data streams. The table highlights the nature of data used for executing the experiments, the types of anomalies found within the data, the windowing model applied, the dataset used, and the criteria used for evaluating the proposed techniques.

Nature of the Data
The selection of the anomaly detection technique depends on the nature of the data being analyzed. The Internet of Things (IoT) is a major source of data streams. A data stream is a continuous series of data records followed and arranged by implicit or explicit timestamps [14]. This data sequence has three key features. First, it is a stream of continuous data flow. Hence, the algorithm should process the data in a limited time. Second, the data stream is unlimited. In other words, the number of data points entering is infinite. Thus, storing such a massive volume of data is another major obstacle. Lastly, data streams change over time, i.e., evolve [16]. Detecting anomalies in the data stream has practical and important applications across a wide range of fields [15]. In realistic implementations, the significance of detecting anomalies in the data stream accelerates its growth by ensuring precision and immediacy [15]. Several methods have been used to propose learning algorithms for anomaly detection in data streams such as C-LOF [14], AutoCloud [24], TEDA Clustering [25], Evolving spiking neural network [43], Combination of (BDLMs) & (RBPF) [26], KPI-TSAD [40], HTM [27], ensembles neural networks [41], Multiple kernel learning [30], xStream [31], CEDAS [36], MuDi-Stream, [37], Long Short-Term Memory [15,45], Density-based Clustering [46], and others.

Anomaly Types
Anomalies are observations that deviate significantly from other observations as to arouse suspicion that it was generated by a different mechanism [47]. Anomalies can be categorized as follows.

Point Anomaly
Point anomaly occurs when anomalies differ significantly from the expected patterns. The detection of this type of anomaly involves the observation of any point that can be detected as different from other data flows. It is also referred to as an outlier [19]. Figure 3 demonstrates a point anomaly.

Anomaly Types
Anomalies are observations that deviate significantly from other observations as to arouse suspicion that it was generated by a different mechanism [47]. Anomalies can be categorized as follows.

Point Anomaly
Point anomaly occurs when anomalies differ significantly from the expected patterns. The detection of this type of anomaly involves the observation of any point that can be detected as different from other data flows. It is also referred to as an outlier [19]. Figure 3 demonstrates a point anomaly.

Contextual Anomaly
Another type of anomaly is observed for any data pattern that occurs as usual and anomalous in one scenario. The detection of spatial anomalies includes comprehension of the meaning [19], which typically occurs inside time-series data sources. A common example is heavy traffic jams during rush hour, which may be a contextually anomalous traffic activity after midnight because of a crash, poor visibility, or other causes related to foggy conditions. An example of contextual anomaly is shown in Figure 4.

Contextual Anomaly
Another type of anomaly is observed for any data pattern that occurs as usual and anomalous in one scenario. The detection of spatial anomalies includes comprehension of the meaning [19], which typically occurs inside time-series data sources. A common example is heavy traffic jams during rush hour, which may be a contextually anomalous traffic activity after midnight because of a crash, poor visibility, or other causes related to foggy conditions. An example of contextual anomaly is shown in Figure 4.

Collective Anomaly
A series of observations is evaluated to recognize collective anomaly behavior. Any variance from the usual pattern may result in collective anomalies over consecutive time intervals concerning complete data patterns [19]. For example, a single time interval calculation is not enough to evaluate the operation of the heart, whereas cumulative signals can determine usual or anomalous behavior, as illustrated in Figure 5.

Collective Anomaly
A series of observations is evaluated to recognize collective anomaly behavior. Any variance from the usual pattern may result in collective anomalies over consecutive time intervals concerning complete data patterns [19]. For example, a single time interval calculation is not enough to evaluate the operation of the heart, whereas cumulative signals can determine usual or anomalous behavior, as illustrated in Figure 5.

Collective Anomaly
A series of observations is evaluated to recognize collective anomaly behavior. Any variance from the usual pattern may result in collective anomalies over consecutive time intervals concerning complete data patterns [19]. For example, a single time interval calculation is not enough to evaluate the operation of the heart, whereas cumulative signals can determine usual or anomalous behavior, as illustrated in Figure 5.

Anomaly Detection Learning Modes
Anomaly detection learning modes can be classified into three types based on the availability of labels in the datasets used to create baselines [48], including supervised, semi-supervised, and unsupervised anomaly detection, as shown in Figure 6.

Anomaly Detection Learning Modes
Anomaly detection learning modes can be classified into three types based on the availability of labels in the datasets used to create baselines [48], including supervised, semi-supervised, and unsupervised anomaly detection, as shown in Figure 6.
A series of observations is evaluated to recognize collective anomaly behavior. Any variance from the usual pattern may result in collective anomalies over consecutive time intervals concerning complete data patterns [19]. For example, a single time interval calculation is not enough to evaluate the operation of the heart, whereas cumulative signals can determine usual or anomalous behavior, as illustrated in Figure 5.

Anomaly Detection Learning Modes
Anomaly detection learning modes can be classified into three types based on the availability of labels in the datasets used to create baselines [48], including supervised, semi-supervised, and unsupervised anomaly detection, as shown in Figure 6.

Supervised Anomaly Detection
The supervised anomaly detection approach detects anomalies by creating a set of grouping rules that help to predict future data. One type of supervised anomaly technique is the classification-based detection of anomalies [18]. Supervised anomaly detection can be divided into two categories, namely, classification and regression.

Classification
The SVM is a separate hyperplane formally defined as a discriminative classifier. With smart transportation [33], SVM is used to provide drivers with real-time alerts and instructions on the streets' anomalies to ensure a safe driving experience. Similarly, in smart cities [49], SVM is utilized for predicting anomalies and attacks that target the IoT systems.
The Naive Bayes classification method is based on the Bayesian Theorem and is primarily compatible when the dimensionality of the input is high. Bayesian techniques have been used in smart homes [34] to propose a two-tier intrusion detection system using a machine learning approach to classify records and monitor suspicious patterns across the network in the service provider's data center. Additionally, Gunupudi et al. [50] used the naïve base classifiers to propose a self-constructing feature clustering method for anomalydetection.
The K-Nearest Neighbor (k-NN) algorithm is an example of supervised machine learning methods utilized for solving classification and regression issues by assuming similarities in devices deployed in a proximate location. The k-NN has been implemented in the industrial system to detect cyber-physical attacks for cyber manufacturing systems by Wu et al. [12]. Similarly, k-NN was used to propose a self-constructing feature clustering method for detecting anomalies by [50].

Regression
Regression algorithms use the input features to predict the data's output values faded into the system. For example, in intelligent transportation, a regression-based model was developed by Farshichi et al. [32] to detect contextual anomalies in the air traffic control framework by identifying the correlation between accidents and resource measurements in the log reports.
The Decision Tree approach constructs regression or classification techniques in a tree structure. This approach separates the dataset into small groups, while at the same time an associated decision tree is increasingly constructed in the industrial system. A decision tree is used to predict anomalies and attacks in smart cities' IoT systems [49].

Semi-Supervised Anomaly Detection
Semi-supervised anomaly detection is an approach where only standard data models and other data are classified as measuring anomalies [18].
In the healthcare system, Bayesian network-based methods have been used in [27] to propose a long-term, semi-supervised monitoring system based on IoT to assess the quality of maternal sleep. Similarly, a reinforcement-learning technique was suggested in intelligent transportation by Lu et al. [51] to detect the motor outlier over the temperature data stream of the unmanned aerial vehicle. Temporal and spatial-temporal techniques have also been suggested in the smart object domain by Chen et al. [52] to detect anomalous behavior within the environmental datasets.

Unsupervised Anomaly Detection
Unsupervised anomaly detection is emphasized by unlabeled information, which does not require separate training and testing stages. Clustering-based detection of anomalies is a general example [18]. The unsupervised anomaly detection approach is grouped into one category, namely clustering, as shown below.

Clustering
The objective of using clustering algorithms is to identify the normal data in the input. The input space has a structure such that certain patterns occur more frequently than others, and we want to see what usually happens and what does not happen. In statistics, this step is called estimating the density. Clustering is one method for estimating the density. For example, utilizing coupling-edge data analysis and cloud data analysis automatically detects anomalies in heterogeneous sensor networks.
In an intelligent transportation system, Luo and Zhong [53] built a stacked denoising autoencoder model to detect anomalies in a gas turbine engine. Furthermore, PNN was evaluated in the industrial system over a real-time thermal power system dataset to explore the anomalies [54]. In smart cities, neural network algorithms have been utilized [55] to detect attacks on IoT architecture. Similarly, neural networks were used by Legrand et al. [56] to detect outliers in the smart home large-scale dataset.
One of the most frequently used unsupervised machine-learning algorithms is Kmeans clustering. Janakiraman and Nielsen [38] used K-means clustering to design a detection and firewall method for anomalies for the IoT site microservices in smart cities. A gaussian mixture model (GMM) is a probability method that has the assumption that the entire data points are produced from a mixture of a finite number of unknown parameter Gaussian distributions. The K-means clustering companied with GMM has been used in road transport by Riveiro et al. [57] to propose a visual analytics framework for road traffic anomaly detection.
The healthcare system [58] used HMM to design a non-intrusive sleep analyzer for detecting real-time sleep anomalies. In the industrial system, Zang et al. [59] proposed a special feature-extraction method for detecting anomalies within time series based on the transfer probability of a Markov chain. Furthermore, Kumar et al. [60] used clustering for monitoring the energy consumption of indoor office devices. In intelligent transportation, a structured sparse subspace clustering was used by He et al. [61] and proposed to detect anomalies. Similarly, Han et al. [62] designed an ANOVA-based technique to detect a vehicle's anomalous behavior.
Clustering has also been used in database management system as predictive modelling and anomaly detection. In such domain application, an incremental equivalence class transformation (i-Eclat) algorithm has been proposed to serves as the association rule mining database engine in testing frequent itemset mining (FIMI) datasets from online repository [10].

Window Models
In time window modeling, the data are divided into some basic windows, and these basic windows are utilized as update units. There are three types of window models as listed below [63]: • Fading (Damped) window model: a weight is allocated to every data point based on a fading concept, and more weights are allocated to latest data compared to old data. The usage of a damped window models reduces the impact of the old data on the mining performance (Figure 7a). • Landmark window model: the window is defined by a particular time point called a landmark and a current time point. It is utilized for mining throughout the history of data streams (Figure 7b). • Sliding window model: data can be counted from a certain range in the past to the present. The concept behind the "sliding window" is to do a thorough review of both the recent data points together with the summarization of the old data points (Figure 7c). A summary of the window models that includes their pros and cons is given in Table 3. The selection of a window model varies based on the requirements of the domain applications [63]. Table 3. Window models in clustering data streams. Figure 7. Window models.

Window Model Definition Pros Cons
A summary of the window models that includes their pros and cons is given in Table 3. The selection of a window model varies based on the requirements of the domain applications [63]. Table 3. Window models in clustering data streams.

Window Model Definition Pros Cons
Fading (damped) window model Assigning weights to data points Compatible for applications in which old data has a significant effect on the mining results, the effect decreases (faded) with respect to time Infinite time window (the window collects the entire history of the data, the size of the window continues to expand as time passes) Landmark window model Analyzing the complete data stream history Compatible for one-pass clustering algorithms The entire data are equally relevant and the volume of data within the window will rapidly increase to un-processable sizes.

Sliding window model
Analyzing the very latest data points Compatible for applications in which the interest exists only in the very latest data such as stock-marketing Disregards part of streams

Datasets
To evaluate the anomaly detection techniques, researchers use either real, synthetic, or altered real datasets. In this review we found most techniques were evaluated using real datasets, which are publicly available. One of the most commonly used datasets in the field of anomaly detection is KDD99, which has been used in many studies such as [30,36,37]. Similarly, the Numenta Anomaly Benchmark (NAB) dataset has gained popularity recently as it consists of seven categories of datasets injected with anomalies, and it has been used in [35]. Yahoo is another real-world data set, which is also publicly available and been used in [40]. Other real world benchmark datasets are also available and have been used by many researchers. However, when the data for some application domains are available, researchers tend to simulate the real data and generate data that will reflect the real-world environment, as in the case of [14,25,29]. With synthetic data, researchers have the advantage of having the data labeled, and they can inject outliers/anomalies inside the data to evaluate the detection performed with the techniques used.

Evaluation Criteria
To measure the performance of anomaly detection techniques, the "accuracy" metric was used in [27][28][29]. Nevertheless, in the event of imbalanced datasets, the reported accuracy will not offer an accurate representation of the technique's efficiency. To measure the performance more accurately, metrics such as True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN), precision, recall, and F1 scores are used [34][35][36][37]. TP offers the information regarding how many positive cases are accurately detected. TN provides the information regarding how many negative cases are accurately labeled as negative cases. FP gives the information regarding the false labeled cases as a positive case. Similarly, FN offers the information regarding the cases that are positive but falsely labeled as negative. Precision is known as the number of class members classified accurately over the total number of cases classified as class members [14]. Recall is known as the number of class members classified correctly over the total number of class members [32]. In anomaly detection, high precision and high recall are needed to develop a high-quality technique. In such scenarios, F-measure is applied to provide an equal importance to precision and recall. Recently, receiving operating characteristics, such as Area Under the Curve (AUC), have also gained popularity in measuring the anomaly detection technique's performance [39]. AUC is the two-dimensional area underneath the curve and interpreted as a probability, which the model ranks a random positive example more highly than a random negative example. It offers robust evaluation in comparison with accuracy in cases of imbalanced datasets, usually utilized with machine learning and deep learning techniques.

Results
Based on our review, there are still many open research challenges and issues to be resolved despite the development seen within the anomaly detection research. Furthermore, future directions are needed the most on anomaly detection approaches in the data stream. Therefore, the focus of this research will be limited to highlighting open challenges related to anomaly detection in the data stream, which is presented in Section 5.
Since a massive volume of data comes in the form of data streams characterized by some problems, it is therefore necessary to address the challenging issue of detecting anomalies within evolving data streams efficiently. Most existing data stream algorithms for anomaly detection are losing their effectiveness in the existence of high-dimensional data. Therefore, it is necessary to redesign the current models to detect anomalies accurately and efficiently. More precisely, where there are many characteristics, there may be a set of anomalies that appear at a given time in only a subset of dimensions. This set of anomalies seems natural regarding a different sub-set to dimensions and/or time frame.
Addressing the anomaly detection problem in a feature-evolving data stream is another primary concern. This challenge occurs when the data evolve over time, and the features of the data change. Furthermore, new/old data dimensions appear/disappear over time. It is a promising field with several possible use cases, such as detecting anomalies in IoT devices. Moreover, another challenge is handling and processing data in real-time when data points from the data source come constantly. Historically, data streams could be massive and should be handled in one pass. In addition, the algorithm should be able to process data in given memory, where enormous amounts of data should not influence the data stream's processing capabilities. Thus, the data stream algorithm should not require unlimited memory for the unlimited data points arriving in the system. Instead, it should be able to process the data within the available memory. Furthermore, the necessity for large-scale IoT implementation is increasing rapidly, with a significant security issue. Yet, a concern arises regarding anomaly detection scalability and how machine learning algorithms can handle large-scale sets. Scalability is a significant issue faced by most of the existing anomaly detection techniques, where some of these techniques lose their efficiency when it comes to large-scale deployment. Table 4 illustrates the summary of the reviewed anomaly detection techniques and their capability in meeting the criteria that describe a good quality anomaly detection technique. The reviewed techniques have been evaluated in their ability to perform data projection, handling noisy data, working within a limited memory and limited time. In addition, their ability to address evolving data, high-dimensional data, evolving features, and finally in scalability are addressed.

Research Challenges and Future Directions
Even though various anomaly detection techniques have been proposed in the literature, there are still several issues to be solved for anomaly detection. Currently, there is no single best technique for the problem; rather, there are several techniques that may be more applicable to certain data types and certain application domains. Below we present a summary of the major challenges found within the state-of-the-art techniques reviewed:

1.
Evolving Data Stream As a vast volume of knowledge falls in the form of data streams labeled by such anomalies, efficiently overcoming the challenging task of detecting anomalies in evolving data streams is necessary [25].
Data streams pose external detection problems, such as detecting in restricted memory and limited time, updating the data once they enter, and managing data in a changing fashion to capture the fundamental changes when detecting them [29]. Data evolution includes algorithms to adjust their configuration and parameters over time and as new knowledge arises. Detection algorithms fail to adjust to complex conditions, such as the ever-changing IoT domain, unlike static records [64].
Further, most existing are less efficient in detecting anomalies in data stream and have poor capability requirements [15]. Detecting anomalies in the IoT data stream environment, known for its evolving characteristics, results in low detection accuracy with a high falsepositive rate [43]. The evolving data stream is a challenge that must be addressed in the environment of IoT anomaly detection [24,65].

2.
Feature-Evolving In a feature-evolving data source, another difficulty would be solving the issue of anomaly detection. The concern is that the data change, and the properties of the data also shift. In comparison, over time, new/old dimensions of data appear/disappear. This area is fascinating, with many potential applications, such as outlier detection in IoT systems in which the sensors periodically go off/on (representing the number of dimensions) [31].

Windowing
The precision is limited (windowing) because of the short data processing used based on fixed interval timing [66]. Another major challenge is determining which frequency is ideal for retraining the models, because most of the current approaches use predefined interval timing [66,67].

Ensemble Approaches
Another area of growth is ensemble approaches. Ensemble methods are well established for increasing the efficacy of detection anomalies by detecting and running the accuracy of time [41]. Another worthwhile potential course of research would then be the ensemble detection of deviations, which shows great promise in boosting the detection accuracy of the algorithms. More specific models can be recommended for resolving unexplored regions. To detect anomalies within the environment of the data stream, initial attempts to examine the ensemble are recommended. However, this field of study is still unexplored and requires more comprehensive models.

5.
Nature of Input Data Many existing challenges need to be solved within IoT anomaly detection. As highlighted by Azimi et al. [68], labelled data availability is a major issue in IoT anomaly detection because the occurrence of anomalies may not be regular. In addition, obtaining the real system data is complicated and requires a lengthy process to reach the operating system data [19]. A wide gap exists in formalizing obtaining knowledge logs and sensory data flow, developing a model, and validating it in real-life environments.
During the study, many experiments have been reported, linked mainly to the usual behavior of the system [19]. The most developed methods are based on normal behavior training, and anything that differs from normal labelled data is considered anomalous. More precise and reliable techniques are required to deal with complicated datasets of real scenarios.
In addition, the availability of a suitable dataset for public anomaly detection is generally a key issue for training and validating techniques for real-time anomaly detection [69]. Such datasets must have a wide variety of new normal and abnormal behaviors, and they should be labelled clearly and constantly updated to prevent any new types of abnormal behavioral threats. Most existing datasets for anomaly detection often lack from wrong labelling, poor diversity of attacks, and compatibility with real-time detection [70]. New data sets for anomaly detection demand realistic environments with a variety of normal and abnormal scenarios. Additionally, the fundamental truth that includes anomalies must be produced to increase the credibility of the dataset when testing a new system of anomaly detection.

6.
Data Complexity and Noise Data complexity, such as imbalanced datasets, unexpected noises, and redundancy within the data, is one of the main challenges in the development of a model for anomaly detection [40]. Well-developed approaches for curating the datasets are required to collect useful information and knowledge.

Parameters Selection
IoT data streams are often generated from non-stationary environments with no advanced information on the data distribution, which affects the choice of a proper set of model parameters for detecting anomalies [25].

Data visualizations
The visualization of the anomaly analysis has highlighted the existence of a gap. New techniques and solutions are needed for the analysis of visual systems to be implemented. Therefore, these gaps related to the fields of the anomaly detection process should be explored [8].

9.
Heterogeneity of data The heterogeneous IoT sensors and devices are sources of an unlimited volume of data streams that demonstrate all types of environmental characteristics, such as light, temperature, humidity, noise, electric current, voltage, power, etc. [28]. Such a data stream requires instant processing for handling timely and critical scenarios, such as healthcare monitoring for a patient and environmental safety monitoring [71]. Every device can transmit data many times per second, and with a large number of connected devices, a typical data processing platform might be necessary for dealing with billions of such incoming events every day [72].

Time Complexity
The data stream's main characteristic is its massive volume of data arriving continuously, which requires the algorithm to work in real-time. However, a major challenge would be the time complexity of detecting the anomalies [14,73,74] because there is always a trade-off between accuracy and time complexity.

Accuracy
Despite having the capabilities of learning algorithms to detect and identify anomalous behavior in real-time, these algorithms are still subject to optimization to improve accuracy, including the reduction in the false positive detection rate, especially in large-scale sensor networks [15,27,69,75].

Scalability
Scalability is another major requirement for anomaly detection algorithms because many algorithms lose their efficiency when handling a large volume of data [41].

High-Dimensional Data
Most current data stream algorithms for anomaly detection lose their effectiveness in the presence of high-dimensional data [25]. Therefore, accurately and efficiently redesigning existing models to detect outliers is necessary. More precisely, when many characteristics are observed, a set of outliers may appear at a given time in only a subset of dimensions. This set of outliers appears natural regarding the different subset dimensions and/or time frame.
The many features are another challenge faced by anomaly detection algorithms in selecting the most important data features [37]. Therefore, the reduction in features is critical in choosing the most important ones that display the whole data.

Conclusions
Anomaly detection has attracted significant attention among researchers in recent years, due to the advancement of sensing technologies categorized with low cost and high impact in diverse application domains. Detection of anomalies greatly eliminates functional threats, removes unseen complications, and prevents downtime of the processes. Machine learning and deep learning anomaly detection techniques play important roles in detecting data stream anomalies in various core implementations across a broad variety of IoT application domains. Yet, there are still several challenges to be addressed to solve the problem of anomaly detection. These challenges include evolving data streams, featureevolving, windowing, ensemble approaches, nature of data, data complexity and noise, parameters selection, data visualizations, heterogeneity of data, time complexity, accuracy, scalability, and high dimensionality. This paper presents the results of a study on machine learning and deep learning techniques used for anomaly detection in IoT data streams. The review offers a complete overview of the developed techniques, nature of data, types of anomalies, detection learning modes, window models, and dataset and evaluation metrics used to measure the performance of the proposed techniques. Consequently, this can help the research community to gain detailed information about the latest developed techniques related to anomaly detection in IoT data. Further, the paper also suggests some future directions of research that can contribute to the development of new techniques, which could help in improving the anomaly detection in IoT data.
Apart from that, the paper has some limitations that are to be considered as future work as well. These limitations include investigating anomaly detection techniques other that machine learning and deep learning, such as statistical techniques. Furthermore, preprocessing of the data used in anomaly detection techniques was also out of the scope of the paper. Finally, analyzing the anomaly detection techniques in use case manner/application domains was again out of the scope of this paper.