Safety, Security and Privacy in Machine Learning Based Internet of Things

: Recent developments in communication and information technologies, especially in the internet of things (IoT), have greatly changed and improved the human lifestyle. Due to the easy access to, and increasing demand for, smart devices, the IoT system faces new cyber-physical security and privacy attacks, such as denial of service, spooﬁng, phishing, obfuscations, jamming, eavesdropping, intrusions, and other unforeseen cyber threats to IoT systems. The traditional tools and techniques are not very efﬁcient to prevent and protect against the new cyber-physical security challenges. Robust, dynamic, and up-to-date security measures are required to secure IoT systems. The machine learning (ML) technique is considered the most advanced and promising method, and opened up many research directions to address new security challenges in the cyber-physical systems (CPS). This research survey presents the architecture of IoT systems, investigates different attacks on IoT systems, and reviews the latest research directions to solve the safety and security of IoT systems based on machine learning techniques. Moreover, it discusses the potential future research challenges when employing security methods in IoT systems.


Introduction
The miraculous advancement in the Internet of Things (IoT) has led to one of the fastest developing computing paradigms, with an estimated of 75 billion smart devices by the end of 2025 [1] It can collect, classify, comprehend and respond to its environment [2]. IoT plays an important role in improving smart systems, such as education, homes, agriculture, farming, grid, transportation, etc. [3,4]. The adoption of this new technology has led to a number of benefits, such as efficiency, productivity, new business directions, agility and mobility and, most importantly, cost reductions. However,the increasing demand and growing deployment of IoT has led to serious new security challenges [5]. IoT systems consist of integrative arrangements of devices, which makes the system very complex, and most IoT system works in an unsolicited and unattended environment. The devices that typically connect over wireless networks in IoT are often targeted by hackers and intruders, eliciting secret information using phishing, eavesdropping, denial of service, spoofing, obfuscation and jamming faces [6][7][8][9]. Therefore, the security risk in an IoT system is higher than that for other computing paradigms, and traditional methods may not be effective in overcoming these security issues. Hence, maintaining and managing the security of the IoT system is very challenging task. A holistic solution is needed to satisfy the security requirements of the IoT system. The existing security methods, such as authentication, encryption, network security and access control, are more challenging and inadequate for large IoT systems with a number of controlling devices. For instance, in DDoS, the attacks spoofed the source IP addresses to attack a location and the legitimate users struggled to access their IoT devices due to the systems' vulnerabilities. Machine Learning algorithms have a unique way of solving complex problems that are implemented in many realtime applications [10]. The basic aim of machine learning algorithms is to improve the performance of any task through training and experience, and it can assess the large dataset with an extraordinary analytical ability in any cyber-physical system [11,12]. It has been used in many fields, including autonomous vehicles, medicine, image recognition, etc.
The research contributions of this survey are as follows: • A detailed discussion on the attack surfaces, vulnerabilities, and security threats of IoT Systems: we discuss various attack surfaces on IoT devices, i.e., network, physical service, cloud service, application interfaces and web service. The rest of the paper is organized as follows: Section 2 presents the related works on the topic. Section 3 demonstrates the safety and security threats in IoT. Machine learning techniques and their applications in IoT are presented in Section 4. Issues, research challenges, and future research directions are mentioned in Section 5. Section 6 presents a discussion of the observations and results. Finally, the conclusion is given in Section 7.

An Overview of IoT System
The Internet of Things (IoT) consists of a cyber-physical system, randing from small sensors to global positioning systems and near-field communication sensors to radiofrequency identification devices (RFID), including emergency alarms and detectors [13]. All these IoT devices collect, classify, communicate, comprehend, and respond to their environment in real-time. These devices are used to store all types of information, such as light intensity, sound data, electricity consumption, temperature readings, chemical reactions, biological changes, etc. The IoT structure is the interconnection of a heterogeneous cyber-physical system in diverse communication forms, such as machine-to-machine interconnection, man-to-man communication, and machine-to-man interactions.
The main task of the IoT system is to convert the conventional object to a smart object using intelligent devices and processes, such as sensor networks, pervasive computing, internet protocols, communication technologies, and applications. The IoT model consists of physical devices, which interact and integrate with communication networks to deliver smart services and applications. The architecture of the IoT system can be divided into three layers, i.e., application, network, and a presentation or physical layer [14]. The architecture and analysis of IoT is shown in Figure 1.
The three-layer IoT architecture shown in Figure 1 defines the main idea of the IoT system, which is also summarised as follows: The presentation or physical layer consists of all the physical objects that are responsible for sensing and collecting information from its surrounding environment. The sensors in this layer identify smart objects in the environment.

2.
The network layer consists of connectivity devices and is responsible for connecting and communicating with smart objects, servers, and network devices. Features of this layer are used for communicating and processing the sensor data.

3.
The application layer describes the various collaboration, deployments, and applications of IoTs such as smart objects for homes, transport, cities, agriculture and farming, etc. This layer classifies the application's services to a user.

Safety and Security Threats in IoT
IoT integrates physical objects and their surroundings via internet connection. The devices primarily work in an unsolicited and unwanted online environment. Therefore, intruders and attacker may exploit the vulnerable IoT devices and expose private information and credentials from sensors by eavesdropping [15,16]. The threats can be classified as passive threats and active threats. Passive threats attempt to use data and information from the system but do not affect the system's resources, e.g., eavesdropping. Active threats refers to the attempts of an attacker/hacker to alter the data and take control of the hardware. Active threats include Sybil, denial of service (DoS), distributed denial of services (DDoS), Trojans, spoofing, phishing and smishing [17]. The potential security attacks that may affect the security requirements, i.e., authorisation, authentication, confidentiality, availability, integrity, and non-repudiation, are shown in the following Figure 2.

IoT Attack Surface(s)
Possible attack surfaces and potential threats are discussed in this section. The attack surfaces of IoT can be divided into network service, physical device, web and application services and cloud service, as shown in shown in Figure 3. The physical surfaces are the primary means of cyber-physical threats. This IoT surface consists of sensors, RFIDs, and actuators.The sensors are used to collect all types of data from the IoT environment. RFIDs play an important role in wireless network communication through automatic identification with a unique identifier. This surface is physically reachable and most vulnerable to physical threats. Most physical devices have valuable information and are resource-constrained, making them a potential attack surface. For instance, physical devices can track device information by access requests, which may cause threats such as DoS, DDoS or other cyber-attacks [18].

Attack Surface at Network Service
The network service of an IoT system consists of two main parts, i.e., RFID and WSN [19]. Both parts are vulnerable to cyber threats: possible attacks on RFID at the presentation layer include Sybil, synchronization attack, and reply attacks; possible attacks at the network layer are false routing and eavesdropping; possible attacks at the application layer are buffer overflow and injection attacks. The possible attacks on the network layer of WSN include the Sybil attack, jamming and replay attacks. These security threats arises when the IoT system is directly integrated with traditional networks as the traditional networks system are no longer secure [17][18][19].
The routing protocol is another potential attack surface on the IoT network service, which may face serious security threats. Therefore, a secure routing protocol is necessary for a safe and secure IoT system. Attackers can also attack open ports and obtain information such as MAC, the router's IP address and the network gateway [20].
The following graphs shows IoT device trends and cyber security losses to the global markets. Figure 4 shows the number of IoT devices in the last two years i.e., 2020 and 2021, and a projected estimate of there being more than 75 billion devices by 2025. The next graph, in Figure 5, presents the average losses on the global market that were caused by cyber security attacks in the last four years. This is expected to reach 8000 (USD) million losses in the year 2022 (source from Satsista.com).

Attack Surface at Cloud Service
Cloud computing provides a distributed service to the store and can obtain information at any time and anywhere [21]. The integration of the cloud-computing paradigm and IoT offers great opportunities for efficient IoT systems. However, several security issues arise with this integration, as the distributed system is more vulnerable to various cyber-security threats, such as: (a) Malicious attacks that manipulate flaws in data security by unauthorized access, e.g., cross-site scripting, SQL injection, cross-site requests [22]. (b) Inadequate integrity controls at the data level may face security threats. (c) The security threats in visualisation software may be used for malicious attacks, e.g., a weak authentication of a virtual server may permit a guest-operating system to run the codes. Privacy and security risks differ in cloud services as per the terms and conditions between the cloud user and service provider [22]. This integration may also introduce several privacy concerns by exposing personal and confidential information, e.g., personal home-based sensor data, medical data and reports, etc. To ensure the secure and reliable integration of cloud services and IoT systems, it is necessary to minimize possible cyber attacks [22,23].

Attack Surface at Web and Application Surface
Most IoT systems provide remote access services to users using Web-based or mobile applications. Smartphones and android-based systems became popular and captured a huge market due to the use of open architecture APIs among developers and malware developers. This permits users to access applications, including malicious online applications, which can store data without any security checks [24]. The malware designer can hack IoT-based systems by extracting device information and potential vulnerabilities, and creating botnets. The android applications may leak the user's privacy and private information, which can expose security risks such as blue-jacking, eavesdropping, smishing, blue-snaring, tracking, and DDoS [24].

Machine Learning for IoT Security
Machine Learning (ML) algorithms have a unique way of solving the complex problems implemented in many real-time applications [25]. The basic aim of machine learning algorithms is to solve complex tasks and improve the performance using training and experience. The ML algorithms automatically develop and control machines through experience with low-computational costs [26]. In this section, we first see the ML techniques in detail, then discuss the application of these algorithms in the field of IoT security. At the end of the section, other emerging techniques for IoT security will be briefly discussed.

Machine Learning (ML) Techniques
The machine learning (ML) algorithms are categories into supervised, unsupervised, and reinforcement learning.

Supervised Machine Learning
The supervised machine learning algorithms learn from experience with the labelled training data. They analyze the training data and imply a function that may be used to map new data samples. Supervised learning can be categorized into classification and regression. The dataset in the classification category is finite and can be binary, such as anomaly detection, or multitudinous, such as speech and face recognition, etc. The regression technique is used to identify the relationship between one dependent variable and another one, or more independent variables, which is often used to predict future outcomes.

Unsupervised Learning
In unsupervised learning, all the input data and samples are unlabeled. The trained model in this learning will not rely on input-output classes. The learning process can be defined as dimensionality reduction, clustering, and density estimation. The clustering method is used to group data in mathematical and statistical problems. This method is useful for unforeseen instances, such as unclassified data from any source [27].

Re-Inforcement Learning
Re-inforcement learning (RL) is considered between the supervised learning and unsupervised learning with no label information. RL is associated with reward value for good decision-making and maximising the rewards. There are two methods of doing this: value function approximation and policy search [28].

Application of Basic Machine Learning (ML) Techniques
This section discusses the most commonly used machine learning techniques.

Supervised Machine Learning
The most common supervised Machine Learning approaches are decision trees (DT), Bayesian algorithms, k-nearest neighbor (KNN), support vector machines (SVM), random forest (RF), and the association rule (AR). The graph shown in Figure 6 shows the research paper statistics for different machine learning algorithms published in different publications up to December 2020.

Decision Trees (DTs)
This method classifies the sample dataset by sorting the feature values accordingly. Each node (vertex) of the tree represents a feature; each branch (edge) represents the value of a sample space to be classified. The sample data are classified as starting from the origin node of their feature, which optimally divide the sample space. This is considered as the origin node [29,30]. There are several ways of finding the optimal solution of the feature that splits the training samples. Entropy is a theory metric, used to measure the uncertainty in a group of observations. The following formula is used to calculate entropy when data are incorrectly classified [30]: p i is the probability of the data classifying to a given class of C. The lowest value of entropy used for root node.
There are two main phases in the DT approach, i.e., building and classification. In the building phase, the DT is constructed with unoccupied branches and nodes.The features splits the sample data until the leaves are obtained. In the classification phase, a tree is constructed with new samples of unknown features and the class proceeds along the path of corresponding values at the inner nodes [31]. The process is continued until a leaf is obtained. Accordingly, at the end of the process, the predicted classes are found [32].
The DT process can be simplified in five steps, as follows: (1) applying the pre-and post-pruning to reduce the size of the tree, (2) adjusting the space after the search state, (3) applying the search algorithm, (4) reducing the data features by removing the replicated features and (5) converting the trees into an appropriate data structure for further data operations. The DT approach uses machine learning classifiers to ensure the security and safety of data, such as DDOs and intrusion detection, and to detect suspicious traffic sources in a network [33,34]. There are some limitations to DT approaches, which can be simplified as follows: (1) the complex nature of the construction of a tree, which requires a large storage space, (2) DT-based approach is only feasible when few DTs are involved and becomes very complex when more DTs are involved, and (3) the computational cost with a complex model to implement.

Support Vector Machine
This method is used for the classification of datasets by constructing a hyperplane in the attributes among more classes. The distance between the hyperplane and the most adjacent attribute in each class needs to be maximized for the best classification of sample data [35]. The hyperplane that separates the data is expressed as: where a is the particular vector, x is the feature vector and b is bias. The components of the equation can be written as a 1 .x 1 + a 2 .x 2 + a 3 .x 3 . . . a n .x n , n is number of vector dimensions x. The following equation is used to predict the vector [36,37]: the sign function returns −1 or +1 for negative or positive values, respectively. This classification method is familiar due to its suitability and capability for datasets with many feature attributes [36]. The main advantages of this approach include to performing real-time tasks and dynamically updating the particular dataset. The SVM approach is generally used in safety and security applications. It is also proficient for storage applications, as it creates a hyperplane to divide the dataset points with less time and lower cost complexity [37][38][39].
A study conducted in [40] for the application SVM to secure the IoT system applied a linear SVM for malware detection. These are used to compare the performance and efficiency of SVM techniques with other learning algorithms for intrusion detection. The results obtained from many research studies showed that the SVM technique is more accurate and outperformed other learning techniques.
In [41], an SVM approach to device security, by transferring and receiving secure data and the results, showed that this is more effective for safe cryptographic techniques than the traditional method.

Bayesian Algorithms
This algorithm is used to calculate the probability of an event based on historical results. It can also be used to estimate and evaluate the probability of attack in a network traffic attack using previous network traffic information [42].
The NB classifier is treated independently of the features used for traffic classification, such as duration, connection, connection status flag, and connection protocol, e.g., UDP and TCP [43]. NB classification is used for intrusion and anomaly detection in any network traffic [43,44]. This classifier is more useful due to its simplicity and ease of implementation, low training data requirements, multi-class classification, and the robustness of its features. However, the NB classifier cannot capture useful hints from the interaction and relationships between features, which are important for classification [45].

K-Nearest Neighbor (KNN)
This is a non-parametric approach used for the classification of data samples, as shown in Figure 7. As per sample classification, there are three types of circular dots in the classified order. The green dots represent normal behavior, the red dots represents malicious behaviors, and unknown data samples, i.e., blue dots represent both normal and malicious behavior. Samples of an unknown class are decided according to the majority number in its nearest neighbors. To check and verify the results, a cross-validation is also used to test the different k values [46,47] The optimal value always varies depending on data samples, and finding the optimal value can be a time-consuming and challenging task. This approach is also used in intrusion and anomaly detection [48,49].
In [50], KNN is used for intrusion detection by classifying the nodes in a wireless sensor network regarding both normal or abnormal system behaviour. The proposed anomaly detection system shows accurate and efficient intrusion detection for an IoT environment.

Random Forest (RF)
This is an ensemble approach, which is used to predict the result based on the Decision Trees. The RF algorithm is mostly used for anomaly detection and intrusion detection [51]. The RF provided better classification results than other classifier algorithms, such as SVM, KNN and ANN, in case limited feature datasets are available. The RF uses the features achieved from network traffic to correctly recognize IoT devices. The researcher extracted network data from seventeen (17) IoT nodes, which are divided into nine categories to train the classifier using the RF algorithm. The research study conducted in [51,52] concluded that RF holds more practical importance for correctly identifying the unauthorized IoT devices compared to other ML algorithms.

Association Rule Algorithms
Association Rule (AR) algorithms are used to classify the unknown data and variables in the training dataset by examining the relationship variables. For instance, let P, Q and R be the variables dataset. The AR algorithm uses these variables and discovers their relationship, then constructs a model for the dataset [53]. This approach is not generally used in IoT safety and security systems as compared to other ML methods; however, the AR approach can be used by other ML algorithms to efficiently and effectively ensure IoT security. The disadvantage of AR algorithms is their high time complexity.

Ensemble Learning (EL)
This is a promising learning algorithm to produce a collective of basic classification methods, which can be used to improve the classification performance of a data sample. The main aim of the EL method is to combine the multi-classifier with heterogeneous or homogeneous data to obtain the best result [54]. An experimental assessment in [55] evaluated that EL is the best machine learning method according to application. A research study in [56] concluded that the time and cost complexity can be reduced for limited hardware resources in IoT devices. The main aim of the framework is to tackle issues such as, (1) distributed and automated online learning methods to identify the anomalies, and (2) help in assessing and obtaining the real data [56].

Unsupervised ML
In the following section, we discuss the general unsupervised machine learning approaches and their applications in IoT security, as well as their advantages, and disadvantages.

K-Means Clustering (KMC)
The KMC is an unsupervised machine learning approach, which is used to discover the clusters in any dataset. Clustering applies an iterative process to generate an accurate output. The input clusters consist of a set of dataset features. In the first step, the k centroid is calculated to assign the closest centroid cluster by calculating the distances between the points using Euclidean distance. Then, the cluster centroids are recalculated using the mean of samples assigned to one particular cluster. The same method is repeated until no clusters exist in the dataset [57]. The KMC approach is applied to the network detection of anomalies by classifying abnormal behavior [58]. In [59], k-means clustering is used to detect anomalies in an IoT system. Unsupervised algorithms have numerous applications in the safety and security of IoT systems. A study in [60] showed that KMC method has the ability to protect private data anonymization in an IoT environment. This clustering method is used to develop an algorithm of data anonymization for advance data-exchange security. The limitation of this approach is the selection value of k, assuming the clusters have approximately the same numbers in the dataset.

Principal Component Analysis
This is an unsupervised algorithm for the feature reduction of large data into reduced data that have the same information that was embodied in the large set. The convergence of the number of correlated and reduced to uncorrelated features is known as principal components. The PCA principle can be applied to real-time intrusion and anomaly detection [61]. The combination of PCA and other ML classifier methods, such as KNN and SVM, provide an efficient and secure computing system, which may be used in the real-time IoT system.
The research survey is summarized in the following Table 1, a comparison of different research studies on the safety and security of IoT based on machine learning algorithms.

Emerging Techniques for IoT Safety and Security
Federated machine Learning and Generative Adversarial Network are the latest field of machine-learning, which have opened new research areas. In the following sections, we will briefly shed light on the emerging machine learning techniques:

Federated Learning (FL)
Federated machine learning is a decentralised collaborative learning technique and is used to gain more experience from large datasets at different locations. The main property of federated machine learning is to secure the data that are collected through different mediums. In contrast with other approaches, FL allows for the model to transfer, without removing data from their origin [71]. This feature makes FL a good choice for IoT systems in terms of privacy and security. Federated learning overcomes the massively distributed and private datasets, which create challenges in ML, by enabling on-device ML, without the migration of private end-user data to any central cloud.

Generative Adversarial Network (GAN)
This is a deep machine-learning-based generative modeling for training the generative model. The first GAN architecture was described in 2014 by Ian Goodfellow in his paper titled "Generative Adversarial Networks" [72]. IoT Security can be improved by using the generative modeling architecture of GAN. The neural network systems can be trained to classify any malicious and suspicious information that might be added by hackers. Research ha shown that GAN-based security models have higher accuracy and precision, with a low rate of false-positives compared to other traditional machining learning approaches [73].

Research Challenges, and Future Directions
Research issues, challenges, solutions and future directions in the field of IoT security are as follows.
• Data Security and integrity: In machine learning techniques, reliable data for datasets and training data are very important to develop an accurate machine learning technique. A training dataset with low-quality data may interrupt the implementation of a particular learning technique. Thus, authenticated training datasets are crucial in ML techniques to secure the IoT network [74]. Furthermore, it is very easy for any hacker to learn the attack type and device vulnerabilities, and be able to manipulate the dataset used in the ML technique. Hence, it is a significant challenge to ascertain how the date can be secured and to detect different types of attacks and their probability of occurrence in any IoT environment. In other words, data security and integrity is challenging the future research field of IoT security. • Backup Security Mechanisms: Usually, it is difficult to accurately state and estimate the network attack in an IoT environment in case of a "bad" defense policy in the learning process. Sometimes, this can cause disaster and drastic loss for IoT networks. Backup security mechanisms may also solve this difficulty protecting IoT systems from the exploration of the learning process. More mechanisms need to address this issue by incorporating ML-based security schemes to provide reliable, resilient and secure IoT services by reducing the risks of selecting bad policies. • Privacy Problems: Privacy is a common issue in IoT environments. In IoT environments, Smart devices, such as sensors and wearable devices, are used to exchange data and information, and the users are not fully aware of where and how their personal information is shared via these devices. IoT smart devices carry the private and personal information of the clients and users, which may be misused. Every IoT device has security protocols to communicate with other devices, i.e., encryption and authentication. Privacy disclosures, leakage, and threats are crucial challenges that make users hesitate to adopt these technologies [75]. • Computational Cost: Many ML-based techniques require a substantial amount of training datasets and a complicated process for feature extraction, creating high computational costs and increasing the complexity of the system. Therefore, it is challenging to find new ML techniques with low communication and computational costs [76]. • Infrastructure Issues: A weak infrastructure always makes it easier for attackers to hack through the software. This is also known as a zero-day attack and is very difficult to determine using traditional security suits and schemes. It is, therefore, essential to build a strong and smart infrastructure to develop a secure IoT system [77]. Safety and Security features must be considered and need to be included in every phase of the IoT system.

Discussion
This research presented a detailed overview of the Internet of Thing (IoT) according to the latest research trends, focused on safety and security and based on machine learning techniques. To achieve this goal, recent, high-quality research papers on the subject are reviewed. The rapid advancements in the research on the security of IoT systems are supported by simulation tools and IoT modelers. Catastrophic failures in IoT networks have been observed, due to serious attacks on the security vulnerabilities in IoT devices. Due to the continuous growth in IoT devices and fewer security measures in the devices, IoT systems will remain a soft target in the future. To avoid such inconveniences, the cyberphysical security system should be considered and strong security measures should be implemented in IoT devices, such as encryption, strong authorization and authentication, and firewalls. This may be an effective means of overcoming the IoT security issues. In this research survey, the research was mainly focused on improving the lightweight encryption and authentication for resource-constrained and low-power devices.
As per the IoT security architecture presented in this paper, three layers, namely, perception/physical, network, and application layers, are considered the most recent mechanisms applied to each layer of the IoT network. It is evident that unsecured physical devices and communication networks with malicious activities result in new threats to IoT networks.
This survey also demonstrates that authentication alone may not suffice. The IoT security system needs to work on lightweight and mutual authentication systems at the application and network layers. Moreover, to mitigate physical device security issues, lightweight and low-cost encryption are proposed for the physical layer.

Conclusions
IoT systems have changed human life, making it easy, smooth, and comfortable. With these advancements and the development of smart things, new challenges are arising, especially those associated with the safety and security of IoT devices. The traditional methods and tools are not effective at countering the new security issues and challenges. Machine learning is a promising method, which allows for the development of various powerful methods to enhance the safety and security of IoT. In this survey, a state-of-the-art comprehensive review of the literature is presented, focusing on the safety and security of IoT, including its architecture, detailed security threats, and attack surfaces in the IoT system. In addition, a comprehensive review of the use of machine learning methods is presented. Finally, issues, research challenges, and future research directions regarding the development of a safe and secure IoT are also presented.