Classifier Performance Evaluation for Lightweight IDS Using Fog Computing in IoT Security

In this article, a Host-Based Intrusion Detection System (HIDS) using a Modified Vector Space Representation (MVSR) N-gram and Multilayer Perceptron (MLP) model for securing the Internet of Things (IoT), based on lightweight techniques and using Fog Computing devices, is proposed. The Australian Defence Force Academy Linux Dataset (ADFA-LD), which contains exploits and attacks on various applications, is employed for the analysis. The proposed method is divided into the feature extraction stage, the feature selection stage, and classification modeling. To maintain the lightweight criteria, the feature extraction stage considers a combination of 1-gram and 2-gram for the system call encoding. In addition, a Sparse Matrix is used to reduce the space by keeping only the weight of the features that appear in the trace, thus ignoring the zero weights. Subsequently, Linear Correlation Coefficient (LCC) is utilized to compensate for any missing N-gram in the test data. In the feature selection stage, the Mutual Information (MI) method and Principle Component Analysis (PCA) are utilized and then compared to reduce the number of input features. Following the feature selection stage, the modeling and performance evaluation of various Machine Learning classifiers are conducted using a Raspberry Pi IoT device. Further analysis of the effect of MLP parameters, such as the number of nodes, number of features, activation, solver, and regularization parameters, is also conducted. From the simulation, it can be seen that different parameters affect the accuracy and lightweight evaluation. By using a single hidden layer and four nodes, the proposed method with MI can achieve 96% accuracy, 97% recall, 96% F1-Measure, 5% False Positive Rate (FPR), highest curve of Receiver Operating Characteristic (ROC), and 96% Area Under the Curve (AUC). It also achieved low CPU time usage of 4.404 (ms) milliseconds and low energy consumption of 8.809 (mj) millijoules.


Introduction
This work proposes lightweight solutions on the basis of standardized resources for the secure connection of IoT devices. This ensures a secure communication between the constrained environment devices and typical Internet hosts, using a lightweight yet standard compliant Internet security anomaly intrusion detection system. Fog Computing offers a platform that addresses issues linked with the security as well as the privacy of IoT, since the IoT devices have very limited computational resources and Evaluating the performance of different machine learning models based on time, energy consumption, and space complexity.

The Work Contribution
Regarding the security of IoT systems, issues such as suitability of high-end traditional security solutions exist due to the challenge of low storage capacity as well as low power for processing the IoT devices. Additionally, connectivity between IoT devices lasts for long periods of time without any human intervention [9][10][11]; the hosting of a Fog node could be done via a large group of devices comprising video surveillance cameras, routers, embedded servers, switches, and industry-based controllers. Most of the aforementioned devices might have insufficient performing computational power; for instance, there is a need for efficient performance as well as the provision of real-time feedback and responses by intrusion detection algorithms, data analysis, and protocols used in cryptography. Thus, lightweight protocols and algorithms that do not require high-end computing resources are needed. An, Zhou, Lü, Lin and Yang [12], with the aim of overcoming the issue of space limitation among Fog nodes, proposed a lightweight IDS, which they Evaluating the performance of different machine learning models based on time, energy consumption, and space complexity.

The Work Contribution
Regarding the security of IoT systems, issues such as suitability of high-end traditional security solutions exist due to the challenge of low storage capacity as well as low power for processing the IoT devices. Additionally, connectivity between IoT devices lasts for long periods of time without any human intervention [9][10][11]; the hosting of a Fog node could be done via a large group of devices comprising video surveillance cameras, routers, embedded servers, switches, and industry-based controllers. Most of the aforementioned devices might have insufficient performing computational power; for instance, there is a need for efficient performance as well as the provision of real-time feedback and responses by intrusion detection algorithms, data analysis, and protocols used in cryptography. Thus, lightweight protocols and algorithms that do not require high-end computing resources are needed. An, Zhou, Lü, Lin and Yang [12], with the aim of overcoming the issue of space limitation among Fog nodes, proposed a lightweight IDS, which they referred to as Sample Selected Extreme Learning Machine (SS-ELM). The dataset employed by the researchers was the KDD Cup 99, coupled with SS-ELM, which outperformed the classical Back Propagation (BP) algorithm with regard to the time spent in training and the accuracy of detection. Consequently, Hosseinpour, Vahdani Amoli, Plosila, Hämäläinen and Tenhunen [13] proposed an Artificial Immune System [14] that was lightweight and distributed IDS, with a detection accuracy reaching 98%. Furthermore, as a means of detecting the behavior of attacks on ADFA-LD, and with an FPR achievement of less than 20%, Xie, Hu, Yu and Chang [15] employed frequency-based models. Invariably, the work was proceeded by Xie, Hu and Slay [16], who implemented a single class SVM model on short sequences. Overall, a better performance was achieved; however, the research encountered a low positive rate of about 20%. In countering these challenge, another researcher [17] implemented a vector space representation and used several classifiers in attaining a detection accuracy above 95%. Among the primary lapses in the work with lightweight IDS are the use of outdated datasets. In addition, IDS models that make use of contemporary datasets are specially designed for cloud-related platforms. The aforementioned challenges reveal the need to develop lightweight smart security solutions that are easily distributed and possess the ability to last long in service. Instead of ensuring individual security for many IoT devices, it is more reasonable to ensure the implementation of network data security solutions. The artificial intelligence theories, like Machine Learning, have already proven their significance when dealing with heterogeneous data of various sizes [18,19]. Thus, in this study, a lightweight IDS via the use of a suitable host dataset was developed. Hence, the core contributions of this work are as follows: • To review security challenges in IoT. Based on the review, a general architecture of lightweight IDS based on the machine learning method is put forward. We also review the specific aspects and look at the advantages, limitations, and key challenges.

•
To explore Intrusion Detection Systems (IDSs) using limited resources, such as a Fog node (Raspberry Pi 3), using a contemporary dataset, ADFA-LD, rather than the outdated KDD Cup 99 dataset.

•
To propose a lightweight technique for Intrusion Detection Systems in a Fog Computing environment, which reduces and mitigates the overheads in the Cloud security challenges.

•
To examine and improve different feature extraction and selection methodologies that are compatible with the constrained resources of Fog Computing.

•
To evaluate the performance of different machine learning models based on time, energy consumption, and space complexity and to evaluate the performance of MLP and its different parameters via a single hidden layer perceptron for practical detection time and energy, deference's number of nodes, and reduced computational complexity for the selection and extraction of features.
The rest of the paper is structured as follows: Section 2 discusses the related works and analyzes the requirements of IoT security events and IDS in Fog Computing. It also presents a brief review of the IDS literature for a foundation of the presented work. It specifies the evolution and history of lightweight IDS based machine learning in the components of the IoT and their drawbacks. Section 3 is an overview of the methodology and the technical background of this work. Section 4 provides the analysis of the benchmark ADFA-LD dataset and details of an experiment in terms of the software and hardware used. We discuss the evaluation of the results in Section 5. Section 6 is a summary of the conclusion, and future work is presented in Section 7.

Related Work
This article is aimed at motivating researchers in the field of IoT security to develop IDSs via lightweight-based machine learning for the securing of IoT learning. It proposes suitable approaches for assessing the performance of classifiers statistically. sists of three main components: (i) end devices; (ii) Fog nodes; and (iii) Back-end cloud infrastructure. Fog computing provides computation and storage in addition to network services between the edge and the cloud servers. Fog computing nodes are heterogeneous in nature and are deployed in a diverse set of environments.

Fog Computing and IoT
Fog computing is becoming an efficient and cost-effective distributed computing paradigm to support applications on billions of connected devices forming the Internet of Things (IoT). Fog computing is proposed to help resolve issues in IoT applications that require low latency, geolocation, and mobility support in addition to location awareness. The infrastructure of Fog Computing allows applications to run in an environment close to the edge of the network and to reduce the overhead in the cloud [20]. Fog computing is a platform that, similar to the cloud, provides computing resources, storage, and application services to the end user [21]. As compared to Cloud Computing, which empowers client computers via the distribution of resources for the computing and storage of data that are resident in a remote server, or the "cloud", Fog Computing ensures the availability of such resources to connected clients closer to the data source.

Fog Computing and Similar Technologies
Though Cisco was the first to suggest the name "Fog Computing", related ideas have been researched and proposed by other parties. The subsections below provide details regarding such related technologies, with the inclusion of how they are different from Fog systems. In addition, a more comprehensive comparison of these similar technologies is available in the literature [22,23] for edge computing.
Edge Computing: this deals with performing localized device processing via PAC, a controller that helps in the handling of data communication, processing, and storage [23]. It has more advantages than Fog Computing due to its ability to reduce failure points as well as ensure the independent nature of individual devices. Nevertheless, this same feature is responsible for its difficulty in managing and accumulating data, especially in large scale networks, for example, the IoT [24].
Cloudlet: this refers to the middle aspect of the three-tier hierarchy referred to as "mobile device; cloudlet; cloud". Cloudlet comprises four main characteristics, namely the ability to totally self-manage, possession of adequate computing power, built on standard cloud technology, and less end-to-end latency [25]. However, there are differences between Fog Computing and Cloudlet, such as non-suitability of application virtualization to the environment, consumption of more resources, and the inability to work in offline modes [25,26].
Micro-Data Center: this refers to a minute fully functioning data center that houses numerous servers that are capable of supporting many virtual machines [27]. Numerous technologies, including Fog computing, are able to benefit from centers of micro data, due to their aptitude for latency reduction, conservation of bandwidth consumption, inbuilt security protocols, relative portability, and enhanced reliability, via compression, and the potential of accommodating numerous novel services.
Fog computing ensures that storage and computation are provided apart from the services occurring in the network within cloud servers and the edge. The nodes of Fog Computing are quite heterogeneous and thus are deployed in different environments.

Fog Computing Mitigating Security Challenges in IoT
The introduction of Fog Computing was as an means of extending that of cloud computing; thus, it inherited many of the privacy and security challenges faced by the cloud [28,29]. Solutions to security issues that were designed for the cloud could actually be applied in the Fog Computing area; nevertheless, limitations could exist due to the special characteristics of Fog Computing [30]. In addition, cloud computing offers a medium that helps to resolve issues linked to the privacy and security of the IoT due to the limitations in the computational resources of IoT devices as well as issues such as inability of those devices to perform their required tasks for computation [31]. Thus, Fog Computing can be likened to an extra IoT security tool, wherein it assists in reducing some concerns regarding the security of IoT [31,32]. Some of the major issues regarding the security of Fog Computing are Access Control, Authentication, Privacy, Intrusion Detection.

Advantages of Fog Computing over Traditional IoT
The features of Fog Computing [33][34][35] are as follows: (1) Low latency: less access time needed for computing and Fog node storage resources, also referred to as smart gateways. (2) Location awareness: the location of the Fog at the network edge makes it aware of where the applications are located as well as their specificity. This is of benefit as the awareness of context is a very essential aspect of IoT applications. (3) Distributed nodes: dissimilar to centralized cloud nodes, Fog nodes are usually distributed. Thus, there is a need for the deployment of multiple Fog nodes in distributed geographical areas so as to ensure the provision of services to the mobile devices in such areas. An example could be with regards to vehicular networks, which deploy Fog nodes in highways, thus providing less latency of data or even ensuring video streaming to vehicles. (4) Mobility: mobility is supported by the Fog, wherein direct communication is allowed between the smart gateways and the devices at a close proximity. (5) Real-time response: Unlike the cloud, a more instant response can be achieved via Fog nodes, as they have less latency than the cloud. (6) Interaction with the cloud: interaction can take place among Fog nodes and the cloud by communicating directly with just the data required to be received by the cloud, as seen in Table 1.  [33,34].

Critical Latency Requirements and Data Rich Mobility
Less network hopping; less concentrated load. Data located at optimal depth; local caches. Significant mitigation in the movement of data within networks, thus leading to lessened expenditure, latency, and congestion, alongside the removal of bottlenecks that occur due to centralized computing systems.

Geographic Diversity
Intelligence localized as appropriate. Provides high levels of reliability, the tolerance of faults, scalability, and the provision of sub-secondly responding to end users.

Network Bandwidth Constraints
Local processing reduces core network load. Lesser bandwidth consumption.

Reliability/Robustness and Analytics Challenges
Fast fail-over; local response in emergency. Appropriate level of analysis and storage. Elimination of the core environment for computing, thus mitigating the chances of failure and major blockages.
User Data/Geographic Privacy Fog can aggregate/ anonymize user data. The security of encrypted data can be more reliable due to its closeness to end users, thus lessening exposure to elements that are hostile. Better scalability of virtualized systems as well as encoding of data due to its close proximity to the network edge.

Relevant Work
Approaches for anomaly detection can be classified generally into two categories: supervised learning and unsupervised learning. For a supervised approach of detecting anomalies, a labelled dataset is used in constructing the system or network's normal behavior. Consequently, an unsupervised technique assumes greater frequency of normal behaviors, thus leading to the establishment of the model on assumptions, wherein there is no need of any data for training [36].
Furthermore, as informed by [37,38], a supervised anomaly detection technique has the ability of leveraging the measurement of distance as well as density of clusters for the detection of intrusions.
Consequently, some researchers developed an anomaly detection technique based on deep packets, with the sole aim of executing it on IoT devices that are resource constrained [39]. Arguments were made by the authors as to the fact that IoT devices of small sizes make use of relatively simple-to-use protocols, thus leading to network payloads that are very similar in nature. Hinging on the aforementioned idea, the researchers employed the use of a technique named bit-pattern matching for performing the selection of features. Worthy of note is the treatment of payloads as a sequence of bytes as well as the operation of feature selection via overlying byte tuples, referred to as N-grams. Thus, matching an N-gram and a bit-pattern takes place in situations wherein there is a match among corresponding bits across all positions. Thus, an experimental evaluation was proposed by the scholars via the use of dual devices enabled by Internet, though there was a low FPR for the four types of attack researched, namely directory traversal attacks, worm propagation, tunnelling, and SQL code injection. Furthermore, a brief introduction on an IoT-based internally distributed anomaly detection system was presented in another study [40]. The essence of this IDS proposal was to find the availability of network discrepancies via the monitoring of the features of one hop neighbor nodes, examples of which include data rate and the size of a packet. As informed by the authors, learning is undergone by the system, leading to the derivation of normal behaviors from the information being monitored. Nevertheless, there was no information regarding the approach used in constructing the profile of normal behaviors. Additionally, there was no clear justification as to the working techniques of the detection algorithms on low-capacity-based IoT devices. In a presentation by Pongle and Chavan [41], an IDS was designed for detecting wormhole attacks found in IoT devices. The authors assumed that the symptoms of a wormhole attack are always left on the system, for example, the exchange of a large amount of control packets within the tunnel's dual ends or the formation of a large number of neighbors subsequent to a successful attack. Employing the aforementioned logic, three algorithms were further proposed by the authors for the detection of such anomalies within the network. Based on their experiment, for the detection of wormhole, a 94% positive rate was achieved by the system, followed by 87% for the detection of both the attack and the attacker. Furthermore, a study related to the consumption of memory and power for the nodes was conducted by the authors. Thus, the proposed system is of suitability for IoT devices due to the low consumption of power and memory. However, there is a need to make comparisons between the achieved results and the literature so as to establish a baseline between both.
Security decisions are normally made with regard to the collection of data from IoT devices as well as from the environment of IoT systems. An instance of this is the learning of attack models by machine learning algorithms from the attack detection data. Thus, trustworthiness as well as the quality of the data is of importance in ensuring appropriate decisions are made. The design of reliable protocols is, however, not trivial for the collection of a high-quality dataset [42].
The deployment of IoT-based IDS mechanisms on low-capacity nodes and networks with less power is quite intricate and challenging. Worthy of consideration for signaturebased methods is the high cost of storing signatures in the databases as well as computing running learning algorithms used in checking individual signatures.
Many NIDS mechanisms have been developed in the IoT via the use of attack signatures; however, there have been issues of non-identification of unknown attacks, as well as higher FPR, also referred to as false alarms [43]. Furthermore, for monitoring traffic within a network, there is a need to connect IDS hosts to any components, such as nodes, routers, or computer devices. Dissimilar to NIDS, the system calls, file system modifications, and the operating system processes are scanned by the HIDS [44]. Nevertheless, due to the many limiting factors, HIDS is not preferable.

Lightweight IDS-Based Machine Learning
IDSs, in general, collect information about the host and network resources of a system and attempt to analyze it in order to detect any probable intrusion in the system. This provides a mechanism for security specialists in an organization to monitor activities across the computer systems and network infrastructure. Figure 1 depicts a general state of IDSs on the basis of implemented techniques for detection alongside the deployed Machine Learning lightweight devices in the IoT environment. The concept of IDS has also expanded to include the monitoring and analysis of host and network features together. The simple concept behind this convergence is to assess the interconnection of all features that may act as an indicator of compromise for the whole system. As shown in Figure 2, the implementation of lightweight IDS-based Machine Learning can be achieved via employing several techniques and approaches. Thus, different mechanisms of detection have been developed for detecting abnormalities and are placed into categories such as data mining methods, machine-learning-based methods, and statistical methods [45]. Finally, a brief discussion on the lightweight IDS-based Machine Learning approaches is included. routers, or computer devices. Dissimilar to NIDS, the system calls, file system modifications, and the operating system processes are scanned by the HIDS [44]. Nevertheless, due to the many limiting factors, HIDS is not preferable.

Lightweight IDS-Based Machine Learning
IDSs, in general, collect information about the host and network resources of a system and attempt to analyze it in order to detect any probable intrusion in the system. This provides a mechanism for security specialists in an organization to monitor activities across the computer systems and network infrastructure. Figure 1 depicts a general state of IDSs on the basis of implemented techniques for detection alongside the deployed Machine Learning lightweight devices in the IoT environment. The concept of IDS has also expanded to include the monitoring and analysis of host and network features together. The simple concept behind this convergence is to assess the interconnection of all features that may act as an indicator of compromise for the whole system. As shown in Figure 2, the implementation of lightweight IDS-based Machine Learning can be achieved via employing several techniques and approaches. Thus, different mechanisms of detection have been developed for detecting abnormalities and are placed into categories such as data mining methods, machine-learning-based methods, and statistical methods [45]. Finally, a brief discussion on the lightweight IDS-based Machine Learning approaches is included.

IDS Techniques-Location and Exploration of Source of Data
IDSs are developed into host-based intrusion detection systems (HIDS) and Network Intrusion Detection Systems (NIDS). They ensure the collection of data from diverse sources of networks and systems, thereby analyzing the data for probable threats [46]. For example, the implementation of NIDS can be done via detection techniques such as anomaly-based detection and signature-based detection [47]. For a signature-based NIDS, there

IDS Techniques-Location and Exploration of Source of Data
IDSs are developed into host-based intrusion detection systems (HIDS) and Network Intrusion Detection Systems (NIDS). They ensure the collection of data from diverse sources of networks and systems, thereby analyzing the data for probable threats [46]. For example, the implementation of NIDS can be done via detection techniques such as anomaly-based detection and signature-based detection [47]. For a signature-based NIDS, there is a limit in the detection of malicious threats that are already known. Furthermore, there is a need to apply a mix of both packet content inspection rules as well as packet header to the detection system via the specification of the signature and via the anomalous traffic flows. The design of anomaly detection techniques is such that they can have an automatic understanding of the attacks that are not known as well those that are not predictable for signature-based NIDS, as shown in Table 2 [45][46][47]. • Weak profile accuracy due to observed events being constantly changed, thus difficulty in adapting to unceasing changes in normal behaviors as well as a dynamic anomaly. • Unavailable during the rebuilding of behavior profiles, thus difficult to discover the boundaries between abnormal and normal behavior. • Difficult to trigger alerts at the right time, thus higher false-positive alarms.  Fault-Tolerant A DIDS state, resulting in more difficulty in storing consistently as well as the manner of recovery. The intrusion detection system is in a state of central storage, resulting in easier recovery after a crashing incidence.

•
Minute or no overhead is imposed on the systems, apart from those running analysis components, wherein the imposing of large loads is necessary. In addition, the hosts might be required to dedicate themselves to the task of the analysis. • Execution: there is a need for continuous running of a relatively small number of components.

•
Resist subversion: a smaller number of components are required to undergo monitoring. However, these components are larger and more complex, thus leading to more difficulty in monitoring.
• Data can be destroyed or modified by an attacker. • An intruder can modify or disable running programs on a system.

•
There is a limit to the size of the intrusion detection system with regard to its fixed number of components. However, as there is an increase in the number of monitored hosts, there will be a need for more storage and computing resources for analyzing the components, so as to ensure a balance of the load. • Dynamic Reconfiguration: a minute number of components are responsible for the data analysis.
Reconfiguring them would probably require the restart of the intrusion detection system. IDSs can be classified into two categories: Network-based IDS (NIDS) and Hostbased IDS (HIDS). In NIDS, several network segments are connected to each other, and network traffic is being monitored to detect malicious activities. Meanwhile, HIDS are attached to a computer device, whereby malicious activities that occur within the systems are monitored. Dissimilar to NIDS, analyses are made by HIDS on the system calls, file system changes, application logs, inter-process communication, and running processes, apart from just the network traffic. Furthermore, system call traces are used in detecting intrusions with HIDS. These intrusions could be in the form of anomalous sub-sequential traces. The collective anomalies are further translated into malicious programs, violators of policy, and unauthorized behaviors [48]. A system call trace refers to the system call sequences that have been ordered, performed by a process while executing, such as write or read, and open. It is important for HIDS applied anomaly detection techniques to put the data context into consideration. Moreover, there is a clear difference between a compromised system call trace and a normal program process [48]. Due to this, it is impossible to apply point anomaly detection techniques that consider personal data instance anomalously in this domain [44]. Additionally, IDS approaches can be classified based on specifications, anomaly, or signature, as shown in Table 2, which presents the advantages and disadvantages of intrusion detection techniques [38,49].
Early IDSs are mainly isolated single instances used in monitoring single networks via the conducting of local analysis in finding attacks. Thus, among stand-alone IDS instances of such a manner, there is no occurrence of interaction as well as communication. It is, however, obvious that such solutions will not be able to detect attacks that are highly distributed and sophisticated. The establishment of connectivity between isolated IDSs will not be possible within the occurrence of malicious events at once in different places, in order to ensure that these large networks and IT ecosystems are well protected [50]. A distributed IDS refers to one that denotes the performance of data analysis in different locations proportionally to the amount of monitored hosts. Previous research only considered the amount of data analysis components and locations rather than the components of data collection. Some distributed systems of intrusion detection are classified into Distributed Intrusion Detection System (DIDS) [51,52].
A centralized IDS is one that involves the analysis of data performed in a fixed number of locations, not minding the amount of monitored hosts. Similarly, for a decentralized IDS, there is no need to consider the data retrieval component location; the only location to be considered should be that of the analysis components. A comparison between centralized and decentralized intrusion detection systems, based on respective conditions, is illustrated in Table 2. The distribution of the IDS can occur across many nodes within a network. However, in the case of intrusion, decisions can be made collaboratively or in an independent way. Regarding the collaborative way, a single decision is shared by many nodes. Collaboration can employ statistical techniques, including game theory and voting; however, using an independent approach, individual nodes make their decisions separately over the network. In addition, a situation whereby all nodes work together under the same capacity is referred to as a flat infrastructure, in a distributed way, as compared to clustered infrastructure, where all nodes are under clusters of diverse capabilities, with each of them making contributions to the decisions uniquely. Thus, the location for computing is another aspect of a distributed IDS. Here, there is a centralized location for computation works on the retried data from the entire network. Dissimilar to the centralized computation location, the location of a stand-alone computation works on the local data, thus disregarding any decisions made by other nodes. A mix of both stand-alone and centralized locations is achievable via cooperative computation, wherein individual nodes are able to detect intrusions personally yet provide contributions to the final decision. Additionally, an IDS computation can be operated hierarchically, wherein all detected intrusions are sent to the root node by a cluster, after which a decision is taken [49].
Some of the research in this context is associated with detecting anomalies in industrial data [53,54] and the industrial IoT [55,56]. In this class, a real application service of IoT, like in smart homes [57] or in industry [58] is considered, and according to context and application, different techniques are employed for the detection of data anomalies. Intrusion detection refers to the activity involving the detection of actions carried out by intruders against information systems. These intruders might be either internal or external. The internal intruders refer to users within the network that possess some extent of legitimate access and try to raise their accessibility privileges, so as to cause a misuse of privileges that are unauthorized for them. On the other hand, the external intruders refer to those outside the targeted network, who try to gain access in an unauthorized manner to information in the system [38,59]. A typical IDS is made up of a system for reporting, sensors, and an engine for analysis. The deployment of sensors occurs at several network hosts. The major task of these sensors is the retrieval of data from the host or network, such as changes to file system, system call operations, packet headers, service requests, and statistics of traffic. Moreover, the sensors transmit the retrieved data to the analysis engine, whose duty is to ensure retrieved data are being investigated as well as to coordinate the detection of ongoing intrusions. Thus, when there is a detection of intrusion in the analysis engine, an alert is generated to the network administrator by the reporting system. Table 2 presents a comparison of methods used in detection, based on several criteria for IDS performance, thus classifying them into diverse IDS features such as threat to security, detection method, and placement strategy. More focus was placed on the state-ofthe-art method review with regard to the implementation of IDS [38].

Datasets in IDS
In this section, a summary is provided of the popular benchmarks in measuring a dataset in the field of intrusion detection. From previous research, it has been discovered that data are generally retrieved using three means: User Behavior, System Call Sequences (HIDS), and Network Traffic (NIDS), some researchers comprehensively evaluated existing dataset via some proposed criterions, alongside an evaluation and design framework for IPS and IDS datasets, namely CPU/memory usage, log files, network data packages, user input command sequences, low-level information of systems, sequences of system call, and error in system logs. Table 3 presents some commonly used benchmarks, all of which are used in the detection of anomalies or misuse [60]. ADFA IDS datasets are used on both Windows and Linux operating systems and are designed for evaluating system calls, such as HIDS. Whereas ADFA-LD datasets offer present day datasets of Linux for traditional HIDS-based evaluation, ADFAWD datasets offer novel datasets powered by Windows, for HIDS evaluation [60].
Numerous datasets are in existence, such as ADFA13, ISC2012, KDD99, and DARPA98, and they have been employed by scholars for evaluating intrusion detection performance and prevention techniques. Nevertheless, there is a gap in research with regard to assessing and evaluating datasets, coupled with the lack of any befitting datasets in the domain.

Machine Learning Techniques for IDS on IoT
An IDS is designed for monitoring occurring events within a computer network or system, thus leading to the analysis of possible incident signs as well as frequent prohibition of unauthorized access [57,62]. There are two methods of differentiating between a malicious and normal system behavior. They are the misuse detection approach and the anomaly detection approach. Misuse detection is capable of detecting a sequence of actions as an attack if there is a match between former complete descriptions of the series of actions, also referred to as the signature, executed by the attacker. The second approach helps in discovering data patterns, which are not conformable to the awaited behavioral expectations. A major technique that has proven to be practically valuable with regard to anomaly-based detection is machine learning. The foundational principle for machine learning approaches lies in the data prediction and learning [36]. Machine learning can be split into two categories, namely Artificial Intelligence techniques, examples of which include SVM, KNN, DT, k-means clustering and MLP, and Computational Intelligence (CI) techniques, examples of which include Fuzzy Logic, Generic Algorithms, Artificial Immune Systems (AIM), and Artificial Neural Networks (ANNs) [63].
It is important to note that the field of machine learning deals with the development of systems with the capability of automatic data learning [36,64], including the identification of patterns that are hidden automatically [65]. Machine learning algorithms are characterized based on their styles of learning as well as the functionality in the similar nature of their manner of engagement [46]. Figure 2 presents an overview of machine learning approaches based on their learning styles in lightweight techniques. The techniques of machine learning tend to be effective with regard to the improvement of detection rate and the reduction of false alarm rates and offer decreased communicating and computing costs [56,66]. The techniques of machine learning can be split into different phases, including supervised, unsupervised, and semi-supervised learning [67,68] and Reinforcement Learning (RL) [69].

Supervised Learning
Here, the algorithms are able to learn representations to predict cases that are not known, from labelled inputted data. Examples of supervised machine learning algorithms include SVMs for solving issues regarding classification and RFs for solving both regression and classification issues [64,70]. SVMs are extensively used in IDS-based research as a result of their powerful classifying and their computing practicality. Thus, this makes them of great suitability for high dimensional data, though with criticality in the selection of a reasonable kernel function; therefore they are resource insatiable and demand memory as well as processing units for computations [46]. The RF algorithm [71] is an ensemble approach of supervised learning, which powerfully deals in an effective manner with uneven data yet is prone to issues of overfitting.

Unsupervised Learning
In the unsupervised learning scheme, representations as well as the structure from inputted data that are not labelled are learned by the algorithms. The aim of this algorithm is the modeling of foundational data structures for the prediction of data that are not known [64,70]. Some examples of these unsupervised learning algorithms include techniques used in reducing features such as clustering techniques, PCAs, and Self-Organizational Maps (SOMs). A PCA refers to an algorithm used in efficiently speeding up feature learning that is not supervised [72]. Many researchers have employed PCA in their work, before classification application [73]. Thus, clustering algorithms like K-means, among other algorithms for distance learning, are employed in the detection of anomalies. Furthermore, an SOM refers to an ANN employed for the reduction of IDS payloads [74]. Clustering algorithms for anomaly detection have their own drawbacks, such as subjectivity to initial conditions, instances including centroids, and the production of high rates of false positives [75]. Furthermore, an approach was proposed by Oh, Kim and Ro [76], wherein an SOM intrusion detection system in real time is able to classify related data and at the same time ensure cluster visualization. Via correlating features, the detection can perform map labelling produced by SOMs. Moreover, there is wide use of SOMs in anomaly detection systems. Some researchers have applied SOMs as HIDS [77].

Semi-Supervised Learning
This refers to a kind of supervised learning that makes use of data that are not labelled, in training. The data used for training comprises a small amount of labelled data and a huge amount of data that are not labelled. Thus, such learning could be useful in situations where there is an absence of large amounts of labelled data; an example is the case of a photo archive where not all images are labelled, thus consisting of some unlabeled data, one of which is used to enhance the accuracy of IDS [78,79]. In another study, two semi-supervised Spectral Graph Transducers used for classification and the Gaussian Fields technique were employed for detecting attacks that are not known, while a semi-unsupervised clustering approach, known as the Metric Pairwise Constrained K-Means (MPCK-Means), was employed for improving the manner in which detection systems perform [80].

Reinforcement Learning (RL) Methods
The ability to learn from nearby environments is among the initial methods of learning experienced by humans. On a natural note, human beings commence learning through interaction with their environment. Therefore, the inspiration behind RL is based on the neuroscientific as well as psychological aspects of animal behaviors, coupled with the mechanisms through which the control of environments can be enhanced by agents [81]. Furthermore, RL has the ability to ensure an agent is able to have knowledge on mapping situations to actions in an appropriate manner, so as to attain maximal rewards [82]. This agent has no memory of the expected actions to execute, but rather needs to first learn about the actions that can produce top-notch rewards; this is done via trial and error. The trial and error feature is the unique and main RL characteristic. Thus, the agent continuously learns from its past experience so as to attain higher rewards. Among recently successful methods in RL is the "deep Q network" [81]. In the literature, there are proposals regarding extending deep Q networks, with the inclusion of double Q-learning [82], non-stop control with deep RL [83], and prioritized experience replay [84].

Machine-Learning-Based IDS Observation for IoT Security
It is quite challenging to implement conventional approaches such as machine-learningbased IDS observation due to their complex structure of technologies, protocol stacks, and layers. Thus, there is a great need for IDSs for regulation of secure and reliable networking among IoT devices. Machine learning or deep learning techniques have been employed for developing IDS, some of which are self-organizing map (SOM), Naïve Bayesian (NB), Random Forest (RF), Support Vector Machine (SVM), as well as Artificial Neural Networks (ANNs), among others [47]. A past study proposed that the system call of a Fog-based system can assist in securing IoT devices [85]. The aforementioned work utilized DT for analyzing traffic across a network, with the sole aim of detecting traffic sources of suspicion as well as consequently detecting the behavior of DDoS. DT has been explored in different scenarios that deal with IoT system security. Some studies [86] have mentioned how DT can be used for the detection of intrusions, whereas others [87] informed on DT's use in identifying types of operating systems that are not known in IoT. Another study focused on the authentication of the physical player of IoT wireless devices [7], comparing DT performance with SVM and K-means when experimenting with Machine Learning algo-rithms aimed at discovering targeted attacks on medical devices. The authors concluded by stating the following: DT possesses the highest rate of detection, low FPR first training, and speed in prediction as compared to that of K-means and SVM. Nevertheless, it was also stated that they experienced a failure in the ability of the algorithm in detecting and providing results similar to those previously achieved; in cases where there is a familiarity between the attacker and the device, it has knowledge of the schedules as well as the patterns of data. DT achieved the highest overall accuracy, and it was also noted that supervised learning performed more suitably with regard to threat detection compared to that of unsupervised learning.
Relating to the IoT environment, a study [88] developed an Android malware detection system that had the ability to provide security for IoT systems; the scholars also applied a linear SVM to the developed system. There are some studies that indicate that SVM is a commonly used approach for issues relating to IoT security [5,6,[89][90][91][92][93]. Wei, Luo, Weng, Zhong, Zhang and Yan [94] compared Naive Bayes and DT algorithms for the detection of malicious mobile malware in android applications, inclusive of an Androidetect system. The Androidetect system refers to an automated tool for detecting malicious applications. Furthermore, some researchers [90,93,95] compared Naive Bayes with other Machine Learning algorithms. The comparison of SVM, Naive Bayes, NN, and KNN in [90] in detecting on-off attack on IoT devices via the use of smart trust management approaches revealed that a high rate of precision was found in Naive Bayes as well as the recall; however, this was not the same with the F1-score. Consequently, as informed by Kotenko, Saenko and Branitskiy [93], the testing data results revealed that the algorithm with the least training time was Naive Bayes, as compared to that of SVM and ANN, which had much longer training times. The reason behind the lesser time in training for Naïve Bayes is due to the use of primitive operations. Correspondingly, as informed by Park, You and Lee [95], there is a lower accuracy in detection with Naïve Bayes, as compare to SVM, Linear, or DT, due to the kind of data used in their study.
Moreover, RF has also been considered as an integrated learning, which entails the selection of diverse input samples from actual training sets via a technique of resampling using bootstrapping [89]. In the aforementioned reference, the scholars compared RF with other techniques, such as KNN, NN, and SVM, for the detection of crypto ransoms within IoT networks. The analysis of their performance of classifying algorithms concluded that RF had the second highest detection, accuracy, precision rate, and F-measure, behind KNN. In addition, another study, after comparing RF with NN, DT, SVM, and KNN techniques, in an experiment trying to detect DDoS within consumer IoT devices, found that all algorithms performed quite high, apart from that of SVM, whose performance was slightly lower [92].
Based on the previous descriptions, it is established that anomaly detection consists of two types of techniques: namely, supervised and unsupervised. The majority of the algorithms from the literature were used in achieving excellent results for the respective techniques applied. However, the application of diverse algorithms for detecting intrusions was done for the purpose of enhancing the performance of IDS in all facets, including clustering, classifying, and the selection of features. Table 4 presents a comparison between commonly used algorithms. The need to write more codes for solving challenges above two classes.
Detection of intrusion [99].  Unsupervised algorithms are generally a good choice when generating the labelled data is difficult. k-Means clustering can be used for private data anonymization in an IoT system because it does not require labelled data.
k-Means clustering is less effective than supervised learning methods, specifically in detecting known attacks [104].
Sybil detection in industrial WSNs [105] and private data anonymization in an IoT system [106]. Machine learning can be categorized into two phases: the training and classification phases. The training phase has the ability to learn how to distribute features, while in the classification phase, features that are already learned are applied as normal profiles for the detection of abnormalities [108][109][110]. In research by Jain, Duin and Mao [111], a model of statistical pattern recognition was developed. Furthermore, data for training are normalized during processing, and noise is removed from the data. The training phase ensures the extraction of features, a representative feature set used in training the classifiers of the processed training data. However, the classification phase ensures the application of the trained classifier in assigning test data to the selected training phase features [100,112].
Training data of high quality is necessary to achieve the most outstanding performance of machine learning IDS; thus it must contain a mixture of abnormal and normal patterns [113,114]. Features refer to the vital information that is extracted from raw data; they are essential for classification purposes as well as detection, and they influence the effectiveness of a machine learning IDS.
The selection of features for IDSs processes large amounts of audit data, containing numerous features. Nevertheless, not all the features are necessary for classifying the audit data network. Consequently, the selection of features is undertaken for discovering essential features or feature sets from all audit data features. Thus, feature selection could result in the improvement of performance during classification as well as make IDSs lightweight. One study [115] offered the following benefits of feature selection: (1) Reduced dimensionality of feature space, to limit storage requirements and increase algorithm speed.
(2) The removal of noisy, redundant, or unimportant data. (3) The immediate effect of data analysis tasks speeding up, reducing the time necessary for learning by an algorithm. (4) Data quality improvement. (5) The resulting model's accuracy is increased. (6) A feature set reduction helps in saving resources in the process of utilization or for future data collection. (7) Improvement in performance for predictive accuracy gain. (8) The adequate comprehension of data for gaining knowledge of the data generation process or data visualization. Feature Selection "Feature Learning" plays a special role in the build-up of IDSs, such that the accuracy is highly affected by the selected features [116]. Thus, diverse representations of features can help in addressing several threat detection aspects. In fact, some seem to be naive, if they carry fundamental details of a network or the software, while others seems to be rich if they represent deeper information [117]. Thus, the obtaining of features can be achieved via some processes or by a mix of these processes, namely selection, construction, and extraction.
Further objectives of feature selection as informed by [115,118] include (i) increasing the speed of classifiers and reducing the capacity of storage; (ii) conserving resources for upcoming data collection processes; (iii) obtaining additional information and comprehensive knowledge of the process of data generation. In the area of IDS, feature selection has been utilized in improving processes of classification. Three common approaches for selecting features are genetic algorithm, principal component analysis (PCA), and information gain. In addition, the selection of features can hinge on the embedded, wrapper, and filter methods [66,119]. The wrapper, filter, and embedded techniques, are discussed briefly in Table 5.
Filters methods can be further categorized into two groups, namely the feature weighting procedure and subset search procedure. The feature weighting procedure assigns weights to features individually and ranks them based on their relevance to the target concept [115,120]. In filter methods, the feature selection process is optimized for the classification procedure. Therefore, filter methods are much faster than wrapper methods and are better suited to highly dimensional datasets. The computational cost is high. Longer duration. Compared with filtering and embedding methods, the risk of overfitting is higher. As the number of features increases, it becomes more computationally unfeasible. If another classifier is used to predict, the optimization of the solution cannot be guaranteed.

Embedded
Investigates interaction more deeply than wrapper method Compared to the wrapper method, it is less computationally intensive, the running time is faster, and the risk of overfitting is less. As the number of data points increases, it is better than filtering methods in terms of generalizability.
Identifying a small group of features can be problematic.
Principle components (PCs) are new variables with two properties: (1) each PC is a linear combination of the original variables; (2) the PCs are not correlated to each other, and also the redundant information is removed [122]. The major areas of PCA application comprise analysis of image, compression of data, recognition of patterns, regression, and prediction of time series, as well as visualization. However, there are limitations in using PCA, such as (1) the assumption of linear relationship between variables. (2) sensible interpretations are based on assumptions of numerically levelled scaled variables, (3) a lack of probabilistic structure for models, such as those relevant in most contexts, like Bayesian decision as well as mixture modeling. Furthermore, PCA is a technique for the reduction of features that is applicable for the transformation of large variable sets into a mitigated set, thus ensuring that the bulk of the information represented is preserved in the large set. The technique also ensures the conversion of several probably related features to a smaller number of features that are not related, referred to as "principal components" [123,124]. Therefore, the main working principle of PCA can be utilized for feature selection to realize real-time intrusion detection for IoT systems; a previous work proposed a model that uses PCA for feature reduction and adopts SoftMax regression and KNN algorithm as classifiers. The author reported that the combination of PCA with these classifiers provided a timeand computing-efficient system that can be utilized in real time in IoT environments by reducing the model features [125].

Feature Extraction
Feature extraction involves transforming original features for the generation of additional, more significant features. As informed by Zheng, Vanderbeek, Daniel, Stambolian, Maguire, Brainard and Gee [126], "Feature extraction is being defined as the construction of linear combinations αTx of continuous features which have good discriminatory power between classes". One of the major challenges for the research on Neural Networks and other areas such as Artificial Intelligence is finding suitable representative multi-variate data. Feature extraction can fill this gap by mitigating complexity issues, thus proffering a simple data representation, wherein each feature space variable is denoted as a linear combination from the initial inputted variable.
System calls are of utmost importance in any operating system, as they ensure the depicting of computer processes, thus constituting a large amount of fragmented as well as unstructured texts, used by a regular HIDS for detecting cyber-attacks and intrusions. In the literature, some researchers have considered text representation techniques for classifying behaviors of processes via system call traces. A classical machine learning method works in the direction of representing features, feature extraction, and feature engineering. Nevertheless, due to advancement in embedded approaches used in machine learning, including deep learning, the importance of feature extraction and the engineering of features cannot be overemphasized. Thus, researchers usually adopt advanced deep learning techniques and methods of representing text for the purpose of capturing information related to sequence and contexts from system calls [125]. Some feature representation methods in the area of NLP used to convert system calls into feature vectors exist in literature and are discussed here. Techniques such as term frequency inverse document frequency (TF-IDF) and Term Document Matrix (TDM) are used for the estimation of vectors in the feature. However, they have their drawbacks, which include the inability to capture system call information in a sequential order [127,128]. The N-gram method of text representation has process capability of preserving the system call information in a sequential order. For example, N size can be 1, denoting unigram; 2, denoting a bigram; 3, denoting a tri-gram [17,129]. In terms of exploiting N-gram location for intrusion detection, there exist many IDSs that use the N-gram technique to analyze network packets payload; two examples are specified in [130][131][132]. The N-grams are used to model the language that characterizes a network traffic profile, since each different N-gram is interpreted as a different feature space used to represent the traffic. The N-gram technique has been used previously in fields like information retrieval [133] and statistical NLP [134]. With this technique, it is possible to extract sequences of symbols from a given input flow by using a sliding window of length n. At each position, a sequence of length n is considered.

Clustering
Clustering refers to a learning method that is not under supervision, and its major task is that of data mining. Furthermore, it divides the data into look-alike groups for the purpose of simplifying them; however, it might lead to loss of information [135]. Some researchers, such as [136], discovered that clustering techniques perform via splitting observed data into clusters, based on the measures of distance or provided similarities. Other approaches regarding anomaly detection based on clustering are of two categories. The first category ensures that training is provided for the model of anomaly detections via the use of unlabeled data, consisting of attacks as well as normal traffic. The second approach ensures that training is provided for the model via the regular data only, thus leading to the creation of a normal activity profile. The foundation behind the first approach lies in the fact that attack or anomalous data make up a small percentage of the overall data. Thus, if the assertion is correct, the detecting of attacks as well as anomalies can be done on the basis of cluster sizes, as the large clusters normally correspond with the normal data, leaving the other data points, referred to as outliers, corresponding to attacks.

The Advantage of Feature Extraction and Selection
Methods of extracting features have been proposed as a step prior to the processing state, to ensure a diminishing impact of class noise on the process of learning. It is thus essential for future data analysis, be it visualization, recognition of patterns, de-noising, and compression of data. Diverse methods have been developed for discovering more appropriate transformations.
One of the benefits attached to feature selection is the preservation of relevant related data of individual features; however, this depends on the size of the features; if the set of diversified original features is too small, there is a tendency to lose some feature information.
Furthermore, dimensionality reduction, referred to also as the extraction of features, entails that there would be a decrease in the feature space size, without any loss of information from the initial feature space. One of the disadvantages of extracting features is the lack of interpretability and details of the lost features [115]. Some vital features that could be possessed by IDS as informed by the literature are as follows [115,118]: (1) It is tolerant to fault, and continually runs with little or no need for human supervision. Furthermore, the IDS has the ability to recover from accidental or malicious activity causing system crashes. (2) It has the ability to resist subversion, thus making it impossible for an attacker to cause disabling or modification of the IDS. In addition, the IDS has the capability of detecting forceful modifications by attackers. (3) It reduces the overhead on systems for the avoidance of interference with normal methods of system operation. (4) It is configurable, thus being able to ensure correct implementation of security policies on systems under monitoring. The IDS also possesses the capability of adapting to changes in user and system behavior. (5) It is easily deployable; this is achievable via portability to diverse operating systems and architectures by simply installing mechanisms as well as making operation easy for users. (6) It is generic for the detecting of diverse kinds of attacks and is able to distinguish legitimate activities from an attack (false positives). In addition, the IDS does not fail to recognize actual attacks, referred to as false negatives.
The use of feature selection in detecting intrusion is mainly intended to eradicate data that are not important and those that are redundant. It is seen as a process whereby a subset of important features is selected, thus giving a full description of the problem with minimal degradation in performance [137]. Using all the features of a dataset does not necessarily guarantee the best performance from the IDS. It might increase the computational cost as well as the error rate of the system [115]. As the number of selected features increases, the required computation power, which is needed to process the data, increases, and vice versa. Therefore, due to correlations between features, selected features can achieve similar or better results in comparison to using all the features of a dataset.
The irrelevant or redundant data might contain false correlations that obstruct the learning process of the classifiers. Therefore, feature selection has a significant impact on intrusion detection system performance, as it reduces the computation cost, removes information redundancy, increases the accuracy of the detection algorithm, facilitates data understanding, and improves generalization [137,138].

Key Challenges
Dissimilar to computer networks that function by users, IoT networks are object driven, thus making it challenging for applying contemporary computer network security mechanisms. Thus, there is a need to ensure the securing of specialized tools as well as ensure the management and preservation of IoT networks and devices from vulnerabilities and threats that are surfacing. Some key challenges and directions for future research as taken from the literature outlined in this chapter are as follows: (1) The design of a singularbased-mechanism IDS is inappropriate for present-day complex and heterogeneous IoT networks. Thus, the proposal of a more suitable mechanism, which is comprised of a combination of diverse approaches might produce better results. Nevertheless, it is important to note that rather than centralizing a model, it could be more reliable to use the approach of distributed attack detection. (2) There is need for specialization of the "Rule-based/Specification-based method", which deals with the definition of laid-down rules for the behavior of benign networks. Additionally, this specialization needs to be done topologically, thus resulting in the creation of an overhead at frequent changes of topologies.
(3) There is a gap in the use of signature-based intrusion detection techniques with regard to present-day evolving zero-day attacks, wherein there is no advance definition of signatures or patterns. (4) Contrary to signature and rule-based approaches, there is a need to use anomaly-based IDS with intelligent techniques for learning, which has the capability of uniquely identifying intrusions that are either known or not known, thus producing more suitable speed and accuracy. (5) Due to constraints in resources, it is quite challenging to apply machine learning as well as other intelligent techniques in IoT. There is a lack of developed approaches that make use of the aforementioned techniques for IoT; there is need to validate current approaches diligently via diverse datasets in order to verify their level of effectiveness in real-time situations using IoT. (6) It is important to note that performance might be affected by intelligent techniques, such as Machine Learning, which needs higher processing power and memory to function effectively.

Proposed Method
The proposed methods as depicted in Figure 3, are split into different phases, as follows: • Phase 1: In this work, more focus is given to a simple and efficiently computable feature extraction technique, involving the pre-processing of data, comprising methods for extracting features based on Modified Vectors Space, formed from a system call trace raw data of fixed size, referred to as N-gram. The phase also compares techniques used in selecting between PCA, MI for memory use, and CPU time. • Phase 2: This phase covers the application of various anomaly-detection algorithms on intrusion detection, for enhancing the performance of IDS at all levels, including classification, clustering, and feature selection. Anomaly detection includes both supervised and unsupervised techniques. Several algorithms are employed for the purpose of achieving more suitable results for the detection of anomalies. Various algorithms for detecting anomalies have been applied to detect intrusion, so as to enhance the performance of IDSs at all levels, including classification, selection of features, and clustering. Based on the preceding descriptions of several algorithms for detecting anomalies, Table 6 presents a comparison of the most relative algorithms. This comparison provides a summary of the pros and cons of each respective algorithm. It is important to note that weaknesses exist for knowledge-based detection techniques. Furthermore, the detection of anomalies consists of both supervised and unsupervised techniques. This research proposes an overview of the techniques used in machine learning for the detection of anomalies. Experiments carried out have demonstrated that supervised learning methods significantly outperform their counterparts, the unsupervised methods. This is based on the criterion that the attacks in the test data are not known. Thus, of the supervised methods, the non-linear methods achieved the best performance; examples of these are multilayer and SVM methods. This is the model selection phase. • Phase 3: this refers to the evaluation performance stage, with regard to the rate of detection using Accuracy, FPR, Recall, FI-measure, ROC, AUC, CPU time, and energy consumption for testing on a Raspberry Pi.    The data used in this article comprise system call traces. These traces refer to relatively shortened series of position integers, referred to as system call numbers; these are inappropriate for inputting into a regular model of machine learning. For the purpose of transforming the inputted data's format into a form of matrix, there is a need to implement the extraction of features during the stage of pre-processing.
Stage (1) In this stage, the aforementioned approach is used. For the system call encoding, only 1-gramam and 2-gram as features are considered. In this stage, vectors from features are constructed via the consideration of only N-grams, which appear in the data that is under training. Also at this stage, a dense form of the data matrix is considered, where it is discovered to be too large, including being large for ample resource-based machines. Feature extraction refers to the process whereby imputed data is being represented in a compatible format with the algorithm used for modeling. Specifically, this paper is concerned about creating a simple technique for the extraction of features, which can as well be computed in an efficient manner.
Modified Vector Space Representation (MVSR) [143] serves as an extension to prior representations and takes into consideration the unique terms, only from the data used to train, via incorporation of the mechanisms needed to handle all unforeseen terms in the course of testing.
In this work, a feature set is considered as a correspondence of vector sets into traces of system call applications. There is a need to consider tuples of terms (or calls), not just the individual term in the model of vector space. The n-gram is defined as tuple c = (c1, c2, ..., cn) of n terms where n generically denotes a positive integer of small value. This modified vector space model is recurrently used in NLP and DNA and the analysis of protein sequences [128,144]. For a given n-gram, c, its weight w(c) in trace T is given, where f (c) denotes the frequency of c in T, and |T| represents the trace length in Equation (1).
Example (1) For illustration of the above notion, the following trace is considered: (6,174,174,174,6,45,33,192,33,192,174,174,6,174) (2) Note that |T| = 14. For n = 2, we calculate, for every 2-gram, c, its frequency, and its weight (see Table 7). The advantage of using this feature type is the inexpensive nature of computationally calculating the weight of an N-gram. Consequently, the N-gram approach results in an additional issue: the data matrix becomes more sparse. Thus, if the data matrix's dense form is taken into consideration, it would tend to be very large, even for machines with ample resources.
Stage (2) The second stage in feature extraction in this study is the use of compressed matrix formation [145][146][147]. Here, the data are entered in the form of a sparse matrix, where only the weights of features are visible in the trace, thus ignoring zero weights and keeping the non-zero values. Additionally, algorithms requiring a minimal runtime are implemented. Regarding the encoding of system calls, which has been done in this research, consideration was given only to 1-gram and 2-gram, which appear in training data, to be represented as features for experimentation.

Stage (3)
The following step is used in compensating for all missing N-grams found in the test data. In this research, a Linear Correlation Coefficient (LCC) was used for the compensation test, between values of all N-grams, as follows: where the definitions of A, B, C, and D are as follows: where ω (c) denotes the mean value of n-gram; c is the positive training dataset; and the sum is over all n-grams in the training dataset.

Feature Selection
Feature selection is needed for the simplification of the model, so it can achieve a more suitable performance and ensure reduction in the time used to train. This is performed by selecting a particular subset from a series of relevant features. In this research, after the analysis and comparison of feature selection techniques, there are three main methodologies of feature selection and also a PCA approach. A Filer method was used in ranking features according to their level of relevance via several criteria, such as the value of information, correlation, and the value of χ2. Furthermore, due to the low space and complexity in time encountered in the study, a common feature selection approach, called MI, was used for classification on the basis of the information value. According to Manning, Raghavan and Schütze [127], MI is defined as follows:

Phase 2: Modeling
In this work, Multi-Layer Perceptron (MLP) modeling was used. A presentation on the comparison of deference classifiers using 1-gram and 2-gram is given. The classifier used in modeling was chosen based on its wide acceptance rate as compared to other classifier algorithms. Table 7 provides a comparison of common classifiers used. Various algorithms used in detecting anomalies have been applied for the detection of intrusion for enhancing the performance of IDS. In reference to the prior descriptions of diverse algorithms used in detecting anomalies, comparisons have been made, and such comparisons summarize the pros and cons of the respective algorithms. In addition, the comparison deduces the weaknesses of knowledge-based detection techniques. Anomaly detection comprises supervised techniques; nevertheless, the design of the classifier must be in accordance with how most classifiers used at the base are designed, such as DT, NB, SVM, RF, and KNN algorithms for classification. The classifier's design is based on the majority, with base model algorithm for classification.

Artificial Neural Network (ANN)
An ANN works in three layers. The input layer receives data, in manner similar to dendrites. Processing of the input is conducted by the hidden layer, similar to axon and soma. The output later is responsible for producing the calculated outputs, similar to the terminals within a dendrite [148,149]. There are basically three types of ANN: supervised, unsupervised, and reinforcement [69,150].
A perceptron is the founding ground of an Artificial Neural Network, based on inspirations from models relating to biological aspects of machine learning. Furthermore, a perceptron mathematically represents the neuron, which is a fundamental unit of computation within the brain. As in the nervous system of humans, input signals are received by neurons from the dendrites, while the output produces signals in line with the axons, which in turn divert and create a connection between other neurons via a synapse of the dendrites, as depicted in Figure 4. ANN refers to a paradigm for information processing, based on biological structures and complex model relationships between inputs and outputs.   Thus, it is possible for a single perceptron-based mathematical model to denote a linear classifier. Mapping of the input is as follows: x = (x 1 , ..., x n ), a real-valued vector, to an output value f (x) in accordance with the following rule: where w = (ω 1 , ..., ω n ) denotes the weights of real-values; b denotes bias (ω 0 ). Thus, generalization can be made on the perceptron via application of non-linear transformation sets, ϕ i , in concordance with the inputted vector x. Thus, the output, y, is represented as where m denotes the number of variables or numbers; g represents the function of a nonlinear activation, an example of which is a Rectified Linear Unit (relu), or a sigmoid, as depicted in Figure 4. The perceptron is a simple model, which is of much benefit with regard to the approximation of numerous non-stopping functions. A way of improving the performance of the model entails the consideration of MLP, which is achieved via the addition of a minimum of a singular layer that is hidden to the above-stated perceptron model, as shown in Figure 4. However, a commonly used learning algorithm is backpropagation, due its generalist nature regarding algorithms of minimal mean squares within a linear perceptron [151]. Thus, the sigmoid and the relu functions are two functions commonly used in activating MLP learning models and are given by

Machine Learning Shallow Models
Different algorithms used in detecting anomalies have been applied for the detection of intrusion, with the sole purpose of enhancing the performance of IDS at all levels, inclusive of classification, clustering, and selection of features. More information is provided in the previous chapter on the literature review.
The conventional models used in machine learning (also referred to as shallow models) for IDS basically comprise the ANN, support vector machine (SVM), K-nearest neighbor (KNN), naïve Bayes (NB), decision tree (DT), and clustering, as well as combined or hybrid methods. Some of these methods have been studied for many years, and their methodology is proven to perform. Their focus is not just on the effect of detection but also on practical issues, which include management of data and effectiveness of detection. The pros and cons of various shallow models are presented in Table 6.

Phase 3: Evaluation Performance
In machine learning, the measuring of performances is very important. Thus, regarding issues of classification, the following can be considered:  Table 8 presents the confusion matrix that can be used for deriving additional measures. The definitions presented above are for purposes of deriving important measures, inclusive of recalls and precision. Precision (P) refers to the positive predictive values, recall refers to the true positive rate, and F1-measure represents the harmonic mean for both recall and precision. Consequently, accuracy refers to the adequate classification ratio of the overall number of examples existing in a dataset test. In this research, all of the available contemporary evaluation metrics widely used in the information retrieval area were used [8].
Furthermore, in evaluating the performance of an IDS, several metrics for evaluation need to be calculated via the confusion matrix values [17]. Based on the above values, the most-used metrics in evaluation are given in the following formulas.

Precision or Positive Predictive Value
This is the ratio of the amount of predicted attack traces as actual attack traces, from the overall number of traces predicted as attack traces.
Precision is defined as

Recall or Sensitivity
This is also referred to as the True Positive Rate (TPR). Moreover, it is a representation of the ratio of the number of attacks being predicted as actual attack traces, from the overall number of traces that are true attacks.
Recall is defined as

Accuracy
This refers to the proportion of true results (number of attack traces and normal traces detected correctly) from the overall number of samples.
Accuracy defined as where N = TP + TN + FP + FN, is the total number of samples.

F1-Measure
This refers to a measure that combines precision and recall into a single measure. It is calculated as the harmonic mean of precision and recall.
F1-measure is defined as

FP Rate
False Positive Rate (FPR) is a measure involving the amount of normal trace labelled as attack trace by a classifier.

FP Rate is defined as
Finally, regarding the study methodology, this work considered the ROC.

Curve and AUC
• Receiver Operating Characteristics (ROC) Curve: This refers to a graph of true positive rate against FPR. Thus, it gives a representation of how binary performs, based on the variations in the discriminated threshold.

•
Area under the ROC Curve (AUC): This refers to the coverage area surrounding the ROC curve. Thus, it is consonant to the possibilities of ranking by the classifier in a randomly selected positive instance, above the randomly selected negative instance [152,153].
AUC-ROC curve refers to a measure of performance for issues of classifying binary within diverse thresholds. ROC refers to the curve of probability, while AUC is a representation of the measure or degree of separation. Furthermore, it gives detailed information as to the amount of the model needed to carry out variances across classes. Thus, the higher the AUC, the better the model becomes for the prediction of 0 s and 1 s as 1 s. Furthermore, the rate at which ROC curves are steeped is of utmost relevance due to it being the best for the maximization of true positive rate while ensuring the minimization of the FPR. Normally, ROC curves, with regard to the classification of binary, are used for the purpose of understanding the outputs from a classifier. Thus, to ensure the extension of the ROC area as well as the curves to a multi-labelled classification, there is a need for output conversion to binary. This can be done via the drawing of one ROC curve at each label, though an ROC curve can also be drawn by taking into consideration each element from the label matrix as binary predictions (or micro-averaging).
In addition, there is a need to consider the consumption of power, CPU consumption, and throughput, as these are essential metrics for evaluating IDS, which are being run on numerous hardware based on a specific setting, including the network's high speed nature or limited hardware resources.

Experimental Setup
This work was based on evaluating the results of benchmark datasets. Nevertheless, currently available datasets lack real-life properties, thus explaining why the majority of systems for detecting anomalies in the intrusion are not applicable in the production environment [154]. Consequently, this leads to an inability to adapt to non-stop variances within networks, such as changes in topology, new nodes, and variations in traffic loads, among others. Therefore, to ensure the evaluation of the proposed framework and algorithm based on their level of effectiveness for detecting attacks in IoT, this research utilized an intrusion detection dataset. According to the literature, the most preferred datasets for intrusion detection are NSLKDD and KDDCUP'99 [154][155][156]. Furthermore, in this work, a more suitable and modern dataset, the Australian Defence Force Academy Linux Dataset (ADFA-LD) [157,158], was employed to train and evaluate the performance of the system.

ADFA-LD Dataset
The collection and generation of traces in system call based on a large scale for the purposes of research for simplistic tasks, as well as the number of available datasets for evaluating IDS, are quite limited. Recently, some researchers proposed ADFA-LD datasets as a novel alternative for outdated datasets [157,159]. From their proposal, it was revealed that there exists a higher level of complexity in ADFA-LD as compared to other contemporary datasets. A preliminary performance analysis was conducted on ADFA-LD by some scholars [160,161], which revealed the ineffectiveness of Euclidean distance as a medium for separating normal behaviors from those that are instigated by attacks. Table 9 presents the distribution between system call traces, splitting them into normal and different types of attacks under three ADFA-LD groups. Table 9. ADFA-LD dataset composition based on binary classification of the types of system call traces.

Dataset
No. of Traces ADFA-LD Class of Traces Python workbench was employed for evaluating the demonstration of the modified vector space on the dataset of ADFA-LD. This workbench has the ability to host concurrent algorithms of machine learning, which also are used for the selected datasets, under different sizes of terms. A total of nine algorithms for classification were selected out of six diverse groups [17]. Furthermore, the datasets were gathered and converted to representations of modified vector space for diverse sizes of terms. The term sizes one and two were chosen for this experiment. The experiment on the ADFA-LD dataset was executed in favor of the classification of binary. Out of the two labels, one was used for each trace, either attack or normal. Additionally, each of the selected algorithms was run independently according to the chosen options for data conversion. Table 9 gives a description of a number of extracted features from the ADFA-LD dataset for the size terming, via a representation of modified spaces within the vector.
In comparison with aforementioned dataset benchmarks, ADFA-LD is seen to have the most representative nature for handling present-day cyber-attacks. In addition, it offers a well-organized framework for evaluating performance and developing intrusion detection systems (IDSs). Furthermore, many system call traces are contained in ADFA-LD, some of which are retrieved based on diverse situations for the purpose of simulating real-life scenarios. Additionally, they comprise a quite concise collection of system call traces, which represents present-day vulnerabilities on system levels as well as attacks. Thus, the generating of ADFA-LD is from a local Linux server (having a Linux kernel of 2.6.38), offering advanced as well as popularly used computer systems. ADFA-LD generated attacks comprise Webshell, Adduser, Hydra-FTP, Meterpreter, and Java-Meterpreter. Each of these attacks is capable of generating up to 8-20 traces of attack. Table 10 presents the number of retrieved traces according to individual attack type. Additional information with regard to the technical aspects of the dataset is available in the literature [157].

Software and Platform
The experiments are implemented using Python 2.7 softwar; Scipy and Numpy are used for experiments and the reprocessing phase. In addition, for the aspect of visualization, the Matplotlib library was used, while the ScikitLearn library was used for machine learning [162][163][164][165].
In this work, Python programming language is used to implement machine learning techniques. Python is a great language for machine learning for numerous reasons. First, it has a clear syntax; second, with python, the manipulation of text is very easy. Thus, the majority of researchers, including those associated with private and public organiza-tions, make use of Python, therefore leading to its rapid documentation and development. Additionally, the compilation of scientific libraries, such as Scikit Learn, an open-source library for machine learning, developed as part of a project for the Google summer code, established around 2007 by David Cournapeau [166,167]. The language used in writing this code is Python, which incorporates both the scientific and numerical libraries of python, such as NumPy, SciPy, Panda, and matplotlib. Furthermore, it also makes available provisions for effective machine learning tools, such as regression algorithms, classification, and clustering. Furthermore, it supports feature extraction approaches and offers tutorials for understanding the respective concepts [19,164].
The testing aspect of the experiment was done via Raspberry Pi 3, a small, lowpowered single board computer made by the UK Raspberry Pi Foundation. The Raspberry Pi has gained popularity among computer enthusiasts and has been employed for diverse projects on robotics and IoT, thus making it a perfect choice for IoT-device-related applications. Among a lot of applications for IoT, Raspberry Pi has been deployed in the form of a Fog node, referred to in [168][169][170]. Thus, this experiment makes use of Raspberry Pi 3, the third generation of Raspberry Pi.
Raspberry Pi 3 comes with a low-powered yet high-speed CPU, named BCM2835 (1.2 GHz 64-bit quad-core ARMv8 CPU), with approximately 40 General Purpose Input Output (GPIO) pins, which are used in connecting sensors, 1 GB of RAM, SD card 16 GB class 10, four USB ports, Video Core IV 3D graphics core, full HDMI port, and Ethernet port. Figure 5 presents the hardware components' location on the Model B of Raspberry Pi 3. Correspondingly, Raspberry Pi is installed with Debian Linux based OS, referred to as Raspbian, which serves as an optimization for the Raspberry Pi hardware. Furthermore, Raspberry Pi helps in evaluating the time taken in response by a classifier, based on specific IoT hardware [4]. Thus, the consumption of energy by Raspberry Pi is calculated via a commonly used technique, not significantly different from a conventional computer system. The voltage value of Raspberry Pi model 3 B is around +5.1 V. With regards to the requirements for current, a keyboard and HDMI port were connected to the Raspberry Pi, resulting in a 400 mA electrical current requirement. The power consumption is yielded by the combination of the products of both the Voltage (V) and the Current (I), which is then multiplied by the CPU time (t) to calculate the consumed energy [21,171].
Energy Consumption = P * t Electronics 2021, 10, x FOR PEER REVIEW 40 of 58 resulting in a 400 mA electrical current requirement. The power consumption is yielded by the combination of the products of both the Voltage (V) and the Current (I), which is then multiplied by the CPU time (t) to calculate the consumed energy [21,171].

Results and Discussion
The experiment conducted in this paper was aimed at addressing the need for the development of an HIDS anomaly based on light weight, which is responsive to most recent attacks via the usage of ADFA-LD datasets, to employ in IoT-based Fog computing. Thus, the research tries to address the following issues:

Results and Discussion
The experiment conducted in this paper was aimed at addressing the need for the development of an HIDS anomaly based on light weight, which is responsive to most recent attacks via the usage of ADFA-LD datasets, to employ in IoT-based Fog computing. Thus, the research tries to address the following issues: • Many algorithms were employed for the purpose of achieving adequate results using these techniques. Thus, the study proposes a general viewpoint for the detection of anomalies via machine learning techniques. The conducted experiments also proved that supervised learning approaches perform far better than unsupervised approaches, provided there are no unknown attacks in the test data. Previous studies revealed that out of numerous supervised approaches, the best performance can be achieved by RF, SVM-RBF, and MLP; however, there are differences with regard to their individual capabilities to detect attack class effectively. In order to compare with other machine learning methods, contrast experiments are designed at the same time. This is done in the binary classification experiments, for using more loading instances for lightweight evaluation.
The first step in this experiment involves the computation of 1-gram and 2-gram, for the system call traces, whereby selections for the top 60 and 80 features for 1-gram, followed by the top 80 and 120 features for the 2-gram, were made on the basis of the value of their information using MI, which were combined to train the model. Comprehensively, the numerals 60, 80, and 120 were selected for the reduction of the data size as well as data representation [21].
Furthermore, the classifiers were compared with regard to their accuracy, recall, F1measure, FPR, ROC Curve, and AUC. In addition, a comparison was made of the testing time for CPU usage and energy consumption between the different classifiers using 1-gram and 2-gram, which are the maximum features on the Raspberry Pi platform.
This work did not focus on memory usage testing time, because after the collection of results and analysis, it was discovered that consumers for each classifier did not exceed 350 Bytes, as shown in Figure 6, for the KNN classifier. Therefore, it is not considered important in the Raspberry Pi platform.

Performance of the Feature Extraction and Feature Selection
In this section, the performance of the feature extraction and feature selection methods are explored in terms of CPU time and memory usage. The 1-gram and 2-gram transformation modified vector spaces (feature extraction) are tested with two feature selection methods: Principle Component Analysis (PCA) and Mutual Information (MI) (filter method). Based on the test, the best feature selection method in term of its "light weight" was selected for implementation with the MLP classification method. Figure 7a,b show the CPU time and memory usage, respectively. Based on this figure, we can notice that (MI) outperforms PCA, which on the contrary has a lesser performance for CPU time with the consumption of less memory. measure, FPR, ROC Curve, and AUC. In addition, a comparison was made of the testing time for CPU usage and energy consumption between the different classifiers using 1gram and 2-gram, which are the maximum features on the Raspberry Pi platform.
This work did not focus on memory usage testing time, because after the collection of results and analysis, it was discovered that consumers for each classifier did not exceed 350 Bytes, as shown in Figure 6, for the KNN classifier. Therefore, it is not considered important in the Raspberry Pi platform.

Performance of the Feature Extraction and Feature Selection
In this section, the performance of the feature extraction and feature selection methods are explored in terms of CPU time and memory usage. The 1-gram and 2-gram transformation modified vector spaces (feature extraction) are tested with two feature selection methods: Principle Component Analysis (PCA) and Mutual Information (MI) (filter method). Based on the test, the best feature selection method in term of its "light weight" was selected for implementation with the MLP classification method. Figure 7a,b show the CPU time and memory usage, respectively. Based on this figure, we can notice that (MI) outperforms PCA, which on the contrary has a lesser performance for CPU time with the consumption of less memory. The following stage, traces in the system call, makes use of a sparse matrix formatting technique from Python to ensure reduction in the size of the memory. This approach to formatting is of much relevance, as it was able to reduce a reasonable amount of data in the matrix, as 98.2% of the 2-gram matrix entries were zero values. Figure 8 depicts the matrix's compressed form, which is around 4% of the initial dense matrix memory space.  The following stage, traces in the system call, makes use of a sparse matrix formatting technique from Python to ensure reduction in the size of the memory. This approach to formatting is of much relevance, as it was able to reduce a reasonable amount of data in the matrix, as 98.2% of the 2-gram matrix entries were zero values. Figure 8 depicts the matrix's compressed form, which is around 4% of the initial dense matrix memory space.

Performance of the Classification Methods
Subsequent to the normalization of the processed data for more suitable performance, Figure 9a presents a comparison of recall, F1-measure, and accuracy between the different classifiers, using 80 features for 1-gram and 120 features for 2-gram, thus selecting the highest number of features to provide more efficiency and to help in determining the classifiers of good performance as well those fit for lightweight techniques.
Based on Figure 9a, it is noticeable that MLP outperforms the other classifiers, in accordance with their accuracy, recall, and F1-measure. Correspondingly, the Naive Bayes and Decision Tree Models achieved the worst results for these metrics.
The following stage, traces in the system call, makes use of a sparse matrix formatting technique from Python to ensure reduction in the size of the memory. This approach to formatting is of much relevance, as it was able to reduce a reasonable amount of data in the matrix, as 98.2% of the 2-gram matrix entries were zero values. Figure 8 depicts the matrix's compressed form, which is around 4% of the initial dense matrix memory space

Performance of the Classification Methods
Subsequent to the normalization of the processed data for more suitable perfor mance, Figure 9a presents a comparison of recall, F1-measure, and accuracy between the different classifiers, using 80 features for 1-gram and 120 features for 2-gram, thus select ing the highest number of features to provide more efficiency and to help in determining the classifiers of good performance as well those fit for lightweight techniques.
Based on Figure 9a, it is noticeable that MLP outperforms the other classifiers, in ac cordance with their accuracy, recall, and F1-measure. Correspondingly, the Naive Bayes and Decision Tree Models achieved the worst results for these metrics.   In terms of FPR, Figure 9b reveals that RF, SVM-RBF, and MLP models outperform by approximately 5% compared to others, which achieved 17.13%. It is also noticeable that RF and SVM-RBF models achieved higher FPR than MLP, up to a difference of nearly 0.08% FPR. This value is meaningful with reference to the rate of detection; however, it is observed in Figure 9c, which shows a comparison between ROC curves, that it is a graph of true positive rate against FPR. It is also noticeable that MLP outperforms the other classifiers, as it has the highest range over curve. In addition, the AUC value is equivalent to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. MLP showed the highest value (95.7%), and this represents that it was able to achieve a higher performance than the other classifiers. Furthermore, Figure 9d shows a comparison of CPU usage for the testing time between the different classifiers. Based on this figure, it is noticeable that Decision Tree (DT) and MLP outperform the other classifiers in terms of the testing time. However, SVM-RBF, RF, and KNN models had the worst values for CPU usage for the testing time.
In terms of energy consumption, Figure 9e depicts the Decision Tree (DT), whereby MLP models outperformed by up to around (0.008) Joules the others models, including the KNN model (4.3 Joules). In other words, the MLP outperforms the other classifiers in terms of accuracy, recall, F1-measure, and FPR and achieved low values for CPU usage and energy consumption.
According to the above results, the best performance is given by the MLP model. Furthermore, MLP also did well with regard to the amount of CPU time and consumption of energy achieved for all instances during the testing phase, as compared to other classifiers. There were only a few cases where simple performances were provided by these techniques. However, observing the MLP model, it is obvious that it achieved a good performance and is also compatible with lightweight devices.  In terms of FPR, Figure 9b reveals that RF, SVM-RBF, and MLP models outperform by approximately 5% compared to others, which achieved 17.13%. It is also noticeable that RF and SVM-RBF models achieved higher FPR than MLP, up to a difference of nearly

Performance on the Raspberry Pi
Raspberry Pi has a significantly slower speed as compared to other regularly used computers; thus its limited resources for computing makes it an appropriate platform to test the performance of the model. The model's performance for this experiment was evaluated via a Raspberry Pi 3 of the Model Type B. The performance showed the median for each of the 10 iterations tested. Further observation showed that there were 350 instances (i.e., "normal" = 200, "attack" = 150) for the experimental testing of the test set.
Furthermore, the optimization of the CPU's energy and time could be achieved via the utilization of Fog Computing's distributive nature. It was observed that the average time used in testing was around 0.004 s, thus indicating a reasonable delay, as applicable to many IoT-inclined applications, and the energy consumption rate was 0.008 Joules.
Based on the benchmark, using the ADFA-LD training set and testing set, the experimental results show that for binary classification, the intrusion detection model of MLP-IDS through the training set and testing set has higher accuracy than other machine learning methods and maintains a high recall, F1-measure, and accuracy rate as well as the highest ROC curve with a high value of AUC, even in the case of FPR. The FPR rate was low for the other models, such as SVM-RBF and RF; however, though they outperformed, they had higher CPU time and energy consumption for the testing process and thus were inappropriate for the lightweight model. Consequently, the DT model outperformed the MLP model with respect to CPU time and energy consumption for the testing process, but maintained a lower rate of recall, F1-measure, and accuracy as well as a low ROC curve and low value of AUC; however, it also achieved a high value of FPR as compared to the other machine learning methods. Of course, the model proposed from this experiment will reduce the time and energy consumption for the testing process more effectively, as it boasts a higher accuracy than the other machine learning methods.

Performance of the MLP Methods (Parameter)
This work employed an ANN approach to serve as the computational model. This is due to its influence on biological neural network characteristics for incorporating intelligence into the proposed methods. Thus, a particular type of ANN is represented in the form of a direct graph, for the purpose of transmitting diverse information from the system over the edge and across nodes, without the formation of a cycle. To achieve this, a MLP model was adopted. However, in order to achieve an outstanding performance, it is important to find an optimal parameter when modeling an MLP [162,165].
Fully connected layer: This layer is referred to as a layer that is fully connected due to the connectivity of the layer's unit with all other units around the succeeding layer. From a general perspective, high dimension mapping can be easily achieved via layers that are fully connected. This will result in more accuracy of the output. The function for non-linear activation used in this experiment is ReLU.
Batch Normalization and Regularization: Dropout (0.01) and Batch Normalization were employed amidst layers that were fully connected, for the purpose of over-fitting obviation as well as to ensure a speed-up of the model training of the Deep Neural Network (DNN). Regarding the alternative architectures, it was discovered that DNNs were able to over-fit the data used for training very easily, without the need for regularization, including situations where training was conducted for a large number samples.
Classification: This is the last layer, and it is fully connected as it makes use of the activation function from sigmoid for the classification of binary, as well as the activation function from softmax for the classification of multi-classes. Thus, the definition of the function for prediction loss in sigmoid is achieved via binary cross entropy; moreover, the definition for the prediction loss of softmax is achieved via categorical cross entropy as follows: The estimation for the prediction loss of Binary classification is achieved via binary cross entropy, given by where pd denotes a vector of the predicted probability for all the testing dataset samples; ed denotes a vector of expected class label, for which the values lie between 0 and 1 The prediction loss for Multi-class classification is estimated using categorical cross entropy, given by where ed represents the true probability distribution; pd denotes the predicted probability distribution. Furthermore, in this research, adam was used as an optimizer for ensuring the minimization of the loss of categorical cross entropy as well as binary cross entropy [129,172].
As MLPs are parameterized, as shown in Table 11, the performance is dependent on the optimized parameters. Determining the parameter that is optimized for both MLP network topologies and network parameters was achieved via the ADF-LD dataset. For identifying which parameter is suitable for MLP, this research employed the use of an architecture of medium size, for the experimentations within specific hidden units, rate of learning, and the functions of activation, as follows: default (Activation = 'relu', Solver = 'adam', Alpha = 0.0001), a small node size, and a single hidden layer. The architecture of MLP contains three layers. The first is the input layer, the second is the hidden layer or fully connected layer, and the third is the output later. The advantages of Multi-layer Perceptron (MLP) are that it is able to (1) learn nonlinear models and (2) learn models in real-time (online learning) via partial fit.
The disadvantages of Multi-Layer Perceptron (MLP) include the following: (1) MLP has hidden layers consisting of a non-convex loss function and comprising the existence of more than a single local minimum. Thus, accuracy of diverse validations can occur due to the initialization of various random weights. (2) MLP requires the tuning of several hyperparameters, including numerous hidden neurons, iterations, and layers. (1) MLP has a sensitivity to feature scaling [129].
The evaluation performance for MLP parameters (Recall, F1-Measure, Accuracy, FPR) is illustrated in Figure 10. The evaluation of light weight is based on the following parameters, as illustrated in Figure 11: Evaluation of CPU Usage MLP Parameter; Evaluation of Memory Usage MLP Parameter; Evaluation of Energy (Consumption). This process includes exploring all factors that help to reach the goals of achieving a light weight. This study obtained performance under the default values of the MLP model that were deemed the most appropriate; however, some parameters were used to help provide the light weight. The performance from comparing both alpha values at an equal point of 10.0 shows that there is an improvement in both the processor usage and energy consumption. in addition, when using activation "identity", it contributes to the support of light weight, while the performance of both "identity" and the alpha 10.0 is poor.

Performance of the MLP Methods (N-gram)
The experiment is performed via combining the top 60 and 80 for the 1-gram, whereas for the 2-gram, it combined the top 80 and 120, with several numbers of nodes at the hidden layer, after which the results underwent evaluation. Figure 12a illustrates the model performance based on accuracy, via the use of combining four top features and their N-grams, accordingly. From the results, the best accuracy value was achieved for 4 nodes at the hidden layer, which used 80 1-gram and 120 2-gram.
With regards to the recall values, as depicted in Figure 12b, the best value achieved around 97%, which was possible under three cases, namely the 2 node model via 80 1-gram and 80 2-gram; the 4 node model via 60 1-gram and 120 2-gram; the 6 node model via 60 1-gram, and 120 2-gram. The F1 measure values for total combinations of 1-gram as well as 2-gram are illustrated in Figure 12c. The best value for the F1-measure was achieved on the 6 node model via 80 1-gram and 80 2-gram.
In terms of FPR, Figure 12d shows that when increasing the number of nodes and the number of combinations of 1-gram and 2-gram, the models outperform. Figure 12e shows that the best Area Under Curve (AUC) value achieved during the experiment was for 6 nodes at the hidden layer, with 80 1-gram and 120 2-gram, which achieve a value of 94.95%.
In terms of CPU usage and energy consumption, Figure 12f,g depicts the low values. When using a small number of nodes and small numbers for all combinations of 1-gram and 2-gram, high values of CPU usage and energy consumption result. When increasing the number of nodes and the number of combinations of 1-gram and 2-gram, the best consumption value is achieved on the 2 node model using 60 1-gram and 80 2-gram, which outperformed by up to around (0.0025) seconds and by a (0.005) Joule energy consumption rate.
Furthermore, comparisons were made among all performances, such as for accuracy, recall, F1-measure, FPR, AUC, CPU usage, and energy consumption of the models via diverse nodes. These comparisons revealed that the overall outstanding performance was attained at the point where 2 or 4 nodes were used at the hidden layer. However, it is worth noting that the 5 and 6 node models also produced good values for the recall rate, similar to that of the 4 node model. Thus, conclusions can be drawn from the above analysis as to the fact that the model can perform well, provided the hidden layer contains a minute number of nodes. The use of 4 nodes at the hidden layer is adequate for the majority of applications. Overall, it can be postulated that the 4 node model of a hidden layer performs far better than models with a greater number of nodes, working on a mix of several 1-gram and 2-gram. The implication of this is that the addition of more nodes will yield an insignificant improvement in the performance of the model.   worth noting that the 5 and 6 node models also produced good values for the recall rate, similar to that of the 4 node model. Thus, conclusions can be drawn from the above analysis as to the fact that the model can perform well, provided the hidden layer contains a minute number of nodes. The use of 4 nodes at the hidden layer is adequate for the majority of applications. Overall, it can be postulated that the 4 node model of a hidden layer performs far better than models with a greater number of nodes, working on a mix of several 1-gram and 2-gram. The implication of this is that the addition of more nodes will yield an insignificant improvement in the performance of the model.

Conclusions
In this article, a novel and intelligent architecture of an IDS is presented. To address the aforementioned limitations of current systems, the IDS presented here includes the main function, type, and normal behavior profile of each IoT device connected to the network; the type of attack deployed is classified. Evaluation of the performance of applying a supervised machine learning approach to automate each function demonstrates that the proposed architecture can successfully distinguish between IoT devices on the network and whether network activity is malicious or benign, and it can detect which attack was deployed on which device connected to the network automatically. In addition to the experimental results, this study provides resources that can further support research into automating IoT-based cyber-attack detection.
In this work, a methodology was presented to develop a lightweight IDS-based machine learning model that can be used to identify potential attacks against IoT security, considering their impacts and the ways of mitigating as well as recovering from these attacks. In this work, the Fog Computing concepts and its security challenges were discussed. • The proposed method is tested using the ADFA-LD dataset for anomaly HIDS. This study obtained performance under the default values of the MLP model that were deemed the most appropriate; however, some parameters were used to help provide the light weight. The performance from comparing both alpha values at an equal point of 10.0 shows that there is an improvement in both the processor usage and energy consumption. in addition, when using activation "identity", it contributes to the support of light weight, while the performance of both "identity" and the alpha 10.0 is poor.

•
The test is executed with 2, 3, 4, 5, and 6 nodes of the MLP single hidden layer. From the experiment, we found that 2 nodes of the MLP single hidden layer with 80 1-gram and 80 2-gram were able to achieve 95% accuracy, 97% recall rate, 94% F1-measure, and FPR of 7%, and an AUC of 95%. This is comparable to the higher number of nodes (i.e., 3 nodes to 6 nodes), which give similar results. Since the 2 node model has lower complexity, and there is no significant advantage when a higher number of nodes is implemented, the 2 node implementation is suggested for the Fog device. • Raspberry Pi, which acts as the Fog device with its limited computational resources, is a suitable platform for testing the model performance. The performance that shows the median for the CPU time usage is approximately 5 ms, and the energy consumption is approximatively 9 mj. The testing time could be optimized by utilizing the distributed nature of Fog Computing.
IoT end devices need novel solutions to protect their privacy. The current paper employed the application of the concepts of "A Lightweight Perceptron-based Intrusion Detection System" using Fog Computing to build a distributed lightweight security solution with high lifespan for IoT security.

Future Work
There is a wide gap in the area of IoT device capacity as well as intelligence, as these IoT systems are known to produce large heterogeneous data every second. Thus, efficient approaches for managing huge data produced by IoT systems is a tentative drive for future work. The application of technologies like Fog Computing, big data, and the cloud can be useful in ensuring a large exchange of information across networks. Furthermore, there is hope with regards to efficiency of the technology in being able to manage huge heterogeneous data in an efficient and secure manner. The concept of a holistic architecture for IDS in the IoT has begun to be explored. Still in its infancy, this trend will grow, with continued evaluation of IDS implementation to improve security of alert traffic and management and further development of applications such as alert correlation and autonomic management systems.