Towards a Lightweight Detection System for Cyber Attacks in the IoT Environment Using Corresponding Features

: The application of a large number of Internet of Things (IoT) devices makes our life more convenient and industries more e ﬃ cient. However, it also makes cyber-attacks much easier to occur because so many IoT devices are deployed and most of them do not have enough resources (i


Intrusion Detection System (IDS)
A vast number of Internet of Things (IoT) devices have been deployed in many applications as a result of the significant development of related technologies.At the same time, the problem of cyber-attacks has become a challenging issue.This is because most IoT devices have very limited resources (e.g., storage and computation capacity); thus, they cannot carry out complicated intrusion detection systems.It was said that global mobile data traffic increased by about 71% in 2017 over the previous year, and almost all of the mobile traffic data will originate from smart devices by 2022 [1].The number of IoT malware in 2017 increased about sevenfold compared with that in 2013.The peak of cyber-attack traffic launched by a huge number of Mirai-infected IoT devices reached 620 Gbps [2].There were 10,263 botnets hosted in different IoT devices identified in 2018 [3].Another distributed denial of services (DDoS) attack from compromised IoT devices, called "IoTroop," was discovered by Check Point in 2017 [4].There were 13,000 IoT devices involved in this powerful attack targeting financial sectors [4].
Figure 1 shows the typical intrusion detection system for the IoT environment.The detection system monitors the IoT components, and it will generate the alert to the users or the response modules to process against the suspected threats as soon as it is decided that the traffic pattern is an attack.The machine learning-based approach needs to train the detection model using a suitable learning classifier.Almost all misuse-based approaches use a pattern matching algorithm to determine whether or not the incoming traffic pattern is an attack.All the above-mentioned systems need to capture the normal traffic patterns or/and the attack signatures patterns for building their detection systems.The number of IoT malware in 2017 increased about sevenfold compared with that in 2013.The peak of cyber-attack traffic launched by a huge number of Mirai-infected IoT devices reached 620 Gbps [2].There were 10,263 botnets hosted in different IoT devices identified in 2018 [3].Another distributed denial of services (DDoS) attack from compromised IoT devices, called "IoTroop," was discovered by Check Point in 2017 [4].There were 13,000 IoT devices involved in this powerful attack targeting financial sectors [4].
Figure 1 shows the typical intrusion detection system for the IoT environment.The detection system monitors the IoT components, and it will generate the alert to the users or the response modules to process against the suspected threats as soon as it is decided that the traffic pattern is an attack.The machine learning-based approach needs to train the detection model using a suitable learning classifier.Almost all misuse-based approaches use a pattern matching algorithm to determine whether or not the incoming traffic pattern is an attack.All the above-mentioned systems need to capture the normal traffic patterns or/and the attack signatures patterns for building their detection systems.The intrusion detection system can be divided into the host-based detection system (HIDS) and the network-based detection system (NIDS).Formally, the traditional IDSs cannot be hosted on ordinary IoT devices because of limitations in the resources of such devices.Thus, the network-based IDS will be the only possible way to implement the detection system for the IoT environment.Anomaly detection and misuse-based approaches are the primary detection mechanisms in IDSs.Misuse-based detection systems are implemented using predefined attack signatures.Thus, they cannot detect new kinds of attacks and new variants of known attacks [5].Although anomaly detection systems may be effective for new kinds of attacks, the problem of the high false positive rate is one of the main challenges [6,7].Moreover, the presence of so many different natures of a wide variety of IoT devices is another challenge for implementing an anomaly detection system.Additionally, the IoT devices-based IDSs must be lightweight because of their computation and storage capacity.Most of the early IDSs were deployed in powerful personal computers or servers.However, there are numerous IoT devices rapidly developing in recent years.These devices are inexpensive, portable, and can be used everywhere.Moreover, these devices only need low power consumption.Therefore, it is not suitable to use highly powerful computers to protect against the cyber-attacks in the IoT environment because they have more power consumption rather than many IoT devices, and the IDS deployment cost can be higher than the setting up of the IoT devices for a small environment.Therefore, the researchers are more interested in deploying the IDS on the The intrusion detection system can be divided into the host-based detection system (HIDS) and the network-based detection system (NIDS).Formally, the traditional IDSs cannot be hosted on ordinary IoT devices because of limitations in the resources of such devices.Thus, the network-based IDS will be the only possible way to implement the detection system for the IoT environment.Anomaly detection and misuse-based approaches are the primary detection mechanisms in IDSs.Misuse-based detection systems are implemented using predefined attack signatures.Thus, they cannot detect new kinds of attacks and new variants of known attacks [5].Although anomaly detection systems may be effective for new kinds of attacks, the problem of the high false positive rate is one of the main challenges [6,7].Moreover, the presence of so many different natures of a wide variety of IoT devices is another challenge for implementing an anomaly detection system.Additionally, the IoT devices-based IDSs must be lightweight because of their computation and storage capacity.Most of the early IDSs were deployed in powerful personal computers or servers.However, there are numerous IoT devices rapidly developing in recent years.These devices are inexpensive, portable, and can be used everywhere.Moreover, these devices only need low power consumption.Therefore, it is not suitable to use highly powerful computers to protect against the cyber-attacks in the IoT environment because they have more power consumption rather than many IoT devices, and the IDS deployment cost can be higher than the setting up of the IoT devices for a small environment.Therefore, the researchers are more interested in deploying the IDS on the resource-constraint devices, like Raspberry Pi.In this study, we implement the machine learning-based lightweight IDS on Raspberry Pi to protect against the specific attacks on the IoT environment.We also proposed the new feature selection method, named correlated-set thresholding on gain-ratio (CST-GR), to make the system lightweight.

Common Attacks in the IoT Environment
The most challenging of IoT security is the botnet attacks, like Bashlite, Mirai, Hajime, and so on.The botnet attacks can be classified into DDoS, pushing, identify theft, information leakage, and keylogging [8].The botmasters conduct port scanning to find the vulnerabilities in them to infect these devices by port scanning, as well as to perform remote mapping networks for gathering the information by operating system fingerprinting (OS fingerprinting).Most of the botnets launched the DDoS attacks to disrupt a service, and to make it unavailable to legitimate users [6].DDoS attacks can occur in both the network layer and the application layer in the IoT environment [9].Data theft and privacy issues are the specific security issues in the application layer of the IoT environment [9].The study [10] addressed the challenges of data disclosure and privacy violations to the users.Thus, we mainly focus on the detection of the following three specific attacks in the IoT environment: 1.
Probing attacks (reconnaissance): These are malicious activities to gather information about the targets through remote scanning.They are often categorized by the two subclasses of port scanning and OS fingerprinting.

2.
DDoS attacks: These are launched collaboratively by many compromised hosts (called bots).Such attacks try to disrupt the availability of services to legitimate users.

3.
Information theft attacks: In these attacks, an adversary seeks to obtain sensitive data.They can be subcategorized into data theft and keylogging.

Our Contributions
Our main contributions in this paper are as follows.

1.
A machine learning (ML)-based lightweight IDS is proposed and implemented on a Raspberry Pi system.2.
To overcome the challenges of the resource constraint problem, a novel feature selection algorithm called correlated-set thresholding on gain-ratio (CST-GR) is proposed for selecting essential features.In our experiment, the number of features is greatly reduced by this algorithm.

3.
The essential features are selected for each specific kind of attack.Thus, good detection performance can be expected.

4.
The detection performance of our proposal is examined in detail using the botnet dataset, Bot-IoT [6], which is collected in a simulated IoT environment.We observe that the CST-GR algorithm can significantly reduce the processing time with almost no sacrifice on detection accuracy.We also observe that, without the help of the CST-GR algorithm, Raspberry Pi cannot handle the entire dataset used in our experiments.

5.
We try several tree-based classifiers-J48, Hoeffding tree (VFDT, very fast decision tree), logistic model tree (LMT), and random forest (RF)-to determine which classifier is the most suitable to the IoT environment in terms of lightweight and detection performance.According to our discussion and experimental results, the J48 algorithm is found to be the most suitable for our detection system.6.
We make sure to what degree the processing time can be decreased for training and testing if the Raspberry Pi device is used in multithreading mode.

Organization of the Paper
The rest of the paper is organized as follows.After the related works are reviewed briefly in Section 2, the new feature selection algorithm (CST-GR) and the general flow of our entire detection system are explained in Section 3. In the samet, some tree-based learning classifiers that we use in our system are introduced.In Section 4, the performance of our detection system is verified using a public dataset collected in a simulated IoT environment.Finally, we conclude the paper in Section 5.

Public IDS
Most of the existing public IDSs such as Snort, Suricata, and Bro are based on pattern matching and static approaches.Snort, provided by Cisco Systems [11], is a leading network-based IDS.The earlier version of Snort only supported single-threading architecture, but from version 2.9, it can support multi-threading.The Suricata, an open-source IDS/IPS solution [12], is fully supported for the multi-threading architecture.However, it is more suitable for a large-scale network architecture.
A study [13] also gives an introduction to the ability of Suricata.Although the Snort is lighter than Suricata, it still has a limitation on the number of rules exploited for attack detection when it is implemented on resource constraint devices, like Raspberry Pi [14].
A research work [7] proposed a machine learning-based detection system to extend Snort.This work also made a testbed performance comparison between Snort and Suricata using different types of malicious traffics, which are collected by the Metasploit framework.After detecting seven types of malicious traffics by the two kinds of public IDSs, the first conclusion was that Suricata required more memory and CPU processing capability than Snort.Another finding was that Snort's rule set could detect six out of seven kinds of attacks, but Suricata could detect only four kinds.Moreover, there were many false alarm rates in both IDS systems: Snort has 55.2% false positive rate (FPR), and Suricata generated up to 74.3% FPR.After adding their machine learning-based detection plug-in to Snort, which was done in parallel with the Snort rule detection engine, this system could reduce the FPR to 16.9% by support vector machine (SVM) and 8.6% FPR by the hybrid approach, which was done by optimized SVM with fuzzy logic.According to their experiments, the hybrid detection system constructed by the signature-based approach and ML-based approach was effective.However, their proposal was complicated by implementation in the resource constraint device, like the Raspberry Pi.

Machine Learning-Based IDS for the IoT Environment
In recent years, as more and more IoT devices are deployed, IDSs in IoT environments have attracted attention from many researchers and developers.Some studies [15,16] addressed specific types of threats targeting IoT devices.One study [15] proposed a detection system for sinkhole attacks targeting routing devices.The detection rate for sinkhole attacks was up to 92% and 72% on fixed and mobile scenarios, respectively.Another study [16] tried to prevent the three different levels of battery exhaustion attacks on the bluetooth low energy (BLE)-based mesh network.
A novel IDS for the IoT environment was proposed using not only ML methodologies, but also a rule-based architecture [17].Their effort was for both prediction of the malicious behavior and detection of the malicious IoT nodes on the network from DDoS attacks.They used the naïve Bayes as the classifier, and their performance evaluation was done on the open-source tool Weka.ML-based forensic mechanism for IoT botnets was proposed in a study [18], which also used Weka to examine their detection accuracy.
In many existing studies, the IDSs are implemented for the IoT environment using machine learning techniques.This is considered to outperform signature-based systems because the attackers can circumvent the normal signature-based detection techniques.However, the existing IDSs are mainly aimed at obtaining a high detection accuracy with a low false alarm, which is not enough for IoT environments where lightweight IDS systems are also necessary.

Raspberry Pi-Based IDS
The IDS studies [5,[19][20][21][22][23][24][25][26][27] also have proposed attack detection based on Raspberry Pi.In the work of [18], the anomaly-based detection method was proposed by capturing the previous traffic patterns asbenign data for attack detection system construction.The work of [21] has implemented a real-time DDoS detection architecture using the complex event processing (CEP) algorithm, and it allowed for real-time analysis of continuous data-streams.Some other studies [14,20,23] have used public IDSs, such as Snort and Bro, to implement the detection system on Raspberry Pi.The study [14] proposed an IDS solution on Raspberry Pi, but its results showed that the number of rules had to be limited owing to their implementation environment, Raspberry Pi.The study [22] also proposed an IDS solution on an IoT router for verifying the domain name system (DNS) traffic generated from IoT devices.An IDS/IPS solution for radio-frequency identification (RFID) using Raspberry Pi was also proposed in the work of [5].
The cyber-attack detection studies based on Raspberry Pi devices are shown in Table 1.Almost all of the Pi-based detection systems used the public IDS, and these are misuse-based systems that used the pre-defined attacks' signatures.Although some public IDS-based studies have indicated that the detection system could be implemented on Raspberry Pi, such systems are mainly simplified ones (e.g., using decreased rule-base) of IDSs in the traditional network.Therefore, such systems often do not have satisfactory detection performance.How to implement a lightweight and efficient detection system in the IoT environment has become a crucially important issue.The IDS on fog computing [24] was introduced using new modern datasets, namely Australian Defence Force Academy Linux Dataset (ADFA-LD) and ADFA-Windows Dataset (ADFA-WD) [3,25].This work [24] also used Raspberry Pi to evaluate the performance of attack detecting model.They examined its performance by setting up two workstation computers and the Raspberry Pi devices.They proposed a Raspberry Pi-based IDS for a home network using different Snort-rules sets.However, it did not focus on the DDoS attacks, which was a well-known main security challenge in the IoT environment.

Feature Selection
The feature selection process involves selecting the most important features or reducing the irrelevant features from the original feature set.The irrelevant features or redundant information should be eradicated [28,29] because of such features that can not only raise the computation cost, but also deteriorate the detection performance [30].The feature selection reduces the effects of noise or irrelevant variables.Thus, feature selection is extremely important for good detection performance [31].The main approaches for feature selection can be categorized into three groups: filter-based, wrapper-based, and embedded-based approaches.Among them, the filter-based approach is suitable to accomplish lightweight systems because it has a separated learning process that can run on a powerful node, not small IoT devices.However, efficient features may be diverse for different learning algorithms to achieve high detection performance.One simple filter-based approach is the Pearson correlation coefficient, which can only detect linear dependencies between features and the target class [32].Other approaches, mutual information (MI) [33] and gain-ratio [34], use the information-theoretic ranking criteria based on the conditional entropy.The significant difficulty of these approaches is to decide the threshold value for removing the unimportant features.Correlation-based feature selection (CFS) [35] can overcome this problem because it can extract the best features subset without having a threshold value.This approach automatically extracts the best subset with the highest value of its merit function based on two criteria: correlated with target class and uncorrelated with each other on the features.However, it can make the prediction accuracy lower if the best merit score is not good enough.
The preliminary of our lightweight IDS proposal [36] was based on the selected features by the CFS algorithm.We selected the same features for nine kinds of attacks in a public dataset, UNSW-NB15 [37], for implementing the detection system.We evaluated the performance comparison with two classifiers-J48 and naive Bayes-for learning our system.We found that the detection accuracy of J48 is better than that of naive Bayes when we used the selected features, and the CFS algorithm could significantly decrease the number of irrelevant features.In this study, we introduce a new feature selection algorithm and the features are selected corresponding to each specific kind of attack.

A New Algorithm for Feature Selection
In this subsection, we introduce our new feature selection algorithm, called CST-GR algorithm, to select the most effective features for detecting each kind of specific attack.The proposed feature selection algorithm is shown in Algorithm 1.The merit function from CFS algorithm [35] is applied to find the best-correlated feature set.However, unlike the CFS algorithm, such features do not necessarily appear in our final feature set.The gain-ratio is also calculated for each of those features to decide which ones are eventually selected.Briefly speaking, the minimum gain-ratio value in the feature set obtained using the merit function is used as the threshold of gain-ratio to select the final features from the original feature set.That is, not only the merit function, but also the gain-ratio are calculated to determine the final feature set.
The feature selection process is to select the really important features and try to reduce the number of irrelevant/unimportant features to the greatest extent.At the same time, the features should not condense the accuracy of a predictive model [38].Like the CFS algorithm, the merit function is used to evaluate the relations between the input features and output classes.A feature is regarded as redundant if one or more of the other features are highly correlated with it.In Equation (1), k is the number of features in the current subset, r c f refers to the average value of all feature-class correlations, and r f f is the average value of all feature-feature correlations.Using the merit function (M s k ), the subset of features with the highest merit value can be generated.A detailed discussion and analysis can be found in the work of [35].In the CFS algorithm, the selection result is very dependent on the merit function, which is a big problem in practice.To overcome this problem, in our CST-GR algorithm, the gain value of each feature is also taken into account.
Extract the feature set having the highest correlation based on the merit function: Equation (1), FC ⊂ FI FC = {f 1 , f 2 , . . ., f c }, c = the number of features in the subset having the best merit value 3.
Calculate the minimum gain value of the features in FC, f min (FC) 4.
Use Equation ( 2) to calculate the gain-ratio value of each feature of FI From GR, select the features whose gain-ratio values are greater than or equal to f min (FC), FS ⊂ FI FS = {r 1 , r 2 , r 3 , . . . . . ... , r s }, s = the number of finally selected features 6.
Output FS The gain value of each feature can be calculated using Equation ( 2), which is given as "GainRatioAttributeEval" in the open-source Weka [39] and is also used to overcome the overfitting problem in C4.5.Equation (3) represents the information generated by splitting the dataset (D) into the v partitions, which correspond to the outcomes on attribute A. The Gain (D) can be calculated by Entropy (S) -Entropy (S, D).
In the CST-GR algorithm, all possible features are loaded to the feature set (FI) in step 1.
Step 2 reveals the best subset (FC) that possesses the highest merit value, and the minimum gain-ratio value (f min ) of the features in FC is calculated in step 3. Step 4 calculates the gain-ratio value (GR) of each feature in the original feature set.Finally, all the features in the original feature set whose GRs are greater than or equal to the minimum gain-ratio value, and f min are selected as the final features in step 5. Finally, the best feature set (FS) is generated in step 6.That is, the minimum GR in the feature set obtained using the merit function is used as the threshold of gain-ratio to select the final features from the original feature set.Thus, our feature selection algorithm takes not only the merit values into account, but also the gain-ratio values (GRs).

The General Flow of the IDS Proposal
The general schemes of our proposal for training and testing are shown in Figures 2 and 3, respectively.There are multiple classifiers (three in Figures 2 and 3 as examples) constructed for the detection of specific attacks on the IoT environments.In the training phase shown in Figure 2, each classifier is trained using the data containing only the corresponding attacks and normal data.The feature set used in each classifier is selected independently by the CST-GR algorithm.In the test phase shown in Figure 3, the necessary features for all the classifiers are extracted from the test data.Then, the classifiers obtain the features they need.The CST-GR algorithm is used to select the most important features for each kind of attack.The experiment results presented in Section 4 indicate that the CST-GR algorithm can greatly decrease the number of features actually used in the classifiers, which makes the entire detection system lighter.More importantly, making the detection system lighter almost does not degrade its detection performance, which means that the CST-GR is able to find the really efficient features for detecting attacks.Moreover, the system can support updating each classifier or extending the additional classifiers for the detection of new kinds of attacks.

Tree-Based Classifiers
We used the well-known tree-based classifiers J48, Hoeffding tree (VFDT), logistic model tree (LMT), and random forest (RF) for implementing our detection system.Tree-based approaches are simple, and most of these classifiers are lightweight processing architecture.These are the type of supervised learning methods for prediction of the learning model by growing the tree(s), and these can be applied for regression and classification problems.

J48
J48 (C4.5) is the descendant of ID3, and it is one of the decision tree generation algorithms developed by Ross Quinlan.It is a tree-like structure that consists of the root node and leaf nodes derived from it [40].It uses information gain for splitting criteria to construct the tree.The leaf nodes represent classes or class attributes.It can handle both continuous and discrete features.Using the

Tree-Based Classifiers
We used the well-known tree-based classifiers J48, Hoeffding tree (VFDT), logistic model tree (LMT), and random forest (RF) for implementing our detection system.Tree-based approaches are simple, and most of these classifiers are lightweight processing architecture.These are the type of supervised learning methods for prediction of the learning model by growing the tree(s), and these can be applied for regression and classification problems.

J48
J48 (C4.5) is the descendant of ID3, and it is one of the decision tree generation algorithms developed by Ross Quinlan.It is a tree-like structure that consists of the root node and leaf nodes derived from it [40].It uses information gain for splitting criteria to construct the tree.The leaf nodes represent classes or class attributes.It can handle both continuous and discrete features.Using the

Tree-Based Classifiers
We used the well-known tree-based classifiers J48, Hoeffding tree (VFDT), logistic model tree (LMT), and random forest (RF) for implementing our detection system.Tree-based approaches are simple, and most of these classifiers are lightweight processing architecture.These are the type of supervised learning methods for prediction of the learning model by growing the tree(s), and these can be applied for regression and classification problems.

J48
J48 (C4.5) is the descendant of ID3, and it is one of the decision tree generation algorithms developed by Ross Quinlan.It is a tree-like structure that consists of the root node and leaf nodes derived from it [40].It uses information gain for splitting criteria to construct the tree.The leaf nodes represent classes or class attributes.It can handle both continuous and discrete features.Using the pruning techniques, the overfitting problem can be solved.It can also be used on training data that have incomplete data and different weighted features.

Hoeffding Tree
The Hoeffding tree is an incremental, anytime decision tree induction algorithm developed by Domingos and Hulten [41].It is also called VFDT, which means a very fast decision tree.Hoeffding trees exploit the fact that a small sample can be enough to choose the finest splitting attribute.It uses Hoeffding bounds to assure that its output is asymptotically nearly identical to that of a conventional learner.A detailed description can be found in the research [41,42].The Hoeffding tree can support binary and nominal class, and can also work with mission class values.This algorithm can also work well with nominal and numerical features.

Logistic Model Tree
Logistic model tree (LMT) is the classification of trees with logistic regression functions at the leaves.The algorithm can support not only binary and multi-class target variables, but also numeric, nominal attributes, and missing values.Landwehr et al. [43] proposed the LMT algorithm, and they use a stagewise fitting process to construct the logistic regression models that can select relevant attributes in the data in a natural way.Their method employs the LogitBoost algorithm [44] for building the logistic regression functions at the nodes of a tree and uses the classification and regression tree (CART) algorithm for pruning.

Random Forest
The random forest (RF) is a classifier introduced by Breiman [45] that consists of a collection of decision trees.Each tree is constructed using a random part of the whole feature vector.The final prediction of the RF is obtained by a majority vote, over the predictions of the individual trees.It is also an effective algorithm to overcome the overfitting problem, and the right kind of randomness gives them accurate classification and regression.One hundred trees on the RF were used in this study because it is claimed that the number of trees ranging between 64 and 128 trees was the best option for this algorithm [46].

Experiments
Raspberry Pi 3 Model B was used as the experiment platform to implement our lightweight IDS.This device is cheap and small, and thus can be used in many applications.The open-source Weka [39] was used to examine the detection performance of our proposal.Our system tried to capture the following three kinds of specific attacks in the IoT environment: probing attack (reconnaissance), DDoS attack, and information theft attack.The CST-GR algorithm was performed to select the really important features for detecting each specific attack.Tree-based classifiers, J48, VFDT, MLT, and RF, were each tried as the classifier in our detection system to evaluate which was the best one for our system.

Dataset
The dataset, Bot-IoT, was built and managed in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) [6], where the aforementioned three kinds of common attacks in the IoT environment were launched intentionally and the traffic data were collected in a simulated environment based on the IoT botnet scenarios [6].The node-red tool [47] was used for constructing simulated IoT devices.The Ostinato tool [48] was used to gather benign data.Hping3 [49] and Golden-eye [50] tools were used for DDoS attacks; Nmap [51], hping3 [49], and xprobe2 [52] were used for probing attacks; qnd Metasploit framework was used to find the vulnerability for data theft [6].In this dataset, 3.6 million instances were contained, and 40 features were originally extracted.

Features Selected by the CST-GR Algorithm
Although our system can deal with any specific attacks, the above-mentioned three so-called common attacks in the IoT environment are taken into account in our experiment.The training dataset was divided into three groups according to the three kinds of attacks.That is, each group contained only one specific attack and benign data, to which the CST-GR algorithm was applied.Thus, the selected features are guaranteed to be most suitable for detecting the corresponding attacks.
As the result of the CST-GR algorithm, only two features are selected for the DDoS attacks, five features for the probing attacks (reconnaissance), and three features for the information theft attack.The selected features are shown in Table 2. From Table 2, we can observe that only a very few features are selected for each kind of attack.Thus, each classifier is very small.

Performance Evaluation
The detection performance of our proposal was examined, including processing time and detection accuracy, for each of the three kinds of attacks.The true positive rate (TPR), false positive rate (FPR), precision, and F-measure were examined and presented as detection accuracy.The performance evaluation measurements are shown in Equations ( 4)- (8), where TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively.The Raspberry Pi device in our experiment could not handle all 3.6 million instances in the original dataset if all 40 features were used.Because the instances of the DDoS attacks and the probing attacks are too many to be handled in our platform (about 1.9 million instances of the DDoS attack and about 83,000 instances of the Probing attack are contained in the original dataset), only a randomly selected part of the instances in the entire datasets of the probing attacks and DDoS attacks was used for the sake of fairness.After selecting features by the CST-GR algorithm, this device could handle the whole dataset without any problem, because the number of selected features is very small.
Specifically, the details of the dataset used in our experiment are shown in Table 3.In order to select features and train a classifier for each of the three kinds of common attacks, the corresponding instances of each kind of attack and the normal instances in the original dataset are taken out to form a sub-dataset for the corresponding attack.For each sub-dataset, around two-thirds of instances are chosen as training data and the remaining parts as testing data.That is, we split 556 instances (367 instances for training and 189 instances for testing) in the information theft attack group, approximately 82,000 instances (54,000 instances for training and 28,000 instances for testing) in the reconnaissance (probing) attack group, and 82,000 instances (54,000 instances for training and 28,000 instances for testing) in DDoS attack group.For the reconnaissance attacks and the DDoS attacks, about 82,000 instances were selected because it was about the most prominent number for our platform to handle in the case of the original 40 features being used.Even though this device could handle more instances in the case of the CST-GR algorithm being used to select features, we must make the performance comparison fair before and after our feature selection algorithm CST-GR was used.

Detection Accuracy
The Raspberry Pi 3 Model B used in our experiment has Quad-Core 1.2 GHz 64 bit-CPU and 1 GB RAM.As mentioned above, a part of the original dataset was selected for our experiment so that the hardware could deal with it even when all the original features were used with the tree-based classifiers.However, the LMT is an exception because we want to select as many instances as possible.Thus, we used a personal computer with i7-8565U and 16 GB RAM to compare the detection performance of the case using LMT with that of other classifiers.By the way, when we applied the features selected by the CST-GR algorithm, the Raspberry Pi 3 Model B can work well for all tree-based algorithms including LMT, which also indicates that our proposed feature selection approach could efficiently guarantee the detection system being lightweight.
To evaluate the detection performance, we compared the TPR, FPR, precision, and F-measure of all the above-mentioned classifiers.Figure 4 shows the comparison of the TPR among the tree-based classifiers in the cases of using all features and using CST-GR selected features for each kind of specific attack (DDoS, reconnaissance, and theft).All these results show that the detection performance is not deteriorated even when using very few features selected by the CST-GR algorithm.Especially in the cases of J48 and RF being used as the classifier and the features being selected by the CST-GR, some TPRs reached up to 99.4% for all three attacks.That is, we can consider J48 and RF to be more suitable for our system if only the TPR is considered.Figure 5 is a comparison of the FPR, which indicates that RF and J48 are the best choices if only the FPR is considered.Comparisons on precision and F-measure are shown in Figures 6 and 7, respectively.The same as the results shown in Figure 5, the precision values and F-measures are better when the CST-GR algorithm is used to select features.These results also indicate that CST-GR is significantly good enough for selecting the most critical features for our lightweight detection system.All of our  Comparisons on precision and F-measure are shown in Figures 6 and 7, respectively.The same as the results shown in Figure 5, the precision values and F-measures are better when the CST-GR algorithm is used to select features.These results also indicate that CST-GR is significantly good enough for selecting the most critical features for our lightweight detection system.All of our Comparisons on precision and F-measure are shown in Figures 6 and 7, respectively.The same as the results shown in Figure 5, the precision values and F-measures are better when the CST-GR algorithm is used to select features.These results also indicate that CST-GR is significantly good enough for selecting the most critical features for our lightweight detection system.All of our performance comparisons indicate that the irrelevant or redundant features may not only raise the computational cost, but also deteriorate the detection performance.performance comparisons indicate that the irrelevant or redundant features may not only raise the computational cost, but also deteriorate the detection performance.
Figure 6.A comparison of precision of J48, VFDT, LMT, and RF between the cases using all features and using CST-GR selected features for each kind of attack.

Figure 7.
A comparison of F-measure of J48, VFDT, LMT, and RF between the cases using all features or using CST-GR selected features for each kind of attack.

Evaluation of Processing Time
The total times for processing all the instances of training data and testing data are presented in Table 4, including all the cases using the above-mentioned classifiers.The processing time for detecting each kind of attack was decreased greatly with the assistance of the CST-GR algorithm.performance comparisons indicate that the irrelevant or redundant features may not only raise the computational cost, but also deteriorate the detection performance.
Figure 6.A comparison of precision of J48, VFDT, LMT, and RF between the cases using all features and using CST-GR selected features for each kind of attack.

Figure 7.
A comparison of F-measure of J48, VFDT, LMT, and RF between the cases using all features or using CST-GR selected features for each kind of attack.

Evaluation of Processing Time
The total times for processing all the instances of training data and testing data are presented in Table 4, including all the cases using the above-mentioned classifiers.The processing time for detecting each kind of attack was decreased greatly with the assistance of the CST-GR algorithm.A comparison of F-measure of J48, VFDT, LMT, and RF between the cases using all features or using CST-GR selected features for each kind of attack.

Evaluation of Processing Time
The total times for processing all the instances of training data and testing data are presented in Table 4, including all the cases using the above-mentioned classifiers.The processing time for detecting each kind of attack was decreased greatly with the assistance of the CST-GR algorithm.Using the CST-GR algorithm for feature selection clearly makes the detection much faster.Table 5 shows the ranking in descending order of processing time among different classifiers.According to the experiments, the VFDT algorithm is the fastest on training, and the J48 algorithm is the fastest on testing.Although the training time of the RF is faster than the LMT, the testing time of the LMT is faster than RF in the case of using CST-GR features.Note that the training time of the LMT is longer than that of the RF, but the testing time of the LMT is faster than that of the RF when we applied our proposal on a personal computer.Our experiment platform, Raspberry Pi 3, cannot handle the processing of the LMT algorithm when all the original features are used.However, this device could process all the cases using the above-mentioned classifiers if the features selected by the CST-GR algorithm are used.According to the experiment results, the CST-GR algorithm can guarantee the detection system is able to handle all the above-mentioned classifiers on the resource constraint device, Raspberry Pi.Moreover, the J48 could be regarded as the most suitable learning method for our lightweight detection system.Our proposal can also be implemented in a parallel mode to improve the response time.A response time comparison of training time and testing time between a sequential mode and a parallel mode is shown in Table 6.We found that the Raspberry Pi can handle the parallel detection architecture if we selected the corresponding features by the CST-GR.In this architecture, the response times of all classifiers are more than two times faster than in the sequential mode.Moreover, the results show that the response time of the J48 is significantly faster than the RF and slightly faster than the other two, the VFDT and the LMT.We investigated the CPU and memory usage in the parallel detection architecture using corresponding features that are selected by the CST-GR algorithm.The htop tool [53], an interactive process-viewer of Unix systems, was used to capture the usage of memory and the CPU.Formally, the htop uses around 2% of CPU and 0.3% of the Raspberry Pi 3 Model B's memory.The case using the CST-GR algorithm for feature selection was investigated.The usage of the CPU and memory in the case of the J48 as the classifier is shown in Figures 8 and 9, respectively.Figure 8 indicates that all four cores in the CPU are working in a not-too-busy state.More specifically, around one-third of the memory (around 352 MB) was used.Note that the background processes consumed around 175 MB of memory.The maximum memory usage for four different classifiers was also investigated and results are as follows.The maximum memory usage reached 425 MB with the J48, 352 MB with the VFDT, 663 MB with the LMT, and 538 MB with the RF.process-viewer of Unix systems, was used to capture the usage of memory and the CPU.Formally, the htop uses around 2% of CPU and 0.3% of the Raspberry Pi 3 Model B's memory.The case using the CST-GR algorithm for feature selection was investigated.The usage of the CPU and memory in the case of the J48 as the classifier is shown in Figures 8 and 9, respectively.Figure 8 indicates that all four cores in the CPU are working in a not-too-busy state.More specifically, around one-third of the memory (around 352 MB) was used.Note that the background processes consumed around 175 MB of memory.The maximum memory usage for four different classifiers was also investigated and results are as follows.The maximum memory usage reached 425 MB with the J48, 352 MB with the VFDT, 663 MB with the LMT, and 538 MB with the RF.Also, we investigated that the usage of CPU and memory in the case of all 40 original features was used in the parallel mode without any feature selection.The system halted halfway, which means our platform cannot run such a system in the parallel mode if the features are not selected.From this experiment, we found that the system halted without giving any results when the usage of the CPU core reached 100% and the usage of device memory grew over 800 MB.

Observations
From the above experiments, we can observe the following: 1. Using our proposed feature selection algorithm (CST-GR) for each kind of attacks, the number of features can be greatly decreased and the detection system can be made much lighter and much faster almost without any sacrifice on detection accuracy (see Tables 2 and 4, Figures 4  and 5).Moreover, the Raspberry Pi device can handle many more instances.2. When using J48 and RF as the classifier, the detection accuracy (TPR) is still up to 99.4% even when using only very few features selected by the CST-GR algorithm.3. The detection system can be implemented in the parallel mode in Raspberry Pi.However, it cannot handle all the data in the parallel mode if the original features are used without the help of the CST-GR algorithm.4. The case of the J48 algorithm being used as the classifier has the shortest response time for detection, although the training time is a little longer than the VFDT (but still faster than the other two) and overall detection accuracy of the J48 is better than that of the VFDT. 5.Although the detection accuracy (TPR and FPR) of the RF is slightly better than that of the J48, the detection time of the J48 is around ten times faster than that of the RF.Therefore, J48 is the best choice for the classifier in our detection system.

Conclusions and Future Work
In this paper, we first pointed out the weakness of the existing detection system for IoT environments.After that, a lightweight detection system for the IoT environment using our new feature selection algorithm (called CST-GR) was designed and implemented on Raspberry Pi.If the CST-GR algorithm is conducted for each of the specific attacks, the detection system needs very few features for detecting the corresponding attacks, which makes the detection system very lightweight and fast.Moreover, this lightweight detection system has almost no sacrifice on detection performance.In addition, several well-known machine learning algorithms-J48, Hoeffding tree (VFDT), logistic model tree (LMT), and random forest (RF)-were tested to be used in our system.The performance of our detection system was examined using the public Bot-IoT dataset, which was collected in a simulated IoT environment.According to our experiment results, the CST-GR feature selection and the J48 classifier could help in the implementation of a lightweight detection system.Also, we investigated that the usage of CPU and memory in the case of all 40 original features was used in the parallel mode without any feature selection.The system halted halfway, which means our platform cannot run such a system in the parallel mode if the features are not selected.From this experiment, we found that the system halted without giving any results when the usage of the CPU core reached 100% and the usage of device memory grew over 800 MB.

Observations
From the above experiments, we can observe the following: 1.
Using our proposed feature selection algorithm (CST-GR) for each kind of attacks, the number of features can be greatly decreased and the detection system can be made much lighter and much faster almost without any sacrifice on detection accuracy (see Tables 2 and 4, Figures 4 and 5).Moreover, the Raspberry Pi device can handle many more instances.2.
When using J48 and RF as the classifier, the detection accuracy (TPR) is still up to 99.4% even when using only very few features selected by the CST-GR algorithm.3.
The detection system can be implemented in the parallel mode in Raspberry Pi.However, it cannot handle all the data in the parallel mode if the original features are used without the help of the CST-GR algorithm.4.
The case of the J48 algorithm being used as the classifier has the shortest response time for detection, although the training time is a little longer than the VFDT (but still faster than the other two) and overall detection accuracy of the J48 is better than that of the VFDT.

5.
Although the detection accuracy (TPR and FPR) of the RF is slightly better than that of the J48, the detection time of the J48 is around ten times faster than that of the RF.Therefore, J48 is the best choice for the classifier in our detection system.

Conclusions and Future Work
In this paper, we first pointed out the weakness of the existing detection system for IoT environments.After that, a lightweight detection system for the IoT environment using our new

Figure 1 .
Figure 1.Typical intrusion detection system for IoT environment.IDS, intrusion detection system.

Figure 1 .
Figure 1.Typical intrusion detection system for IoT environment.IDS, intrusion detection system.

Figure 3 .
Figure 3.The testing phase (the case of three classifiers).

Figure 3 .
Figure 3.The testing phase (the case of three classifiers).

Figure 3 .
Figure 3.The testing phase (the case of three classifiers).

Figure 4 .
Figure 4.A comparison of TPR of J48, very fast decision tree (VFDT), logistic model tree (LMT), and random forest (RF) between the cases using all features and using CST-GR selected features for each kind of attack.

Figure 5 .
Figure 5.A comparison of FPR of J48, VFDT, LMT, and RF between the cases using all features and using CST-GR selected features for each kind of attack.

Figure 4 . 19 Figure 4 .
Figure 4.A comparison of TPR of J48, very fast decision tree (VFDT), logistic model tree (LMT), and random forest (RF) between the cases using all features and using CST-GR selected features for each kind of attack.

Figure 5 .
Figure 5.A comparison of FPR of J48, VFDT, LMT, and RF between the cases using all features and using CST-GR selected features for each kind of attack.

Figure 5 .
Figure 5.A comparison of FPR of J48, VFDT, LMT, and RF between the cases using all features and using CST-GR selected features for each kind of attack.

Figure 6 .
Figure6.A comparison of precision of J48, VFDT, LMT, and RF between the cases using all features and using CST-GR selected features for each kind of attack.

Figure 7 .
Figure 7.A comparison of F-measure of J48, VFDT, LMT, and RF between the cases using all features or using CST-GR selected features for each kind of attack.

Figure 8 .
Figure 8.The consumption of each core (400% for four cores) of CPU for the parallel mode with J48 when using CST-GR selected features.

Figure 8 . 19 Figure 9 .
Figure 8.The consumption of each core (400% for four cores) of CPU for the parallel mode with J48 when using CST-GR selected features.

Figure 9 .
Figure 9.The consumption of device memory for the parallel mode with J48 when using CST-GR selected features.

Table 1 .
Raspberry Pi-based intrusion detection systems.DoS, denial of services; IoT, Internet of Things.

Table 2 .
Selected features by correlated-set thresholding on gain-ratio (CST-GR) for each kind of attack.

Table 3 .
The number of instances used in our experiment.DDoS, distributed DoS.

Table 4 .
Comparison of processing time (seconds).VFDT, very fast decision tree; LMT, logistic model tree; RF, random forest.

Table 5 .
Ranking of response time.

Table 6 .
Comparison of processing time between sequential mode and parallel mode when using CST-GR features.