Application of Machine Learning Algorithms for the Validation of a New CoAP-IoT Anomaly Detection Dataset

: With the rise in smart devices, the Internet of Things (IoT) has been established as one of the preferred emerging platforms to fulﬁl their need for simple interconnections. The use of speciﬁc protocols such as constrained application protocol (CoAP) has demonstrated improvements in the performance of the networks. However, power, bandwidth, and memory-constrained sensing devices constitute a weakness in the security of the system. One way to mitigate these security problems is through anomaly-based intrusion detection systems, which aim to estimate the behaviour of the systems based on their “normal” nature. Thus, to develop anomaly-based intrusion detection systems, it is necessary to have a suitable dataset that allows for their analysis. Due to the lack of a public dataset in the CoAP-IoT environment, this work aims to present a complete and labelled CoAP-IoT anomaly detection dataset (CIDAD) based on real-world trafﬁc, with a sufﬁcient trace size and diverse anomalous scenarios. The modelled data were implemented in a virtual sensor environment, including three types of anomalies in the CoAP data. The validation of the dataset was carried out using ﬁve shallow machine learning techniques: logistic regression, naive Bayes, random forest, AdaBoost, and support vector machine. Detailed analyses of the dataset, data conditioning, feature engineering, and hyperparameter tuning are presented. The evaluation metrics used in the performance comparison are accuracy, precision, recall, F1 score, and kappa score. The system achieved 99.9% accuracy for decision tree models. Random forest established itself as the best model, obtaining a 99.9% precision and F1 score, 100% recall, and a Cohen’s kappa statistic of 0.99.


Introduction
The construction of Internet of Things (IoT) networks has advanced significantly in the last couple of years, adding a new dimension to the world of information and communication technologies.The versatility in the use of sensors in Internet of Things (IoT) networks and the reduced cost through improved process efficiency, asset utilisation, and productivity have endorsed this network for new development opportunities in environments such as smart home management, intelligent transport system, smart energy management, and e-health.
The IoT consists of a large amount of nodes, and every node that connects with a server maintains a persistent connection.Besides, each node in the network connects with many other nodes, generating large loads on limited IoT devices.Congestion results in packet retransmissions, increasing energy consumption, latency, and packet loss, while reducing throughput and the packet delivery ratio (PDR) [1].The use of specialised, lightweight IoT mechanisms can ease the computational load on the system, free up resources, and improve security.Thus, an alternative way to deal with the challenges of the IoT is the use of specially designed standards and communication protocols.At the transport layer, the main protocols are the transmission control protocol (TCP) and user datagram protocol (UDP).
However, the requirements for the distribution of messages may vary according to the specifications of the IoT application.Diverse and popular IoT protocols at the application layer available today are geared towards the machine to machine (M2M) communication, where one of the predominant protocols is constrained application protocol (CoAP).This protocol is a customised and compressed version of the hypertext transfer protocol (HTTP), and shares the same methods and principles, light weight, and low energy consumption [2].
Despite CoAP having difficulties managing congestion because of low bandwidth and higher network traffic, security is a hot topic as it does not have reliable standards for secure architectures [3].Furthermore, the extensive implementation of sensor monitoring technologies, low-cost solutions, and high impacting diverse application domains generate an enormous amount of data while monitoring physical spaces and objects.The huge data streams collected can suffer from unhealthy behaviours or anomalies that can be normal in one scenario, while abnormal in another.Therefire, to reduce functional risks, avoid unseen problems, and prevent downtime of systems, analysis is required [4].
Depending on the type of analysis performed, security problems can be handled through intrusion detection systems (IDSs), which are intended to strengthen the security of information and communication systems.Intrusion dd=etection systems are classified as either signature-based or anomaly-based.Machine learning (ML) anomaly-based techniques have proven to be a reliable approach for the detection of network anomalies, including for IoT networks.These techniques take a decision based on the learning of anomalous/normal network behaviour.The machine learning approach is categorised into various learning paradigms: supervised, unsupervised, semisupervised, reinforcement, and active learning.Labelled data are required to train supervised learning approaches and, as a result, the output is optimised compared to other models.Depending on the enhancement in various applications, various learning methodologies and algorithmic models have to be tested and evaluated [5].Therefore, to develop supervised ML anomaly-based IDS, it is necessary to collect a dataset of traffic with correct labelling, both normal and abnormal.The inclusion of various types of anomalies can allow a validation and evaluation of system effectiveness, considering the real trends in security vulnerabilities [6].

Motivation
Modern monitoring technology has evolved significantly, as optimised systems allow data to be collected at multiple measurement points, obtaining a large quantity of data that, when processed with artificial intelligence, have obtained deep and meaningful research results [7].The lack of dedicated IoT datasets with specific protocols and conditions limits the emergence of novel mechanisms to recognise abnormal network behaviour.The detection of traffic anomalies can be improved by the proposal of new datasets representing real systems with wireless sensor networks and light communication, where IoT protocols are modified.
The implementation of models that recognise abnormal circumstances is possible due to the representative knowledge of the environment.These facts are motivated the presentation of CIDAD (available at https://github.com/dad-repository/cidad.git), URL (accessed on 30 October 2022)) a complete and labelled CoAP-IoT dataset based on the reproduction of certain traffic with a sufficient number of samples, specific feature extraction, and mixed anomalous scenarios.
An incident that happened at the Centre for Information and Communications Technology Research (CITIC) [8] served as the inspiration for the scenario design in this article.The temperature sensors initially failed to deliver accurate data to the cooling system due to a software issue that resulted from an erroneous sensor configuration in the data centre.The system failed due to a hardware breakdown brought on by changes in the device temperatures.
In addition, to validate CIDAD for traffic anomaly detection on CoAP-IoT networks, five shallow machine learning models were applied.Indeed, ML cannot work with raw data in suitable conditions; thus, as a preliminary stage, it was necessary to extract and analyse the representative features and remove redundant information.Consequently, the data had to be pre-processed to be perfectly interpreted by the ML algorithms.

Contributions of the Work
The main objective of this work is to present a new CoAP-IoT dataset, perfectly labelled, easy to follow and manipulate, and designed for traffic anomaly detection in IoT environments, and present its validation through the application of supervised learning algorithms.The main contributions of this research are:

•
An exhaustive overview of existing IoT datasets and their validation using machine learning algorithms for anomaly detection.Application of diverse machine learning techniques and their evaluation using different traditional metrics to validate the selected dataset.

Paper Organisation
This work is structured as follows: Section 2 presents the most relevant IoT datasets and how they have been used in detecting traffic anomalies using machine learning techniques.Section 3 identifies the architectures used for both the physical and virtual settings.The structures implemented for the exchange of messages are presented in Section 4.An overview of the CIDAD is detailed in Section 5. Section 6 exposes the machine learning techniques used.Section 7 presents the preprocessing of data as well as the feature engineering, to make the data perfectly understood by classifiers.For model implementation, CIDAD was divided into training and test sets.The results of the training split are discussed in Section 8.The assessment of the test set is detailed in Section 9. Finally, a discussion and future research directions are presented in Section 10.

Related Work
The emerging use of IoT networks and their relative novelty, along with their security threats, has led to the recent endeavour of building specialised IoT datasets.
This section presents some of the most relevant published public datasets, which were developed specifically in IoT sensor networks with sufficient traces and are perfectly labelled, designed for IDSs using supervised machine learning.As well as this, examples of their implemention for traffic anomaly detection using machine learning are shown.
• N-BaIoT [9]: This dataset is composed of captured traffic from two networks, the first one corresponding to a video surveillance network with internet protocol (IP) cameras with eight different types of attacks that interrupt the availability and integrity of the video links.The second one comes from an IoT network with three computers and nine IoT nodes, one of which is affected with the Mirai botnet malware.This dataset was employed to detect botnet attacks in the IoT by extracting snapshots of network behaviour and using deep autoencoders.In addition, using the mentioned dataset, Kitsune [10] devised a system for network intrusion detection, which learns how to detect online attacks on local networks without supervision using autoencoders.Furthermore, Abbasi [11] exploited the dataset for classification using logistic regression (LR) and artificial neural networks (ANN).• CCD-INID-V1: Liu et al. [12] created a new dataset to identify traffic anomalies in IoT networks.In addition, they developed a hybrid intrusion detection method that incorporated an embedded model for feature selection and a convolutional neural network (CNN) for attack classification.The considered method has two models: RCNN, where random forest is combined with a CNN, and XCNN, where extreme gradient boosting (XGBoost) is combined with a CNN.They compared the accuracy of their method with classical ML techniques, such as support vector machine (SVM), K-nearest neighbours (KNN), LR, and naive Bayes (NB).To validate their experiments, they also compared their proposed dataset with the N_BaIoT dataset [9] and the DoH20 dataset [13] to investigate the performance of these learning-based security models and compare their method against conventional ML techniques.[20], who employed a small automation testbed using MODBUS/TCP.Using both the MOD-BUS and TCP protocols, the testbed emulates a cyber-physical system process that is managed by a SCADA system.They analysed several attacks in the testbed: "ping flooding", "TCP SYN flooding", and "MODBUS query flooding -read holding registers", which target the programmable logic controller (PLC).In order to validate the dataset, four classifiers were implemented, kNN, SVM, DT, and RF, with the DT classifier achieving the best results.[22] presented by Sivanathan et al. [29] for normal traffic coming from various devices; the IoT network intrusion dataset [25], a part of the IoT-23 [26], which mainly involves malicious traffic; and a small automation testbed [20] for industrial control systems.• ToN-IoT: Moustafa at the Cyber Range Labs of UNSW Canberra [30] created an MQTT dataset, named ToN-IoT, that comprised heterogeneous data sources collected from a realistic and large-scale network, where more than ten IoT and IIoT sensors, including weather and industrial control sensors, were employed to capture telemetry data, presenting both regular and abnormal events.The IoT nodes (for instance, green gas IoT and industrial IoT actuators) communicate using MQTT, and they publish and subscribe to different topics, namely temperature and humidity.Sarhan et al. [31] intended to standardise the techniques to apply them to any dataset.Six ML models, deep feed forward (DFF), CNN, recurrent neural network (RNN), DT, LR, and NB, and three feature extraction algorithms, principal component analysis (PCA), linear discriminant analysis (LDA), and automatic encoder, were applied on three reference datasets, and among them was the ToN-IoT [30].
• MedBIoT: This is another semi-synthetic dataset related to IoT botnet, focused on the detection of botnet attacks such as Mirai, BashLite, and Torii [32].Developed in the Department of Software Science at Tallinn University of Technology, this dataset was obtained from a network with more than 80 devices, including normal and malicious traffic.The scale extension enables capturing malware spreading patterns that cannot be seen in small-sized networks, thus providing a more realistic environment.KNN, SVM, DT, and RF classification models were implemented, verifying the applicability of the proposed dataset.• DAD [33]: This is a labelled IoT dataset containing a reproduction of certain real-world behaviours seen from the network and used for the detection of traffic anomalies in IoT sensor networks.The scenario involves four InRows, with four temperature sensors each, connected to a broker that establishes connections through MQTT messages.
The dataset presents three types of anomalies, duplication, interception, and modification, on the MQTT protocol payload spread over 5 days.Later, this was validated in [34] applying LR, RF, NB, AdaBoost, and SVM, demonstrating its applications in anomaly detection, where the best classification results were obtained for RF and Adaboost models.IoT-Flock: Ghazanfar et al. [37] built a real-time IoT smart home system dataset using an IoT traffic generator tool that could produce normal and abnormal IoT traffic over a real-time network by means of just a physical machine.The given dataset contains four types of environmental monitoring sensors (temperature, humidity, light, and motion) and communicates using MQTT protocol through a wireless local area network (WLAN).In order to show the utility of the dataset for creating an IoT security solution, it was used with three common ML models (NB, RF, and KNN), achieving high detection rates.• SDN-IoT: Bhayo et al. [38] developed a secure IoT framework based on a softwaredefined network.This framework is capable of detecting vulnerabilities in IoT devices and abnormal traffic created by IoT devices taking into account the number of IP sessions and analysing IP payloads.In this way, it can easily detect DDoS attacks in the network by studying different parameters.The authors presented IoT nodes in different scenarios.A large amount of traffic is generated from an exposed node, and abnormal traffic is detected and notified by the SDN controller.Given the results reported, the proposed framework detection accuracy for DDoS attacks is high, from 98% to 100%.These results were obtained in the early stage of the attacks and have a low false-positive rate.Despite that, due to the novelty of the proposed framework, it has not been tested for ML IDS.
• IoTID20 [39]: The testbed for the IoTID20 dataset is a combination of IoT devices and interconnecting structures.The attacks are carried out on two devices that act as victims.The dataset contains eight types of attacks, belonging to DoS, Mirai, MITM, and scan.The ML models used for anomaly detection are SVM, GaussianNB, LDA, LR, DT, RF, and ensemble classifiers.The best detection rates achieved were with RF and ensemble classifiers.

•
MedBIoT [32]: Developed at the Department of Software Sciences at Tallinn University of Technology, this is another semi-synthetic IoT dataset obtained from a network with eighty emulated devices and three physical devices.The malicious behaviour of the dataset is generated by the deployment of three prominent botnet malware within the controlled environment: Mirai, BashLite, and Torii.In order to verify the suitability of the IoT behavioural dataset, kNN, SVM, DT, and RF classification models were implemented.To summarise, Table 1 outlines the mentioned datasets, emphasising the type of IoT network used, the anomalies they contain, and some works that validate their use in detecting anomalies using ML.Most of the described IoT datasets are based on smart home environments or video surveillance cameras in non-specific environments.Three datasets handle environments or traces on IIoT platforms, and only a few of them use dedicated protocols.Some of them introduce networks of wireless sensors, such as systems for measuring temperature, humidity, movement, etc., but none of the aforementioned datasets provide networks of lightweight temperature sensors in a data centre with an acceptable amount of traces in the IoT, with a mixture of anomalies, based on a CoAP protocol.Consequently, we found the need to generate a dataset that would adjust to the requirements of the described environment.Thereby, the design, implementation, and application of ML for anomaly detection to a dataset under the described conditions is presented in this article.Deep autoencoders [9].Encoders [10].LR and ANN [11].RCNN, XCNN [12].CCD-INID-V1 [12] Smart sensors in an IoT network ARP poisoning, ARP DoS, UDP Flood, Hydra Bruteforce with Asterisk protocol, SlowLoris RCNN, XCNN, KNN, NB, LR, SVM [12].

Scenario
In order to obtain a dataset that can be used to detect traffic anomalies, a virtual scenario based on a real data centre temperature sensor network has been designed and implemented.First, this section presents the structure and actual operation of the sensor network from the real data centre.This includes the interaction, distribution, and localisation of the sensor devices.Secondly, the implementation of the virtual scenario and the elements required for the construction of an IoT sensor network are described.

Physical Architecture
To build a real environmental dataset, data were obtained by modelling observations from authentic temperature sensors in the Centre for Information and Communications Technology Research (CITIC) Data Center [8].The data centre structure consists of three elements with sensors: the racks, the power strips (PDU), and the refrigeration machines (InRow).In the proposed scenario, only the sensors of the InRows will be considered.The other sensors in the data centre maintain static temperature values; therefore, they are not susceptible to analysis.Each InRow has control over four sensors: (a) unit entering fluid temperature (TFEU), which measures the temperature of the input fluid to the InRow unit, (b) unit leaving fluid temperature (TFSU), which measures the temperature of the fluid leaving the InRow back to the outside cooler, (c) unit return air temperature (TAR) unit, which measures the temperature of the air entering the InRow air conditioning unit of the closed hot corridor, and (d) unit supply air temperature (TAS), which measures the temperature of the air coming out of the InRow into the cold corridor.
Based on how the CoAP protocol works, a client node can command each InRow node sending a CoAP packet.Every sensor is identified by a uniform resource identifier (URI) address (ex.www.inrow15.gal/1), where the resource indicates the InRow, and the number that follows the slash corresponds to its sensor: (1) TAS, (2) TAR, (3) TFEU, and (4) TFSU.Then, when performing virtualisation, there will be a virtual machine for each InRow, with the host machine acting as a client.

Virtual Scenario Description
To generate the dataset, a virtual infrastructure was deployed using the ESXi 6.5 hypervisor.Through the vSwitch capacity provided by this hypervisor, an isolated Ethernet network, called the IoT network, hwas configured, so that the infrastructure is not interfered with by external or unwanted traffic.Five virtual machines with the Ubuntu 20.04 operating system were deployed in this network, four of which represent the four CoAP servers, called inRow, which in turn have four sensors each, while the fifth virtual machine implements the CoAP client functions and therefore implements the messages sent to each of the four sensors of each of the four CoAP servers, thus performing the function of a system monitor.Figure 1 shows the scenario with different virtual machines that contain a client and four CoAP servers.The CoAP client message, sent by the client virtual machine, contains a CoAP packet built using the Scapy for Python tool [43], where the IP and UDP protocols are concatenated and the token value is randomly generated.Finally, the CoAP packet is sent asynchronously using the send method from the Scapy library.In each of the four virtual machines that simulate the operation of the InRows, a server CoAP that listens for messages from the CoAP client was implemented, using the sniff method of the Scapy library and performing a filtering process.For each message received, a reply is sent, which includes the temperature value for each node.The temperature values were calculated based on the time series implemented in [33].
Each of the samples generated by the sensors is sent every five minutes to the client through a CoAP message that contains its node identifier as a clientID (using URI-Host and URI-Path values).The capture of the traffic exchange between the nodes is executed on the client.The connection information is stored to be latter tagged in the dataset via Scapy.Most abnormal packets are flagged on the server, except those that belong to the interception anomaly.

Dataset Generation
The CIDAD was generated from a seven day monitoring of the situation that occurs daily in a simulated environment.Five of the seven days present anomalies injected in one of the InRows on the payload of the CoAP message.Traffic labelling is performed at the packet level, where abnormal traffic is statistically different from normal traffic.Three different scenarios of anomalies are presented in the dataset, where the behaviour of the node can be affected as follows: • Interception: the duplication anomaly consists of arbitrarily removing packages.• Modification: this anomaly consists of modifying the temperature of the sensors to a very low value.This causes the system to stop cooling, resulting in overheating of the devices.• Duplication:this anomaly sends more tokens than the packets initially planned.

Dataset Setup
One of the most promising protocols for small devices is CoAP.It is intended to be used for web transfer with restricted devices and networks, such as scenarios with power constraints or dissipating lines.It is created to communicate machine to machine applications, according to a request-response model.It is capable of discovering both services and resources.Moreover, it takes into account very important web ideas, such as the unified resource identification (URI) and media types.For this reason, it can communicate with HTTP; however, at the same time, it allows for multicast communications, very slight overhead, and plainness for limited scenarios.CoAP employs UDP at the transport layer, so it communicates with unreliable and connectionless datagrams [2].The UDP port is 5683.The structure used in this scenario for the exchange of messages in each of the roles used by every host is described below: • Client: The client sends CoAP request messages to InRows (servers).These messages are of type NON (non-confirmable) and do not require a confirmation ACK message (the network will be less overloaded and the devices will consume fewer resources when processing and sending these messages).The request sent by the client has the following structure: -Version: the value of this field will be 1 (01).-Type: since the messages will be of type NON, the value will be 1 (01).-Code: the messages that are sent from the client requesting a resource from the server will be of the GET method.Therefore, the value of the code field will be 1 (0.01).

-
Message ID: this is calculated according to the message sent; the message id increases according to each message and follows a chronological order.-Token: each token is a unique and randomly generated field of the message.The size of this token is 4 bytes, the minimum size recommended.-Options: the options used are URI-Host and URI-Path, able to correctly identify the resource on the corresponding server.We will also add the Accept option with value 0, which indicates that the format of resource representation that the client accepts is plain text (text/plain).

Dataset Analysis
The anomaly traffic is measured from InRow 13 with IP source 192.168.0.2 during one week, as shown in Table 2.All the anomalies presented are performed on the temperature message packets.During the interception anomaly, the data text line packets from the InRow 13 sensors are not sent to the client.For the modification anomaly, all the temperature measurements are adjusted to the same value, placing them at 11.56 ºC regardless of which sensor they belong to.When a duplication anomaly occurs, additional packets are sent to the client.
As aforementioned, the sensor sends data to the client every 5 min.The CoAP data text line packets correspond to the messages from the sensors.Therefore, under normal conditions, there should be 288 daily CoAP data text line packets.Anomalous duplication and modification packets must be reflected in the number of labelled daily packets.In the case of interception, the packets are not shipped from the host; thus, these do not exist in the dataset and cannot be labelled as an anomaly.However, the request messages sent by the client, in this case, are labelled as anomalies.

General Dataset Description
Taking the type of communication used in the proposed scenario into account, a series of important features are selected in the sending of CoAP messages, such as the amount of source and destination bytes, the number of source and destination packets sent, the number of packets for each protocol, and the amount of abnormal and normal packets per day.
The CIDAD has a total of 88,238 packets, and recalling that the CoAP protocol goes over the UDP, 73% of all the traffic corresponds to UDP packets.Detailed information is presented in Table 3.It contains ARP, IP, and IPv6 traffic.However, the number of packets sent in IPv6 represents only 0.95% of the total packets and 26% of the ARP traffic.All IPv6 packets correspond to ICMPv6 packets, while IPv4 packets contain UDP and ICMP datagrams.All the anomaly packets belong to UDP protocol.Thus, 71.62% of UDP packets are normal, while just 1.20% of all the traffic is labelled as an anomaly.It is important to note that the packets from the sensor that are intercepted do not appear in the dataset because they do not exist anymore.As a consequence, the request packets sent by the client that have no response are labelled as anomalies.Knowing how CoAP carries out the exchange of messages and assuming the request/response connection, in normal instances, there should be homogeneity in the number of packets sent from each IP address source, as well as the number of bytes sent.Therefore, alterations in data uniformity are an indication of an anomaly.
The four sensors belonging to the same InRow share the same IP address.Due to the type of architecture presented, the sensors of the IoT network only send packets to the client and do not create connections between them.A large portion of packets, about 32,249, corresponds to requests made by the client.All abnormal traffic is involved with InRow 13 with the IP address 192.168.0.2, over IPv4.
The lack of homogeneity in the packets sent by the sensors represents abnormal network behaviour.The frame length determines the number of bytes sent per packet.Taking into account the proposed scenario and the homogeneity that must be presented, given the nature of the CoAP protocol, changes in the density function of the frame length imply anomalies in the packets sent.

UDP/CoAP Description
The CIDAD presents the UDP protocol at the transport layer and the CoAP at the application layer.The CoAP makes use of two message types, requests and responses, using a simple, binary-based header format.In the dataset, there are two types of CoAP messages; the packets marked as CoAP represent those requests made by the client and the CoAP data text line are the responses sent by each of the sensors.Therefore, the messages with the temperature data from the sensors will be housed in the payload of the CoAP data text line messages.
Table 4 shows the number of packets for each type of CoAP message.As mentioned above, all anomaly packets are over UDP messages.The packets labelled as anomalies in the CoAP protocol on Tuesdays, Fridays, and Saturdays correspond to packets sent by the client that have not received a response and correspond to an interception anomaly.This anomaly is also reflected in the absence of a data text line CoAP.The modification and duplication anomalies are labelled over the CoAP data text line.In the case of Wednesday (modification anomaly), the number of packets remains unchanged, while on Thursday (duplication anomaly), there is a considerable increase in the number of packets.In cases where there is a mixture of anomalies (Friday and Saturday), the number of packets and their distribution is not an accurate indicator of the presence of anomalies.As a consequence of the simplicity of the structure handled by the CoAP protocol, the network presents a low, uniform, and homogeneous exchange of packets.This fact makes it necessary to find other types of features that allow to model network behaviour as normal or abnormal.
All InRow sensors have the same IP address, but the URI-host field allows to determine the InRow to which it belongs, and the URI-path field allows to identify the sensor.The network has a total of 16 sensors and each sensor sent 288 packets per day to the client.The absence or presence of additional packets also gives us an initial indicator of the type of anomaly the InRow presents that day.The first evidence of removed packets in this kind of lightweight environment could be the lack of uniformity in traffic.Indeed, the packet reduction presented on Tuesday at InRow 13 is a strong indicator of the presence of an interception anomaly, in contrast to the excess of packets on Thursday, indicating a duplication anomaly.In the case of days in which there is a mixture of anomalies, such as Friday and Saturday, the presence of duplication and interception anomalies will not affect the homogeneity in the number of packets sent per day, but the number of anomalous packets will increase.

Machine Learning Techniques
The aim of a machine-learning-supervised approach is to train an algorithm using a dataset for which we know the result.From this data, the algorithm "learns" and can then in the future make decisions about values for which the outcome is not known [44].There are many machine learning methods, some are more or less flexible, and shallow models are simpler since they have a relatively small estimation range.The implementation of the algorithms was executed using the scikit-learn tools for classification in Python [45].
During the training phase, there is a process called hyperparameter optimisation or tuning.It consists of trying several models with different combinations of values and then comparing the model performance to choose the best one according to a predefined metric on a cross-validation (cv) set.The principle of grid search is exhaustive searching [46].
To perform the experiment, five machine learning models were selected in order to validate the dataset to be employed to detect traffic anomalies in CoAP-IoT networks: logistic regression, naive Bayes, random forest, AdaBoost, and support vector machine.Initially, logistic regression with the liblinear solver, tuning L1, L2, and C parameters to address overfitting, was applied.
To check or visualise the performance of a classification problem, one of the most important evaluation metrics is the AUC (area under the curve) ROC (receiver operating characteristics) curve.The ROC is a probability curve and the AUC represents the degree or measure of separability.It tells us to what extent the model is capable of distinguishing between classes.The higher the AUC, the better the model is at predicting 0 classes as 0 and 1 class as 1.A mathematical explanation of the metric can be found in [47].
The type of classification performed on experiments is binary; thus, the Bernoulli naive Bayes was employed, using a single parameter, α, for optimisation.For random forest classification, the parameters considered were the n_estimators and the max_features.The n_estimators parameter suggests the number of trees established in the RF model.The max_ f eatures parameter can take the values of sqrt and log2.It suggests the number of features to deal with when searching for the best split.The fourth model selected was AdaBoost, using the decision tree classifier as a base estimator.The hyperparameters selected for tuning AdaBoost, considering a trade-off between them, were n_estimators and learning _rate.Finally, a crucial hyperparameter for a linear support vector machine (SVM) model is the regularisation penalty, C, which can severely affect the resulting shape of the regions for each class.

Data Preparation
This work was developed by adapting the the cross-industry standard process for data mining (CRISP-DM) [48].The first step in a machine learning pipeline involves all the techniques adopted to clean the dataset, reduce their dimensionality, and remove noise.The algorithm performance can be adversely affected when using raw datasets.Some features can determine the general behaviour of the sample, while others simply do not provide additional information.Therefore, it is important to have a clear view of a dataset and reduce or select the number of relevant features.
After the analysis carried out in Section 5, the following features were initially considered: The implementation of the selected machine learning algorithms requires numerical arrays as input data, so it is necessary to transform categorical variables into numerical ones.This process is known as discretisation.In this approach, the features are encoded using a "dummy" encoding scheme.This creates a binary column for each category and returns a sparse matrix or a dense array (depending on the sparse parameter).After mapping symbolic attributes to numeric values, if there is significant variance, scaling of the feature values is required.Feature scaling is achieved through mean normalisation.
In this case, the dimensionality of the input dataset is high and so is the complexity of every related machine learning algorithm.Feature selection is the process in which each characteristic is evaluated to determine the ones that effect the outcome within the dataset.Its purpose is to reduce high-dimensional data and maintain or improve precision.The recursive feature elimination (RFE) algorithm excludes properties and calculates their performance recurrently [49].For the estimation of the RFE in CIDAD, a DT classifier with a cv k-fold of 5 divisions was used as the base model.The feature importance ranking is shown in Figure 2, and each colour represents each of the iterations.
Unbalanced data are a common problem in network intrusion systems, where the important cases requiring detection represent a very small portion of the data.This situation is clearly presented here (see Table 3), where abnormal packets represent 1.2% of the total packets.When learning extremely unbalanced data, there is a significant probability that a selected sample contains few or even none of the minority class.All of this results in an algorithm with poor performance in predicting the minority class.An alternative to mitigate misclassifications in machine learning algorithms is using stratified k-fold cv.This technique generates partitions in the data, maintaining the balance of the samples between the classes during the training of multiple models.Another mechanism that permits a reduction in the imbalance is the use of sampling techniques.The synthetic minority oversampling technique (SMOTE) is a function that allows handling cases of unbalanced classification by randomly subsampling the majority class.Instead of oversampling the minority class with existing samples, it creates instances of synthetic minority classes to increase the current data and try to avoid overfitting [50,51].

Models Training
As a good practice, the CIDAD was split into training and test sets, with a training subset of 80%.For the evaluation of the models, a nested stratified k-fold cv was chosen with an internal and an external loop.Both loops were adjusted with k = 5.Additionally, to guarantee balanced samples, SMOTE was applied in the internal loop using five kneighbour.The minority class is over-sampled at about a ratio of 0.4, and the majority class is under-sampled at a ratio of 0.5.This section outlines the process of hyperparameter tuning on the machine learning models over the remaining training data, computing ROC AUC scores.The results presented correspond to the average value of the inner cv, and the best parameters will be chosen to be used in the test stage.
The first model to consider was logistic regression used with a liblinear solver.The penalty hyperparameter was adjusted to L1 and L2.To tune the C parameter, a sweep over the values of 0.001, 0.1, 1, 100, and 1000 was carried out.The best model was obtained with an L2 penalty and a C value of 0.1 and an ROC AUC score of 0.984.The global mean over all iterations was 0.9845, with a standard deviation of 1.013 × 10 −3 .Normally, a better operation is expected at high values of C; however, as can be seen in Figure 3, the variation in the parameter C does not significantly affect the performance of the model with an appropriate choice of penalty.
The tuned hyperparameters that offer the best results for LR, in all cases, are a C equal to 0.1 and a penalty L2, as shown in Table 5.This table presents the parameters that obtain the best average score in each internal cv and its standard deviation, among all the possible combinations tested.The parameters selected here are those used in the evaluation of the model in the test split.To identify the optimal composition of hyperparameters for the Bernoulli naive Bayes model, a sweep for the values 0.01, 0.1, 0.5, 1, and 10 in the parameter α was performed.The overall mean ROC AUC value was 0.9398, with a standard deviation of 7.35 × 10 −3 .The best individual score for this model was 0.9596, with α equal to 10. Figure 4 shows the average results obtained over the selected values of α in the inner iterations.As a logistic regression, naive Bayes does not present a linear relationship between values of α in each iteration.An average of the values obtained in the internal iteration in each of the parameter combinations is calculated for the selection of the parameters that will be used in the model.The best mean in the inner cv is adopted and defines the parameter that will be used in the test split.Table 6 presents the results of this process for the NB model.To perform RF, the hyperparameters were tuned using sqrt and log2 as max_features, and tested with 100, 200, and 300 trees.This model obtains a score of 1 on average over all iterations.Figure 5 presents the average AUC ROC results by iteration.The best performance found among all the models at the training stage is obtained by random forest.In tree-based models, the number of trees and features can affect the time and performance of the model during its execution.For this dataset, many combinations of parameters presented perfect scores; however, those that achieved the same performance with fewer trees were selected.Table 7 shows the chosen parameter combinations that perform perfectly for each inner iteration.On the other hand, the AdaBoost classifier employs a DT base estimator initialised with a max_ depth = 1.In this model, tuning is performed on the values of 500, 1000, and 2000 trees.The learning rate values tested were 0.001, 0.1, 0.5, and 1.The overall mean AUC ROC obtained was 0.99, with a standard deviation of 4.04 × 10 −4 .
AdaBoost's performance can be improved by increasing the learning rate parameter and the number of selected trees, as shown in Figure 6.Even so, in the ideal scenario, high accuracy values are obtained with low learning rates and less trees in the classifier.Table 8 presents the parameters that achieve the highest average in internal cv for the AdaBoost model.The best parameter frequently obtained is a learning rate of 0.5 using 500 n_estimators.Finally, in SVM, the penalty parameter C defines how much error is bearable and can regulate the trade-off between the decision boundary and the misclassification term.The parameter C was varied between 0.1, 1, 5, and 10.The linear SVM achieved a mean ROC AUC score of 0.98, with a standard deviation of 1.04 × 10 −3 .
Figure 7 shows the averages obtained in each of the internal iterations for the SVM model.For this dataset, the variations in the parameter C do not improve the performance of the model.As can be seen in Table 9, the configurations of the parameter C that obtain the best means are 10 and 1, with an ROC AUC value of 0.98.

Results and Model Comparison
A classification task can be evaluated in many different ways to accomplish specific objectives.For binary classification problems, the assessment of the optimal performance of the algorithm can be defined based on a confusion matrix.A confusion matrix compares the values of the current target with those predicted by the machine learning model.In practice, accuracy is the most widely used evaluation metric either for binary or multi-class classification problems.Through accuracy, the quality of the produced approach is evaluated based on the proportion of correct predictions over total observations.Instead of accuracy, F1 is also a good discriminator and performs better than the latter in optimising binary classification assignments.In contrast, precision and recall present a single evaluation task (positive or negative class, respectively).This section presents the results of four traditional metrics for all the shallow machine learning models selected: (a) accuracy, (b) precision, (c) recall, and (d) F1.The definitions of the metrics are thoroughly explained in [47].Furthermore, Cohen's kappa statistic is evaluated.In order to achieve the best model, a comparison between the performance of the algorithms is made.
The evaluation of the models uses the remaining 20% of the dataset.The hyperparameters and configurations used in the test stage are the same used in the training split.Figure 8 shows the results of the average values in inner iterations in each of the outer iterations.
Figure 8a presents the mean accuracy values for each model.Even though the naive Bayes classifier exhibits the lowest performance for this metric, it is still an excellent result, with a detection rate above 90%.In other words, all the classifiers for the selected configurations perform a correct prediction on the total number of evaluated observations.The best results are achieved by the models based on decision trees, obtaining a perfect result in some iterations.
Precision refers to the ratio of positive instances that are correctly predicted to the total observations in the positive class.Figure 8b illustrates the results obtained for this metric.Decision-tree-based classifiers achieved values of 1 in most iterations.On the contrary, the rest of the models obtained very low values of below 50%.
Since we are interested in the detection of network traffic anomalies, the classification of abnormal packets as normal can imply a risk for the system, and therefore it is important to minimise false negatives.The recall metric, illustrated in Figure 8c, provides an indication of how well the model correctly identifies positives out of the total number of true positives.For this metric, NB obtains deficient values, contrary to SVM, LR, AdaBoost, and RF, which also obtain very satisfactory results.The F1 metric defines a harmonic mean between precision and recall; thus, low values in any of these metrics will directly affect the values of F1, as shown in Figure 8.
The global results obtained for each model and their standard deviation are presented in Table 10.The naive Bayes classifier presents the lowest performance in all metrics.Despite having excellent results in accuracy, NB cannot be declared as a reliable classifier.The linear family models, SVM and LR, give similar results.Likewise, decision-tree-based classifiers obtain similar results and are found to be the best models, with RF being an optimal classifier for this dataset.Cohen's kappa statistic is a very useful evaluation metric when dealing with imbalanced data.It is a metric often used to evaluate the agreement between two qualifiers and can also be used to assess the performance of a classification model.In the experiments, the NB classifier obtained a kappa score of 0.11, meaning that the classification could have been obtained by a random guess.The LR and SVM classifiers have a poor score of around 0.34.On the other hand, RF and AdaBoost presented kappa statistics of 0.86 and 0.99, respectively, guaranteeing an excellent classification performance.

Discussion
After the recent advances in IoT networks present in daily life and the difficult task of incorporating secure systems, intrusion detection systems have been a versatile tool to certify their integrity and confidentiality.
Preventing or avoiding network failures using anomaly-based IDSs requires accessible datasets to analyse and detect anomalous behaviour in network traffic.The CoAP protocol is one of the most widespread protocols in the IoT environment.However, after reviewing the literature, it was not possible to find datasets fulfilling our requirements that were developed for this protocol to allow determining its scope and limitations.
This article presents a versatile, realistic, easy to handle, and fully labelled dataset designed for use in intrusion detection systems in CoAP-IoT environments.In addition, a specific analysis of the most relevant characteristics of the dataset is undertaken.It contains a total of 88,238 packets, among which we find the use of ARP, IP, and IPv6 protocols.A total of 64,459 packets correspond to UDP, and all the anomalies are performed over the CoAP data.This means all anomaly packets belong to UDP/CoAP and represent 1.64% of the UDP packets and 1.2% of the total packets presented in the dataset.
To validate the dataset, we selected five widely used shallow machine learning algorithms, such as logistic regression, naive Bayes, random forest, and linear support vector machine.Three of them base their behaviour on statistical linear functions (logistic regression, naive Bayes, and SVM) and the other two on are based on decision trees.
The data entered into the classifier had to be appropriate so that it was perfectly understood by the classifier.Initially, data pre-processing using discretisation techniques was performed.Then, RFE for feature eliminations was carried out.The SMOTE data balancing technique, along with stratified k-fold cross-validation, assured the presence of the minority class, while the hyper parameterisation of the model looked for the configuration that presented the best results for the classifier.
To evaluate machine learning algorithms for classification, the cv train-test split procedure was used.In the training stage, the AUC ROC score was considered.The best values achieved were 0.97 for LR and SVM, 0.96 for NB, and 1 for RF and Adaboost, in different configurations of parameter combinations.
For the test stage, as traditional for ML models, four measures were analysed to determine the quality of the classifier: accuracy, precision, recall, and F1.However, due to the type of scenario in which the study was developed, i.e., detection of traffic anomalies, the most relevant aspect is the adequate detection of positives.This is the reason why recall is so important in this context, since it measures the ability of an algorithm to predict a positive outcome when the actual outcome is positive.
The obtained results confirm the tree-based classifiers as the best for this dataset, reaching almost perfect values (close to 1) and also presenting the lowest standard deviations, indicating the stability of the classifier.As expected, models of the same nature obtained very similar results.Cohen's kappa supplied a more objective description of the model performance, confirming RF and AdaBoost as the best classifiers.These results demonstrate that this dataset can be used for anomaly traffic detection in IoT-CoAP environments.
In conclusion, we present the first application of ML algorithms on the proposed dataset for anomaly detection in IoT networks.In future work, we will consider other datasets in order to verify the obtained results.On the other hand, this research focuses on the application of shallow machine learning models.However, in the future, the research can be expanded to the application of deep learning algorithms on the proposed dataset.

Figure 5 .
Figure 5. Random forest mean score by iteration.
[14]oshi et al.:They developed some classifiers to automatically identify attacks in IoT traffic.They demonstrated that considering IoT traffic information in the feature selection process can lead to a high accuracy distributed denial of service (DDoS) detection[14].A pipeline was assembled to operate on network middleboxes and corresponding devices that may be part of an ongoing botnet.Different classifiers for attack detection were implemented, such as KNN, support vector machine with linear kernel (LSVM), decision tree (DT), random forest (RF) and neural networks (NN).
[16]l eyes on you: Pahl et al.[15]developed an IoT microservice anomaly detection, where they created and published two separate datasets.The datasets monitored connections between seven different virtual state layer (VSL) service types of light controllers, movement sensors, thermostats, solar batteries, washing machines, door locks, and user smartphones.Additionally, employing this dataset, Hasan et al.[16]proposed an ML system using LR, SVM, DT, RF, and ANN.The proposed models based on ML could recognise and conserve the system when it is in an abnormal state.In training and test stages, the best accuracy results were obtained [41]IIoTID[40]: This dataset is implemented using the Brown-IIoTbed[41]testbed developed at the University of New South Wales (UNSW).It consists of a variety of M2M, machine to human (M2H), and human to machine (H2M) connectivity protocols, sensors, actuators, various mobile and IT devices, means of access, application programming interfaces (API), and states that are implemented at the three levels of an IIoT system.It contains a wide variety of attacks on different layers of the system.
[42]data include normal and abnormal traffic, captured at various hours of the day for four non-continuous months.For dataset validation, DT, NB, kNN, SVM, LR, DNN, and GRU algorithms were used.The best results were achieved with DT in all cases, achieving an accuracy of 99.54% for binary classification.•CICIoTDataset 2022[42]: Simulating smart home activity, this dataset was created from 60 devices in an isolated IoT sensor network at a laboratory in the Canadian Institute for Cybersecurity (CIC).The IoT network includes WiFi, ZigBee, and Z-Wave devices.In the experiments, six different scenarios were investigated, containing two different types of attacks: flood denial-of-service attack and real time streaming protocol (RTSP) brute-force attack.The IDSs were developed using 12 different Gaussian classifiers, NB, DT, LDA, AdaBoost, Ridge, Perceptron, passive-agressive, XGBoost, kNN, RF, LSVC, and SGD, obtaining the best results with XGBoost (98.6% accuracy) and the tree-based classifiers, DT and RF (98.5% accuracy), but with a lower prediction time on the tested dataset.They finally decided on an RF classifier because it is computationally efficient, does not overfit, and the performance can be improved with data and feature selection.

Table 1 .
IoT dataset review summary considering applications in ML-IDS.

Table 5 .
Best logistic regression model by iteration.

Table 6 .
Best naive Bayes model by iteration.

Table 7 .
Best random forest model by iteration.

Table 8 .
Best AdaBoost model by iteration.

Table 9 .
Best support vector machine model by iteration.

Figure 7 .
Support vector machine mean score by iteration.

Table 10 .
Best scoring model by iteration.