A Comparison of Neural-Network-Based Intrusion Detection against Signature-Based Detection in IoT Networks

: Over the last few years, a plethora of papers presenting machine-learning-based approaches for intrusion detection have been published. However, the majority of those papers do not compare their results with a proper baseline of a signature-based intrusion detection system, thus violating good machine learning practices. In order to evaluate the pros and cons of the machine-learning-based approach, we replicated a research study that uses a deep neural network model for intrusion detection. The results of our replicated research study expose several systematic problems with the used datasets and evaluation methods. In our experiments, a signature-based intrusion detection system with a minimal setup was able to outperform the tested model even under small traffic changes. Testing the replicated neural network on a new dataset recorded in the same environment with the same attacks using the same tools showed that the accuracy of the neural network dropped to 54%. Furthermore, the often-claimed advantage of being able to detect zero-day attacks could not be seen in our experiments.


Introduction
Intrusion detection belongs to the fundamental techniques in network defense.Therefore, new methods and technologies should be adapted as soon as possible to improve network intrusion detection.Over the past years, machine learning techniques and in special deep neural networks have attracted great interest in the field of network intrusion detection.There exist several publications which show the benefit of this approach [1,2,3,4,5,6,7,8,9,10,11].
The Internet of Things enables a wide range of applications for example in the field of smart home, smart agriculture, smart cities or smart healthcare.According to estimates from the year 2022, the sheer number of connected IoT devices was already 14.3 billion and was expected to rise to 16.7 billion in 2023 [12].But the huge amount of devices connected to the Internet simultaneously enables a large number of possible attacks.In principle, we observe two different attack scenarios which we have to distinguish: In the second case, IoT devices are typically misused for DDoS attacks [13,14].However, the DDoS attack is usually preceded by an attack targeting and hijacking the IoT device.Hence, intrusion detection systems for IoT environments focus on the first case.
IDSs are classified according to the detection method of attacks into signature-based or anomaly-based.In a signaturebased IDS, the detection of attacks is based on events or patterns, which describe the attack.These are referred to as signatures.The patterns are not limited to simple character recognition, but can also reflect the behavior of users in the network.For signature-based IDS, therefore only known attacks can be detected.However, the attack patterns should be precisely defined, otherwise the risk of false alarms increases.In contrast, with an anomaly-based IDS, attacks are detected by the fact that they differ significantly from the normal behavior of the system, i.e. they form an anomaly.The decision whether it is an anomaly or not can be based on the protocol used or statistics.For this purpose, statistical variables such as the mean value or the variance for the behavior of the system are collected.Anomaly-based IDSs therefore offer the advantage that it is no longer necessary to define each attack signature individually.However, it cannot be guaranteed in every case that an attack will differ significantly from normal behavior, which would prevent it from being detected by the IDS.[15] In recent years, new approaches have emerged alongside these classic intrusion detection systems, which use machine learning methods to detect attacks and learn patterns from the data.A survey from 2021 lists about 54 different intrusion detection systems based on machine learning which address the IoT domain [16].Especially the use of deep learning IDS (DLIDS) is particularly favored in the literature [1,2,3,6,4,5,9,10,11].The term deep learning is used for neural networks that consist of at least four layers [17].This means that in addition to the input and output layer, at least two hidden layers are required.
Starting with the paper of Arp et al. in 2020 [18], the scientific community has realized that the pitfalls in the design, implementation, and evaluation of machine learning-based systems also occur in the context of security systems.Pitfall number 6 stated by Arp et al. is the inappropriate baseline which is described as follows: "The evaluation is conducted without, or with limited, baseline methods.As a result, it is impossible to demonstrate improvements against the state of the art and other security mechanisms".Regarding the research in the field of DLIDS in the IoT, this pitfall can be regularly observed in scientific publications: Authors present their promising results which are compared against a prior ML-based system, but there are no comparisons included with state-of-the-art approaches such as the classic signature-based systems like Snort, Suricata and Zeek [19,20,21].As pitfall 6 describes, this approach does not allow any insight into how good the presented system performs in comparison with the state of the art.The only publication known to us where a comparison against state of the art is presented, originates from Gray et al. [8] where the authors compare a Random Forest Classifier against the Suricata intrusion detection system.Suricata was configured only with the attack specific sections of the Emerging Threats Open rule set [22] without any configuration for the detection of SSH brute-force attacks, DDoS or port scans.So, the comparison remains unrealistic.
Cahyo et al. published a survey about hybrid Intrusion detection systems in 2020 [23].They compare 20 papers published between 2016-2020 which all combine a signature-based IDS, for example Snort, with anomaly-detection where the anomaly-detection is based on an ML-based approach (Gaussian Mixture Model, K-means, Decision Tree, ...).The hybrid approach seems obviously to be promising since it combines the best of both worlds: the signature-based approach to detect known attacks and the ML-based approach to detect anomalies.The adoption of these hybrid approaches in practice is still missing.For example, the well-known signature-based systems Snort and Suricata have not integrated ML-based preprocessors in their released source code.
All in all, the real benefit of ML-based approaches, in special for multi-class detection, must be questioned.To get deeper insight in the pros and cons we started a comparison study.Instead of creating a new DLIDS, we picked up one of the best performing DLIDS from the literature and did an experimental study under different benchmark scenarios.Further, we compared the DLIDS against Snort [19] as a well-known signature-based system.
The contributions of this work are: 1. Evaluation of a DLIDS in a replicated environment using the same attacks with and without slight modifications in the attack pattern.Furthermore, we evaluate the DLIDS against an unseen attack, demystifying the claim of zero-day attack detection.
3. Showing shortcomings of existing training sets and proposing new validation datasets (MQTT-IoT-IDS2020-UP and variations).
4. Discussion of the weaknesses of current applied performance metrics in the context of IDS, (addressing P7 -Inappropriate Performance Measures from [18]) 5. We give an overview over IoT datasets to support IDS researchers in selecting a suited dataset in their research.
2 Related Work

Discussion of Machine Learning Pitfalls
While machine learning has shown its big potential in many different areas, several researchers are discussing pitfalls in the context of security systems [18,24,25].The authors of [18] argue that these pitfalls may be the reason why learning-based systems are potentially still unsuited for practical deployment in security tasks.
Arp et al. [18] conduct a study of 30 papers from top-tier security conferences published within the past 10 years which shows that these pitfalls are widespread in the current security literature.Among the 30 publications were also four papers from the network intrusion detection area.The authors present 10 pitfalls which occur frequently in the reviewed papers.Some of them are particularly relevant for network intrusion detection systems (NIDS):  [25].
Further, there are also several publications which discuss the reliability and the practical benefit of ML-based approaches in the context of network IDS [18,26,27,28].Venturi et al. discuss the influence of temporal patterns in malicious traffic on the detection performance [26].They experiment with an ML-based NIDS which was trained on a public dataset for bot detection [29].They observe a drastic performance drop when just 1% of benign flows are injected at the end of the training set.The authors conclude that their findings raise questions about the reliable deployment of sequential ML-NIDS in practice [26].A similar result was recently reported by Aldhaheri and Alhuzali [27].The authors compared five different ML-based models (support vector machine, naïve Bayes, LSTM, logistic regression, and K-nearest neighbor) with generated synthetic adversarial attack flows.The detection rate for these slightly adapted attack flows was reduced for all five IDSs by an average of 15.93%.For example, the precision rate for detecting DDoS attacks dropped for LSTM from 98% (test set from the CICIDS2017 dataset [30]) to 73% (new generated DDoS attack traffic).
In a recent publication from Zola et al. [28], the authors state that most of the ML-based approaches ignore a key feature of malware: its dynamic nature since malware is constantly evolving.They conclude that it may be more beneficial to train models on fewer but recent data and re-train them after a few months in order to maintain performance.
Further, Ml-based approaches are the target of attacks themselves.Tidjon et al. have published a recent survey of threat assessment in machine learning repositories [31].Since ML frameworks are also software, they also can possess vulnerabilities.For example, for TensorFlow the authors report 5 dependencies with high severity according to the CVSS v3/v2 score (i.e., yaml, sqlite3, icu, libjpeg-9c, curl/libcurl7, psutil).Further, the model creation process can be the target.This is done to create models that behave incorrectly.In 2023 the Open Worldwide Application Security Project (OWASP) has published for the first time a Machine Learning Security Top Ten list [32].The Top 3 attacks are Input Manipulation, Data Poisoning and Model Inversion.OWASP started also work on an AI Security and Privacy Guide [33].
Alaiz-Moreton et al. trained a deep neural network with their own dataset for multi-class classification [1].The dataset is online available 1 .Further details about the dataset is given in Section 3.
Ciklabakkal et al. [2] trained a neural network for anomaly detection, i.e. to distinguish benign and attack traffic.They use 8 IP features, 16 TCP and 6 MQTT features.Since they also trained with source and destination IP addresses, the network will hardly perform similar in any other environment.
The dataset is public available2 and includes benign traffic and 4 attack types: Aggressive scan (Scan A), UDP scan (Scan sU), Sparta SSH brute-force (Sparta), and MQTT brute-force attack (MQTT BF).They use IP and TCP features, but drop source and destination IP addresses, upper layer protocol and the MQTT flags.The remaining features are 7 IP features and 10 TCP flags.
Ge et al. are doing both, anomaly detection and multi-class prediction [4].They train with the Bot-Iot dataset [35].The dataset is public available 3 .They extracted 11,174,616 packets from the dataset and deployed custom preprocessing.The authors dropped for example IP destination and source addresses, and TCP and UDP checksums.Further, they merged some features.Due to the preprocessing, duplicated rows were introduced which had to be removed.The final dataset consists of still 7,310,546 packets.However, the final set of features was not reported.
Khan et al. [5] are also doing both, anomaly detection and multi-class prediction.They reuse the MQTT-IoT-IDS2020 dataset from [3].In case of multi-class classification, the network is trained to distinguish five classes: benign traffic and the four attack types present in the MQTT-IoT-IDS2020 dataset.While the MQTT flags were removed in [3] Ullah et al. [7] propose a convolutional neural network (CNN) and gated recurrent unit (GRU) to detect and classify binary and multi-class IoT network data.The proposed model is trained and validated using four datasets: the BoT-IoT [36] and the IoT Network Intrusion dataset, both collected by Ullah and Mahmoud [37], the MQTT-IoT-IDS2020 [3], and the IoT-23 [38] dataset.It should be pointed out that the Bot-IoT dataset from Ullah and Mahmoud is different from the also called Bot-IoT dataset from Koroniotis et al. [35].The authors mention that they use the 48 best features where the significance of the features were determined using a random forest classifier.The authors compare their very good detection results (over 99.9% accuracy, precision, recall, and F1 score.) with previous published DLIDS, but not with a signature-based IDS.
Vaccari et al. have published the MQTTset4 , and trained a neural network with this dataset for multi-class prediction [34].They used 3 TCP-related and 30 MQTT-related features.The authors report a very high accuracy of over 99%.Accuracy is the ratio of the number of correct predictions to the total number of input samples.But this result is favored by the MQTTset which consists of 11,915,716 network packets including only 165,463 packets (0.01%) belonging to the attacks.
Recently, federated learning was proposed to cope with privacy demands [11], but again, no comparison with an adequate baseline was done (Pitfall 6: Inappropriate Baseline).
While there are several very promising ML-based models published, they still lack the comparison with an adequate baseline.In summary, we still observe that a deeper investigation of the performance of the presented models seems to be necessary.

Discussion of IDS Benchmarks
Already in 2000, John McHugh pointed out that the datasets are a crucial point in training and testing NIDS [39]: Either it should be demonstrated "that the training mix is an accurate representation of the real-world or accurate real-world training data should be available for each deployment environment".The problem becomes even harder due to the diversity of the environments and the dynamic behavior of IT infrastructures and services.For example, if the benign traffic is well known since only few services are available, it is much easier to get a realistic benign data traffic set compared to an environment like a cloud data center with very heterogeneous users.Further, new applications evolve and may initiate new types of traffic.With new applications also new vulnerabilities may occur.This in turn may lead to new attacks.Hence, our environments regarding benign and malicious traffic are very dynamic in nature.
About 10 years later, Nehinbe surveys the shortcomings of existing datasets and the problems in creating datasets [40].Twenty years later, the research community still concludes that many benchmark datasets do not adequately represent the real problem of network intrusion detection [41,9].In the context of ML-based intrusion detection the situation is even worse, since the datasets are used not only for evaluation but also for training.

Popular IDS Benchmarks
Since the quality of datasets is a relevant problem for IDS research, this topic is a recurring research theme (see for example [40,42,9]).Hindy et al. give an overview of datasets, network threats and attacking tools which are available and used for collecting datasets [41].Their analysis of 85 IDS research papers from 2008-2020 shows that most systems use the KDD-99 or its successor the NSL-KDD dataset which cover only four attack classes.For comparison, the most attack classes are included in the more recent CICIDS2017 dataset with 14 attack classes.
The CICIDS2017 dataset was released by the Canadian Institute of Cybersecurity [30].It is not a recording of traffic in a real environment, but the authors recorded simulated normal and malicious traffic.The authors addressed the dynamic behavior of IT environments by including the most common attacks based on the 2016 McAfee report.They recorded nowadays common attacks like DoS, DDoS, Brute Force, XSS, SQL injection, Infiltration, Port Scan and Botnet.The CICIDS2017 dataset includes 14 different attack classes and benign traffic.The covered protocols are HTTP, HTTPS, FTP, SSH and email protocols.
While the CICIDS2017 dataset was highly appreciated by the research community and used in intrusion detection research [43,44,45], it still has some shortcomings [46,9].Panigrahi et al. criticized the huge volume of data which made it hard to handle and also the high class imbalance [46].Engelen et al. investigated the CICIDS2017 dataset in detail [9] and report that 25% of all flows in the dataset turn out to be meaningless.The reasons are for example the misimplementation of one of the attack tools (Dos Hulk attack), and a misunderstanding of the TCP protocol which resulted in additional classified TCP flows.Engelen et al. report that also the CSE-CIC-IDS2018 dataset 5 which is the successor of CICIDS2017 reveals errors in flow construction.
It is worth to mention that the CICIDS2017 dataset does not include recordings of the MQTT protocol which is important for IoT environments.

Benchmarks specific for the IoT
The Message Queuing Telemetry Transport (MQTT) [47] is the most widespread protocol for edge-and cloud-based Internet of Things (IoT) solutions.Hence, an IDS for an IoT environment has to consider MQTT-specific attacks.The communication model follows the Publish-Subscribe style with a broker as the central heart of the architecture.The broker manages the subscriptions and forwards published messages to the subscribers.Addressing is based on so-called topic names.
Here, we will survey only publically available datasets.An overview is given in Table 1.One of the first who recorded a dataset with MQTT-specific attacks 6 were Alaiz-Moreton et al. [1].Three different attack scenarios were captured: 1. DoS: A DoS attack against the MQTT broker by saturating it with a large number of messages per second and new connections.
2. Man in the Middle (MitM): the attack used the distribution Kali Linux and the tool Ettercap to intercept the communication between a sensor and the broker to modify sensor data.

Intrusion (MQTT topic enumeration):
Here, the attacker contacts the broker on its well-known port (1883) and tries whether the broker returns all registered MQTT topics.
The dataset used in [2] is public available, but not documented.The only information given by the authors is that they used the MQTT malaria tool 7 to simulate MQTT DoS attacks.MQTTspecific [2] Bot-Iot [35] MQTTset [34] IoT-23 [38] MQTT-IoT-IDS2020 [3] CICIoT2023 [48] Year 2019 2019 2019 2020 2020 2021 2023 The Bot-Iot dataset [35] aims to identify IoT environments which are misused as botnets.The dataset was recorded in a Virtual Network using the Node-Red tool [49] to simulate benign traffic.The recorded attacks are Port Scans, OS fingerprinting (OS f. p.), DoS from an IoT Botnet (Botnet DoS), and Data Theft using the Metasploit to exploit weaknesses in the target machine.
The pcapfiles of the Bot-IoT dataset were adapted by Ullah et al. [36].The new dataset is called IoT Botnet.The authors claim that it has more general network features and more flow-based network features, but no different attack types.From their description it remains unclear which changes they made.
The MQTTset [34] consists of benign traffic and five attack classes.They recorded four different types of DoS attacks: Flooding Denial of Service with the broker as target, MQTT Publish Flood from a single malicious IoT device, SlowITe which initiates a large amount of connections with the MQTT broker, and an attack with Malformed Data sent to the MQTT broker to raise exceptions.Further, they recorded a brute-force Authentication attack against the MQTT broker using the rockyou word list (MQTT B. F.).The dataset consists of 11,915,716 network packets including the 165,463 packets (0.01%) belonging to the attacks.
The IoT-23 dataset [38] was published in 2020 and consists of twenty-three captures of different IoT network traffic: three benign IoT traffic and 20 malware captures.While the benign traffic was recorded on three real IoT devices the malicious traffic was received from executing different malware on a Rasberry Pi.Furthermore, the dataset provides 8 labels for further information about the attacks: Three labels (Mirai, Okiro, Torii) indicating botnet behavior, horizontal port scan, DDoS from an infected IoT device, file download or HeartBeat.In addition, if a C&C server is part of the attack this is shown by the label C&C.
The IoT-CIDDS dataset [10] focuses on an IoT network stack consisting of the following protocols: CoAP, UDP, IPv6 and 6LoWPAN and IEEE 802.15.4.For this environment the authors recorded five DDoS attacks in the 6LoWPAN network, namely, hello flooding (HF), UDP flooding, selective forwarding (SF), Blackhole (BH), and the ICMPv6 flooding attack.
For the CICIoT2023 dataset [48] traffic from a wide range of different IoT devices (67) were recorded in a lab environment.In this environment the following attack categories were executed by predominantly Raspberry Pis: DoS, DDoS, Recon, Web-Based, Brule Force, Spoofing and Mirai.

Class Imbalance
Class imbalance is an important problem frequently discussed in machine learning fields [18,24].For network intrusion detection this is especially relevant since the number of malicious traffic is highly dependent on the attack type.For example, brute-force attacks and DoS attacks can generate millions of packets per second, while a targeted exploit may contain only a few packets.The same distribution problem occurs for flow data after flow analysis.When recording traffic in a real environment, the amount of benign traffic surpasses the amount of malicious traffic by far, except for DoS attacks.To circumvent this problem, the authors generate malicious traffic from one or a few attacker machines on which popular attack tools are executed.However, with this approach the number of attack packets/flows is increased, but the variety and variability of the attack traffic is not captured.Furthermore, the imbalance between different malicious traffic types is not remedied as can be seen in Table 3 for the MQTT-IoT-IDS2020 dataset.The biggest class is Sparta with more than 61% of the packets.The smallest class is Scan_SU with only 0.07%.The benign traffic share accounts for only 7%.
The usual recording approach has further drawbacks.For example, attacks are executed in distinct time windows and from a single or only a few sources.This makes it necessary to exclude all features that are influenced by the time and source of the attack.Most publications describe the features that were excluded, for that reason, but features like TTL can be easily overlooked by the authors.For example, a strong correlation between the packets' TTL value and traffic type was observed for the CIC-IDS-2017 dataset in [50].

Reproducibility Study
The FAIR principles were defined by a consortium of scientist and organizations in March 2016 [51].FAIR stands for findability, accessibility, interoperability, and reusability.The intention behind the FAIR principles is to enhance the reusability of data.This can include datasets like network traces, model descriptions and/or scripts which were used in the model evaluation.If these data is publicly available together with the published research paper, this would help to ensure transparency, reproducibility, and reusability.
As the scientific community is only slowly becoming aware of the importance of the FAIR principles, it is still not common practice to publish all the relevant data.Here, we present our efforts and difficulties during our reproducibility exercise starting only with a publication at hand8 .

Selection of Representative DNN-IDS
Only papers have been shortlisted for our investigation which fulfill the following criteria: 1.Only deep learning models are considered.
2. The model is trained for IoT environments, i.e. the training set contains also MQTT-specific attacks.
3. We want to compare the selected model against a proper baseline, which is a signature-based intrusion detection system.These perform multi-class classification.For example, a Snort rule is configured with so-called rule messages which describe the detected attack.Hence, for a fair comparison, we consider only models which are also capable of multi-class classification.4. We restrict ourselves to recent publications which offer the best performance as far as we are aware.Table 2 gives an overview over the publications, which are most relevant for our investigation.We selected the following classification criteria: First, which data sets are used?Is both, anomaly and multi-class detection investigated?Which machine learning approaches are implemented?What is the baseline used to evaluate the presented approach?And finally, is there enough material available to replicate the Model?For the last question, the dataset needs to be publicly available and the source code for creating the model needed to be published or needed to be sufficiently described in the paper.
As a first result, it can be observed that the trend goes to multi-class detection.The reported performance metrics (accuracy, precision, recall, F1 score) are always very high.This may lead to the impression that DLIDS are very well suited for multi-class detection.
Other researchers compare against published results as a baseline which is actually a laudable thing.But sometimes these comparisons have to be interpreted with care, when the cited performance results are obtained for different data sets [5,7,4,6].Khan et al. are doing both: They compare different implemented approaches, and also against [1] as baseline.Also Mosaiyebzadeh et al. [6] are doing both.They compare their proposals also against a Decision Tree which was the best model from [3].
Eventhough the work of Hindy et al. [3] introduces the MQTT-IoT-IDS2020 dataset, which is used in many publications, we did not consider it for our study, since they used only classical ML-based approaches and no DLIDS.
The models in Ciklabakkal et al. [2] are doing no multi-class attack classification.Further, the IP addresses of source and destination are included in the feature set.This is not recommended in the context of NIDS since attackers will change their IP addresses frequently.Furthermore, the code of the model is not available.
All considered publications worked on publicly available datasets which shows the effort of the research community to publish comprehensible results.But only some groups published also the code of their model(s) and/or give a precise definition of their data pre-processing.For example, the authors of [1] are aware of the imbalance in their datasets and used Scikit-learn for resampling.But this process is not further described.In the paper of Ullah et al. [7], also too many details of the data preprocessing are missing.For example, the list of trained features is not known.Hence, the reproducibility is not feasible.
From the three works which seem to be reproducible ( [5,3,6]), [3] was not considered as previously mentioned.
Eventhough Mosaiyebzadeh et al. [6] published their code, they removed the SSH Sparta attacks from the MQTT-IoT-IDS2020 dataset, without giving any reason for it.This makes it impossible to compare the results to other publications using the same dataset.Since using only a subset of the dataset will have most likely have increased the performance of their model, since the SSH Sparta attacks were the most difficult to detect for us as can be seen in Section 5.9.Therefore Mosaiyebzadeh et al. was not further considered in our study.
The remaining work from Khan et al. [5] was chosen since it seems to be well documented when reading the paper which helps to redo the study.Further, the model was trained with the popular MQTT-IoT-IDS2020 dataset which includes several relevant attacks.Finally, the proposed DLIDS outperforms the other tested ML-based approaches for all three feature categories: Packet-based, Bi-flow and Uni-flow.
Remark.The chosen work for our reproducibility study is chosen because it seems to be one of the best papers from the field.The further discussed problems should not be interpreted as a finger-pointing exercise, but to showcase problems which are common in the field of ML-based intrusion detection.

Detection Metrics
This section describes the metrics which we used in our investigation.For all models we used 5-fold cross-validation and employed early stopping.To compare the performance to the original publication, we used the same metrics as the authors.Hindy et al. [3] give a precise definition of the metrics they use to evaluate their multi-class detection experiments.But in general this is not the case.Also in the original publication the definition for the metrics Accuracy, Precision, Recall and F1 Score are not defined for multi-class classification.For multi-class classification we opted to use macro-averaging, since the severity of an attack does not correlate with the number of flows or packets included in an attack.
To be comparable to the original publication, the normal/benign class in multi-class classification was treated as all other classes.Therefore, there are no True Negatives (TN) in multi-class classification.
The used metrics for multi-class classification with L classes and n samples are therefore defined as the following: For the experiments 2 and 4 in Section 5, where only one attack was analyzed, we used the binary classification metrics in which correctly identified benign traffic is considered as True Negative (TN).

Replication Problems
The here discussed problems are representative for multiple papers and are only described for the chosen paper of Khan et al. [5].Even though the chosen paper seemed to be well documented, several inconsistencies were found during the replication process.
The MQTT-IoT-IDS2020 dataset contains three abstraction levels of features, namely, packet-based, uni-flow and bi-flow.The packet-based feature set includes all L3 and L4 header values and MQTT header values.The uni-and bi-flow sets include IP address, port information and flow data.For biflow, the flow values for both directions are provided separately.Khan et al. create models for all three feature sets.
P1 Dataset: In the first step, we aggregated all records in the MQTT-IoT-IDS2020 dataset from the packet-based CSV files.The resulting class distribution can be seen in Table 3.Our numbers however differ significantly from the results of Khan et al.After email consultation with the first author about these differences, we got the information that they "might have used techniques like random oversampling, SMOTE (Synthetic Minority Over-sampling Technique), or random undersampling [...]" without going into any specifics.So to achieve a similar class distribution like in the study of Khan et al., the classes MQTT Brute-force and Sparta were cut off: Since both classes are password brute-force attacks trying passwords from a dictionary, only about the first 200,000 attack packets were sampled.

P2 Missing Value:
The missing value strategy used by Khan et al. is to use the median of that feature to be less susceptible to an outlier.But for the packet-based data, non-TCP or non-MQTT packets do not have any data in the TCP resp.MQTT features.Using this strategy results in non-TCP packets having TCP options set.Therefore, we decided to train a second model where we used the unset values instead of using the median value.For example, for TCP flags this is zero.
P3 Feature Selection: According to [5] certain features were removed from the uni and bi-flow dataset, without giving a complete list.Also the encoding and normalization was not specified for all features in the packet-based dataset.This makes it difficult to reproduce the study.Khan et al. report a feature count of 52, while by adhering to the publication we counted 48 features after preprocessing.With that we were unable to replicate the feature count after preprocessing in the given paper.
P4 Feature Scaling: Furthermore, the feature encoding must not only fit the dataset but also the real world use case.
Using feature scaling only on values from the dataset instead of protocol specific minimum and maximum values, may worsen the detection rate in a real deployment, when encountering packets or flows not previously encountered in the training dataset.
The details and the code of the replication are publicly available [52].

Replication Results
In practice complete flow data are hard to obtain during runtime.State-of-the-art IDS like Snort [19] and Suricata [20] work on a per-packet basis and provide partial flow data that includes information until the current packet.However, even this partial flow data is not complete.For example, for the MQTT-IoT-IDS2020 Dataset Snort reported more than 8000 TCP-Stream events, signaling errors in the stream reassembly.
In recent work [8], flow data was aggregated on P4 programmable switches.The authors are using a hash table to store the flow data.However, when the table gets filled entries get evicted.These partial flows than get exported, however any other packets of that flow can no longer be associated with it and will be lost.
Since complete flow data is unrealistic in practical deployment, we primarily focused on the models based on the packet-based data.Table 4 shows the results of the replicated model.We have replicated the model adhering strictly to the information given in the publication of Khan et al.The replicated model shows a slightly worse accuracy and precision when compared to the results reported by Khan et al.Since the original publication only mentions that the feature ip_len is scaled using feature scaling in python, multiple features like ttl, mqtt_messagelength, mqtt_messagetype and ports are left unscaled.Therefore, in the next step we trained a model with features that have all been scaled using min-max scaling, with the protocol specific limits.The results can be seen in the row "with adopted scaling" in Table 4.
The results show a significant uplift compared to the results of Khan et al. and the replicated model.The model however still includes the features IP address and timestamp.In the MQTT-IoT-IDS2020 dataset all actors have fixed addresses as can be seen in Figure 1.Furthermore, all attacks have been recorded at disjoint time intervals.Since in practice attackers will not have a fixed address nor adhere to fixed time windows we also removed these features.The results can be seen in the last row of Table 4.The low standard deviation as shown in Table 5 shows the small variance between the folds.
The results still don't match the reported numbers of Khan et al. and are slightly better.This shows we were not able to replicate the results of Khan et al. exactly, but achieved a trustworthy substitute, for further analysis and comparison in practice.

Evaluation
To validate these results in practice, we experiment with known traffic (a new recording of the MQTT-IoT-IDS2020 traffic in a similar environment), similar traffic (slightly modified network scans), and unknown traffic (Sensor Update, Zero-Day Attacks).Finally, we compare the trained model against the signature-based IDS Snort.

First Experiment: Replication of Dataset
First, we analyzed the models on newly recorded data.For this we rebuild the architecture of the MQTT-IoT-IDS2020 dataset using the CORE network emulator [54].The recorded traffic resembled the original dataset as close as possible and therefore similar results are expected.The same tools as mentioned by [3] were used.For brute-force attacks the rockyou word list was used.The sensors were deployed as docker containers using the MYNO [55] sensor emulator.The camera streamed lecture recordings using ffmpeg.The dockerfiles and the resulting pcaps and the configuration can be found here [52].
For further validation we trained the two best performing models, "adopted scaling" and "w/o IP & timestamps", on the whole MQTT-IoT-IDS2020 dataset.These are our final models.We validated them against the new recorded dataset, called MQTT-IoT-IDS2020-UP.The results of the models can be seen in row two and four in Table 6.The results are compared here to the results that the best performing model of the 5-fold cross-validation set achieved on its 20% validation set of the MQTT-IoT-IDS2020 dataset (in row one and three).For the "adopted scaling" model a significant decrease from 99.99% to 70.65% in recall is observed.This is due to the fact that the model misclassified the aggressive scan and part of the UDP scan traffic as MQTT brute-force as can be seen in Figure 2 in the Appendix.The other model trained without IP addresses and timestamps misclassified 80% of the MQTT brute-force traffic as normal traffic.The model also mispredicts parts of the scans as can be seen in Figure 3 in the Appendix.
The "adopted scaling" model performance deteriorated significantly less than the model "w/o IP & timestamps.This is due to the fact, that the attacker still used the same IP address as in the original training dataset as could be shown in the following experiment: Table 7 shows the results of changed IP addresses in the MQTT-IoT-IDS2020-UP dataset.The first address is the attacker address of the original dataset, the second address is an unused address in the same subnet, and the last address is a completely random, unseen address.The prediction results for the last IP address can be seen in Figure 4 in the Appendix, showing that the network classified almost all packets as benign traffic.This shows that the adopted scaling model has predominantly learned IP addresses, making it completely unfit for practical use.The model trained without IP addresses and timestamps in Table 6 however also shows a significant drop in overall performance and only achieving an accuracy of 54.52%.Since the same attacks and the same network topology were used for the MQTT-IoT-IDS2020-UP dataset, the significant performance drop of both models was unexpected.

Second Experiment: New Scan Variants
In this experiment, the neural network models are tested against slightly modified network scans.Therefore, two additional port scans were recorded.
The MQTT-IoT-IDS2020 dataset used the tool nmap for an UDP and an "aggressive" TCP SYN Scan.The TCP SYN Scan scans the 1000 most common ports.The "aggressive" flag enables OS detection, version detection, script scanning and traceroute.The new scans were recorded also using nmap's TCP SYN Scan.The first modification "Full Port Scan" was to scan not only the 1000 most common ports, but all 65535 ports.The second modification "Sensor" was to scan a sensor instead of the MQTT broker.
The models are now evaluated for those new scans and compared against their detection performance on only the scan attacks from the MQTT-IoT-IDS2020 (original scan) and MQTT-IoT-IDS2020-UP (UP scan) dataset.The results are shown in Table 8.Row one and five of Table 8 show again the scan results of the best model from the 5-fold cross-validation on the test portion of the MQTT-IoT-IDS2020 dataset.The other rows show the results of the final models on the scans from the new MQTT-IoT-IDS2020-UP dataset and the new scan recordings "Full Port Scan" and "Sensor".
Astonishingly the models performed much worse on the scans out of MQTT-IoT-IDS2020-UP dataset than the other scans, even though the attacks were recorded with the same options as in the original dataset.To note is that the model "adopted scaling" misclassified the "Full Port Scan" as different attacks but not as benign traffic, showcasing again, that it has predominantly learned IP addresses.Astonishingly the model performed admirable on the "Sensor" scan.
The model trained without IP addresses and timestamps misclassifies again about 50% of the MQTT-IoT-IDS2020-UP aggressive scan as Sparta attack as can be seen in the second row of Figure 3.This reoccurs also for the "Sensor" scan.
The reason why the model has trouble detecting scans correctly might be due to class imbalance in the training data and thus the low sample count of the scans.
Table 8: Results of the models "adopted scaling" and "w/o IP & timestamps" for scans out of the MQTT-IoT-IDS2020 (original scan), MQTT-IoT-IDS2020-UP (UP scan) and the two described variants.

Third Experiment: Sensor Updates
In the previous experiment, the attack traffic was modified.In this experiment, the benign traffic is modified.The MQTT-IoT-IDS2020 dataset contains MQTT traffic from sensors to the broker.Each sensor sends small sensor data in regular periods to the broker.For the MUP dataset a sensor was updated using the MUP-Protocol [55], which sends the firmware using MQTT.This is different to the previously seen traffic in direction and traffic size.The update contains a big payload compared to the previously sent sensor values.
The models are now evaluated on the firmware update traffic and compared against their detection performance on only benign traffic from MQTT-IoT-IDS2020 and MQTT-IoT-IDS2020-UP dataset.The results in Table 9 show that the models are able to detect the update traffic with high accuracy.Both models even perform better on the "MUP" dataset than on the benign traffic of the MQTT-IoT-IDS2020-UP dataset.Intrusion detection with deep learning models on zero-day attacks is a highly researched topic [56,57,58,59].Multiple papers [57,58,59,2,60,61] claim that the advantage of machine learning-based intrusion detection systems is the ability to detect zero-day attacks.In the next experiment, we test this claim in the context of multi-class classification.
The models are tested against an attack on which they have not been trained on.Since the feed-forward model has no dedicated output for this attack, all outputs beside the benign label will be interpreted as true positive.The attack used in this experiment is the DoS-New-IPv6 attack out of the THC-toolkit [62].This attack denies a sensor to configure an IPv6 address by claiming it is already in use by another party.This attack exploits the missing authentication in the neighbour discovery protocol of IPv6.
Table 10 shows that both models detect all attack packets as benign traffic.This is however not surprising since the model was not trained on this kind of attack or any similar to it.During our experiments we retrained our models multiple times.While we observed little to no variance (std-dev: 0.0046 for recall) in the performance on the MQTT-IoT-IDS2020 and MQTT-IoT-IDS2020-UP datasets, we noticed a very high variance on the DoS-New-IPv6 attack: Out of 5 runs, 3 models correctly classified DoS-New-IPv6 as attack, while 2 models classified the attack as benign.This shows that unseen traffic is not automatically and consistently classified as an attack.89.02 0.00 0.00 0.00

Comparison against Snort as Baseline
To evaluate the benefit of the trained model, it needs to be compared to other approaches.A realistic baseline and state of the art is a signature-based IDS which is configured to detect the same attacks as the ML-based IDS.In the sections above we showed that the model "adopted scaling" primarily learned IP addresses and is therefore excluded from further evaluation.

Snort Configuration
One of the most common used signature-based Intrusion Detection Systems is Snort [19].For our evaluation the latest version of Snort (snort3-3.1.66)was used.Before running Snort, variables like HOME_NET and EXTERNAL_NET need to be configured which are used by Snort rules as placeholders.For the detection of attacks like a port_scan, an IDS needs to keep some state.In case of Snort, this is done by so-called inspectors.Therefore, thresholds for inspectors like the port_scan inspector need to be configured and enabled.If the administrator is interested in all inspector events, this can be enabled globally or on a per-event basis.Snort can also be easily extended via plugins with other inspectors and rule options.The neighbor discovery protocol inspector indp 9 for example detects attacks on the IPv6 neighbor discovery protocol [63].
There are multiple rule collections available for Snort.Besides the openly available community ruleset there are also commercial rulesets available like the talos ruleset, which is available for free after 30 days.To be alerted for too many SSH authentication attempts, a custom rule needs to be written, which requires a threshold to be configured per environment.An example for this can be seen in Listing 1, which is already given as an example by the Snort 3 Rule Writing Guide 10 .Since MQTT servers like mosquitto reply to a failed authentication attempt with a Connect ACK packet (type 20) with a reason code of 5, a rule as seen in Listing 2 can easily be created.The limits chosen here were chosen arbitrarily and not tuned to the dataset.The Snort configuration for the evaluation includes the previously mentioned rules and the default medium port scan profile.The resulting Snort installation and complete configuration can be found here [52].

Results for MQTT-IoT-IDS2020-UP
The results of our Snort configuration on the MQTT-IoT-IDS2020-UP dataset can be seen in Table 11.Snort was not used on the original dataset because the original MQTT-IoT-IDS2020 dataset does not provide pcap files that are prefiltered by attack classes.All pcap files of the original dataset contain benign traffic.Since the MQTT-IoT-IDS2020-UP dataset was created with the same tools the attack signatures did not change.Therefore, the Snort results for the MQTT-IoT-IDS2020 should be similar to those of MQTT-IoT-IDS2020-UP.For all 4 attacks Snort achieves a precision of 100%, showing that there were no false positives.This is however not surprising since only the rules relevant to the attacks discussed were enabled and not the community ruleset.The recall 10 https://docs.snort.org/rules/options/post/detection_filter of the attacks however varies and is below 100%.This is due to the fact that a brute-force attack is only detected after the count threshold specified in the detection filter is reached.The same principle applies to the port scan, however for the UDP scan the recall is significantly lower, showing that Snort did not classify all UDP scan as attack even though the alert_all option was enabled.Because of the port scan the victim replied with ICMP unreachable packets, which in turn were classified as possible ICMP scan.This is correct and a result of the scan but can be misleading to novice users.
For all classes included in the MQTT-IoT-IDS2020-UP dataset Snort outperforms the neural network model.

Results for traffic variations
The results for Snort for the additional experiments from Section 5.2, 5.3 and 5.4 can be seen in Table 12.Snort outperforms the model for all experiments except for the new Sparta variations.These variations are somehow artificial since the configuration of the attacked SSH server needed to be changed.In "Sparta fast" unlimited login tries are allowed, while in "Sparta slow" only 10 login tries are allowed from the same IP address.

Discussion
Even though the model achieves an accuracy, precision and recall of over 90% on the validation split, our experiments show that the model performance deteriorates significantly when tested on similar but unseen data (see Section 5.2).This highlights two current problems of machine learning-based intrusion detection.First, the used datasets are too small and homogeneous.There is little to no variance in the benign traffic and the attacks.Therefore, once there is some variation in new data, the model is not able to generalize.Furthermore, even if more variance is added to the dataset, the Dimpled Manifold Model [64] indicates that the model will most likely create a "dimple" around the new variation in the decision boundary, instead of generalizing.Second, the differences between benign traffic and attacks are so significant that even arbitrarily chosen (fixed) thresholds can be used to differentiate between attacks and benign traffic.For example, the average MQTT authentication rate of benign traffic is 11.94 per second with a standard deviation of 0.47.While for the MQTT brute-force attack the average authentication rate is 660.07 per second with a standard deviation of 298.22.Machine learning models are not needed to detect these easily distinguishable events.The area of application should be scenarios where static thresholds and signature-based IDS struggle.This underlines again the importance of realistic datasets and the comparison against a baseline which was configured and tuned to the environment.For example, it should be no surprise that a machine learning model can outperform a signature-based IDS on attacks that are distinguishable on flow data if the signature-based IDS includes only the emerging threats dataset, but no configured flow limits.
Since the creation of realistic datasets is another hard problem, it is important to not only trust the results of 5-fold cross-validation but to validate the final model on similar data and slight variations of it.

Discussion of Performance Metrics
As in the original paper [5], we provided the performance measures accuracy, precision, recall and F1 score based on each packet or flow analyzed.These per-packet or per-flow metrics make sense in other domains of machine learning.For example, in image recognition it is interesting how many of the provided images were classified correctly.In intrusion detection this is however not the correct metric.Cybersecurity analysts are not interested in diagnostics on a packet/flow layer but on an attack layer.A scan or brute-force attack consists of thousands of packets and flows.The performance of an intrusion detection solution therefore should be evaluated on whether it detects each attack instance and not whether it classifies each packet correctly.This difference seems to be insignificant at first, however it has a significant impact on the precision metric.Imagine a dataset containing 10 6 packets, which are part of 3 distinct brute-force attacks.If a model detects 999,000 packets correctly with only 1000 false positives, it has an impressive precision of 99.9%.However, since the 10 6 packets are only part of 3 distinct brute-force attacks, this changes the number of true positives down to 3. The resulting precision is 3 3+1000 = 0.29%.The 1000 false positives are unchanged from this conversion since each false positive still produces an alert.The introduction of a simple threshold that only a certain amount of true positive detections trigger an alert is no fix, because the number of flows in an attack varies significantly depending on each attack type.
All analyzed models detected at least one packet of each attack in the datasets, not counting the zero-day experiment.However, the model "w/o IP & timestamps" alone created 930,723 false positive alerts making it unusable in real-world scenarios.

Usability
The usability of intrusion detection systems is an important factor for practical use.Usability is multifaceted and can be divided into multiple subcategories.Here we will discuss two factors, the easy integration into existing environments and the quality and quantity of alerts.Most signature-based IDS provide multiple input and output formats.Most publications of machine learning-based IDS analyze already preprocessed data.For integration into live traffic the processing including flow analysis of network traffic needs to be done in real-time.As mentioned in Section 4.4 the flow analysis in real-time may lead to a significant decrease of flow data quality if compared to the preprocessed data.Since most SIEM systems accept a variety of input formats, the chosen model output format is generally not a problem.
The quality of alerts is another important factor.For signature-based intrusion detection systems the quality of alerts is directly related to the quality of the rule set used.One of the approaches for IDS deployment is to start with huge rule sets and then remove rules that lead to false positives.
For machine learning-based IDS the quality of alerts is directly related to the quality of the dataset.Creating fitting datasets is a hard problem.The effort to label real network traffic is usually too high.Therefore, attacks are most commonly reproduced in lab environments.The transferability of those models to real-world data is not guaranteed and may lead to a significant decrease in performance as we have shown in this paper.Furthermore, in a signature-based IDS the reason for an alert is directly visible in the rule set.While for deep learning models, explainability is still an open problem.

Analysis with flow data
As mentioned in Section 4.4 we focused our experiments on packet data.To ensure that the transferability problem of the models to real-world data is also present in models that were trained on flow data, we also replicated the uni-flow and bi-flow models.The results for the 5-fold cross-validation on the original MQTT-IoT-IDS2020 dataset and the results of the final model on the MQTT-IoT-IDS2020-UP dataset can be seen in Table 13.
While our replicated models surpassed the results of the original paper, the models failed to detect the same attacks on the MQTT-IoT-IDS2020-UP dataset.The flow models correctly identified scan attacks, since these are easy to detect in flow data, but had problems with the brute-force attacks as can be seen in Figure 5 in the appendix.

Conclusion
We presented experiments with a deep neural network model for intrusion detection.While the model achieves high accuracy, precision and recall using 5-fold cross-validation, the performance on similar datasets is significantly worse.The most likely reason is the low variance in the synthetically generated training datasets.Furthermore, the long asserted claim that machine learning-based intrusion detection systems are superior to signature-based IDS could not be confirmed.But it is worth mentioning that even though the models recall was not high, they mostly only mispredicted an attack as a different attack and not as benign traffic.In case of binary classification those models would have performed significantly better.We conclude that a suitable area of application for ML-based intrusion detection is most probably anomaly detection.
The importance of a good baseline is also highlighted by our experiments.New machine learning models should always be compared to classical signature-based IDS to show that the problem is not trivially solvable with a static threshold, or already solved in their rule set.For this comparison, it is important to configure the baseline with equivalent effort to realistically show where a machine learning model can outperform state-of-the-art solutions.
We also argued that performance metrics should be based on detected attacks and not on correctly classified flows/packets, since practitioners are not interested whether all packets in an attack were correctly attributed to this attack, but that the attack as a whole was detected.This is particularly important in real-world detection since every false positive will result in an alert.
At last but not least, we want to thank all researchers who publish all research artifacts including datasets and source code and stress how important the FAIR principles are for easy replication and trust in publications.

Table 2 :
Classification of reviewed publications.

Listing 1 :
Snort rule to detect too many SSH authentication attempts.a l e r t t c p any any −> $HOME_NET 22 ( f l o w : e s t a b l i s h e d , t o _ s e r v e r ; c o n t e n t : " SSH " , n o c a s e , o f f s e t 0 , d e p t h 4 ; d e t e c t i o n _ f i l t e r : t r a c k b y _ s r c , c o u n t 3 0 , s e c o n d s 6 0 ; msg : " s s h b r u t e f o r c e " ; s i d : 1 0 0 0 0 0 2 ) This rule creates an alert when more than 30 SSH connections are established from the same source IP to the local network within 60 seconds.It does not count the actual failed SSH authentication tries, since this information is TLS encrypted.Since most SSH configurations only allow 3 authentication tries per connection, this is a good substitute.Listing 2: Snort rule to detect too many MQTT authentication attempts.a l e r t t c p any 1883 −> any any ( f l o w : e s t a b l i s h e d , t o _ c l i e n t ; c o n t e n t : " | 2 0 | " , o f f s e t 0 , d e p t h 1 ; c o n t e n t : " | 0 5 | " , o f f s e t 3 , d e p t h 1 ; d e t e c t i o n _ f i l t e r : t r a c k b y _ s r c , c o u n t 3 0 , s e c o n d s 6 0 ; msg : "MQTT B r u t e f o r c e " ; s i d : 1 0 0 0 0 0 1 ; )

Figure 4 :
Figure4: Results of the model "with adopted scaling" for attacks originating from a completely unseen attacker address.

Figure 5 :
Figure 5: Confusion matrix of the uni-flow model validated on the MQTT-IoT-IDS2020-UP dataset.
Sampling Bias, when the datasets used for training and testing do not represent the network traffic.(seeSection 3) P6 Inappropriate Baseline, when new approaches are only compared against very similar approaches, but not against the state of the art.(see Section 5.5) P7 Inappropriate Performance Measures, in the context of NIDS not only one performance metric like accuracy are sufficient, but also precision is important since a high false positive rate would render the system useless.Further, in the context of NIDS, both binary and multi-class classification is done.And the detailed definition of the performance metrics for the multi-class classification problem is often missing.(see Section 5.7) P8 Base Rate Fallacy, this is similar to P7, but accounts misleading interpretation of results.Dambra et al. discuss similar pitfalls in the context of Windows malware classification , Khan et al. did not remove them and trained with them the packet-based models.

Table 4 :
Mean of 5-fold cross-validation results for multi-class classification

Table 5 :
Standard deviation of 5-fold cross-validation results for multi-class classification.

Table 6 :
Comparison of the results for multi-class classification for MQTT-IoT-IDS2020 and the new recorded MQTT-IoT-IDS2020-UP dataset.

Table 7 :
Evaluation of model "adopted scaling" on the MQTT-IoT-IDS2020-UP dataset with different attacker IP addresses.

Table 9 :
Results of the models "adopted scaling" and "w/o IP & timestamps" for benign traffic in both datasets and the MQTT Firmware Update (MUP).

Table 10 :
Results of the models "adopted scaling" and "w/o IP & timestamps" for a previously unseen DoS attack.

Table 11 :
Comparison of results for Snort and the model trained without timestamps and IP addresses on the MQTT-IoT-IDS2020-UP dataset.

Table 12 :
Comparison of results for Snort and the model trained without timestamps and IP addresses on the datasets with traffic variations.

Table 13 :
Results for the replicated models trained on uni-flow and bi-flow data from the MQTT-IoT-IDS2020 dataset compared with the results reported by Khan et al.The replicated models were also evaluated on the MQTT-IoT-IDS2020-UP dataset.