The Proposition and Evaluation of the RoEduNet-SIMARGL2021 Network Intrusion Detection Dataset

Cybersecurity is an arms race, with both the security and the adversaries attempting to outsmart one another, coming up with new attacks, new ways to defend against those attacks, and again with new ways to circumvent those defences. This situation creates a constant need for novel, realistic cybersecurity datasets. This paper introduces the effects of using machine-learning-based intrusion detection methods in network traffic coming from a real-life architecture. The main contribution of this work is a dataset coming from a real-world, academic network. Real-life traffic was collected and, after performing a series of attacks, a dataset was assembled. The dataset contains 44 network features and an unbalanced distribution of classes. In this work, the capability of the dataset for formulating machine-learning-based models was experimentally evaluated. To investigate the stability of the obtained models, cross-validation was performed, and an array of detection metrics were reported. The gathered dataset is part of an effort to bring security against novel cyberthreats and was completed in the SIMARGL project.


Introduction
The surge in the number of devices communicating with one another over the Internet is expected to reach 50 billion by the end of the decade [1].This expansion of the Internet makes network security and cyberthreats a global problem.
Increasingly frequent leaks cause users to lose confidence in whether their data is being kept secure.Furthermore, attacks on critical infrastructure, such as water treatment plants or power stations, can have dire consequences [2,3].This is why the development of appropriate mechanisms to defend against hackers and malware is crucial.One of the mechanisms at the forefront of attack detection are Intrusion Detection Systems (IDS).The constant evolution of malware drives further development of IDS [4].One of the most important aspects of state-of-the-art IDS comes with the utilization of the machinelearning (ML) technologies.Apart from the influence of hyperparameter setups [5], these methods are only as good as the data used in the training phase.In cybersecurity, data acquisition is particularly hard.The traffic needs to represent the behavior of realistic and current network architectures and feature contemporary attacks.This, in conjunction with numerous privacy and technological issues, creates a vacuum and a constant need for new cybersecurity datasets.This paper is a preliminary description of the creation of the RoEduNet2021 dataset, along with the initial tests performed with the use of ML benchmark algorithms.
The proposed dataset helps build, train, and evaluate ML algorithms used for intrusion detection, on data which is relevant to contemporary network environment.The characteristics of network traffic change in time; not only are there novel attacks, but the nature of the benign traffic also fluctuates, as new services become popular.Thus, even the datasets which gain popularity in the research community become less and less relevant in time.
The contribution of this work is in the collection of relevant, real-life traffic that will be publicly available, and testing its usability for supervised machine learning methods.On top of that, the dataset, in contrast to other available datasets, contains a set of features which allows for formulation of a machine learning model which can be used for detection in live traffic, a feat not found in other datasets available to researchers.
The work described in this paper is part of the SIMARGL project, which is co-financed by the European Union under the Horizon 2020 program.The main goal of this project is to fight the issue of malware and other novel challenges of cybersecurity.This is achieved by finding new solutions that can effectively deal with the detection and prevention of, among others, network anomalies, stegomalware, ransomware and mobile malware.This research was conducted in conjunction with and the participation of the Romanian Education Network (RoEduNet), a national education and research network in Romania.RoEduNet collected and provided real-life data from its own networks, along with records of attacks.The dataset is geared towards cybersecurity researchers who are interested in examining their methods on contemporary, realistic traffic.The dataset also contains features which can be calculated from live traffic, which makes the dataset highly usable for the construction of deployable network intrusion detection systems.
The paper is structured as follows: in Section 2, the overview of the proposed dataset is introduced, with the sources of traffic, the used attacks and the extracted network features explained.Section 3 enumerates the related works in intrusion detection and recent datasets in the domain.Section 4 contains the description of the proposed methodology, Section 5 encompasses the experiments, and their results and the conclusions are given in Section 6.

Related Work
The development in the domain of network traffic analysis and the preparation of appropriate techniques for detecting anomalies and threats in computer networks, along with the development of the IDS have resulted in the creation and publication of many research papers.The ML algorithms vary in terms of their sophistication and types of features used [6,7].The following chapter will examine the state-of-the-art publications from the fields related to network traffic threat detection.

Intrusion Detection Systems
Attacks on networks are becoming more sophisticated and pose a serious threat to various types of infrastructure.Unavailability of services due to various types of attacks drastically reduces the confidence in the security of the stored data.
Systems, such as Intrusion Detection Systems (IDS), are used to defend against unwanted activities.The authors of Reference [8] highlight a cross-section of modern IDS, additionally comparing current datasets used to evaluate the IDS.The paper distinguishes the division of IDS into two categories: signature-based (SIDS-an overview and taxonomy are realized in Reference [9]) and anomaly-based (AIDS-characteristics and examples are presented in Reference [10]).
The literature features a myriad of examples of various types of ID Systems.One example comes from Reference [11], where the authors propose a solution to the problem of attack detection for minority classes.They point out the problem of long learning and detection times of deep neural networks.As a result, they propose a solution that is based on the adaptive synthetic (ADASYN) oversampling [12] and LightGBM (Light Gradient Boosted Machines) [13] technologies.The developers first normalize and encode the original data and then increase the number of samples of minority classes using the data balancing technique, ADASYN.Finally, the data prepared in this way is trained using the LightGBM algorithm.NSL-KDD [14], UNSW-NB15 [15], and CICIDS2017 [16] datasets are used to verify the performance of the proposed solution.This approach results in the precision of 92.57%, 89.56%, and 99.91% in the three test sets, respectively.
The authors of Reference [17] touch on the problem of detecting the zero-day attacks.The authors point out that even systems with frequent data updates are unable to detect zero-day attacks due to the lack of an adequate signature database.Zero-day attacks in their early stages are able to bypass the signature-based network intrusion detection systems.The authors of this paper propose to solve this problem by using RNNIDS, which uses recurrent neural networks (RNNs) to find complex patterns in attacks and create similar samples.Using this approach results in an improved NIDS detection rate.
The IDS developed in Reference [18] presents an approach based on a multi-layer perceptron.The research and testing were realized on the UNSW-NB15 dataset, from which 30 features were selected using the gain factor method.The binarization discretization technique was applied.The model achieved the results of 76.96% accuracy.
Another example comes from Reference [19].The paper presents the design of an IDS that bases its assumptions of detecting unwanted network traffic on feature selection and ensemble learning.In the first step, the authors eliminate the multidimensionality of the data using the CFS-BA algorithm.The operation of this method is based on selecting the most optimal parameters using feature correlation.The data is then subjected to ensemble learning with two algorithms: Random Forest (RF) and Forest by Penalizing Attributes (Forest PA).The last step of labeling is done by using the voting technique, and, through it, the final decision is made.Testing and learning are carried out on the following datasets: NSL-KDD [14], AWID [20], and CIC-IDS2017 [16].The experimental results for these datasets oscillate around 99% in terms of accuracy.

Overview of Existing Datasets
Detecting anomalies in the network traffic is a challenging endeavor, as more and more threats enter contemporary networks every day.Many contemporary tools today are based on the aspects of machine learning.The effectiveness of the ML-based methods is directly in proportion to the quality of data on which the model is trained.This is an extremely important activity because it conditions the subsequent correct detection of attacks.In this section, the main features of selected benchmark network datasets (CTU-13 [21], CICIDS 2018 [16], IOT-23, LITNET-2020 [22]) are presented.

CTU-13
In 2011, a team of researchers from the Czech Technical University in Prague created the CTU-13 dataset.Network traffic developers decided to establish a benchmark consisting of anomalies, along with reshuffled background traffic.The entire dataset is composed of thirteen sub-sets representing real network traffic.The set includes over twenty million samples.The distribution of labels in the dataset for each scenario is presented in Table 1.The initial collection was captured in a PCAP file [23]; then, during processing, the unidirectional Netflow traffic was separated and converted to a bidirectional Netflow.With this transformation, more features were obtained, and the client-server traffic was distinguished.The list of features with their descriptions can be found in Table 2.The creators provided within each subset both normal and infected traffic, which contains the following infection types: Brute-force, Heartbleed, Botnet, DoS, DDoS, web attacks, and network infiltration from within.All records were produced using the B-Profile tool to profile abstract human behavioral interactions and produce naturalistic, smooth, background traffic.The collection structure contains 83 features, and it was extracted using the proprietary CICFlowMeter tool.The tool, created by the Canadian Cybersecurity Institute, generates a bidirectional flow of network traffic by determining the direction from source to target and from target to source using the first packet.With this approach, the developers have extracted as many as 83 features, like Duration, Number of packets, Number of bytes, Packet length, etc.

IoT-23
The IoT dataset was entirely developed by the Stratosphere Laboratory in the Czech Republic and was published in 2020.The dataset contains infected and normal traffic and features twenty malicious attacks and three benign captures.The researchers managed to collect the traffic from IoT devices and make it available for developing machine learning algorithms that will effectively defend against these threats.The traffic has been divided into twenty-three scenarios, each containing a different type of malware or attack.The traffic distribution is shown in Table 3.The scheme consists of 20 features and a label.The most active anomalies in this set are PartOfAHorizontalPortScan (213,852,924 samples), Okiru [24] (47,381,241 samples), and DDoS (19,538,713 samples), while the least frequent anomalies are: C&C-Mirai [25] (2 samples), PartOfAHorizontalPortScan-Attack (5 samples), and C&C-HeartBeat-FileDownload (11 samples).The list with the description of all features is shown in Table 4. LITNET-2020, which leveraged an academic network, was published in 2020.The traffic was collected in real-life scenarios.The collection period lasted for ten months, and, during all this time, 12 types of different anomalies were extracted, and the data structure itself contains as many as 85 different features.Table 5 shows the specific number of samples by attack class.In total, the dataset contains 45,492,310 flows.

Proposition of the RoEduNET2021 Dataset
In this section, the proposition of a dataset that is derived from a real data flow in an academic network is presented.The network data schema is in the Netflow v9 format, and it contains 44 unique features and a label describing each frame.The entire flow contains two different types of DDOS attacks [26] and a PortScan attack [27], in addition to normal traffic.The following subsections will provide descriptions of the network and the overall infrastructure.Furthermore, all the collected features will be presented, and the dataset will be tested using machine learning methods.

Overview of RoEduNet's Client Infrastructure
For generating and capturing normal and malicious network traffic, the topology presented in Figure 1 was used:

•
Target network: contains the systems that are used in the research laboratories and for educational purposes that are used by one of RoEduNet's clients.In this network, we added vulnerable systems that will be attacked using different attack scenarios and vectors.This network contains hosts that run Ubuntu, CentOS, or Windows as operating system.The traffic that is generated by the target network (normal and malicious traffic) is represented in dataset.

•
Attacker network: contains systems that are used to generate attacks against the vulnerable systems and applications.This network contains virtual machines that run Kali Linux [28] as the operating system.To train the machine learning algorithms, the malicious traffic must be labeled; thus, the source of the attacks must be known.The attacker network is controlled to create and monitor the traffic, and to label the vector attacks.• Clients (legitimate traffic): represents the traffic flowing through the target network and labeled as normal.Besides malicious traffic, the vulnerable systems contain legitimate traffic, as well.

•
Internet: the network architecture is connected to the Internet since the research and education systems highly utilize applications that require Internet connectivity.
• Router: all the previous mentioned network architecture components are converging to the same router.All the traffic that flows through the target network is mirrored, using nProbe and ntopng, and captured.
In the topology presented in Figure 1, the vulnerable systems are the ones that are targeted when running the attacks.Mainly, they contain eLearning platforms (Moodle) that are used in the research and education field.The eLearning platforms were chosen because they are an important part of the university and school activities (especially during online classes) and may represent a target for the entities that want to harm the educational process.The goal is to protect those assets against attacks.The network used for collecting the data consists of multiple physical and logical elements.The physical elements are the core router, which is comprised of a pair (VSS, Virtual Switching System) of Cisco WS-4506-E with Cisco Catalyst 4500E Supervisor Engine 7-E, to which the the NProbe node has been connected.The core router is also connected to all the switches that bridge the servers from where data of the hosted services is collected.In addition, in this core router, there are multiple links connected to the university campus buildings (from where data generated by RoEduNet's end-users is collected).The data is collected using Catalyst Switched Port Analyzer (SPAN) from the VLAN interfaces, which are the gateways for the services mentioned above.
NProbe is running on a CentOS7 box that processes the data sent by the SPAN.The services presented in Figure 1 are running on top of an Openstack Cloud deployment.Openstack uses logical links and switches to connect the virtual machines using the Neutron service.The logical links are overlay networks on top of the physical network, implemented using openvswitch.

Traffic and Attack Orchestration
To manage vulnerable servers and to generate legitimate or malicious traffic, virtual machines that are orchestrated using OpenStack [29] were used.OpenStack is an opensource set of tools that can be used to manage a cloud environment.
For the attacker network, a template image was created that is based on Kali Linux that contains the tools necessary for the attacks.In addition, the scripts that can be used to start the attacks were configured.Based on the Kali template, multiple virtual machines were created from where the attacks can be performed.
Even though the attacks are run in a research and education network, and the tests are run during work and classes time, we wanted to have more legitimate traffic.Thus, virtual machines (using OpenStack) were added in the Clients (legitimate traffic) network that use the services alongside with students and researchers.Based on an Ubuntu 20.04 server template, multiple workers, along with a Kubernetes orchestrator [30], were created, that became part of the vulnerable servers' clients, in addition to students and researchers.Since the platforms used are Moodle instances (eLearning platform), to generate legitimate traffic, JMeter scripts were run to simulate a user's activity: login, check courses and assignments and logout.The traffic generated by JMeter is not intensive and does not affect the process of generating malicious traffic.

Attack Scenarios
For replicating the real-life use cases, the following attacks were considered to be run into the pilot network: network scanning (reconnaissance) and denial of service.
Usually, network scanning and reconnaissance (commonly implemented using network port scanning) is the first step that is run by an attacker to detect the network connected devices and their configuration details: operating system, open ports, the versions of the running applications and their vulnerability.Thus, one of the attacks that were run against the network is related to network scanning.For running this attack, tools, such as nmap [31] or Masscan [32], were used.For generating network scanning traffic, scanning applications from the attacker network on the IP networks that are contained by the target network are run.
An SYN Scan attack [33] is one the fastest methods of detecting a port's state.It relies on the TCP three-way handshake where the attacker sends a SYN packet to the desired port.Based on the response (or the lack of it), the attacker can determine if the port is open, closed or has some firewall filters active.
The Denial-of-Service attack category usually leads the system to be inaccessible or to increase the response time to requests.There are multiple methods of attacks that can lead to denial of service.The following two types were chosen: • Denial-of-Service using SlowLoris [34]: this type of DoS attack opens many HTTP connections to the target and sends incomplete, but legitimate HTTP requests or responses to the target in a very slow manner, keeping the connection alive for a long period of time.Since the HTTP messages are correct and not delivered very fast, they result in flooding the target (as most Denial-of-Service attacks work).The traffic can be considered as legitimate and the attacker as a slow client.Due to the large number of connections that are opened and the slow pace of communication, this type of attack can cause the target to respond very slowly to normal clients, or even to become unresponsive.• Denial-of-Service using R-U-Dead-Yet (RUDY) [35]: this is also a DoS attack that works in a slow manner to occupy all the target's processing power by opening and keeping alive many connections and sending responses slowly.However, the main difference between RUDY and SlowLoris is that the first one sends many small HTTP POST messages (usually, 1 byte of data), while the latter sends only HTTP header messages.
For generating Denial-of-Service network traffic, the attacks from the attacker network are started and target the vulnerable servers in the target network.
When the attacks are conducted, the following things need to be taken into consideration to provide a reliable dataset that can be used to train machine learning algorithms:

•
The attackers' IP addresses; • The targets' IP addresses; • The attack's start and end date.
This information helps properly identify the network packages that should be considered malicious.As shown in Figure 1, using nProbe, all the data that flows through the target network is collected.After the raw logged data is collected and stored (which is saved as a JSON), a Python script is used to convert the logged data into a format that is required by ML algorithms.The script does the following:

•
Adds a new key named "LABEL" for each packet.This field specifies if the traffic is considered to be normal ("Normal flow") or malicious ("SYN Scan", "Denial of Service SlowLoris", or "Denial of Service R-U-Dead-Yet").

•
Modifies the key fields from the JSON to match the names described in subsection "Features and labels" (nProbe saves an index for each field, and we replace the index with its name, based on the NetFlow v9/IPFIX format).

Features and Labels
The set of features that was collected from the network infrastructure by the collector and stored in JSON files is based on a data schema in the form of Netflow.This is a network protocol developed by CISCO for collecting and monitoring network flows.During the data collection process, 44 features were extracted that may be needed to correctly analyze network flows and detect anomalies.All the collected featrues are summarized in Table 6.
In addition, each frame contains its own label that specifies exactly the type of flow classifying it as anomalous or not.There are two DDoS attacks (Slowloris and R-U-Dead-Yet) and one PortScan attack type (SYN SCAN) in the dataset.The distribution of these types is as follows: the dataset contains 6,570,058 frames representing the non-infected base traffic.Next, 2,496,814 frames contain the SYN Scan attack.The dataset contains 2,276,947 frames of the Denial of Service R-U-Dead-Yet and Denial of Service Slowloris has 86,4054 flows.In summary, our collection contains 6,570,058 frames of pure traffic and 5,637,815 flows that are labeled as anomalies.

Proposed Methodology
In this section, the architecture of the created system will be presented.This section also describes the data preparation process detailing all the steps needed to obtain the final version of the schema of the data that will be used later on to prepare and train the model.

Architecture Solution
The process of network intrusion detection occurs in the network environment and can be described by three steps: Collecting Traffic, Delivering Traffic to the Stream, and Verification.Figure 2 shows a simplified diagram of the relationship between the key modules of the system.To properly run real-time stream anomaly detection from the delivered network traffic, it is necessary to train the model in advance.This process is done offline and an initial collection of labeled data is required.Once the data model is created and stored, one can move to the next step which is to perform the live detection.The entire process starts with collecting data and delivering it to the Kafka [36] stream.The detector is set up to work with network data that is delivered in Netflow version 9 format.The detector is developed to work in a real-life situation of real-time network intrusion detection, where the traffic from an environment will be collected with a probe, like NTOPNG [37], which provides the ability to collect and transport network traffic to the stream in any form.The use of the Apache Kafka software in the detection environment is dictated by a number of necessities of real-time network intrusion detection.These are, among others, providing high throughput and low latency message queuing services.Kafka uses the Publish and Subscribe message handling model and stores partitioned data streams securely in a distributed, replicated cluster.Kafka scales linearly as throughput increases.
The traffic delivered to the Kafka stream is received in real time by the detection engine.All features of a single frame are prepared for verification by a suite of machine learning algorithms.After passing through the verification system, the frame is assigned a label.Clean traffic which does not bear any signatures of an attack is labeled as "Normal Flow".Infected traffic receives a specific label corresponding to the attack type.The intricacies of the stream-based network intrusion detection have been presented in Reference [38].The final step is to prepare the tagged frame for sending to the Elasticsearch database.

Data Preparation
The data preparation process is a crucial step in the ML pipeline.The data preparation steps performed are presented in Figure 3.The listed elements of the data preparation process are described below.

Feature Selection
The first step in the process of preparing the final data shape is feature selection.As the name suggests, feature selection is about choosing from among all features only those that contribute to the effectiveness of the model.Feature selection reduces the computational cost, as well as, in many cases, improves the model performance [39].Feature selection methods evaluate the relationship between each input variable and the target variable.
For this research, the SelectKBest method was used for feature selection, with the result function set to chi2.The Chi-Square method allows for determination of whether the occurrence of a particular trait and the occurrence of a particular class are independent.This can be expressed by the following formula: (N e t e c − E e t e c ) 2 E e t e c . (1) N is the observed value of w, and E the expected value.e t takes the value of 1 if the document contains the term t, and 0 otherwise.e c takes the value 1 if the document belongs to class c, and 0 otherwise.Each feature in the dataset that receives a high Chi-Square score should be discarded as it means that the class has no effect on the incidence of the feature.Conversely, when the score value is low, it means that the class and the feature are dependent.In Figure 4, the distributions of the 15 most important features in the dataset is shown.

Resolving the Data Imbalance Problem
Uneven distribution of classes is a known ML challenge [40][41][42].Many ML algorithms can under-perform on imbalanced data, experiencing issues, like misclassification of samples from minority classes to majority classes.
To solve the imbalance issue, SMOTE technique was used at the data preparation stage.SMOTE is one of the most commonly used oversampling methods.This technique was first defined and presented in Reference [43].It aims to balance the class distribution by increasing the minority class instances with the use of an adaptation of the nearest neighbors algorithm.
To create a synthetic instance, it finds the K-nearest neighbors of each minority sample, randomly selecting one of them, and then computes linear interpolations to create a new minority sample.
For this research, one of the extensions of SMOTE, SMOTE-ENC (Encoded Nominal and Continuous Synthetic Minority Oversampling Techinque), was used.The reason for choosing this particular method was that the data schema contained categorical values.The authors of Reference [44] show the correct results of using this method on categorical values and confirm that this method works correctly.In SMOTE-ENC, if the sample of a categorical attribute differs from its nearest neighbors, then a constant value is added during distance calculation.This method allows for the use of SMOTE on datasets containing both continuous and categorical features.

Feature Standardization
After the dataset was balanced, all samples were subjected to the standardization process.The values were standardized by removing the mean and scaling to unit variance with the use of scikit-learn StandardScaler.To maintain consistency in our tests, we have also centered and scaled the features for the decision-based methods, even though RandomForest can handle both scaled and unscaled features.

Label Encoding
Categorical and textual data is a fairly common occurrence in datasets.In our case, fields, such as protocol and label, are precisely of the categorical type.Some ML algorithms can handle categorical features, but most expect only numeric values.Therefore, all categorical values in the dataset are converted to numeric values.There are multiple ways to perform this conversion; in this work, two were used: One-Hot-Encoding and Label-Encoding.The Label-Encoder method converts each value in the column to a number assigning a value according to the order of appearance, and it is suitable for conversion of the dependent variable.The second approach creates a new column for each category and fills it with zeros (False), only assigning ones (True) for samples where the particular value of the feature was present, making it suitable to use on features.

Experiments and Results
This section describes and details the tests and provides the results of the study.The formulas by which the machine learning and neural network algorithms were tested and compared on the dataset are specified.

Evaluation Metrics
In this paper, a standard set of well-known metrics was used to evaluate the approach: Accuracy (ACC), Precision (Pr), Recall (Re), F1-Score, Matthews correlation coefficient (MCC) [45], and Balanced accuracy (BCC).
The metrics are calculated with the use of the confusion matrix.The following are the values featured in the confusion matrix: True Positives (TP), which specify correctly predicted positive values, followed by True Negatives (TN), which are correctly predicted negative values.The other two variables are described as False Positives (FP), which is when the result of the actual class is false and the result of the predicted class is positive.The last variable is False Negatives (FN), which is when the actual class is positively classified, but the predicted class indicates a negative value.Presented below are the individual formulas that were considered in the process of evaluating the performance of the algorithm. .

Results
In this section, the results that were achieved after detecting the malicious traffic in the dataset that was collected from an academic network are presented.At the very beginning, the data schema that was provided in Section 3.4 was subjected to feature selection, and the 15 most useful features from this dataset that have the greatest impact on the effectiveness of the model were extracted.A summary of these features can be found in Figure 5.The Y-axis features the 15 parameter names with the highest score.The feature with the strongest influence on the result of the classification, according to the SelectKBest method [46], is the duration of the data flow.
In the remaining part of the research the focus was on utilizing the following ML methods: Deep neural network [47][48][49], the Random Forest Classifier [50,51], the AdaBoost Classifier [52], and the Gradient Boosted Trees Classifier [53].
The choice of these algorithms was dictated by the following factors: Random Forest has been proven in multiple studies on network attacks; its performance was always high [4], and results were satisfactory, and the authors have found promising results from the utilization of this algorithm in earlier work [54,55].The Gradient Boosted Trees (GBT) algorithm combines the advantages of RandomForest with the added benefit of gradient utilization.Artificial neural networks were used because they have been proven to continue to learn even when the other methods reach their full potential.Thus, adding ANNs can be a good opportunity to improve results with larger amounts of data.The AdaBoost algorithm was selected to check its potential in the real-world implementation of the NIDS component in the SIMARGL project: the algorithm is fast, simple to use, and does not need extensive hyperparameter tuning.
The selection of hyperparameters in the used algorithms was done using gridSearch, which performs an exhaustive search over the chosen hyperparameter space.The setting of hyperparameters can be a decisive factor for the results obtained by machine-learning methods, as was presented in Reference [5].
Each classifier was subjected to a learning procedure on a training set.Cross-validation was used to test and evaluate the model more accurately.This is a procedure that is used to resample the data.The number of groups into which the set is divided is defined using the K parameter was set, in this case, to 10.Therefore, each result in the summary table of a given test contains 10 records.Each classifier underwent the learning procedure on the training set.
The first classifier is a deep neural network.The architecture of this classifier consists of an input layer with the count of neurons corresponding to the used number of features, and the Rectified Linear Unit (ReLU) activation function.This is followed by a dropout layer with the dropout set at 0.01, and another hidden layer with a set of 16 neurons and the ReLU activation function.The setup closes with a "softmax".The loss function was set to the "categorical_crossentropy" method, while the chosen optimization algorithm was Adaptive Momentum (ADAM) [56].Eleven epochs were needed to train the model with a batch size of five.The test results for this model can be found in Table 7.
The next classifier that was used to detect malicious samples in the network traffic was the RandomForest.The settings of this classifier were as follows: n_estimators was set to the value of 100, and this parameter signifies the number of trees used.The maximum tree depth was set to the value of ten, the minimum number of samples required to separate the internal node was set to the value of two.The rest of the settings were used as provided by default.Test results for this model can be found in Table 8.
The Gradient Boosted Trees classifier, is another model that was selected for testrunning the dataset.The preparation of this classifier consisted of setting the learning rate to 0.5, the number of boosting steps to be performed was set to 100, the fractions of samples was set to 0.5, the maximum depth of each regression estimator was set to 2 and the number of features to be considered in the search for the best split was set to 2. Test results for this model can be found in Table 9.
The last classifier that was utilized in the study was AdaBoost.The parameter configuration of this classifier was as follows: The maximum number of estimators at which boosting will be completed was set to a value of 50.The weight applied to each classifier in each boosting iteration was set to a value of 1, and the base estimator from which the boosted ensemble is built was set to DecisionTreeClassifier. Test results for this model can be found in Table 10.
To further summarize the results of the experiment the measured metrics for all the used algorithms are gathered in Table 11.The random forest classifier has achieved the best metrics.For comparison, the results of the classifiers without the SMOTE data balancing applied are provided in Table 12.The results are given on a 60/40 train/test split.
In order to compare the different classifiers on the accuracy of each model, a statistical method based on paired Wilcoxon test [57] was applied.The results of these tests are presented in Table 13.It can be seen that the AdaBoost algorithm loses every time compared to all others, while the best choice is DNN or GBT algorithm, whose results are comparable.

Conclusions
The work presented in this paper provides the results of efficient detection of anomalies in network traffic coming from a real-life architecture.As part of the presented research, traffic from a real-world academic network was collected and, after performing a series of attacks, a dataset was formed.The dataset contains 44 network features and an unbalanced distribution of classes.The traffic captures were annotated accordingly.The efficacy of the dataset for training machine learning algorithms was experimentally evaluated.To investigate the stability of the obtained ML models, cross-validation was performed, and a series of standard detection metrics were reported.The utility of the obtained dataset has been evaluated for the following ML algorithms: Random Forest Classifier, Gradient Boosting Classifier, and a Neural Network.The obtained dataset is part of an ongoing endeavor to provide security against novel cyberthreats, executed in the SIMARGL project.
Although the proposed infrastructure generates attacks, collects, and labels the traffic, it can be improved.The current approach is to generate one attack at a time.However, in a real life environment, multiple attacks may be simultaneously run to destabilize various services: DNS, email, e-learning platforms.Thus, as future work, more complex scenarios may help the researchers train their machine learning algorithms using datasets that are even closer to real-life network traffic.
Moreover, due to physical resource limitations, the proposed infrastructure does not scale well, since larger amounts of data cannot be generated without affecting the functionality of RoEduNet's client network.The most important limitations that were encountered are the limited disk storage for logged data collected by NProbe, traffic generated through port mirroring sent to NProbe to process data, or large datasets that must be manually transferred from the source to the BDE Platform.Thus, in the future, a more scalable infrastructure should be implemented, as well as an integration procedure that delivers data directly from the source to the BDE Platform.
In the current implementation, the attacks that are generated must be manually started and stopped at well established moments (each attack runs in a well defined time interval so that the traffic can be labeled accordingly).A further improvement that should be added would be to automatically run and label attacks based on a given schedule.
For future work, more different types of attacks are going to be added to the dataset.The number and variety of normal traffic samples are also going to be increased.In addition, this collection is set to become publicly available to provide more researchers the ability to test improve their cybersecurity solutions on contemporary and realistic traffic.In scope of the SIMARGL project, the aim is to provide RoEduNet with a NIDS solution to suit their needs.In addition, future work is dedicated to further improvements towards integrating more machine learning concepts and algorithms, including the notion of online learning, lifelong learning, and unsupervised anomaly detection.

Figure 1 .
Figure 1.Network architecture for the "Reconnaissance and Denial-of-Service attacks".

Figure 3 .
Figure 3. Process of preparing the collected data.

Figure 4 .
Figure 4. Distribution plots of the 15 most important features.

Figure 5 .
Figure 5. Result of feature selection.

Table 1 .
Distribution of network traffic in the CTU-13 dataset for each scenario.

Table 3 .
Representation of the traffic content of the IoT-23 dataset by executed attacks.

Table 5 .
Representation of the traffic content of the LITNET-2020 dataset by executed attacks.

Table 6 .
The list of the collected network features.

Table 7 .
Summary of the results for the deep neural network.

Table 8 .
Summary of the results for the Random Forest Classifier.

Table 9 .
Summary of the results for the Gradient Boosting Classifier.

Table 10 .
Summary of the results for the AdaBoost Classifier.

Table 11 .
Comparison of the models used and their prediction results on test data with SMOTE.

Table 12 .
Comparison of the models used and their prediction results on test data without SMOTE.

Table 13 .
Statistical analysis of the classifiers by accuracy of the model based on paired Wilcoxon test with p-value 0.05.