LITNET-2020: An Annotated Real-World Network Flow Dataset for Network Intrusion Detection

: Network intrusion detection is one of the main problems in ensuring the security of modern computer networks, Wireless Sensor Networks (WSN), and the Internet-of-Things (IoT). In order to develop e ﬃ cient network-intrusion-detection methods, realistic and up-to-date network ﬂow datasets are required. Despite several recent e ﬀ orts, there is still a lack of real-world network-based datasets which can capture modern network tra ﬃ c cases and provide examples of many di ﬀ erent types of network attacks and intrusions. To alleviate this need, we present LITNET-2020, a new annotated network benchmark dataset obtained from the real-world academic network. The dataset presents real-world examples of normal and under-attack network tra ﬃ c. We describe and analyze 85 network ﬂow features of the dataset and 12 attack types. We present the analysis of the dataset features by using statistical analysis and clustering methods. Our results show that the proposed feature set can be e ﬀ ectively used to identify di ﬀ erent attack classes in the dataset. The presented network dataset is made freely available for research purposes.


Introduction
Network attacks are a set of network traffic events which are aimed at undermining the availability, authority, confidentiality, integrity, and other critical properties of networked computer systems [1]. Various types of cyber-attacks, such as IP spoofing [2,3] and Distributed Denial-of-Service (DDoS) flooding attacks [4], have been recognized as a serious security problem. With the increased scope, type and complexity of computer systems and communication networks, as well as with the emergence of new types of distributed computing technologies (such as the Internet-of-Things (IoT), Edge Computing, Fog Computing, etc.), new types of threats continue to arise against usual user requirements for privacy, security, and trust [5][6][7][8]. Despite numerous research studies in intrusion detection [9,10], there is still a large number of successful cyber-attacks registered each year, which affects the daily operation of businesses and governments, but also can cripple critical infrastructures [11], cloud-based IoT environments [12], cloud storage services [13], wireless sensor networks (WSN) [14], wireless body are networks (WBAN) in telemedicine systems [15], wireless ad-hoc networks [16], in-vehicle networks [17], software defined networks (SDN) [18,19], industrial IoT networks [20], cyber-physical systems [21], Internet of Drones (IoD) [22], Industry 4.0 smart factories [23], Internet of Medical Things (IoMT) such as implantable medical devices [24], IoT edge devices [25], vehicular ad hoc networks (VANET) [26], fifth-generation (5G) mobile networks [27], fog computing services [28], and user smartphones [29].

UNSW-NB15
The UNSW-NB 15 dataset was generated by the IXIA PerfectStorm tool in a small network environment (only 45 unique IP addresses) over a short (31 h) period of time, and it includes a mix of real typical activities and artificial attack behaviors of the network traffic, resulting in 175,341 records for training and 82,332 records for testing. The IXIA tool simulated nine types of attacks. The dataset provides 49 features for analysis, which include basic features, content features (based on the content of packets), time features (based on time characteristics of packet flow), and additional generated features based on the statistical characteristics of connections.

CICIDS 2017
The dataset was made public by the Canadian Institute for Cybersecurity. The creation methodology used two types of usage profiles and multistage attacks, such as Heartbleed, and a variety of DoS and DDoS attacks. It has 80 network traffic features that are extracted by using the CICFlowMeter tool. User profiles were based on the abstract human behavior of 25 users working with the HTTP, HTTPS, FTP, SSH, and email protocols, aiming to generate the background traffic. The traffic was generated for a short (5 days) span of time.

UGR'16
This dataset was originated by the University of Granada (Spain) and is aimed for the assessment of cyclostationary NIDSs. The dataset was acquired from a tier-3 Internet Service Provider (ISP) over four months and has 16,900 million single directional flows. The real network traffic was mixed with synthetically generated malicious attack flows captured in a controlled network environment that somewhat decreases the quality of the dataset. It has 13 types of malware, including annotated botnet, SSH scan, and SPAM attacks, as well as background and normal network traffic, where background assumes that it is not known whether it contains a malicious traffic. The dataset was labeled by using the logs from the honeypot system.

NSL-KDD
NSL-KDD is an enhancement of the KDD dataset. In the KDD dataset, classification was biased toward more recurring records. However, in the NSL-KDD dataset, redundant items were removed, preventing the classifiers from achieving unreasonably high detection rates due to reoccurring records. The dataset includes 4 classes of attack: Denial of Service (DoS), Probe, User to Root (U2R), and Remote to Local (R2L). The training set has 4,898,431 records, and the testing set has 311,027 records.

CSE-CIC-IDS2018
The dataset covers six types of network attacks: Botnet, brute-force, Denial of Service (DoS), Distributed DoS (DDoS), infiltration, and web attacks. The dataset was generated based on the synthetic user profiles, which capture abstract representations of network events and behaviors. Fifty network nodes were used to organize an attack on the victim infrastructure with 420 computers and 30 servers. The dataset includes 84 network traffic features extracted from the network traffic, using the CICFlowMeter-V3 tool.

Summary
The comparison of the analyzed datasets by examples of attack types represented are summarized in Table 1. Here, Fuzzer aims to cause a network node suspended by transmitting to it the random data. Virus is a self-replicating malicious program that intrudes on the computer system without the knowledge of the user. Worm spreads through the network without the user's permission, while consuming network bandwidth resources. Trojan is a malicious program that causes the security problems in the network, while masquerading as a useful program. DoS aims to reject access to network nodes or resources for other users. Network Attack is an attempt to endanger network security from the data link layer to the application layer. Physical Attack attempts to cripple the physical units of computers or networks. Password Attack aims to obtain a password by login, and can be discovered by several login failures. Information Gathering Attack searches for known security holes by scanning or probing network nodes. User to Root (U2R) attack aims to take advantage of vulnerabilities of a network system in order to gain privileges as the super-user of the system. Remote to Local (R2L) attack dispatches packets to a remote computer system, without having a valid account on that system, aiming to obtain access either as a user or as a root. Probe attack scans the networks aiming to find valid IP addresses and to collect private data about the host, in order to start an attack on a selected set of systems and services.
The comparison of old reference (DARPA'98 [70] and KDDCup'99 [71]) datasets and more recent network intrusion datasets is presented in Table 2.

Conclusion of Dataset Analysis
New cyberthreats and types of attack continue to emerge. As a result, new realistic network datasets are needed to keep the development and benchmarking of network intrusion methods up-to-date. This has motivated us to collect network flow data from real-world network and to Electronics 2020, 9, 800 5 of 23 present it as an open benchmark dataset to be used freely for the research community in the cyber security domain.

Proposed Dataset
In this section, we describe the network environment used for the collection of network traffic data, provide the description of network attacks that are present in the dataset, provide the descriptive characteristics of the dataset, and discuss dataset preprocessing and availability.

Network Environment
Infrastructure consists of connecting nodes with operating equipment and communication lines connecting those exporters (CAPACITY, CYTI1, KTU UNIVERSITY 1, KTU UNIVERSITY 2, and FIREWALL). The LITNET NetFlow topology consists of two main parts (senders and collector). The NetFlow sender are Cisco (Cisco Systems Inc., San Jose, CA, USA) routers. Fortige (FG-1500D) high-performance-next generation firewalls analyze the data that has passed through it, process the Electronics 2020, 9, 800 6 of 23 data, and send it to one or more NetFlow server collectors. The NetFlow server (collector) is a server with appropriate software (nfcapd, nfdump, nfexpire, nfprofile, nfreplay, and nftrack. Version: 1.6.15), which is responsible for receiving, storing, and filtering data. We are used 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 x86_64 GNU/Linux operating system; 4 core Intel Xeon Processor (Skylake, IBRS) CPU processor; 10GB for system and 30T hard disks for collecting data; and 8 GB of RAM.
Each of the NetFlow Exporters (CAPACITY, CYTI1, KTU UNIVERSITY 1, KTU UNIVERSITY 2, and FIREWALL) continuously monitors the flows passing through it (this is a sequence of previous data packets in one direction from a specific sender to a specific recipient) and caches them when it receives new traffic. The main ring connects the five largest Lithuanian cities (see Figure 1), which are as follows: (CITY1) Kaunas-Vytautas Magnus University and Kaunas Technological University, which is administrator of Lithuanian Research and Education Network (LITNET) and maintenance and development connecting nodes and Vilnius University (CAPACITY); Vilnius Gediminas Technical University (CAPACITY); Klaipeda University (CITY2); Siauliai University (CITY3); and KTU Panevezys Faculty of Technologies and Business (CITY4). For efficiency reasons, the NetFlow sender only scans the first packet of the new stream, which saves the corresponding values, and then subsequent packets of the same stream are processed according to the same policy, thus reducing the load on the network device. Kaunas University of Technology (FIREWALL) has a high availability infrastructure and a perimeter Fortigate 1500D with FortiOS operating system, 80 Gbps network, IPS (intrusion prevention system) 13 Gbps, NGFW (next-generation firewall) 7 Gbps, and Threat Protection 5 Gbps bandwidth, firewall. CITY1 has an exit to broadband networks NORDUNET, and GEANT. CITY1 and CITY2 have a peering connection, and this is a process by which two Internet networks connect and exchange traffic. CITY1 and CITY2 can transfer data traffic directly between each other's LITNET users. Every other city (CITY2, CITY3, and CITY4) has end users. These are schools, municipalities, other organizations.  We used a 5 min time interval to make a nfcap (KTU UNIVERSITY1, KTU UNIVERSITY2, CITY1, and CAPACITY) file (nfcap is an application programming interface (API) for capturing network traffic with the format nfcapd.YYYYMMDDHHMM) and sent it to the NetFlow server to process information. In this time interval, we counted the number of packets, which satisfied the We used a 5 min time interval to make a nfcap (KTU UNIVERSITY1, KTU UNIVERSITY2, CITY1, and CAPACITY) file (nfcap is an application programming interface (API) for capturing network traffic with the format nfcapd.YYYYMMDDHHMM) and sent it to the NetFlow server to process information. In this time interval, we counted the number of packets, which satisfied the attack rules to distinguish the type of attack.

Description of Network Attacks
We describe the attack types in the proposed dataset as follows. Smurf attack keeps sending the Internet Control Message Protocol (ICMP) broadcast requests to the network on behalf of the target node, aiming to flood the node with network traffic in order to slow down the targeted node.
An Internet Control Message Protocol (ICMP) flood attack is a DoS attack which aims to overwhelm a targeted network node with ICMP echo-requests (pings).
UDP-flood attack is a DoS attack using the User Datagram Protocol (UDP). A DNS Flood Attack (DNS Flooding) is an application-specific variant of a UDP flood, which is characterized by network packets sent to any IP address, using UDP protocol and port 53 as the target.
TCP SYN-flood attack is a Distributed DoS (DDoS) attack that misuses a part of the ordinary Transmission Control Protocol (TCP) three-way handshake to drain resources on the victim node and make it unresponsive. The attack packets packages have S flags but do not have the AFRPU flags.
HTTP-flood attack is a DDoS attack, which exploits seemingly legitimate Hyper Text Transfer Protocol (HTTP) GET or POST requests to assail a web server or application. In a complex Layer 7 attack, HTTP floods do not use ill-formed packets, spoofing, or reflection and need less bandwidth than other types of attacks, in order to disable the victim server or site. The attack packets are directed only to 80 port.
LAND attack is a Layer 4 DoS attack in which the malicious node sets the same source and destination data of a TCP segment. The attack packets have S flags and use the TCP protocol. An attacked node will hang due to the same packet being repeatedly processed by the TCP stack.
W32.Blaster Worm attack spreads by utilizing the Buffer Overrun Vulnerability of Microsoft Windows DCOM RPC Interface. The attacks are directed only to 135, 69 (TFTP), and 4444 (Kerberos) ports.
Code Red Worm attack aims to cause a buffer overflow problem on a target node, so that it begins to overwrite the adjacent memory. The packets are directed to source IP and only to 80 (no Secure Sockets Layer (SSL)) ports; this is how the HTTP GET method is applied.
Spam bot's attack dispatches spam messages or posts spam in social media platforms or forums. The packets are directed only to 25 (no SSL) port. The attack is characterized by the presence of an excessively large number of SMTP connections from one address.
Reaper Worm attack begins its last phase of scanning once the IP is passed to the exploit process. Reaper attack is directed at TCP ports 81, 82, 83, 84, 88, 1080, 3000, 3749, 8001, 8060, 8080, 8081, 8090, 8443, 8880, and 10,000. An attack is only recorded when the package contains the TCP stream and have not UDP or ICMP or ICMP6 protocols.
Port Scanning/Spread attack dispatches client requests to some server port addresses, aiming to discover an active port and taking of advantage of a known security hole. An abnormal number of connections from one host to one or more other hosts is as follows: several ports, one address; single port, multiple addresses.
Packet fragmentation attack is a kind of DoS attack, in which the attacker overloads a network by taking advantage of the datagram fragmentation.

Descriptive Characteristics
The traffic analysis is described for the cumulative flows while generating the dataset. The descriptive characteristics of the dataset are given in Table 3.  Electronics 2020, 9, 800 9 of 23 Table 4 represents the distribution of all data instances of the proposed dataset. All instances are categorized into ordinary data and attack data. The attack instances are further categorized into nine classes, according to the type of the network attack. A DoS attack with SYN packet can be explained in a simple way as the flow of illegal traffic to network resources from an IP address or the flow of IP addresses that results in a lack of network resources. The attackers disrupt the three-way click sequence by not responding to the SYN-ACK from the server, or they will constantly send a SYN packet from a non-existent IP, the server actually supports the queue set to which the SYN-ACK is sent because there will be no response from the clients, the queue will overflow, and the server will no longer be available. This is called a SYN Attack or Flood. The example of network traffic flows is shown in Figure 2.
Electronics 2020, 9, x FOR PEER REVIEW 9 of 24 clients, the queue will overflow, and the server will no longer be available. This is called a SYN Attack or Flood. The example of network traffic flows is shown in Figure 2. Here, we can see that there is an obvious anomaly in sending SYN packets on the Kaunas (CITY1) channel. As you can see in the graph, this attack lasts two days (started from 2019-10-05 and ended on 2019-10-07). We also see that this attack occupies 14 Mb/s data traffic on the Kaunas (CITY1) channel. According to the presented real case, we can see that, on 2019-10-07 at 09:30, there was an Here, we can see that there is an obvious anomaly in sending SYN packets on the Kaunas (CITY1) channel. As you can see in the graph, this attack lasts two days (started from 2019-10-05 and ended on 2019-10-07). We also see that this attack occupies 14 Mb/s data traffic on the Kaunas (CITY1) channel. According to the presented real case, we can see that, on 2019-10-07 at 09:30, there was an attack peak of TCP SYN packets. This is evidenced by the huge number of packets in the data stream layout and the exceptional increase in data traffic on the graph.
For visualization of NetFlow, based on our proposed attack detection, the rules (see Figure 2; Profile: SYN) were developed for automatic notification of a possible cyber incident. For analysis of data stored in the collector (NetFlow server), we used flow-tools, nfstat, flowd, nfsen 1. 3

Dataset Availability and Preprocessing
For the collection of data, we use a methodology suggested in [61]. The network traffic data are captured in the nfcapd binary format files. The nfcapd files are collected in a single file per week for two capture periods. The mean size of files is about 1.35 GB (compressed). The nfcapd files have all NetFlow features, extended with 19 custom attack detection features which starts from 2019-03-06 first flow and 2020-01-31 last flow. The IP addresses of network nodes have been anonymized. Information from senders (NetFlow raw data files) to the collector is received in NetFlow v9 format (rfc3954). All values are transferred to the MySQL database (NetFlow database SQL server).
All dataset files can be freely downloaded from our website: https://dataset.litnet.lt. The dataset formation is summarized in Figure 3. The data preprocessor selects 49 attributes that are specific to the NetFlow v9 (RFC 3954) protocol to form a dataset. The Data extender expands the generated dataset with additional fields of time, tcp flags, which are later used to identify attacks. The Extended dataset is supplemented by a set of 15 attributes. The generator creates additional 19 attributes for attack type recognition (see Table 5). The combinations of these and NetFlow attributes are used to detect attacks. We also added two additional fields to separate in the dataset, where the record is assigned to the attack and what specific type of attack, and where the normal network traffic is. Therefore, we have a total of 85 attributes.

Dataset Availability and Preprocessing
For the collection of data, we use a methodology suggested in [61]. The network traffic data are captured in the nfcapd binary format files. The nfcapd files are collected in a single file per week for two capture periods. The mean size of files is about 1.35 GB (compressed). The nfcapd files have all NetFlow features, extended with 19 custom attack detection features which starts from 2019-03-06 first flow and 2020-01-31 last flow. The IP addresses of network nodes have been anonymized. Information from senders (NetFlow raw data files) to the collector is received in NetFlow v9 format (rfc3954). All values are transferred to the MySQL database (NetFlow database SQL server).
All dataset files can be freely downloaded from our website: https://dataset.litnet.lt. The dataset formation is summarized in Figure 3. The data preprocessor selects 49 attributes that are specific to the NetFlow v9 (RFC 3954) protocol to form a dataset. The Data extender expands the generated dataset with additional fields of time, tcp flags, which are later used to identify attacks. The Extended dataset is supplemented by a set of 15 attributes. The generator creates additional 19 attributes for attack type recognition (see Table 5). The combinations of these and NetFlow attributes are used to detect attacks. We also added two additional fields to separate in the dataset, where the record is assigned to the attack and what specific type of attack, and where the normal network traffic is. Therefore, we have a total of 85 attributes.

Summary
The proposed dataset was collected in the real-world network, over an extended period of time (10 months), and contains real network attacks over the country-wide network infrastructure with

Summary
The proposed dataset was collected in the real-world network, over an extended period of time (10 months), and contains real network attacks over the country-wide network infrastructure with servers in four geographically distributed locations (cities). As such, the proposed network flow dataset is more advantageous than some of its counterparts (such as UNSW-NB 15 dataset [59]), which generated the attacks artificially and thus do not contain realistic data.

Description and Statistical Analysis of Dataset Features
This section presents the description and analysis of dataset features. First, we formulate the requirements for dataset features. Then, we present the description of different classes of features. Next, we analyze the statistical distribution of the feature values and illustrate the results by figures. Finally, we summarize the results.

Requirements for A Dataset and Its Features
We formed the dataset by using the requirements for NIDS evaluation datasets outlined in [61] as follows. The dataset features should include network flow characteristics, such as IP addresses and port numbers, number of packets and bytes, flow duration, and flags. The dataset records should be correctly labeled as malicious or not, and in case of attack records, they should also include the type of attack. The dataset should cover several different periods of network activity, such as daytime/nighttime and weekdays/weekends.

Description of Features
In describing the features, we follow the description scheme suggested in [72] that considers the flow, basic, content, general purpose time slice, and connection features. These are summarized in Tables 5-7. The source_IP, target_IP, and time are noted as key for intrusion detection [73]. Source and destination ports, source and destination IP addresses, and time were mentioned as the most informative features for intrusion alerting [19]. Similar time-slice features based on the calculation of unique IP addresses within a time window were successfully used for network-attack detection before [74]. Additionally, we provide two network-attack attributes labeled by the network security experts, in Table 8.  Table 7. Time-slice-based connection features. The largest count of connections from the same destination IP address port in previous 10,000 connections 7

No. Description
The largest count of connections from the same source port in previous 10,000 connections 8 The largest count of connections from the same destination port in previous 10,000 connections 9 The average count of connections from the same destination IP address and port in previous 10,000 connections 10 The average count of connections from the same source port in previous 10,000 connections 11 No. of connections with unique source IP-destination IP address pairs in previous 10,000 connections 12 The largest count of connections with the same pair of source IP-destination IP addresses in previous 10,000 connections

Analysis of Features
Following [75], we calculated the mean and standard deviation of all features. We studied the feature variance, using the cumulative distribution function (CDF), as was suggested in [76]. The results are shown in Figure 4.

Analysis of Features
Following [75], we calculated the mean and standard deviation of all features. We studied the feature variance, using the cumulative distribution function (CDF), as was suggested in [76]. The results are shown in Figure 4. These feature distributions are heavy tailed, and the smaller values make most cases. Moreover, 95% of source network nodes connect to fewer than 29 unique destination IPs, only 1% of source nodes may connect to more than 185 unique IPs, and only 0.1% connect to more than 2900 unique IPs (SYN attack subset of the dataset). Such outliers help in identifying the malicious behavior in the network. Feature value distributions also reveal possible correlations between features. For example, These feature distributions are heavy tailed, and the smaller values make most cases. Moreover, 95% of source network nodes connect to fewer than 29 unique destination IPs, only 1% of source nodes may connect to more than 185 unique IPs, and only 0.1% connect to more than 2900 unique IPs (SYN attack subset of the dataset). Such outliers help in identifying the malicious behavior in the network. Feature value distributions also reveal possible correlations between features. For example, the Pearson correlation between input packets and input bytes is 0.92, when an attacker performs the SYN attack. Such high levels of correlation allow us to identify features containing duplicate information.
To analyze the dynamical changes in the network flows, we applied window slices and observed the change of unique source and destination IPs over time. Here, we used a window of 10,000 NetFlows moved with a step of 5000 NetFlows. The results are presented in Figure 5 for the IP addresses and in Figure 6 for the port connections. One can see sharp changes in the behavior of network nodes, which may be indicative of the network attack.
The distribution of data according to the protocol types is presented in Figure 7. The most common protocols in the dataset are TCP and UDP.
The temporal frequency of the source and destination ports in the NetFlow connections are presented in Figure 8. As a baseline, a reference frequency is given, if the distribution of connections to ports would be uniform. Note that connections from/to some ports are much more frequent, e.g., the most frequent source ports are 54,438 and 444, while the most frequent destination ports are 444 and 54.
Electronics 2020, 9, x FOR PEER REVIEW 14 of 24 the Pearson correlation between input packets and input bytes is 0.92, when an attacker performs the SYN attack. Such high levels of correlation allow us to identify features containing duplicate information.
To analyze the dynamical changes in the network flows, we applied window slices and observed the change of unique source and destination IPs over time. Here, we used a window of 10,000 NetFlows moved with a step of 5000 NetFlows. The results are presented in Figure 5 for the IP addresses and in Figure 6 for the port connections. One can see sharp changes in the behavior of network nodes, which may be indicative of the network attack.  The distribution of data according to the protocol types is presented in Figure 7. The most common protocols in the dataset are TCP and UDP. the Pearson correlation between input packets and input bytes is 0.92, when an attacker performs the SYN attack. Such high levels of correlation allow us to identify features containing duplicate information.
To analyze the dynamical changes in the network flows, we applied window slices and observed the change of unique source and destination IPs over time. Here, we used a window of 10,000 NetFlows moved with a step of 5000 NetFlows. The results are presented in Figure 5 for the IP addresses and in Figure 6 for the port connections. One can see sharp changes in the behavior of network nodes, which may be indicative of the network attack.  The distribution of data according to the protocol types is presented in Figure 7. The most common protocols in the dataset are TCP and UDP. SYN attack. Such high levels of correlation allow us to identify features containing duplicate information.
To analyze the dynamical changes in the network flows, we applied window slices and observed the change of unique source and destination IPs over time. Here, we used a window of 10,000 NetFlows moved with a step of 5000 NetFlows. The results are presented in Figure 5 for the IP addresses and in Figure 6 for the port connections. One can see sharp changes in the behavior of network nodes, which may be indicative of the network attack.  The distribution of data according to the protocol types is presented in Figure 7. The most common protocols in the dataset are TCP and UDP. The statistical distribution of the connection flags in the TCP SYN-flood subset of the dataset is presented in Figure 9. It shows that most of the network connections had the S (TCP SYN) flag.
In case of the scan spread attack, the attacker scans for new IPs in the network. As a result, the number of unique destination IPs in a NetFlow window slice grows steadily during the attack (see Figure 10, left). Moreover, the attacker performs the port scan on the attack nodes, which can be seen from the distribution of port numbers in time (see Figure 10, right). The temporal frequency of the source and destination ports in the NetFlow connections are presented in Figure 8. As a baseline, a reference frequency is given, if the distribution of connections to ports would be uniform. Note that connections from/to some ports are much more frequent, e.g., the most frequent source ports are 54,438 and 444, while the most frequent destination ports are 444 and 54. The statistical distribution of the connection flags in the TCP SYN-flood subset of the dataset is presented in Figure 9. It shows that most of the network connections had the S (TCP SYN) flag.
. Figure 9. The statistical distribution of connection flags (TCP SYN-flood).
In case of the scan spread attack, the attacker scans for new IPs in the network. As a result, the number of unique destination IPs in a NetFlow window slice grows steadily during the attack (see Figure 10, left). Moreover, the attacker performs the port scan on the attack nodes, which can be seen from the distribution of port numbers in time (see Figure 10, right). The temporal frequency of the source and destination ports in the NetFlow connections are presented in Figure 8. As a baseline, a reference frequency is given, if the distribution of connections to ports would be uniform. Note that connections from/to some ports are much more frequent, e.g., the most frequent source ports are 54,438 and 444, while the most frequent destination ports are 444 and 54. The statistical distribution of the connection flags in the TCP SYN-flood subset of the dataset is presented in Figure 9. It shows that most of the network connections had the S (TCP SYN) flag.
. Figure 9. The statistical distribution of connection flags (TCP SYN-flood).
In case of the scan spread attack, the attacker scans for new IPs in the network. As a result, the number of unique destination IPs in a NetFlow window slice grows steadily during the attack (see Figure 10, left). Moreover, the attacker performs the port scan on the attack nodes, which can be seen from the distribution of port numbers in time (see Figure 10, right). The values of the time-slice-based connection features are presented in Figure 11. Note the sharp changes and peaks in the values, which may be indicative of network attacks. The statistical distribution of feature values is highly skewed and shows a considerable difference between the values. We used the violin plot, which was already used to visualize the distribution of networktraffic features before, in [58]. See the violin plot of feature-value distribution presented in Figure 12. The values of the time-slice-based connection features are presented in Figure 11. Note the sharp changes and peaks in the values, which may be indicative of network attacks. The statistical distribution of feature values is highly skewed and shows a considerable difference between the values. We used the violin plot, which was already used to visualize the distribution of network-traffic features before, in [58]. See the violin plot of feature-value distribution presented in Figure 12.
The values of the time-slice-based connection features are presented in Figure 11. Note the sharp changes and peaks in the values, which may be indicative of network attacks. The statistical distribution of feature values is highly skewed and shows a considerable difference between the values. We used the violin plot, which was already used to visualize the distribution of networktraffic features before, in [58]. See the violin plot of feature-value distribution presented in Figure 12.  Table 7). Figure 11. Values of time-slice connection features (from Table 7).  Table 7.
For the analysis and unsupervised clustering, we first performed the normalization of the dataset features. For the analysis of unsupervised datasets, the dimensionality reduction methods, such as Principal Component Analysis (PCA), are often used [77]. Here, we applied the t-stochastic Neighbor Embedding (t-SNE) method [78] to reduce the dimensionality to two dimensions. The resulting low-dimensional embedding has clusters assigned by the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm [79]. As a result, we obtained 12 clusters corresponding to different network-attack types, which are shown in Figure 13.
To analyze the differences between clusters, we adopted a pairwise two-sample multivariate Kolmogorov-Smirnov (KS) test, which is used to determine whether two sets of data arise from the same or different distributions. The null hypothesis was that the data in both pairs of compared clusters are drawn from the same continuous distribution. We used a two-dimensional version of the KS test because the low-dimensional embedding was two-dimensional. The hypothesis was rejected (p < 0.001) for all pairs of clusters. For the analysis and unsupervised clustering, we first performed the normalization of the dataset features. For the analysis of unsupervised datasets, the dimensionality reduction methods, such as Principal Component Analysis (PCA), are often used [77]. Here, we applied the t-stochastic Neighbor Embedding (t-SNE) method [78] to reduce the dimensionality to two dimensions. The resulting low-dimensional embedding has clusters assigned by the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm [79]. As a result, we obtained 12 clusters corresponding to different network-attack types, which are shown in Figure 13.
To analyze the differences between clusters, we adopted a pairwise two-sample multivariate Kolmogorov-Smirnov (KS) test, which is used to determine whether two sets of data arise from the same or different distributions. The null hypothesis was that the data in both pairs of compared clusters are drawn from the same continuous distribution. We used a two-dimensional version of the KS test because the low-dimensional embedding was two-dimensional. The hypothesis was rejected (p < 0.001) for all pairs of clusters.  To analyze the differences between clusters, we adopted a pairwise two-sample multivariate Kolmogorov-Smirnov (KS) test, which is used to determine whether two sets of data arise from the same or different distributions. The null hypothesis was that the data in both pairs of compared clusters are drawn from the same continuous distribution. We used a two-dimensional version of the KS test because the low-dimensional embedding was two-dimensional. The hypothesis was rejected (p < 0.001) for all pairs of clusters.
The distribution of feature values according to the attack-behavior clusters in time can be seen in Figure 14. To allow for better comparison, all feature values were normalized to (0,1).
Electronics 2020, 9, x FOR PEER REVIEW 18 of 24 The distribution of feature values according to the attack-behavior clusters in time can be seen in Figure 14. To allow for better comparison, all feature values were normalized to (0,1). To evaluate the significance of each feature, we used two tests. First, we applied the t-test-based feature, ranking using each cluster vs. all other clusters as a dependent variable. In another test, we used the split-half approach by splitting the set of clusters in half randomly and performing feature To evaluate the significance of each feature, we used two tests. First, we applied the t-test-based feature, ranking using each cluster vs. all other clusters as a dependent variable. In another test, we used the split-half approach by splitting the set of clusters in half randomly and performing feature ranking, while the procedure was repeated N (N = 1000) times. The results of feature ranking were analyzed by using the non-parametric ranking-based Friedman test, and the result was statistically significant (p < 0.05). Finally, the post hoc Nemenyi test was applied, and its results were presented by using the significance diagram [80] (see Figure 15). Note that, according to the one-versus-all splitting, there is no statistically significant difference between the feature ranks ( Figure 15a); therefore, all features are significant and contribute to the constructions of clusters. We also performed the split-half testing. The feature-ranking results (Figure 15b) show that features F6, F9, F1, and F10 have the highest rank among all features, which is statistically significant (p < 0.001). To evaluate the significance of each feature, we used two tests. First, we applied the t-test-based feature, ranking using each cluster vs. all other clusters as a dependent variable. In another test, we used the split-half approach by splitting the set of clusters in half randomly and performing feature ranking, while the procedure was repeated N (N = 1000) times. The results of feature ranking were analyzed by using the non-parametric ranking-based Friedman test, and the result was statistically significant (p < 0.05). Finally, the post hoc Nemenyi test was applied, and its results were presented by using the significance diagram [80] (see Figure 15). Note that, according to the one-versus-all splitting, there is no statistically significant difference between the feature ranks ( Figure 15a); therefore, all features are significant and contribute to the constructions of clusters. We also performed the split-half testing. The feature-ranking results (Figure 15b) show that features F6, F9, F1, and F10 have the highest rank among all features, which is statistically significant (p < 0.001).

Conclusions of Dataset Analysis
The statistical analysis of the features included in the proposed dataset shows that the features can be to for detecting various types of network attacks. In particular, the time slice connection features can be embedded to lower dimensional space, where clustering methods can be applied to map the low-dimensional representation of features to network attack classes. Table 9 shows a comparative analysis with the analyzed datasets, according to the number of networks, number of unique IP address, period of the data collection, attack vectors, and the number of features for each dataset.

Comparison with Other Datasets
The proposed network dataset was collected for a longer period of time (10 months) than other analyzed datasets; it covers more network-attack classes (12) and contains more features (85). Therefore, the proposed dataset could present a valuable contribution to the research community and enrich the available set of datasets for the development and improvement of new network-attack recognition methods.

Conclusions of Dataset Analysis
The statistical analysis of the features included in the proposed dataset shows that the features can be to for detecting various types of network attacks. In particular, the time slice connection features can be embedded to lower dimensional space, where clustering methods can be applied to map the low-dimensional representation of features to network attack classes. Table 9 shows a comparative analysis with the analyzed datasets, according to the number of networks, number of unique IP address, period of the data collection, attack vectors, and the number of features for each dataset.

Comparison with Other Datasets
The proposed network dataset was collected for a longer period of time (10 months) than other analyzed datasets; it covers more network-attack classes (12) and contains more features (85). Therefore, the proposed dataset could present a valuable contribution to the research community and enrich the available set of datasets for the development and improvement of new network-attack recognition methods.

Conclusions
Known network-intrusion benchmark datasets usually do not provide the realistic case of the modern network-traffic and network-attack scenarios. In contrast, the proposed dataset contains real-world network traffic data and annotated attack examples, rather than artificially simulated attacks executed in the sandbox network environment.
To facilitate the improvement of existing network-intrusion-detection methods, and the development of new, we have suggested a new network flow dataset. The dataset has 85 features that can be used to recognize 12 different types of network attacks. We provided an analysis and comparison of the proposed dataset with two classical and four other modern datasets by key features and described its advantages and limitations. Our dataset contains real network traffic captured over 10 months. This provides an advantage over synthetically generated datasets, because an artificial synthesis of network traffic might lead to incorrect network-attack models and behaviors.
In the future, we expect that the proposed dataset can be helpful to researchers working the cybersecurity domain and can be used as a modern benchmark network-intrusion dataset.

Conflicts of Interest:
The authors declare no conflict of interest.