Introducing UWF-ZeekData22: A Comprehensive Network Traffic Dataset Based on the MITRE ATT&CK Framework

: With the rapid rate at which networking technologies are changing, there is a need to regularly update network activity datasets to accurately reﬂect the current state of network infrastructure/trafﬁc. The uniqueness of this work was that this was the ﬁrst network dataset collected using Zeek and labelled using the MITRE ATT&CK framework. In addition to identifying attack trafﬁc, the MITRE ATT&CK framework allows for the detection of adversary behavior leading to an attack. It can also be used to develop user proﬁles of groups intending to perform attacks. This paper also outlined how both the cyber range and hadoop’s big data platform were used for creating this network trafﬁc data repository. The data was collected using Security Onion in two formats: Zeek and PCAPs. Mission logs, which contained the MITRE ATT&CK data, were used to label the network attack data. The data was transferred daily from the Security Onion virtual machine running on a cyber range to the big-data platform, Hadoop’s distributed ﬁle system. This dataset, UWF-ZeekData22, is publicly available at datasets.uwf.edu.


Introduction
As the variety of cyberattacks grow by the day, targeting everything from corporations (large and small), municipalities, healthcare institutions, educational institutions, critical infrastructure, etc., it is no longer sufficient to just analyze attacks after they happen.Though analyzing attacks after they happen will provide some insight, attackers are constantly finding new ways to attack different systems and infrastructures.Basically, in addition to network intrusion detection, other aspects such as threat hunting, intelligence hunting, and risk management are equally important for corporations (large and small) as well as other systems and infrastructures.Hence, developing a good cybersecurity dataset is a major challenge in today's world.In addition to network intrusion detection capabilities, a good network dataset has to be able to provide intelligence and be responsive to address the new threats.Hence our choice of using the MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) framework in the development of this new network dataset, UWF-ZeekData22, available at datasets.uwf.edu[1].The MITRE ATT&CK framework has a knowledge base that can be expanded to be quickly responsive to newer threats.
This paper describes the creation of a one-of-a-kind modern real (not simulated) network data repository, created using Zeek [2], labeled using the MITRE ATT&CK framework [3].The MITRE ATT&CK ® framework, originally created in 2013 and constantly being upgraded to the present day, is a knowledge base of adversary tactics and techniques based on real-world observations.This knowledge base serves as a foundation for the development of threat models used in the private sector as well as government.The present

•
Can be used to detect adversary behavior leading up to an attack; • Can be used to develop a profile of user or user groups intending to perform attacks; • Can also be used to identity attack traffic and attacks.
The rest of the paper is organized as follows.The next section presents the related works and provides some background on the present state of network datasets.Section 3 presents the architectural framework used for collecting UWF-ZeekData22.Section 4 describes the process of generating and collecting the data.Section 5 explains the data.Section 6 presents the mapping and correlation (labelling) of the data.Section 7 presents a traffic analysis, and, finally, Section 8 presents the conclusion.

Background and Related Work
To develop strong and robust automated network risk detection and mitigation solutions, the first necessity is to have a modern network traffic dataset, which is presently lacking.Though several network intrusion datasets have been developed over the past 25 years, researchers are still looking for better datasets that can be used to build robust solutions.Table 1 presents a comparison of many of the major network intrusion datasets built to date, starting with KDDCup99.The datasets were compared based on the following parameters: duration of data collected, whether the data was simulated or real, number of attack families, format of the data collected, the number of networks the data was collected from, number of distinct IP addresses, extraction tools used, number of features, number of files in the data, and the framework.Next, the datasets are briefly discussed.While other network attack datasets have been purposed [4][5][6][7], older datasets such as DARPA'98 and KDDCup 99 [8,9] are still being used in current research.The KDD99Cup dataset has been a very widely studied network intrusion dataset.This simulated dataset was mainly built off DARPA'98 and inherited the problems of DARPA'98.For example, an analysis of the attacks in the DARPA dataset revealed that many did not fit any of the attack categories and were likely caused by simulation artifacts [10].KDD99Cup has additional problems of its own.The KDD99Cup dataset is un-proportionately distributed and hence is not efficient for machine learners.It is known to have repeating records.To solve the issues of the KDD99Cup dataset, the NSL-KDD was developed.In the NSL-KDD dataset, redundant or duplicate records were removed, and the dataset was more balanced [10].This dataset has four types of attacks: DoS, probe, user-to-root, and remote-to-local.It has 5,209,458 records.
The DDoS 2016 dataset (not included in Table 1) was developed using the Network Simulator NS2 [11].This dataset has 27 features and 734,627 records.It includes four types of attacks: HTTP flood, UDP flood, DDoS using SQL injection, and Smurf.
The UNSW-NB15 dataset was developed using the IXIA PerfectStorm tool in a network with 45 IP addresses over 31 h [12].This dataset, which has 49 features and 175,341 records, and includes both typical activities and injected attack behaviors.
The University of Granada 2016 (UGR16) acquired network data from a teir-three Internet Service Provider (ISP) over four months, and this dataset was labeled using the logs from a honeypot system [13].This dataset includes three types of malware: annotated botnet, SSH scan, and SPAM attacks.
The CICIDS 2017 dataset, collected by the Canadian Institute for Cybersecurity [14], used the CICFlowMeter for the extraction of network data from twenty-five users over five days.This dataset includes Heartbleed, DoS, and DDoS attacks and has 80 features.
The CSE-CIC-IDA 2018 dataset uses synthetic user profiles, which abstractly represent network events and behaviors of 420 computers and 30 servers collected with CICFlowMeter-V3 [15].This dataset has eighty-four features and includes four types of attacks: Botnet, brute force, denial-of-service, and distributed DoS.
ToN-IoT, published in 2021 by researchers at the University of New South Wales, includes Zeek netflow data, system operation logs from both windows and linux operating systems, and telemetry datasets from a collection of seven IoT and IIoT devices.The dataset has nine attack families, and the authors specifically emphasized the need for a standardization of feature descriptions and cyberattack classes.This dataset is the first that combines netflow, IoT telemetry, and operating system data [16].
Additional comparisons of network attack datasets can be found in [17,18].These comparisons are based largely on network statistics, types of attacks, whether the data is synthetic or not, and the size of the dataset.Hence, from the literature it is apparent that, to date, there is no modern network labelled dataset using the MITRE ATT&CK framework, as is created in this work.

Overall Architectural Framework
Figure 1 presents the overall architectural framework for data collection.It shows how both the cyber range and big data platform were used for this research.Cybersecurity attacks generated in University of West Florida (UWF)'s cybersecurity classes were collected using Security Onion in two formats: Zeek and PCAPs.Mission logs, which contain the MITRE ATT&CKs, were collected and used to label the network attack data.The data was transferred daily from the Security Onion virtual machine running on the cyber range to the big-data platform, Hadoop's distributed file system (HDFS).

The UWF Cyber Range
The UWF cyber range is an internet HTML5 browser-accessible VMware vCenter that consists of one vCenter server appliance and three ESXi servers.The cyber range allows for the development of a full spectrum of cybersecurity skills, malware analysis, offensive cyber operations, and defensive cyber operations, in the safety of a sandbox environment.The software stack consists of virtualization, router/firewall, penetration testing/cybersecurity, induction detection system/threat hunting, targeting insecure application servers, and targeting insecure platform servers (Windows and Linux).Specifically, it is composed of: VMware vSphere is VMware's virtualization platform, which transforms data centers into aggregated computing infrastructures that include CPU, storage, and networking resources [19].vSphere manages these infrastructures as a unified operating environment and provides tools to administer the data centers that participate in this environment.The range runs VMware vCenter Server Essentials Version 6.7. Figure 1 presents the UWF's cyber range, and the specifications of UWF's cyber range are presented in Table 2.

The UWF Cyber Range
The UWF cyber range is an internet HTML5 browser-accessible VMware vCenter that consists of one vCenter server appliance and three ESXi servers.The cyber range allows for the development of a full spectrum of cybersecurity skills, malware analysis, offensive cyber operations, and defensive cyber operations, in the safety of a sandbox environment.The software stack consists of virtualization, router/firewall, penetration testing/cybersecurity, induction detection system/threat hunting, targeting insecure application servers, and targeting insecure platform servers (Windows and Linux).Specifically, it is composed of: VMware vSphere is VMware's virtualization platform, which transforms data centers into aggregated computing infrastructures that include CPU, storage, and networking resources [19].vSphere manages these infrastructures as a unified operating environment and provides tools to administer the data centers that participate in this environment.The range runs VMware vCenter Server Essentials Version 6.7. Figure 1 presents the UWF's cyber range, and the specifications of UWF's cyber range are presented in Table 2.

UWF's Hadoop Cluster
UWF's big-data platform is a HTML5-accessible JupyterLab via secure shell protocol (SSH) tunneling for security, as shown in Figure 2. The software stack consists of HDFS and the Spark distributed computing system.

UWF's Hadoop Cluster
UWF's big-data platform is a HTML5-accessible JupyterLab via secure shell protocol (SSH) tunneling for security, as shown in Figure 2. The software stack consists of HDFS and the Spark distributed computing system.

•
RedHat Enterprise Linux; Red Hat Enterprise Linux is the world's leading enterprise Linux platform, which is certified on hundreds of clouds and with thousands of hardware and software vendors [20].Red Hat Enterprise Linux can be purchased to support specific use cases such as edge computing or SAP workloads, but every subscription includes these core benefits.Podman is a daemonless container engine for developing, managing, and running OCI containers on Linux system's [21].Containers can either be run as root or in a rootless mode.Apache HDFS is a distributed file system that provides high-throughput access to application data [22].Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters [23].JupyterLab is the latest web-based interactive development environment for notebooks, code, and data [24].Its flexible interface allows users to configure and arrange workflows in data science, scientific computing, computational journalism, and machine learning.A modular design allows extensions to expand and enrich functionality.
The hardware stack includes nine servers.
• UWF's Hadoop cluster consists of one Hadoop name node and five Hadoop worker nodes.The Apache Hadoop software library is a framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models [22].It is designed to scale-up from single servers to thousands of machines, each offering local computation and storage.Rather than relying on hardware to deliver highavailability, the library itself is designed to detect and handle failures at the application Red Hat Enterprise Linux is the world's leading enterprise Linux platform, which is certified on hundreds of clouds and with thousands of hardware and software vendors [20].Red Hat Enterprise Linux can be purchased to support specific use cases such as edge computing or SAP workloads, but every subscription includes these core benefits.Podman is a daemonless container engine for developing, managing, and running OCI containers on Linux system's [21].Containers can either be run as root or in a rootless mode.Apache HDFS is a distributed file system that provides high-throughput access to application data [22].Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters [23].JupyterLab is the latest web-based interactive development environment for notebooks, code, and data [24].Its flexible interface allows users to configure and arrange workflows in data science, scientific computing, computational journalism, and machine learning.A modular design allows extensions to expand and enrich functionality.
The hardware stack includes nine servers.UWF's Hadoop cluster consists of one Hadoop name node and five Hadoop worker nodes.The Apache Hadoop software library is a framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models [22].It is designed to scale-up from single servers to thousands of machines, each offering local computation and storage.Rather than relying on hardware to deliver highavailability, the library itself is designed to detect and handle failures at the application layer, thus delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.The cluster runs Apache Hadoop Version 3.3.1-RC3on Redhat Enterprise Release 8 (Figure 3).The cluster has a storage capacity of 214.88 TB.
layer, thus delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.The cluster runs Apache Hadoop Version 3.3.1-RC3on Redhat Enterprise Release 8 (Figure 3).The cluster has a storage capacity of 214.88 TB.  3); • Use five Dell PowerEdge R730xd, while maximizing the storage, as the Hadoop worker nodes and Spark workers; The cluster is interconnected using two bonded 10 gbps   3);

•
Use five Dell PowerEdge R730xd, while maximizing the storage, as the Hadoop worker nodes and Spark workers; The cluster is interconnected using two bonded 10 gbps

UWF's Spark Cluster
UWF's Spark cluster consists of one Spark master and five Spark workers.Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters [23].The cluster runs Apache Spark Version

Generating and Collecting the Data
These data were collected from a cyber wargaming course, designed and offered at the University of West Florida, Pensacola, Florida, USA.The theme of the course was that every organization, whether government or private, needs IT personnel to defend its networks against attack.The most effective way to provide this experience was to recreate scenarios that students will see in the real world.Exercises were provided that performed specific roles in attacking and defending IT infrastructures.The cyber wargaming courses used UWF's cyber range, as presented in Figure 5.

Generating and Collecting the Data
These data were collected from a cyber wargaming course, designed and offered at the University of West Florida, Pensacola, Florida, USA.The theme of the course was that every organization, whether government or private, needs IT personnel to defend its networks against attack.The most effective way to provide this experience was to recreate scenarios that students will see in the real world.Exercises were provided that performed specific roles in attacking and defending IT infrastructures.The cyber wargaming courses used UWF's cyber range, as presented in Figure 5.
This VMware vCenter allowed access to virtualized networks from the Internet via a hypertext markup language version 5 (HTML5)-compatible web browser.Network principles were practiced with the aid of pfsense, which acted as a router with a built-in firewall.Offensive cyber operations (OCO) (e.g., red team or penetration testing) were practiced in the safety of the closed virtualized networks provided by the UWF cyber range using Kali Linux, implementing the full Lockheed Martin kill chain (e.g., the use of EternalBlue) [25].Kali Linux is an open-source penetration-testing, security-research, computer-forensics, and reverse-engineering distribution based on Debian Linux [26].Defensive cyber operations (DCO) and network operations (NetOps) (e.g., blue team, hunt, network monitoring) were conducted using Security Onion with both built-in and custom IDS rules via Snort and Suricata, and analytics using Elastic Stack dashboards and visualization were used to detect events.Security Onion is a free and open-source threat-hunting, network-securitymonitoring, and log-management platform including best-of-breed open-source tools (e.g., Zeek, Wazuh, and Elastic Stack) [27].A target-rich and diverse environment was provided by both Windows and Linux variants of Metasploitable.Metasploitable is a virtual machine with numerous built-in security vulnerabilities (e.g., security vulnerabilities found in GlassFish, Apache Structs, Tomcat Jenkins, IIS FTP, IIS HTTP, psexec, SSH, WinRM, Chinese caidao, ManageEngine, ElasticSearch, Apache Axis2, WebDAV, SNMP, MySQL, JMX, Wordpress, SMB, Remote Desktop, PHP MyAdmin), which are intended to be exploited using Metasploit, such as the Metasploit Framework found in Kali Linux [18].These VMs formed the bases of our virtualized networks, but many other VMs and services have found their way into the network for research and educational purposes (e.g., Splunk, CentOS, Windows XP, Windows 7, Windows 10, lightweight directory access protocol (LDAP), active directory (AD)).Pfsense, Kali Linux, Security Onion, and Metasploitable were arranged in such a manner that each group had their own network (Figure 5).This VMware vCenter allowed access to virtualized networks from the Internet via a hypertext markup language version 5 (HTML5)-compatible web browser.Network principles were practiced with the aid of pfsense, which acted as a router with a built-in firewall.Offensive cyber operations (OCO) (e.g., red team or penetration testing) were practiced in the safety of the closed virtualized networks provided by the UWF cyber range using Kali Linux, implementing the full Lockheed Martin kill chain (e.g., the use of EternalBlue) [25].Kali Linux is an open-source penetration-testing, security-research, computer-forensics, and reverse-engineering distribution based on Debian Linux [26].Defensive cyber operations (DCO) and network operations (NetOps) (e.g., blue team, The cyber war-gaming stages of the cyber operation topics (i.e., reconnaissance, gaining access, hiding presence, establishing persistence, execution, and assessment) were assessed through labs conducted within UWF's cyber range (e.g., conduct a reconnaissance offense cyber operation on your target(s)).The attacks were recorded using the Security Onion VM, producing Zeek logs and PCAP files.Mission logs that contained the MITRE ATT&CK data were collected and used to label the network attack data.The data was transferred daily from the Security Onion virtual machine running on the cyber range to the big-data platform, HDFS.

ER REVIEW 9 of 19
To date, 208.62 GB of Zeek logs and PCAPs have been collected.A total of 16 weeks of network traffic data have been collected using 81 subnets.

The Data
This dataset contains several files that include nominal and numeric as well as object variables.To completely understand this dataset, it is necessary to have a good understanding of Zeek as well as the MITRE ATT&CK framework, both of which are fairly complex structures.

Zeek
Zeek, in many ways, exceeds the capabilities of other network-monitoring tools and is also highly customizable.Zeek produces an extensive set of logs that describe the network activity.These logs not only document every connection but also document applicationlayer transcripts such as DNS requests with replies.Table 4 shows the Zeek log files collected in this experimentation process, the total count of records in each file, and a description of each of the files.The field names of each of the files are given in Appendix A. For further information on the files, for example, the field types, null values, and other information, please refer to [1].

MITRE ATT&CK Framework
ATT&CK is a behavioral model consisting of tactics, techniques, and sub-techniques.It documents known adversary behavior.The first ATT&CK model, created in September 2013, was refined and released in May 2015 with ninety-six techniques organized under nine tactics.Since then, the ATT&CK model has experienced tremendous growth based on contributions from the cybersecurity community and has had several updated versions.The April 2021 version, used for the creation of UWF-ZeekData22, has 14 tactics as well as 191 techniques and 358 sub-techniques, for a grand total of 576 techniques.In order to keep the techniques at a manageable level as well as address some of the new abstractions of the techniques, sub-techniques were added to the knowledge base in 2020.
The MITRE ATT&CK framework reflects various phases on an adversary's attack lifecycle.Tactics represent an adversary's goals for an attack.Tactics are the ways that adversaries perform an operation, such as persist, discover information, move laterally, or execute files.
Tactics have techniques, and techniques have sub-techniques.A technique or subtechnique can be used to perform one or multiple tactics, and there can be multiple techniques for each tactic.Likewise, there are multiple ways to perform a technique, so there can be multiple sub-techniques for each technique.However, all techniques may not have sub-techniques.
Techniques or sub-techniques show what an adversary intends to do.For example, for the discovery tactic, the technique or sub-technique may show what type of information an adversary is after.Techniques and sub-techniques, which are actions for a tactic, are implemented with procedures and used for achieving tactical goals.Sub-techniques further break down behaviors described by techniques [28].
The quick response offered by the MITRE ATT&CK framework after identifying new adversarial behavior is created by adding a new technique or sub-technique and making the existing technique or sub-technique inclusive of the new adversarial behavior.Some techniques were originally very broad, having the capacity to add sub-techniques, hence limiting the need to create new techniques every time.However, each sub-technique will only have a relationship with a single parent technique, and the sub-technique is not required to fall under all tactics that a technique is in.This is shown in Figure 6 with the blue lines.Sub-technique2 falls under Technique1 and Tactic1, but not Tactic2, although Technique1 falls under both Tactic1 and Tactic2.Similarly, Sub-technique3 falls under Technique3 and Tactic2, though Technique3 falls under both Tactic2 and Tactic3.This presents an interesting problem for the data structure and how the data is stored, which is discussed briefly in the latter part of this paper.presents an interesting problem for the data structure and how the data is stored, which is discussed briefly in the latter part of this paper.Hence, since the ATT&CK model is just as much about the mindset of the user as the process of using it (with a combination of various techniques and sub-techniques), it can be used to develop profiles of adversary groups, which can be used to improve defensive measures [29].
Tactics Available in UWF-ZeekData22 All 14 tactics presently available in the MITRE ATT&CK framework are now available in this dataset, UWF-ZeekData22.Table 5 presents the tactics found in UWF-ZeekData22.Not all tactics are disruptive to information systems.For example, some Hence, since the ATT&CK model is just as much about the mindset of the user as the process of using it (with a combination of various techniques and sub-techniques), it can be used to develop profiles of adversary groups, which can be used to improve defensive measures [29].

Tactics Available in UWF-ZeekData22
All 14 tactics presently available in the MITRE ATT&CK framework are now available in this dataset, UWF-ZeekData22.Table 5 presents the tactics found in UWF-ZeekData22.Not all tactics are disruptive to information systems.For example, some tactics such as initial access, discovery, and credential access are mainly focused on breaching the confidentiality of information and can be used to gain information and obtain more access within an environment, with the eventual goal of obtaining information through collection and exfiltration.

Attack Type Description
Reconnaissance Active or passive tactics for gathering information that can be used to plan future operations.

Discovery
Tactics that may be used to gain knowledge about the system and internal network.

Credential access
Tactics for stealing credentials such as account names and passwords.
Privilege escalation Tactics used to gain higher-level permissions on systems or networks.

Exfiltration
Tactics that may be used to steal data from network.

Lateral movement
Tactics used to enter and control remote systems on networks.
Resource Development Tactics to try to establish resources that can be used to support operations.

Initial access
Tactics that use various entry vectors to gain an initial foothold within network.

Persistence
Tactics used to keep access to systems across restarts, changed credentials, and other interruptions.

Defense evasion
Tactics used to avoid detection throughout their compromise.

Execution
Tactics to try to run malicious code.

Collection
Tactics to try to gather data to reach a goal.

Command and control
Tactics to try to communicate with compromised systems to control them.

Impact
Tactics to try to manipulate, interrupt, or destroy systems and data.
Table 6 presents the distribution of malicious traffic in the UWF-ZeekData22 dataset, labelled as per the MITRE ATT&CK framework.There were 10 attack types (or tactics), but reconnaissance made up 99.97% of the attacks in this dataset.Defense evasion 1 1.07749 × 10 −7

Mapping and Labeling the Data
One of the major challenges faced in creating this dataset was mapping and labelling the attacks or Zeek logs as per the MITRE ATT&CK framework.This was conducted with the help of the mission logs and is detailed in the next couple sections.We present the pseudo-algorithms used to label two data files: the DNS data file and a similar sub-set of the DNS mappings that were used to label the Conn data files.The slop factor (Figure 8) was used to take-into-account any human error that might have occurred in the recording of the mission logs.Since the manual timings entered in the mission logs might be plus or minus a few minutes, to adjust for this, a time interval of plus or minus a few minutes was used to record the actual time of the attack.

Labeling the DNS Data File
Since there were multiple techniques for each tactic and multiple sub-techniques for each technique, the STIX data contained many array type attributes.We chose to flatten these; Table 7 shows a generic base case with array data, and Table 8 shows the method we used to flatten our data applied to the generic case from Table 7.

Mapping and Labeling the Data
One of the major challenges faced in creating this dataset was mapping and labellin the attacks or Zeek logs as per the MITRE ATT&CK framework.This was conducted wit the help of the mission logs and is detailed in the next couple sections.We present th pseudo-algorithms used to label two data files: the DNS data file and a similar sub-set o the DNS mappings that were used to label the Conn data files.The slop factor (Figure 8) was used to take-into-account any human error that might have occurred in the recording of the mission logs.Since the manual timings entered in

Traffic Analysis
Table 9 shows the distribution of the malicious vs. non-malicious traffic in dataset, UWF-ZeekData22.The non-malicious traffic was collected at a time wh was no possibility of an attack.

Non-malicious traffic Malicious traffic
A total of 60 different tactics and techniques are available in UWF-ZeekDat information about the specific tactics and techniques (both flattened and unfla

Traffic Analysis
Table 9 shows the distribution of the malicious vs. non-malicious traffic in this new dataset, UWF-ZeekData22.The non-malicious traffic was collected at a time where there was no possibility of an attack.

Traffic Analysis of Cumulative Flows
A summary traffic analysis is presented for the cumulative flows during the period of data collection while generating the UWF-ZeekData22 dataset.Table 11 presents the dataset statistics, which shows the flow numbers, total of source bytes, destination bytes, number of source packets, number of destination packets, protocol types, number of normal and abnormal records, and the number of unique source/destination IP addresses for the data collection period.

Conclusions
In conclusion, UWF-ZeekData22 can be considered a modern NIDS benchmark dataset and will be useful to the NIDS research community.Since it is based on the MITRE ATT&CK framework, in addition to the network traffic analysis that is usually carried out using machine learning, other aspects of adversarial behavior can also be studied using this dataset, which is available at datasets.uwf.edu[1].

Figure 1 .
Figure 1.UWF cyber range and big-data platform.

Figure 1 .
Figure 1.UWF cyber range and big-data platform.

Figure 3 .
Figure 3. Hadoop cluster user interface.UWF's Hadoop and Spark clusters share the same hardware.The clusters: • Use one Dell PowerEdge R730 with 40 cores, 128 GB Memory, and a minimal amount of storage as the Hadoop name node and Spark master (Table3);•Use five Dell PowerEdge R730xd, while maximizing the storage, as the Hadoop worker nodes and Spark workers;•The cluster is interconnected using two bonded 10 gbps

8 (
1 (Apache Software Foundation, United States) on Redhat Enterprise Release 8 (Figure4).The cluster has a storage capacity of 460 GHz/200 cores and 640 GB of memory.Figure4).The cluster has a storage capacity of 460 GHz/200 cores and 640 GB of memory.

Data 2023, 8 ,
x FOR PEER REVIEW 12 of 19

Figure 7
Figure 7 presents a flowchart of the process used to map and label the DNS data file.The numbers in the figure correspond to the numbered list below. 1

Figure 7
Figure 7 presents a flowchart of the process used to map and label the DNS data file The numbers in the figure correspond to the numbered list below.

Figure 7 .
Figure 7. Process to label the DNS data file.

Figure 7 .
Figure 7. Process to label the DNS data file.

3 .
Join mission logs and preprocessed Conn file on the following: a.Time (see Figure 8 for specifics on slop factor) Conn datetime ≥ mission log start time (±slop factor) AND Conn datetime ≤ mission log end time (±slop factor) AND IP Conn src ip == mission log src ip AND Conn dest ip == mission log dest ip AND Port Conn src port == mission log src port AND Conn dest port == mission log dest port 4. Join labeled Conn and STIX data; a. Flatten array columns (IP and MITRE attacks).Unflattened and flattened data are shown in Tables 7 and 8 respectively; b.Map MITRE technique (already in Conn) to MITRE tactic (with mappings from STIX data); i.Some techniques map to multiple tactics.This is handled by flattening the array; 5. Mix benign data; a. Label with mitre_attack == none, label_tactic == none; 6. Join labeled Conn with raw DNS to produce labeled DNS; a. FROM Conn SELECT uid, mitre_attack, label_tactic FROM dns SELECT all Join on conn.uid== dns.uid

Figure 9
Figure 9 presents a flowchart of the process to map and label the Conn data file.The numbers in the figure correspond to the numbered list in Section 6.1.

3 val2 6 6. 2 .
Figure 9 presents a flowchart of the process to map and label the Conn dat numbers in the figure correspond to the numbered list in Section 6.1.

Figure 9 .
Figure 9. Process label the Conn data file.

Figure 9 .
Figure 9. Process to label the Conn data file.

Table 1 .
Comparing major network intrusion detection datasets.
3.4.UWF's Spark ClusterUWF's Spark cluster consists of one Spark master and five Spark workers.Apache Spark is a multi-language engine for executing data engineering, data science, and

Table 6 .
Distribution of malicious traffic in UWF-ZeekData22.

Table 9 .
Malicious vs. non-malicious traffic.total of 60 different tactics and techniques are available in UWF-ZeekData22.More information about the specific tactics and techniques (both flattened and unflattened) is available in [1].Table 10 presents an unflattened count of each of the different tactics and techniques in UWF-ZeekData22. A