Generating Network Intrusion Detection Dataset Based on Real and Encrypted Synthetic Attack Trafﬁc

: The lack of publicly available up-to-date datasets contributes to the difﬁculty in evaluating intrusion detection systems. This paper introduces HIKARI-2021, a dataset that contains encrypted synthetic attacks and benign trafﬁc. This dataset conforms to two requirements: the content requirements, which focus on the produced dataset, and the process requirements, which focus on how the dataset is built. We compile these requirements to enable future dataset developments and we make the HIKARI-2021 dataset, along with the procedures to build it, available for the public.


Introduction
It is challenging to estimate how much malicious detection methods have improved in the intrusion detection system (IDS) field. Training IDSs that employ machine learning depends on the available datasets, but obtaining a reliable dataset for comparison is difficult. Among the factors that make it difficult to compare datasets are a lack of proper documentation of the methods [1], a lack of comparison methodology [2], and a lack of important features, such as ground-truth labels, and publicly available and real-world environment traffic. Furthermore, network traffic nowadays is mainly being encrypted for communication security and privacy, and only very few datasets reflect this situation.
The dataset is an important part to build machine learning-based IDS models. The process starts with capturing traffic either as a packet or flow from the internet. Afterward, the captured traffic is compiled into a specific type of data containing network-related features, including labeling. A general machine learning-based IDS can be shown in Figure 1. Labeling is a crucial process for the dataset. Handling ground-truth is a real challenge, especially when experts cannot determine whether the traffic is an attack or benign. This is a reason why researchers use synthetic traffic. However, this implies the generated traffic is not representative of the real world environment. In a nutshell, the process of making a dataset starts with capturing traffic, and ends with the final preprocessing phase. The final result from the preprocessing phase is a labeled dataset. Each data point is classified into malicious or benign. The file contains tabular data in a human-readable format, such as a CSV file, or binary form, such as an IDX file. The number of detected malicious or false alarms can be used to benchmark the dataset.
The existing datasets lack reliably encrypted traces and practicality to produce as the basis to build the comprehensive model for the detection of new attacks. Most of the existing research that employs encrypted traffic are focused on different scopes, such as traffic classification and analysis [3]. Although such research exists [4], the dataset is not publicly available, due to the sensitivity of the data. Benchmark datasets are an important basis to evaluate and compare the quality among different IDS. Based on the detection methods, there are three types of IDS: signaturebased, anomaly-based, and a combination of signature-based and anomaly-based. These three types of IDS benchmark their systems with the KDD99 dataset, which is obsolete. The signature-based one focuses on building automatic signature generation [5], while the anomaly-based focuses on observing an outlier from the legitimate profile [6]. The signature-based type relies on a pattern-matching method to identify and attempt to match with the signatures database. When an attack attempt matches with the signature, an alert is raised. The signature-based type has the highest accuracy and lowest false alarm rate but this type cannot detect unknown attacks. While the anomaly-based type might detect unknown attacks by comparing abnormal traffic with the normal traffic, the ratio of false alarm rates remains high.
In this paper, we present a tool and requirements for making a new dataset created by generating encrypted network traffic in a real-world environment. Our contributions are two-fold. First, we propose new requirements for creating new datasets. Second, we create a new IDS dataset that covers the network traffic with encrypted traces. The dataset is labeled with attacks, such as brute force login and probing. The packet traces with payload are provided along with the background traffic and ground-truth data. We extract and adopt more than 80 features from the CICIDS-2017 dataset for the ground-truth, benign traffic, and malicious traffic by using Zeek [7], an open source network security monitoring tool.
The paper is organized as follows. In Section 2, we review the existing datasets and we provide the most important features from their dataset, such as the duration of capturing of the network traffic, what kind of attack they implemented, and what format of data they used. From the review, we summarize the requirements that need to be satisfied to build a practical, implemented dataset and compare it among the existing datasets in Section 3. In Section 4, we describe the dataset generation methodology along with the attack traffic generation and explain the characteristics of the attack traffic. Subsequently, we describe the network configuration for generating network traffic, the scenarios, the tools and code we used to generate, and the duration of capturing the network features. In Section 5, we analyze the dataset and provide information on how the labeling works. Finally, the last section concludes this paper.

Review of Existing Datasets
Many researchers have published papers based on generated IDS datasets, such as ISCX [8], UNSW-NB15 [9], and UGR'16 [10]. In this section, we introduce several IDS datasets with their characteristics. We highlight several important requirements from their perspective.

KDD99
The KDD99 dataset was created in 1999, using tcpdump, and was built based on the data captured by the DARPA 98 IDS evaluation program [11]. The training data are around four gigabytes of compressed TCP data from seven weeks of network traffic. The network traffic contains attack traffic and normal traffic. The capture of the network traffic was done in a simulated environment. The dataset contains a total of 24 attack types, which fall into four main categories: Denial of Service (DOS), Remote to Local (R2L), User to Root (U2R), and probing. KDD99 has been used extensively in IDS research. The report [12] showed that during 2010-2015, 125 published papers performed IDS evaluation using KDD99. While this dataset is considered inadequate for evaluation, such as a lot of redundant instances recorded, the main problem is that the dataset is not up to date with the recent situation and attack vectors. Although many researchers were already convinced with this information, studies from another group of researchers argued that this dataset is the most widely used for benchmarking [13] or to limit their study only for KDD99 [14].

MAWILab
MAWI was built in 2001 and consists of a set of labels locating traffic anomalies in the MAWI archive [15]. This dataset contains tcpdump packet traces captured from an operational testbed network in a link between Japan and the United States. The archive contains 15 min of daily traces. This dataset is huge with a very long period. The labeled MAWI archive is known as MAWILab, obtained from a graph-based methodology that combines different and independent anomaly detectors [16]. MAWI archives labeling is based on inferences that results in no ground-truth traffic that can be used for evaluation. The label has three classes: anomalous for a true anomaly, suspicious indicates that the traffic is likely to be anomalous, and notice is assigned as an anomaly but it does not reach a consensus from all anomaly detectors. Several researchers used MAWILab for anomaly detection [17] and generating labeled flow [18]. The limitation of this dataset is that the packet traces are captured for 15 min each day. The header information is available in the packet traces but the payload is removed.

CAIDA (Center of Applied Internet Data Analysis)
CAIDA has several different types of datasets, categorized as ongoing, one-time snapshot, and complete [19]. CAIDA collects the data from different locations, and each of the datasets has different characteristics, such as Distributed Denial of Services (DDoS) attack, UDP probing, BGP monitoring, IPv4 census with passive traffic traces captured from a darknet, an academic ISP, and a residential BGP with active measurements of ICMP ping, HTTP GET and traceroutes. Most of the datasets are anonymized with IP addresses and the payload, which severely reduces their usefulness. Based on their catalog, during 2017-2020, most of the papers related to IDS and security focused on Denial of Service (DoS) [20,21], Distributed Denial of Service (DDoS) [22], DNS security [23], Network Telescope Daily Randomly, and Uniformly Spoofed Denial-of-Service (RSDoS) Attack Metadata. Each record contains the IP address of the attack victim, the number of distinct attacker IPs in the attack, the number of distinct attacker ports and target ports, the cumulative number of packets observed in the attack, the cumulative number of bytes seen for the attack, the maximum packet rate seen in the attack as the average per minute, the timestamp of the first and the last observed packet of the attack, the autonomous system number of target_IP at the time of the attack, and the country and continent geolocation of target_IP at the time of the attack. This dataset is updated every day.

SimpleWeb
SimpleWeb is a dataset collected from the network of the University of Twente [24]. This dataset contains packet headers of all packets that are transmitted over the uplink of access to the internet. The packets are captured with tcpdump and Netflow version 5. The payload from the packets is removed because it contains sensitive information, such as HTTP requests or conversations of instant messaging applications. The labeled dataset for suspicious traffic is collected by using a honeypot server. Despite no ground-truth data being available, researchers still use it to compare with the different real-world environment (e.g., campus network, backbone link) [25], while others employ it for background traffic for botnet detection [26], and to evaluate publicly available dataset for similarity searches to detect network threats [27].

NSL-KDD
NSL-KDD is an updated dataset that tries to solve some of the inherent problems in the KDD99 dataset [28]. The NSL-KDD dataset contains features and labels indicating either normal or an attack, with specific types of attacks. Every instance in the training set contains a single connection session, which is divided into four groups, such as basic features from the network connection, content-related features, time-related features, and host-based traffic features. Each instance is labeled either as normal or attack. These attacks are categorized into four groups: Denial of Service (DoS), User to Root (U2R), Remote to Local (R2L), and Probing. Many researchers use it as a benchmark to help them to compare their intrusion detection systems performance. Several groups of researchers used different scopes, such as IoT-based networks [29] and Vehicular Ad Hoc Network (VANET) [30]. The former is for SYN flood, UDP flood, and Ping of Death (PoD) detection, while the latter is mostly for DDoS detection. Other researchers used different methods and switched from conventional machine learning to deep learning based methods [31][32][33].

IMPACT
IMPACT is a marketplace of cyber-risk data. The data distribution and tool repository are provided by multiple providers and stored and accessed from multiple hosting sites [34]. The datasets related to cyber-attacks, such as the daily feed of network flow data produced by Georgia Tech Information Security Center's malware analysis system, updates once a year. These datasets are only open for specific countries based on approval by the Department of Homeland Security (DHS).

UMass
UMass is a trace repository provided by the University of Massachusetts Amherst [35]. The network-attack-relevant data are provided with various type of data, such as traffic flow from the TOR network [36], a trace of attack simulation on peer-to-peer data sharing network [37], passive localization attack simulation with reality mining dataset [38] containing sensor data (proximity, location, location labels, etc.), and survey data (personal attributes, research group, position, neighborhood of apartment, and lifestyle).

Kyoto
This dataset was created between 2006 and 2015 by Kyoto University through honeypot servers. This dataset was created using Bro 2.4 (the former name of Zeek) with 24 statistical features consisting of 14 features extracted based on the KDD99 dataset and an additional 10 features, such as IDS_detection, Malware_detection, Ashula_detection, Label, Source_IP_Address, Source_Port_Number, Destination_IP_Address, Destination_Port_Number, Start_Time, and Protocol [39]. The information is limited to the attack information targeting the honeypot server. There are no packet traces or information about the payload. Furthermore, the information on how to label the dataset is not found [40]. Several published papers using the Kyoto dataset focused on anomaly detection, especially on the feature analysis [41], feature dimensionality reduction and ensemble classifier [42].

IRSC
This dataset was created by Indian River State College and consists of network flows and full packet capture [43]. The dataset represents a real-world environment in which the attack traffic has two different types: the controlled version, which is synthetically created by the team, and the uncontrolled version, which are the real attacks. The flow based traffic created with the Silk [44] and the full packet capture created with the Snort IDS [45]. The additional source of flow data was produced from the Cisco firewall using NetFlow version 9. While the authors stated that the dataset is a complete capture with payload and flow data, unfortunately, it is not publicly available.

UNSW-NB15
UNSW-NB15 was created using a commercial penetration tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). This tool can generate hybrid synthetically modern normal activities and contemporary attack behaviors from network traffic [9]. They collected tcpdump traces for a total duration of 31 h. From these network traces, they extracted 49 features categorized into five groups: flow features, basic features, content features, time features, and additional generated features. Feature and statistical analyses are the most common methods used in several published papers employing UNSW-NB15 [46][47][48]. While [46] could obtain 97% accuracy by using 23 features, [47] incorporated the XGBoost algorithm for feature reduction, using several traditional machine learning algorithms for evaluation, such as Artificial Neural Network (ANN), Logistic Regression (LR), k-Nearest Neighbor (kNN), Support Vector Machine (SVM) and Decision Tree (DT).

UGR'16
This dataset was created from several NetFlow v9 collectors located in the network of a Spanish ISP [10]. It is composed of two different types of datasets that are split in weeks. First, the calibration set contains real background traffic data, and second, the test data contain real background traffic and synthetically generated traffic data with well-known types of attacks. Due to the nature of the NetFlow data, payloads from the network traffic were not included. The types of attacks implemented in this dataset are Low-rate DoS, Port scanning, and Botnet traffic. Between 2017 and 2021, we found mixed methods from several published papers, such as [49,50], Rajagopal et al. [49], who argued that conventional machine learning methods were ineffective and instead used stacking ensembles to improve performance and reliable predictions, while [50] proposed hybridized multi-model system to improve the accuracy of detecting the intrusion. Ref. [51] addressed imbalanced data problems by producing synthetic data with the Generative Adversarial Network (GAN).

CICIDS-2017
This dataset was created by the Canadian Institute for Cybersecurity at University of Brunswick in 2017. The purpose of CICIDS-2017 was intrusion detection, and it consisted of several attack scenarios. In this dataset, the attack profiles were produced using publicly available tools and codes. Six attack profiles were implemented, such as brute force, heartbleed, botnet, DoS, DDoS, web attack, and infiltration attack. The realistic background traffic was generated, using a B-Profile system [52]. The B-Profile system extracted 25 behaviors of users based on several protocols, such as HTTP, HTTPS, FTP, SSH, and SMTP. The network traffic features were captured with CICFlowMeter [53], which extracted 80 features from the pcap file. The flow label included SourceIP, SourcePort, DestinationIP, DestinationPort, and Protocol. Mixed methods are used, incorporating CICIDS-2017 to detect specific attacks such as DoS attack [54] by using feature reduction, web-attack detection [55], and anomaly web traffic [56] with ensemble architecture and feature reduction. Others are improving the AdaBoost-based method [57] to counter the imbalance of the training data [58], and combining feature selection and information gain to find relevant and significant features and to improve accuracy and execution time.

Dataset Requirements
While the authors of ISCX [8], UGR'16 [10], and CICIDS-2017 [53] introduce a new dataset and provide extensive requirements about the dataset, their works have different research objectives and scope. In contrast to their earlier dataset, our work is a complement to fill the gap, missing from the previous requirement.

Requirements for IDS Evaluation Datasets
Generally, different datasets have different assets and requirements. Shiravi et al. [8] focused on accurate labeling in the dataset by building a systematic profile to generate the dataset. They argued that the network traffic should be as realistic as possible, so a complete capture in a realistic network must be satisfied. It will impact anonymity and lead to potential privacy issues. Fernandez et al. [10] provided only flow information and focused on the duration of the capturing. Furthermore, a flow format with only 5-tuple is not enough and needs additional features if the malicious traffic is delivered via an encrypted protocol, such as HTTPS. We found that the requirements to build an IDS dataset from Sharafaldin et al. [52] is extensive. Unfortunately, their generated traffic comes from an emulated network, which is missing a realistic environment. In addition, the information about ground-truth and how the labeling works was not found in their paper and, thus, has the potential to be inaccurate and unreliable for analysis. Cordero et al. [59] created a tool called ID2T and we found that their requirements are better in practical terms. They categorized the requirements into functional and non-functional ones. Functional requirements focus on what is needed to construct datasets, while the non-functional requirements specify several criteria that need to be satisfied to be of practical use.
All of the requirements have high similarity. However, none of the works highlighted the importance of encrypted traffic in the dataset, and this is one of the emphases in our requirements. We derived our requirements for datasets based on the above works as well as by reviewing the existing datasets which described that the quality of the dataset mostly affects the outcome of the NIDS system. We classified the requirements into content requirements and process requirement. The content requirements are similar to [59], such as functional requirement, which focuses on what is needed to construct a dataset, and [8] on complete network traces and realistic network traffic capture. The process requirement is similar to that of [10] in the documentation point. While this is not enough, the information on how to produce a new dataset and practical to implement does not exist.
The proposed requirements try to fill the gap of information from previous datasets. Based on our content requirements, we found at least four missing points: (1) Most of the datasets are not anonymized, such as KDD99, SimpleWeb, NSL-KDD, Kyoto, IRSC, and UNSW-NB15. We chose to preserve privacy by anonymizing only a specific part of the background traffic based on the Crypto-Pan algorithm. (2) The majority of the datasets are impractical to generate, such as KDD99, CAIDA, NSL-KDD, IMPACT, UMass, IRSC, UNSW-NB15, and CICIDS-2017.
As for encryption information, most of the datasets contain non-encrypted traffic, except for MAWILab, UGR'16, and CICIDS-2017. These datasets neither focused on nor classified encrypted traffic. However, HIKARI-2021 is focused on encrypted traffic.
The content requirements focus on the assets of the dataset to achieve a practical way to produce a dataset, while the process requirement specifies the information on how the dataset is built, so a new dataset can be built in the future using the same process. We list these requirements below along with some descriptions of each item.

Content Requirements
(1) Complete capture: complete capture of the network traffic, such as communication between host, broadcast message, domain lookup query, the protocol being used. The most important thing from complete capture is that both flow data and pcap should be available.
Payload: payload is not needed for a flow-based approach. However, having comprehensive information and extracting the most out of the data is important. HIKARI-2021 is the dataset that provides labeled encrypted traffic, while the well-known datasets do not focus on encrypted traffic. There is a possibility that a full payload captured might be useful in the future.
Anonymity: synthetic traffic should provide full packet capture, while real traffic must anonymize certain packets to preserve privacy. (4) Ground-truth: the datasets should provide realistic traffic from a real production network, compared with the synthetic traffic, and ensure no unlabeled attack in the ground-truth. (5) Up to date: both packet traces from flow data and pcap should be always accessible by repeating the capturing process of the network traffic. Because the data are subject to change over time, repeating the procedures guarantees that the dataset always obtains the latest information. (6) Labeled dataset: correctly labeling data as malicious or benign is important for accurate and reliable analysis. The labeling process is a manual task and determined by the flow with a combination of the source IP address, source port, destination IP address, destination port, and protocol.
Encryption Information: information on how to establish benign or malicious traffic must be stated. We are focused on application layer attacks, such as brute force and probing that employ HTTPS with TLS version 1.2 to deliver the attacks.

Process Requirement
Methods: producing a new dataset with specific requirements and practical implementation is important. Therefore, the methods should cover information on how to generate the dataset, how to generate the benign and attack traffic, how the background traffic is being captured, how the labeling process works, and how to implement it in the network. Furthermore, we need to determine what scenarios and how to deliver the synthetic attack in the network. In addition, the information of what features and how many can be extracted from the packet traces should be declared. Information on how to make a new dataset should be available in detail and practical to generate.

Comparison of the Existing Datasets against the above Requirements
Comparisons between IDS datasets are shown in Table 1, where we assess the datasets in Section 2, based on the requirements that we set in Section 3.1. We were unable to find the information regarding the anonymity of the UMass dataset; therefore, no indicator was given. As for the IMPACT dataset, this platform has many datasets, some parts of which are anonymized, while others are not. In the CICIDS-2017 dataset, one part of the traffic has samples for encrypted traffic with benign and attack profiles.
We have four observations from the above comparisons. First, there is a need for encrypted samples of benign and attack traffic. We found that [15] in their dataset have information on whether the traffic is anomalous or suspicious but it depends on the anomaly detectors. The payload from the packet traces was not included. This limited the capability of IDS because many attacks cannot be detected only by network flow with only 5-tuple attributes. In addition, [53] in their datasets included the traffic from benign and attack profiles from SSH. While this is beneficial, the diversity of the attack needs to be expanded to applications, such as browser attacks, or with different protocols, such as HTTPS, and we did not find that this protocol exists in their dataset. Second, we found that most of the datasets are not anonymized. The reason is probably that their testing beds are in a controlled environment or they have consent with their activity. The former is the best option with the consequences that the traffic will have more synthetic traffic while reducing the real traffic. The latter is preferred if they can preserve privacy. Furthermore, privacy can be maintained by anonymizing the traffic, but being highly anonymized may decrease the results of the analysis [8,60,61]. Third, we found that most of the datasets do not have ground-truth data and background traffic, which make the analysis limited only to their model. Fourth, there is a need for a methodology on how to create a new dataset. This is due to the nature of the network environment that is subject to change over time. How to create new datasets with the practical implementation is important, so researchers may make their dataset and evaluate it with their environment. This methodology can be a guideline for IDS researchers to follow for making a practical dataset.

HIKARI-2021 Generation Methodology
In this section, we explain our methodology for producing our dataset, which we call HIKARI-2021. The process starts with creating a victim network, where background traffic is captured, and attackers generate synthetic benign traffic, using a benign profile (details in Section 4.3), and malicious traffic, using an attacker profile (details in Section 4.4). The attacker traffic is captured in the attacker network. We do this to differentiate between synthetic benign and malicious traffic. Distinguishing between benign and malicious traffic is based on several criteria (details in Section 4.4). We then process the packet traces to anonymize the background traffic and extract the features. The packet traces and extracted features, as well as the documentation, constitute the produced dataset.
We are focused on application layer attacks that employ HTTPS. Based on the report from the 2021 Data Breach Investigation, 80% of the attack vectors come from applicationlayer attacks. There are many attacks on the internet but we are not focused on how many attacks we can generate. Based on the survey from netcraft.com and websitesetup.org, WordPress, Joomla, and Drupal are among the ten most popular open-source CMSs, with the combined market share of almost 50%. Based on the information from CVE, more than 300 vulnerabilities existed for WordPress from 2006 to 2021, 92 vulnerabilities for Joomla from 2004 to 2021, and 202 vulnerabilities for Drupal from 2002 to 2021. More than half of the vulnerabilities from these three CMSs are part of Brute Force and Probing. Furthermore, the goal of this research is not in the attack diversity but in what kind of attack we can deliver in the encrypted network. We decided to focus on common application-layer attacks, such as brute force and probing. In addition, the IDS researcher may build their script based on our tool to enrich the attack, such as SQL Injection, Denial of Service, etc.

Background Profile
Generating realistic data is important. For the background traffic, we captured all the data without any filter or firewall in the victim network. Therefore, there is a possibility that the background traffic may contain malicious traffic or attacks. To preserve privacy without degrading the result of the analysis, we anonymized several pieces of information, such as IP address and the payload.

Benign Profile
To generate the benign profile, we considered using a profile similar to human behavior. To achieve it, we used Selenium [62], which runs two headless browsers: Google Chrome and Mozilla Firefox. These two browsers act like humans by clicking random links from multiple websites, sign up as a user, sign in, post an article to the target victim's server, and sign out. To behave like a human and to avoid being detected as a bot or web spider, we used several configurations, such as user-agent and random delay, for every sequence of action. The addresses of the websites are from Alexa's top 1 million visitors [63]. The benign profile was developed with Python script; this activity simulates benign traffic. All benign traffic is captured without anonymizing the payload. The type of traffic generated is HTTPS only.

Attacker Profile
The attack traffic is generated synthetically, first by targeting a specific page for user login of the CMSs, and second by scanning their vulnerability. Both of the attacks are delivered via the HTTPS protocol. The attacks are delivered on different days with different scenarios (details in Section 4.5). The types of attacks are as follows: (1) Brute force attack: this attack is the most famous for cracking passwords. The attacker usually repeatedly tries to gain the target over and over using all possible combinations using a dictionary of possible common passwords [64]. We developed a script that mimics a brute force attack, using a browser to deliver the attack. We intentionally added a user to the three different CMSs with the role as an admin and password, which we took randomly from [64]. The purpose is to make sure that the brute force attack is delivered successfully. (2) Brute force attack with different attack vectors: while the first type of attack uses the browser as the attack vector, the second attack uses a different attack vector, XMLRPC.
We developed a script that accesses XMLRPC for gaining valid credential access.
Probing: this is also called vulnerability probing. This script scans the web applications, such as Joomla, WordPress, and Drupal to find their vulnerability. The tools for vulnerability scanning are publicly available. For this dataset, the scripts used these probing scripts: droopescan [65] for WordPress and Drupal, and joomscan [66] for Joomla.
We provide the template script to customize the attack profile so researchers may use it for making custom attacks using different vectors. Distinguishing between an attack profile and benign profile is based on the source IP address, source port, destination IP address, destination port, protocol, and the day both of the profiles being generated. In addition, to determine benign traffic, any destination addresses in the Alexa list are considered benign.

Scenarios
We captured the traffic non-consecutively between 28 March and 4 May 2021, with each capture session lasted for 3 to 5 hours. In the first scenario, no attack traffic was generated, and only background traffic was being captured. In the second scenario, brute force attack traffic was generated for 2 days. Furthermore, a brute force with different attack vectors was generated in the third scenario. In the last scenario, scanning vulnerabilities of WordPress, Joomla, and Drupal were generated.

Dataset Preprocessing
The traces were captured using tcpdump with full packet capture. As for the background traffic, we fully captured the traffic but then we anonymized it to maintain privacy. To preserve privacy, we used a Crypto-PAn based algorithm [67]. The complete dataset contains several files: pcap files from background traffic, and synthetic attacks. The flowmeter files with pkl and CSV are available for downloads [68]. The preprocessing flow from pcap files into CSV files is presented in Figure 3.

Labeling Process
During background traffic validation, we found malicious cryptomining traffic. The result comes from the Zeek rules, which shows that some traffic is that of malicious cryptomining, such as XMRIGCC. We then separated and added it as a new attack, which we categorized as XMRIGCC CryptoMiner. Labels were applied on a per-flow basis. In the background traffic, we did not find any attack besides the cryptomining. Other than background, our labeling was based on the generated synthetic rules, such as source IP address, source port, destination IP address, destination port, and protocol. The dataset consists of two labels: traffic_category and label. The former represents the name of the traffic category, while the latter is only a single value with 0 representing Benign, and 1 representing Attack as shown in Table 2.  Table 3 shows the features while Figure 4 displays a statistical description of the features. Most of the features were adopted from CICIDS-2017, while uid, originh, originp, responh, responp, traffic_category, and Label were derived from Zeek.

Performance Analysis
We conducted an examination using a basic performance analysis using four machine learning algorithms. Table 4 displays the performance of the examination results in Accuracy, Balanced Accuracy, Precision, Recall, and F1.  Table 5 shows an analysis comparison among KDD99, UNSW-NB15, CICIDS-2017, and HIKARI-2021. The table consists of seven parameters: the number of unique IP addresses, simulation, duration of the data being captured, format data being collected, attack category, feature extraction tools, and the number of features extracted from each dataset. The number of unique IP addresses of CICIDS-2017 and HIKARI-2021 were from the unique destination IP addresses from the dataset. Partial means that the dataset is mixed between a simulation or synthetic and real-network environment.

Conclusions and Future Work
Publicly available up-to-date datasets to benchmark and compare among IDS are important, especially as the network traffic is changing over time. There are two main contributions of this paper. First, we made a new requirement for building new datasets which are lacking in the existing datasets, such as anonymization, payload, ground-truth, encryption, and a practical method to implement it. Anonymizing certain data will prevent privacy issues, while capturing with the payload will enrich the information that we can collect for detecting malicious traffic within encrypted traffic. Providing the ground-truth data is crucial, so no unlabeled attack is recorded in the dataset. The lack of existing datasets with encrypted traffic, even though most present-day traffic use it for delivering attacks, has become our concern. Second, we generated a new IDS dataset called HIKARI-2021, which covers the network traffic with encrypted traces. The datasets were produced with a mix of ground-truth data, which are missing in the existing IDS datasets. The datasets are available publicly [68]. We adopted more than 80 features from CICIDS-2017 and added more features as a reference, such as a source IP address (originh), source port (originp), destination IP address, and destination port. We labeled each flow as benign or attack, where benign has two categories (Benign or Background), while attack has four (Bruteforce, Bruteforce-XML, Probing, and XMRIGCC CryptoMiner).
We want to highlight what makes our dataset different from the existing IDS datasets. This is based on our proposed ideal requirements. The first is from the content requirements, such as complete capture, for which we provide all traces with pcap files (e.g., background traffic, benign, and attack); the payload is provided with the exception that we anonymize the background traffic, while anonymity is part of a requirement to preserve privacy.
The ground-truth and labeled are manually evaluated based on the source IP address, source port, destination IP address, destination port, and protocol. This process is to make sure that no unlabeled attack is in the ground-truth. The last requirement is encryption. This one of the most important requirements, as we know that unknown malicious traffic uses these attack vectors to deliver attacks.
The second is process requirement. It is to ensure that researchers can follow the guidelines to create their dataset. The information on how to generate the synthetic attacks and the network configuration should be available. We provided the scripts on how to capture and generate the synthetic attacks from the attack profile. The tools for mimicking human interaction, such as browsing and clicking random links, are available. These two profiles, the attack profile and benign profile, are important for producing new data if researchers want to add more attack vectors and update the traffic with their own needs. The labeling process script to produce ground-truth data is provided. The process requirement can be implemented in the controlled environment so that researchers can make new datasets based on their network configuration. For a basic evaluation, we examined the performance of the HIKARI-2021 dataset in terms of Accuracy, Balanced Accuracy, Precision, Recall, and F1, using four machine learning algorithms.
In the future, we would like to extend our observation with the background traffic and add an evaluation. Because background traffic is uncertain and not labeled in the data, the possible approach for evaluation is using machine learning with unsupervised learning. Furthermore, we would like to make performance comparisons with the existing datasets and proceed with the analysis of application identification, as this is important because malicious traffic may be disguised using reserved ports to bypass firewalls or IDS and blend with normal network activity.