The VNF Cybersecurity Dataset for Research (VNFCYBERDATA)

Believe Ayodele; Victor Buttigieg

doi:10.3390/data9110132

and

Department of Communications and Computer Engineering, Faculty of ICT, University of Malta, Msida MSD 2080, Malta

^*

Author to whom correspondence should be addressed.

Data2024, 9(11), 132;https://doi.org/10.3390/data9110132

Version Notes

Order Reprints

Abstract

Virtualisation has received widespread adoption and deployment across a wide range of enterprises and industries throughout the years. Network Function Virtualisation (NFV) is a technical concept that presents a method for dynamically delivering virtualised network functions as virtualised or software components. Virtualised Network Function (VNF) has distinct advantages, but it also faces serious security challenges. Cyberattacks such as Denial of Service (DoS), malware/rootkit injection, port scanning, and so on can target VNF appliances just like any other network infrastructure. To create exceptional training exercises for machine or deep learning (ML/DL) models to combat cyberattacks in VNF, a suitable dataset (VNFCYBERDATA) exhibiting an actual reflection, or one that is reasonably close to an actual reflection, of the problem that the ML/DL model could address is required. This article describes a real VNF dataset that contains over seven million data points and twenty-five cyberattacks generated from five VNF appliances. To facilitate a realistic examination of VNF traffic, the dataset includes both benign and malicious traffic.

Dataset: https://figshare.com/s/f065022eb85701278c58.

Dataset License: Creative Commons Attribution 4.0 International.

Keywords:

virtualised network function; VNF; dataset; cybersecurity; network function virtualisation; NFV

1. Introduction

The emergence of virtualised network functions (VNFs) has given the computing world agility, flexibility, cost effectiveness, resource optimisation, and scalability in offering network functions or services on demand by allowing VNF appliances to run on commodity hardware, decoupling them from dedicated, proprietary hardware. VNF appliances can provide tasks such as routing, firewalling, intrusion detection, and load balancing; therefore, there is a significant need to protect these functions because the bulk of VNF appliances are deployed at the network’s edge, where malevolent individuals can easily target them [1,2,3,4,5]. Current research on combating attacks using Machine or Deep learning (ML/DL) models is frequently based on well-known datasets, many of which are based on capturing network traffic patterns during an attack and a non-attack scenario. Several existing datasets, such as the BETH [6], ISOT CID [7], CIC-IDS2017 [8], CSE-CIC-IDS [9] and NSL-KDD [10], and CIC-Bell-DNS-EXF-2021 [11], have been used to train ML/DL models to prevent intrusions and attacks over time. However, these datasets are not focused on network traffic in a VNF-enabled system. This work introduces the VNFCYBERDATA, the first VNF dataset for cybersecurity, and the first session-level dataset in this domain, with the potential to inspire new research directions.

A VNF-based dataset is necessary due to the observed discrepancies, such as differences in throughput, packet delay, resource contention, and the hypervisor obscuring important information in the case of anomaly detection, which can change the overall behaviour of network traffic [12,13,14,15,16,17]. Whiteaker et al. [13] found that virtualisation can increase latency due to additional abstraction layers, especially in high-throughput scenarios with significant packet processing demands. Chung and Wang [12] further illustrate that while high-performance cloud systems can mitigate some latency issues through optimised architectures, the fundamental overhead of virtualisation remains a challenge. Wang and Ng [14] investigated the impact of virtualisation on network performance in Amazon EC2 data centres, observing that resource sharing among different VNFs might result in unexpected performance owing to contention for CPU, memory, and I/O resources. This contrasts with Physical Network Functions (PNFs), which typically have dedicated resources, resulting in more predictable and stable performance metrics. Gogunska et al. [15] further explored the implications of measuring traffic in a virtualised environment, emphasising that the overhead associated with virtualisation can complicate traffic measurement and analysis. This overhead can introduce additional latency, jitter, and packet processing delays, altering the behaviour or pattern of a virtualised appliance compared to the PNF counterpart. Lee et al. [16] investigated traffic anomaly characteristics in a virtualised network testbed, revealing that VNFs may exhibit different traffic patterns than PNFs. The authors note that the virtualisation layer can obscure certain traffic behaviours, making it more challenging to detect anomalies. Collectively, these studies demonstrated that the bottlenecks introduced by virtualisation, such as increased packet delays, resource contention, and hypervisor complexity, can have a considerable impact on network traffic performance when compared with PNFs. As a result, a VNF-based dataset is required to properly capture and investigate the behaviour of virtualised network devices under these described constraints, as traditional PNF-based datasets may not reflect the unique problems and performance characteristics introduced by virtualisation.

The BETH dataset [6] contains two sensor logs—kernel-level process calls and network traffic; however, it needs to provide adequate network features for traffic analysis since it contains fourteen features that do not offer diversity in investigating network traffic. The ISOT CID dataset [7] offers an even smaller number of features than BETH. Also, it follows a generic approach in labelling the dataset by not specifying the actual malicious behaviour that was captured. The ISOT CID [7] shares similarities to this work in capturing performance metrics such as CPU utilisation. The CIC-IDS2017 [8] offers an improved dataset in terms of diversity and volume; however, it poses a limitation in the number of attacks (eight cyberattacks) and a minimal number of protocols generated in the dataset (FTP, SSH, brute force SSH, DoS, etc.). The work described here captured seventy-seven protocol sets and twenty-nine cyberattacks. A similar work to [8,9] also offers an improved dataset for anomaly detection that provides a benchmark for intrusion detection based on creating user profiles with about eighty features; however, it is limited in the number of cyberattacks.

The VNFCYBERDATA dataset provides researchers with several data points to develop ML/DL models to safeguard VNF appliances. It could also serve as a guide to study or investigate the network behaviour of VNF appliances in comparison with physical network functions. During the collection of the dataset, the following considerations and actions were taken:

All VNF environments were completely updated/upgraded to the most recent version/build before collecting the dataset; therefore, no known vulnerabilities were present.
Not all attacks were carried out at the same time.
Attacks were initiated from both inside (LAN) and outside (Internet) the network.
Not all initiated attacks were successful in exploiting the VNF appliances; the aim of generating the dataset is to monitor and collect network behaviour during such attacks and regular operation, not necessarily exploit the VNF appliances.

2. Data Description

The VNFCYBERDATA dataset comprises over seven million anonymised data points with forty attributes and a target (Label) column (as seen in Table 1) collected between 03 December 2023 and 28 March 2024. The target column is multiclass in nature, with normal traffic classified as “Benign” and malicious/attack traffic classified as shown in Table 2. Table 3 shows the list of entries for each attack and the overall percentage. The link to archived data is available online.

Table 1. The VNFCYBERDATA dataset features with description.

Table 2. Malignant traffic classification label.

Table 3. Total number of entries and corresponding percentages for benign traffic and each attack type.

As shown in Figure 1, the folder is organised to keep data for each VNF appliance distinct. The dataset for each appliance is further divided into multiple files to avoid a single huge file for easy analysis/transfer. The vLoad balancer, vProxy, vIDS, vRouter, and vDNS are 20.64, 8.01, 6.68, 4.92, and 3.70 gigabytes, respectively. The total size of the dataset is 43.94 gigabytes. The VNFCYBERDATA dataset includes the following file formats:

Figure 1. Folder structure of the dataset.

PCAP format: This format includes network packet data that can be used to analyse the network properties. This comprises information about the communication, such as port numbers, IP addresses, payload sizes, protocols, and TCP flags. The format facilitates further analysis of the gathered traffic.
CSV format: The collected packets are transformed to CSV files, which can be readily viewed in a processing application like Microsoft Excel which shows traffic information in rows and columns.

The naming structure for network traffic datasets is “sessions_#_*.csv”, where “#” denotes the numerical sequence of the file (starting from 1 to last serial number) and “*” represents the type of VNF, such as vIDS. Additionally, the dataset folders include the “performance” folder, which contains system performance measurements in CSV format. The performance data were collected for vDNS, vLoad Balancer, vRouter_Firewall, and vProxy, containing the CPU, memory, disk, and network utilisation of the VNF appliances in benign and attack scenarios.

Additionally, Table 4 shows the number of sessions captured for each protocol type/set in the dataset for each appliance. Arkime keeps track of a protocol set for a given session.

Table 4. The number of sessions captured for each protocol type/set.

3. Methods

This Section provides a detailed explanation of the setup used to capture the VNFCYBERDATA dataset, including the hypervisors, VNF appliances, and network related details and types of cyberattacks.

3.1. VNF Environment

The architecture in Figure 2 shows the design/layout of the environment with various VNF appliances deployed for traffic capturing while Table 5 shows details of the network addresses, operating systems, and the VNF appliances deployed. The environment consists of two Type 1 hypervisors, ESXi-Server1 and ESXi-Server2, both of which are deployed on separate HPE blade servers (ProLiant BL460c Gen 8). A firewall device connects to the Internet Service Provider (outside interface) and the LAN network (inside interface) and connects to the HPE Virtual Connect Flexfabric 10 GB/24-port module on the HPE c7000 enclosure, which then connects to the mezzanine card on the BL460c blade server and then the hypervisor’s virtual switch (vSwitch).

Figure 2. The environment for capturing VNFCYBERDATA dataset.

Table 5. Network information of VNF appliances deployed.

Furthermore, two vSwitches are configured at the hypervisor level for this setup for all appliances. As shown in Table 6, each VNF appliance is connected to two vSwitches: the “VM Network” vSwitch handles standard VNF traffic with an IPv4 address assigned to the interface, while the “Mirroring” vSwitch is configured to intercept and read all network packet hence, operating in promiscuous mode. For vRouter_vFirewall and vIDS, an additional vSwitch is set up to carry network traffic for an internal subnet–LAN interface as seen in Figure 3.

Table 6. vSwitch connecting the appliances and capture solutions.

Figure 3. Image showing an example of how the vSwitch are connected to a VNF appliance.

A capture solution—Arkime [18]—was deployed to capture and subsequently extract network information in PCAP and CSV formats. Arkime is installed separately on Ubuntu 22.04 LTS operating system and linked to the “Mirroring” vSwitch.

Inbound traffic to the Arkime is blocked using a host-based firewall (Ubuntu ufw), leaving only outbound traffic for collecting network traffic data. Figure 3 shows an example of the connection of an appliance to the vSwitches in Table 6.

In addition to the information in Table 5, the following resources were deployed or used:

Public DNS: Google DNS (8.8.8.8), Cloudflare DNS (1.1.1.1), Quad9 DNS (9.9.9.9), OpenDNS (208.67.222.222, 208.67.220.220). vDNS is sometimes used for name resolution with the previous public DNS setup as forwarders.
Kali Linux VMs: 10.59.32.52, 10.59.32.53, 10.59.32.51, 10.59.32.50
VM for traffic capturing (Arkime): 10.59.32.41.
Internal subnet connected to vRouter_vFirewall: 192.168.100.0/24.
Internal subnet connected to vIDS: 192.168.200.0/24.
Internet port forwarding from public IP to 10.59.32.32 (vLoad Balancer) on port 443 and 80.
Internet port forwarding from public IP to 10.59.32.37 (vDNS) on port 53.
vProxy is set up to allow client access on port 8080.
Domain name registration techtechinfra.tech and vnfdataset.info. Also includes subdomains like www.techtechinfra.tech, vpn.techtechinfra.tech, kali.techtechinfra.tech.
Public SSL certificate installed on vLoad Balancer and web servers.

3.2. Benign Traffic

For all the VNF appliances, the traffic capture process began with the benign traffic and progressed to malicious traffic capture; however, while attacks are being simulated, benign traffic is also being generated simultaneously. The following describes how benign traffic was generated for each VNF appliances.

vLoad Balancer: Benign traffic was generated using Apache JMeter [19], two custom scripts [20]. HTTP traffic was generated both internally and outside (Internet) of the network using the JMeter and the custom script. Internet traffic was based on port forwarding. Inbound traffic on the domain name www.techtechinfra.tech on port 443 and 80 was forwarded to the vLoad balancer.
vDNS: All internal name resolution was performed using vDNS which generated numerous benign traffic. Further name resolution was carried out using a custom script found in [21]. In addition, a port forwarding rule was created to redirect all DNS traffic on the well-known port 53 to the vDNS appliance.
vRouter_vFirewall, vProxy, and vIDS: The vRouter and the vIDS were set up with two interfaces each, with one interface serving as the internal interface that connects to internal machines (Figure 1). For vProxy, we set up all the devices in the lab environment to connect to the internet with the vProxy on port 8080.

3.3. Details of Attack

For all VNF appliances, the traffic capture process began with benign traffic and progressed to malicious traffic capture. Malicious traffic capture was carried out in two stages: stage one involved attacks without deploying malware samples, and stage two involved the deployment of several types of malwares on each VNF appliance or corresponding environment (such as VMs attached to the VNF appliances). Table 7 and Table 8 provide a summary of malware hash and each attack on the VNF appliances, including the number of data points recorded, respectively. The malware samples were carefully chosen to reflect MITRE ATT&CK [22] tactics displayed by conventional malware such as persistency, directory discovery, defence evasion, reconnaissance, lateral movement, command and control, privilege escalation, file encryption, and collection.

Table 7. The type of malware attack, behaviour, and corresponding hash function for each VNF appliance.

Table 8. The entries of each of the VNF appliances.

As previously stated in the Section 1, the primary goal of this dataset is to record the network’s behaviour throughout different attacks. It is worth noting that not all attacks successfully exploited the VNF appliances. The following gives a summary of the attack states observed:

Malware Attacks: All malware attacks were carried out internally by intentionally infecting the VNF appliances with malware samples of various types, as shown in Table 7. These attacks usually stemmed from internal malicious content or were delivered via malicious emails.
DoS Attacks: The DoS attacks against the vLoad Balancer, vRouter, and vIDS, vProxy did not consistently drain system resources. The vLoad Balancer’s performance was captured in the performance dataset, allowing for comparing its performance under attack scenarios and everyday operations. Aligning the timestamps in the attack and performance datasets can provide insight into the overall behaviour of the network during attack and normal operations.
DNS Attacks: DNS-related attacks, including DNS exfiltration, DNS spoofing, and DNS amplification attacks, were carried out successfully.
Web Application Vulnerabilities: Attacks such as directory traversal, injection, SQL injection, unrestricted file uploads, and remote code execution were successfully exploited due to vulnerabilities in the web application.
Unsuccessful Attacks: XSS and brute force attempts were unsuccessful.
Both ARP spoofing and Man-in-the-Middle (MiTM) attacks were successfully executed within the network. In some situation these attacks disrupted the connectivity in the entire hypervisor as the devices may remain unresponsive over a period of time.

3.4. Labelling of Dataset

The labelling of the VNFCYBERDATA dataset entails precisely establishing the ground truth which is labelling the dataset to either benign or attack class labels identified in Table 2. The labelling of the dataset was carried out using the following approaches:

Labelling traffic begins with manual inspection to discover patterns. Once identified, a Python script [23] labels the resulting csv file. This procedure is continued until all patterns are labelled. When the script cannot be used to classify a pattern, it is manually labelled based on observed pattern behaviour. Furthermore, the session was assessed using Arkime and WiresShark [24] for easy traffic pattern identification.
External threat intelligence was utilised in analysing traffic behaviour in addition to the manual inspection of traffic. Tools such as Virustotal [25], AbuseIPDB [26], Who.is [27], Wireshark, Mx Toolbox [28], and Shodan [29] were leveraged to identify known malicious IPs, subdomain and domain names, and attack patterns. External intelligence was mostly used in situations where an observed pattern could not be established and required further study.
VNF appliance logs: There are events where logs from VNF appliances were correlated with traffic captured in Arkime to fully understand the overall traffic behaviour. This approach gave better insight to identified traffic with patterns that could not be identified easily. Such situations were utilised in relation to Nginx logs (vLoad balancer) and Snort logs (vIDS).

It is worth noting that Arkime does not have the capacity to collect system performance measures, so a custom script [30] was created to acquire the required information. The timestamp generated in the collected system performance dataset allows the performance metrics dataset to be correlated with the network dataset. In addition, to protect privacy, public IP addresses were anonymised using private class B IP addresses, with the exception of public IP addresses for DNS resolutions such as Google DNS and OpenDNS.

Author Contributions

B.A. helped in the conceptualisation, methodology, investigation, validation, traffic collection, labelling of the dataset, analysis and writing of the original draft. V.B. helped in the conceptualisation, review, and supervision. All authors have read and agreed to the published version of the manuscript.

Funding

The authors thank the Malta Digital Innovation Authority (MDIA) and Tertiary Education Scholarship Scheme (TESS) from the Ministry of Education, Sport, Youth, Research and Innovation Malta for funding the tuition of the first author.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset is available online at https://figshare.com/s/f065022eb85701278c58 (accessed on 2 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chaudhry, S.R.; Liu, P.; Wang, X.; Cahill, V.; Collier, M. A measurement study of offloading virtual network functions to the edge. J. Supercomput. 2022, 78, 1565–1582. [Google Scholar] [CrossRef]
Emu, M.; Yan, P.; Choudhury, S. Latency Aware VNF Deployment at Edge Devices for IoT Services: An Artificial Neural Network Based Approach. In Proceedings of the 2020 IEEE International Conference on Communications Workshops (ICC Workshops), Dublin, Ireland, 7–11 June 2020; pp. 1–6. [Google Scholar] [CrossRef]
Leivadeas, A.; Kesidis, G.; Ibnkahla, M.; Lambadaris, I. VNF Placement Optimization at the Edge and Cloud. Future Internet 2019, 11, 69. [Google Scholar] [CrossRef]
Battisti, A.L.É.; Macedo, E.L.C.; Josué, M.I.P.; Barbalho, H.; Delicato, F.C.; Muchaluat-Saade, D.C.; Pires, P.F.; Mattos, D.P.d.; Oliveira, A.C.B.d. A Novel Strategy for VNF Placement in Edge Computing Environments. Future Internet 2022, 14, 361. [Google Scholar] [CrossRef]
Vieira, J.L.; Battisti, A.L.; Macedo, E.L.; Pires, P.F.; Muchaluat-Saade, D.C.; Delicato, F.C.; Oliveira, A.C. Dynamic and Mobility-Aware VNF Placement in 5G-Edge Computing Environments. In Proceedings of the 2023 IEEE 9th International Conference on Network Softwarization (NetSoft), Madrid, Spain, 19–23 June 2023; pp. 53–61. [Google Scholar] [CrossRef]
Highnam, K.; Arulkumaran, K.; Hanif, Z.; Jennings, N.R. BETH Dataset: Real Cybersecurity Data for Anomaly Detection Research. CEUR Workshop Proc. 2021, 3095, 1–12. [Google Scholar]
Aldribi, A.; Traoré, I.; Moa, B.; Nwamuo, O. Hypervisor-based cloud intrusion detection through online multivariate statistical change tracking. Comput. Secur. 2020, 88, 101646. [Google Scholar] [CrossRef]
Sharafaldin, I.; Lashkari, H.A.; Ghorbani, A.A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP), Funchal, Portugal, 22–24 January 2018; Available online: https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 9 January 2024).
UNB. IDS 2018 Datasets. Available online: https://www.unb.ca/cic/datasets/ids-2018.html (accessed on 24 January 2024).
Travallaee, M.; Bagheri, W.; Lu, W.; Ghorbani, A. A Detailed Analysis of the KDD CUP 99 Data Set. Available online: https://www.unb.ca/cic/datasets/nsl.html (accessed on 11 July 2023).
Mahdavifar, S.; Salem, A.; Victor, P.; Razavi, A.; Garzon, M.; Hellberg, N.; Habibi Lashkari, A. Lightweight Hybrid Detection of Data Exfiltration using DNS based on Machine Learning. In Proceedings of the 2021 the 11th International Conference on Communication and Network Security, Weihai, China, 3–5 December 2021; pp. 80–86. [Google Scholar]
Chung, W.-C.; Wang, Y.-H. The Effects of High-Performance Cloud System for Network Function Virtualization. Appl. Sci. 2022, 12, 10315. [Google Scholar] [CrossRef]
Whiteaker, J.; Schneider, F.; Teixeira, R. Explaining Packet Delays under Virtualization. Comput. Commun. Rev. 2011, 41, 38–44. [Google Scholar] [CrossRef]
Wang, G.; Ng, T.S.E. The Impact of Virtualization on Network Performance of Amazon EC2 Data Center. In Proceedings of the IEEE INFOCOM Conference, San Diego, CA, USA, 14–19 March 2010; pp. 1–9. [Google Scholar]
Gogunska, K.; Barakat, C.; Urvoy-Keller, G.; Lopez-Pacheco, D. On the Cost of Measuring Traffic in a Virtualized Environment. In Proceedings of the 2018 IEEE 7th International Conference on Cloud Networking (CloudNet), Tokyo, Japan, 22–24 October 2018; pp. 1–6. [Google Scholar]
Lee, C.; Abe, H.; Hirotsu, T.; Umemura, K. Traffic Anomaly Analysis and Characteristics on a Virtualized Network Testbed. IEICE Trans. Inf. Syst. 2011, 94, 2353–2361. [Google Scholar] [CrossRef]
Nedyalkov, I. Performance comparison between virtual MPLS IP network and real IP network without MPLS. Int. J. Electr. Comput. Eng. Syst. 2021, 12, 83–90. [Google Scholar] [CrossRef]
Arkime. Arkime—Full Capture Solution. Available online: http://arkime.com (accessed on 11 July 2023).
Apache. Apache JMeter—User’s Manual: Building a Web Test Plan. Available online: https://jmeter.apache.org/usermanual/build-web-test-plan.html (accessed on 13 July 2023).
Ayodele, B. BelieveDjango/http_request. 2024. Available online: https://github.com/BelieveDjango/http_request (accessed on 17 January 2024).
Ayodele, B. BelieveDjango/dns_request. 2024. Available online: https://github.com/BelieveDjango/dns_request (accessed on 20 February 2024).
MITRE ATT&CK^®. Available online: https://attack.mitre.org/ (accessed on 25 February 2024).
Ayodele, B. BelieveDjango/labelling_dataset_python. 2024. Available online: https://github.com/BelieveDjango/labelling_dataset_python (accessed on 9 January 2024).
Wireshark Go Deep. Available online: https://www.wireshark.org/ (accessed on 9 January 2024).
Virustotal. Home. Available online: https://www.virustotal.com/gui/home/upload (accessed on 17 July 2023).
AbuseIPDB. IP Address Abuse Reports—Making the Internet Safer, One IP at a Time. Available online: https://www.abuseipdb.com/ (accessed on 9 January 2024).
WHOIS. Search, Domain Name, Website, and IP Tools. Available online: https://who.is/ (accessed on 9 January 2024).
MxToolbox. DNS Lookup Tool. Available online: https://mxtoolbox.com/Public/Content/Toolhandler.aspx?command=a (accessed on 9 January 2024).
Shodan. Available online: https://www.shodan.io (accessed on 9 January 2024).
Ayodele, B. BelieveDjango/system_performance_metrics. 2024. Available online: https://github.com/BelieveDjango/system_performance_metrics (accessed on 24 January 2024).

Figure 1. Folder structure of the dataset.

Figure 2. The environment for capturing VNFCYBERDATA dataset.

Figure 3. Image showing an example of how the vSwitch are connected to a VNF appliance.

Table 1. The VNFCYBERDATA dataset features with description.

Features	Data Types	Description
Start Time	Date/Time	Session starts time
Stop Time	Date/Time	Session stops time
Src IP	IP	Source IP
Src Country	Upper case string	Source country
Src Port	Integer	Source port
Dst IP	IP	Destination IP
Dst Country	Upper case string	Destination country
Dst Port	Integer	Destination port
Packets	Integer	Total number of packets sent AND received in a session
Protocols	Mixed case string	Protocols set for session
IP Protocols	Lower case string	IP protocol number or friendly name
Data bytes	Integer	Total number of data bytes sent AND received in a session
Src databytes	Integer	Total number of data bytes sent by source in a session
Dst databytes	Integer	Total number of data bytes sent by destination in a session
Bytes	Integer	Total number of raw bytes sent AND received in a session
Src Bytes	Integer	Total number of raw bytes sent by source in a session
Dst Bytes	Integer	Total number of raw bytes sent by destination in a session
Payload Src UTF8	Lower case string	First 8 bytes of source payload in hex
Payload Dst UTF8	Lower case string	First 8 bytes of destination payload in hex
TCP Flag SYN	Integer	Count of packets with only the ACK flag set
TCP Flag SYN-ACK	Integer	Count of packets with FIN flag set
TCP Flag ACK	Integer	Count of packets with PSH flag set
TCP Flag PSH	Integer	Count of packets with RST flag set
TCP Flag FIN	Integer	Count of packets with SYN and no ACK flag set
TCP Flag RST	Integer	Count of packets with SYN and ACK flag set
TCP Flag URG	Integer	Count of packets with URG flag set
Session Segments	Integer	Number of segments in session
Session Length	Integer	Session Length in milliseconds
Src MAC	Lower case string	The MAC address of the originating traffic
ICMP Type	Integer	ICMP type field values
ICMP Code	Integer	ICMP code field values
Dst ASN	Upper case string	GeoIP ASN string calculated from the destination IP
Version	Mixed case string	SSL/TLS version field
URI	Mixed case string	URIs for request
Host	Lower case string	DNS lookup hostname
Alt Name	Lower case string	Certificate alternative names
Initial RTT	Integer	Initial round trip time, difference between SYN and ACK timestamp divided by 2 in ms
Type	Upper case string	BGP Type field
GEO	Upper case string	GeoIP country string calculated from the IP from DNS result
Hostname	Lower case string	QUIC host header field
Label	Mixed case string	The class label of the traffic

Table 2. Malignant traffic classification label.

Attack Class Label	Description
Scanning	Port scanning activities which include scan for vulnerabilities, weakness, and brute force attack
Flood Attack	Includes attacks such as UDP, TCP, ICMP flooding (combined)
UDP Flood Attack	Type of DoS where UDP packets are targeted to a server
DoS–RST Attack	A TCP reset attack is a form of DoS attack that uses bogus TCP reset packets to terminate an established TCP connection between two parties.
DoS–SYN Attack	A SYN Flood is a type of DoS attack in which a huge number of SYN requests are sent to a server in order to overrun the open connections.
DoS–ICMP Flood Attack	This involves using huge ICMP ping or echo-requests packets to attack or overload a server
DoS–Xmas Attack	Christmas Attack involves turning the Urgent, Push and FIN flag a TCP communication on and flooding the network with such traffic
DDoS	DDoS attacks use multiple compromised computer systems as attack traffic sources to attack a target.
ICMP Tunnelling Attack	An ICMP tunnel uses ICMP echo queries and reply to packets to build a hidden connection between two remote computers (a client and a proxy).
Brute Force	Attacks on authentication and the discovery of hidden content/pages within a web application/application.
DNS Exfiltration	This attack involves removing confidential or sensitive data by embedding it within DNS packets
DNS Amplification	This attack involves sending out a large DNS request and receiving extremely large response back
DNS Spoofing	This attack involves manipulating DNS records to redirect users toward a fraudulent website or environment
SQL Injection	This attack involves inserting or injecting SQL query into an input data from a front-end application/website
Malware Attack	This involves a malicious code designed to harm or damage a computing environment
XSS	Cross Site Scripting (XSS) involves when an attacker uses a web application to send malicious code, usually in the form of a browser side script
Man-in-the-middle attack	This involves when an attacker intercepts and potentially alter a communication between two or more parties
ARP spoofing	This involves manipulating the ARP cache or a victim
Remote code execution	This allows attackers to execute arbitrary code on a target system or remote system.
Injection Attack	This allows an attacker to inject malicious code/script into a target system. This is differentiated from SQL injection in this work
SQL Injection	This allows an attack manipulate data to trick web or application systems into executing SQL commands
Unrestricted file upload	This allows an attack upload files without proper validation and controls on a web application.
Directory Traversal Attack	This allows an attack to navigate through the file system to access files and directories outside the intended directory.

Table 3. Total number of entries and corresponding percentages for benign traffic and each attack type.

Attacks	Total Entries	Percentage
DoS–ICMP Flood Attack	1,376,923	19.61%
DNS Amplification	1,227,407	17.48%
Benign	935,832	13.33%
UDP Flood Attack	734,261	10.46%
DoS–SYN Attack	690,862	9.84%
HTTP Flood Attack	595,995	8.49%
Scanning	547,157	7.79%
Flood Attack	461,734	6.58%
Malware Attack	279,756	3.98%
DNS Spoofing Attack	163,658	2.33%
DNS Exfiltration	2842	0.04%
Directory Traversal Attack	1447	0.02%
DoS–RST Attack	1374	0.02%
Injection Attack	330	0.00%
DoS–Xmas Attack	239	0.00%
DoS	151	0.00%
SQL Injection	129	0.00%
SYN/RST Attack	112	0.00%
XSS	112	0.00%
MiTM	110	0.00%
ARP Spoofing	52	0.00%
Remote Code Execution	43	0.00%
DDoS	12	0.00%
Unrestricted File Upload	10	0.00%
Brute Force	8	0.00%

Table 4. The number of sessions captured for each protocol type/set.

Protocol Set	vLoad Balancer	vIDS	vDNS	vRouter	vProxy
http, tcp	129,906	19,358	376	222	7917
http2, tcp, tls	389	-	492	2331	1491
icmp	1733	781,475	1715	1,065,293	5484
igmp	4	15	11	12	16
ssdp, udp	5	1200	43	-	754
tcp	16,241	895,791	2042	198,670	16,690
tcp, tls	58,455	15,421	2238	3548	6790
udp	366	1298	898,431	763	1527
udp, dns	48,821	410,559	957,211	-	98,912
udp, mdns	5764	208	29	-	223
udp, ntp	3627	5489	29	-	114
dns, udp	12,798	4558	13,257	102,877	5203
mdns, udp	1083	77	52	185	216
quic, udp	42	2	34	42	8
tcp, tls, http2	5863	-	35	-	-
ntp, udp	723	114	36	2858	253
tls, http2, tcp	2	-	4	-	-
tls, tcp	91	-	719	421	693
tls, tcp, http2	5	-	-	-	165
tcp, http	1,057,362	14,919	487	1481	1001
ssh, tcp	22,603	11,812	1	4781	4
udp, quic	50	509	414	-	11
tcp, http2, tls	908	1762	153	-	-
tcp, http, tls	16	16	-	-	-
tcp, ssh	39,607	1355	-	7330	377
tcp, tls, http	5	2	-	-	-
tcp, rdp, http	1	-	-	-	-
http2, tcp, http	6	-	-	-	-
rdp, tcp, http	6	-	-	-	-
dnp3, dns, udp	1	-	-	-	-
dhcp, udp	-	163	73	107	115
http, tcp, tls	-	2973	301	-	3563
tcp, dns	-	14,234	37	-	9
udp, dns, dnp3	-	9	20	-	-
udp, isakmp	-	4	11	-	250
llmnr, udp	-	21	15	-	19
udp, dhcp	-	267	183	239	272
udp, ssdp	-	4	542	665	-
lldp	-	1	-	-	-
udp, llmnr	-	40	45	28	9
tcp, http, dns	-	7	-	-	-
tcp, http, http2	-	1	-	-	-
tcp, tls, dns	-	14	-	-	-
pim	-	3	-	2	2
portmap, udp	-	2	-	-	-
tcp, tls, ssh	-	10	-	-	-
udp, bjnp	-	6	-	22	6
udp, rip	-	4	-	11	4
ospf	-	1	-	-	2
radius, udp	-	2	-	-	2
snmp, udp	-	231	-	-	-
snmp, udp, ldap	-	1	-	-	-
stun, udp	-	4	-	-	-
tcp, dcerpc, dns	-	1	-	-	-
tcp, dns, ldap	-	1	-	-	-
tcp, dns, smb	-	2	-	-	-
tcp, dns, thrift	-	1	-	-	-
tcp, http, ldap	-	1	-	-	-
tcp, nfs, dns	-	1	-	-	-
tcp, postgresql, dns	-	4	-	-	-
tcp, rdp, dns	-	2	-	-	-
tcp, redis, dns	-	1	-	-	-
udp, dns, ldap	-	9	-	-	2
udp, krb5	-	4	-	-	-
valve-a2s, udp	-	1	-	-	-
http, tcp, rdp	-	1	-	-	2
rip, udp	-	2	-	-	-
ssh, tcp, tls	-	3	-	-	-
udp, portmap	-	4	-	22	2
tcp, dns, http	-	2	-	-	-
tcp, dns, tls	-	5	-	-	-
tcp, http2, http	-	3	-	-	-
bjnp, udp	-	4	-	-	-
http2, http, tcp, tls	-	-	83	-	510
udp, ldap, dns	-	-	6	-	-
smb, tcp	-	-	7	-	-
http2, tls, tcp	-	-	238	160	1
http, tcp, tls, http2	-	-	3	-	-
dns, tcp	-	-	-	769	-
ftp, tcp	-	-	-	3659	-
smtp, tcp	-	-	-	1	-
smtp, tcp, tls	-	-	-	1005	-
dns, dnp3, udp	-	-	-	1	-
http, tcp, gh0st	-	-	-	-	1
tcp, ldap	-	-	-	-	2
udp, snmp	-	-	-	-	12
tcp, irc	-	-	-	-	13
http, http2, tcp	-	-	-	-	2
http, http2, tcp, tls	-	-	-	-	59
tcp, ssh, tls	-	-	-	-	20
tls, tcp, http	-	-	-	-	352
udp, dnp3, dns	-	-	-	-	2
udp, stun	-	-	-	-	4
tls, http, tcp	-	-	-	-	13
tls, tcp, http2, http	-	-	-	-	102

Table 5. Network information of VNF appliances deployed.

Hypervisor	VNF IP Address	VNF Appliance Type	Device/Application Connected to VNF
ESXi-Server1 10.59.32.27 Intel(R) Xeon(R) CPU E5-2609 @ 2.40 GHz X 2 Sockets	10.59.32.35	vRouter_vFirewall (pfsense) OS–Free BSD VNF resource: 16 GB RAM, 4vCPU	VMs
	10.59.32.36	vIDS (Snort) OS–Ubuntu 22.04 VNF resource: 16 GB RAM, 2vCPU	Application server, database
	10.59.32.37	vDNS (MaraDNS) OS–Ubuntu 22.04 VNF resource: 16 GB RAM, 2vCPU	VMs
ESXi-Server2 10.59.32.28 Intel(R) Xeon(R) CPU E5-2609 @ 2.40 GHz X 2 Sockets	10.59.32.32	vLoad Balancer (Nginx) OS–Ubuntu 22.04 VNF resource: 16 GB RAM, 2vCPU	Web servers
	10.59.32.55	vProxy (Squid) OS–Ubuntu 22.04 VNF resource: 16 GB RAM, 4vCPU	VMs

OS—operating system, VM—virtual machine.

Table 6. vSwitch connecting the appliances and capture solutions.

Port Groups	vSwitch	Uplink	Promiscuous Mode	Appliance Connected
VM Network	vSwitch0	vmnic0 (10,000 Mbps, Full duplex)	NO	All VNF Appliances
LAN	LAN	vmnic1 (10,000 Mbps, Full duplex)	NO	vRouter, vIDS
Mirroring	Mirroring-vSwitch	vmnic2 (10,000 Mbps, Full duplex)	YES	All VNF Appliances

Table 7. The type of malware attack, behaviour, and corresponding hash function for each VNF appliance.

VNF Type	Malware Type	Behaviour	Malware Hash (SHA 256)
vLoad Balancer	Trojan Gafgyt	Execution, Defence Evasion, C2	864533db99aade7897c872cffb6e991e166adb370bbad3c0ec969bf646d92dcc
vLoad Balancer	Trojan Dropper	Discovery, C2, credential access, persistence, defence evasion, transfer of information, stop services	41ae4ee74cd60c0cdfa99ae870359d774b2e7cc583a94f1fff7b430634ef8b3b
vIDS	Trojan Okrov, malgo	Defence evasion, C2, discovery	f5ab886589558a8a265c216f6754d1477c19ca46d8ed4d57a1ee975c590e4aab
	Trojan Shellcode, Mettle	Obfuscation, C2	c7b3d3da745510a14e3cc3ea75328b5bd948e1bd1b7d629c8fb348ace00af2fe
	Trojan Memz	Execution, evasion, discovery, C2	8621177f7208c8fd4447010f3b0c45ef7b8aa9ca2900e989cb1d3c8e3054d838
	Trojan Mirai	C2, brute force, persistence, botnet	fb7670ca5c5ef55a0b481a53d9ad2629a95dad7f34b0f904f5379ac275520167
vDNS	Trojan Ransomware, worm, BadRabbit	Initial access, execution, persistence, privilege escalation, evasion, credential access, lateral movement, C2, transfer of information, encryption	630325cac09ac3fab908f903e3b00d0dadd5fdaa0875ed8496fcbb97a558d0da
	Trojan Ransomware, shellcode	Execution, persistence, privilege escalation, defence evasion, discovery and C2	6afc9fe877c4656707db249ff2ea63536ec3726f8bc3e49fb30e085a6439f106
	Trojan Asyncrat, msil	Defence evasion, privilege escalation, collection, C2	b99b8c52dd67d2a9d4b8a58664056b7ce64f271e25efe3a3b8adf33c70d3db46
	Trojan Rrat, revengerant	System shutdown/reboot, initial access, execution, persistence, privilege escalation, credential access, discovery, C2	a71916846ff796a16a2305782a656adbc4b21be2343773c8832ae73d2a7a9e6f
	Trojan Gulpix, plugx	Execution, persistence, privilege escalation, defence evasion, discovery and C2	996b9e029e0d93efa265c69b2cf1cfac64b3b848ae936b2c43f2decdd591757e
vRouter	Trojan Rrat, revengerant	System shutdown/reboot, initial access, execution, persistence, privilege escalation, credential access, discovery, C2	a71916846ff796a16a2305782a656adbc4b21be2343773c8832ae73d2a7a9e6f
	Trojan Gulpix, plugx	Execution, persistence, privilege escalation, defence evasion, discovery and C2	996b9e029e0d93efa265c69b2cf1cfac64b3b848ae936b2c43f2decdd591757e
	Trojan Gafgyt	C2, brute force, persistence, botnet, download and execution	fb7670ca5c5ef55a0b481a53d9ad2629a95dad7f34b0f904f5379ac275520167
vProxy	Trojan Redcap, wingo	Execution, discovery	1fe0e05682324decfe2b44c1c44301212ecd2cdedddd19926fc39eb8e41917f7
	Trojan Aqorh, qwiogr	Execution, discovery	6ed9979a6496f56229320348e35b7dbb0f2b5d0760abe7ff339bd369b619097c
	Trojan Tsunami, flooder	Execution, persistence, privilege escalation, defence evasion, discovery and C2	7b35cd2a604c911aae05d41d2e4fe0d28ee435902db341ed4e00e6d9712b7c9d

Table 8. The entries of each of the VNF appliances.

VNF Appliance	Class label	Total Entries	Percentage
vLoad Balancer	Benign	696,861	49.55%
	Scanning	113,537	8.07%
	Malware Attack	38	0.00%
	Directory Traversal Attack	43	0.00%
	Code Injection	9	0.00%
	HTTP Flood Attack	595,995	42.37%
vIDS	Benign	52,379	2.40%
	Scanning	237,606	10.88%
	DoS–ICMP Flood Attack	778,732	35.66%
	UDP Flood Attack	119	0.01%
	DNS Exfiltration	2171	0.10%
	DNS Amplification	234,688	10.75%
	DoS–SYN Attack	690,862	31.63%
	ARP Spoofing	52	0.00%
	Brute Force	8	0.00%
	Directory Traversal Attack	1255	0.06%
	DNS Amplification	1	0.00%
	Injection Attack	330	0.02%
	Malware	185,521	8.49%
	MiTM	15	0.00%
	Remote Code Execution	6	0.00%
	SQL Injection	127	0.01%
	Unrestricted File Upload	10	0.00%
	XSS	112	0.01%
vDNS	Benign	53,149	2.83%
	Scanning	434	0.02%
	Malware Attack	6075	0.32%
	UDP Flood Attack	734,142	39.06%
	DNS Spoofing Attack	163,658	8.71%
	DNS Exfiltration	671	0.04%
	DNS Amplification	921,244	49.02%
vRouter_vFirewall	Benign	71,041	5.08%
	Scanning	180,855	12.94%
	Flood Attack	461,734	33.04%
	Malware Attack	85,445	6.11%
	DoS–ICMP Flood Attack	598,191	42.80%
	DoS–Xmas Attack	239	0.02%
vProxy	Benign	62,402	40.73%
	Scanning	14,725	9.61%
	DNS Amplification	71,474	46.65%
	Directory Traversal Attack	149	0.10%
	Malware	2677	1.75%
	MiTM	95	0.06%
	Remote Code Execution	27	0.02%
	SQL Injection	2	0.00%
	DoS	151	0.10%
	DoS–RST Attack	1374	0.90%
	Remote Code Injection	1	0.00%
	SYN/RST Attack	112	0.07%
	DDoS	12	0.01%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

The VNF Cybersecurity Dataset for Research (VNFCYBERDATA)

Abstract

1. Introduction

2. Data Description

3. Methods

3.1. VNF Environment

3.2. Benign Traffic

3.3. Details of Attack

3.4. Labelling of Dataset

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics