Introducing the UWF-ZeekDataFall22 Dataset to Classify Attack Tactics from Zeek Conn Logs Using Spark’s Machine Learning in a Big Data Framework

: This study introduces UWF-ZeekDataFall22, a newly created dataset labeled using the MITRE ATT&CK framework. Although the focus of this research is on classifying the never-before classiﬁed resource development tactic, the reconnaissance and discovery tactics were also classiﬁed. The results were also compared to a similarly created dataset, UWF-ZeekData22, created in 2022. Both of these datasets, UWF-ZeekDataFall22 and UWF-ZeekData22, created using Zeek Conn logs, were stored in a Big Data Framework, Hadoop. For machine learning classiﬁcation, Apache Spark was used in the Big Data Framework. To summarize, the uniqueness of this work is its focus on classifying attack tactics. For UWF-ZeekdataFall22, the binary as well as the multinomial classiﬁer results were compared, and overall, the results of the binary classiﬁer were better than the multinomial classiﬁer. In the binary classiﬁcation, the tree-based classiﬁers performed better than the other classiﬁers, although the decision tree and random forest algorithms performed almost equally well in the multinomial classiﬁcation too. Taking training time into consideration, decision trees can be considered the most efﬁcient classiﬁer.


Introduction
Due to the remarkable advancements in technology and increase in the number of internet users in recent times, there has been an unprecedented surge in the amount of data generated and exchanged over the internet.Millions of financial transactions are being conducted on the internet and cloud-based systems store a plethora of sensitive data, including personal, health, and other sensitive data.Given the ever-increasing magnitude of data collected and utilized, ensuring robust security measures and safeguarding against unauthorized network intrusions have become equally imperative.Also, due to the rise in network attacks, it has become very important to detect intrusions (attacks) before they happen rather than after they happen.Therefore, in order to develop stronger network intrusion systems that detect attacks before they happen, it is important to understand what an adversary is planning and how the adversary is planning to attack.The MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) framework defines tactics as being the motives of the adversary.The MITRE ATT&CK framework, which is a foundation of threat models and methodologies, is based on 14 tactics and several techniques belonging to more than one tactic [1].Hence, data labeled as per the MITRE had approximately 12 million packets and 84 characteristics.It was found that the accuracy of logistic regression was slightly lower than the accuracy of SNN; however, the training time for SNN was longer than that for logistic regression.
In another study by Kejriwal et al. (2022) [12], the intrusion detection accuracy of different algorithms like logistic regression, random forest, KNN, XGBoost, Gaussian naïve Bayes, and a multi-layer perceptron classifier (MLP) were compared.The dataset used in their study was CIC-IDS2017.Of all the ML algorithms, the random forest algorithm was found to have the best results.Logistic regression was found to perform better than naïve Bayes, which had the lowest performance of all.
In yet another comparative study, Disha and Waheed (2021) [13] used the GBT classifier for binary classification to determine network intrusions using the UNSW-NB15 dataset.A Chi-squared test was used to remove irrelevant features, and GBT was discovered to have the highest accuracy after decision trees.
In Swamy et al. (2012) [14], the decision tree (DT) classifier and min-max normalization algorithms were used as part of the intrusion-detection process using the KDDP99 dataset.The DT classifier had good accuracy, with lower false positive and true negative rates.
In another study, Mulay et al. (2010) [15] compared the performance of the decision tree classifier with support vector machines (SVMs).The accuracy of decision trees was found to be high, and their training and testing times were low compared to SVMs.
Jha and Ragha (2013) [16] used the SVM classifier on the NSL-KDD dataset, which consists of selected records from the KDD99 dataset.The information gain algorithm was used in their study to extract relevant features.It was found that the reduced dataset increased the detection rate of the SVM and also reduced the training and testing times.Belouch et al. (2018) [17] compared four well-known classification algorithms, SVM, naïve Bayes, decision tree, and random forest, using Apache Spark with the UNSW-NB15 dataset.They found that random forest gave the best performance followed by decision tree and naïve Bayes.
All the above works performed classification on network attacks.There are no works on the classification of tactics, which is the novelty of our work.Moreover, none of the above works performed classification on tactics using the two new uniquely created datasets, UWF-ZeekData22 and UWF-ZeekDataFall22.

The Datasets
The Zeek Conn log MITRE ATT&CK framework labeled datasets, UWF-ZeekData22 and UWF-ZeekDataFall22, available in [5], generated using the Cyber Range at the University of West Florida (UWF), were used for this analysis.UWF-ZeekData22, created in 2022, had 9,280,869 attack records and 9,281,599 benign records [18], and a breakdown of this dataset's tactics is also presented in Bagui et al. (2023) [18].The new dataset, UWF-ZeekDataFall22, created in 2023, has 350,001 attack records and 350,339 benign records [5].The breakdown of the attack tactics in the UWF-ZeekDataFall22 dataset is presented in Table 1.This dataset had the highest number of resource development tactics followed by reconnaissance and discovery; hence, in this work, we focus on these three tactics.
The resource development tactic encompasses multiple techniques, with a focus on obtaining resources to help support targeting.These techniques range from creating, purchasing, or illicitly obtaining resources to including, but not limited to, infrastructure, accounts, or capabilities [4].
The reconnaissance tactic involves various techniques that are conducted to facilitate both active and passive information gathering for target development.Techniques of reconnaissance could involve trying to identify potential attack surfaces and entry points by scanning ports, mapping the topology of the network, and researching the target's online digital footprint [2].
The discovery tactic involves various directed techniques meant to determine network infrastructure details like identifying the exact services and versions running on remote hosts or collecting technical details about the local network architecture.Going beyond the reconnaissance tactic, the discovery tactic is set on gathering precise information about the target that can be used to plan and execute specific attacks.It allows attackers or security professionals to identify vulnerabilities, misconfigurations, and targets that can be exploited [2].

The Zeek Conn Log Files of the UWF-ZeekDataFall22 Dataset
The Zeek Conn log files of the UWF-ZeekDataFall22 dataset were used in this work.The Zeek Conn log files, explained in [5,19], track the protocols and associated information such as IP addresses, duration, two-way bytes, states, packets, and tunnel information.In short, the Conn log files provide all the data regarding the connection between two points.The full list of the attributes of the Conn log files is available in [5,19].The "mitre_attack" attribute was added to label the data.

Experimentation
The datasets, stored in parquet format on HDFS, were read into Spark's data frame for processing.Spark's data frame, organized in the form of rows and columns, allows for the easier processing of large amounts of data.

Overview of Processing UWF-ZeekDataFall22
First, the input dataset, UWF-ZeekDataFall22, was binned.Since this dataset has several continuous attributes, these attributes were binned to smoothen the data and remove noise.Also, binning was used to prepare the data for algorithms that require discrete values, like decision trees and random forest algorithms as well as gradientboosting trees.Binning also helps to address any over-fitting issue [20].
Figure 1 presents the overview of the experimentation.As presented in Figure 1, the binned data were then used to calculate information gain.Information gain was used to determine the relevance of the features or attributes.Machine learning (ML) using Spark was performed on this preprocessed dataset, and Spark's optimum parameters were determined before ML was used.For the ML algorithms, the input dataset was divided into training and testing sets with a ratio of 80:20.Apache Spark's machine learning libraries for naïve Bayes, logistic regression, random forest, gradient-boosting trees, support vector machines, and decision trees were used to train and detect the network tactics.To assess the effectiveness of the machine learning classifiers, various evaluation metrics, such as accuracy, precision, and recall, were utilized for comparison.

OR PEER REVIEW 5 of 26
training and testing sets with a ratio of 80:20.Apache Spark's machine learning libraries for naïve Bayes, logistic regression, random forest, gradient-boosting trees, support vector machines, and decision trees were used to train and detect the network tactics.To assess the effectiveness of the machine learning classifiers, various evaluation metrics, such as accuracy, precision, and recall, were utilized for comparison.

Preprocessing Using Binning
From the Zeek Conn log files of the UWF-ZeekDataFall22 dataset, the following features were binned: • dest_ip and src_ip for IP addresses; • dest_port and src_port for port numbers; • local_orig and local_resp, which are Boolean data types; • protocol, conn_state, history, and service, all of which are nominal attributes; • duration, orig_bytes, orig_pkts, orig_ip_bytes, resp_bytes, resp_pkts, resp_ip_bytes, and missed_bytes, which are all continuous valued attributes.

Binning IP Addresses
For binning the IP addresses, the commonly recognized network classifications [21] of A, B, C, D, and E were used, each of which pertains to specific ranges of the first octet in the IP address.Null and non-applicable values are assigned a value of 0, 0-127 octet values are assigned a value of 1, and so on.Table 2 presents the classification after binning the IP addresses.[22], administering ports 0 through 65,535.Binning was performed as per wellknown ports, registered ports, and dynamic/private ports, as shown in Table 3. Table 3 presents the binning of port numbers in UWF-ZeekDataFall22.

Preprocessing Using Binning
From the Zeek Conn log files of the UWF-ZeekDataFall22 dataset, the following features were binned:

Binning IP Addresses
For binning the IP addresses, the commonly recognized network classifications [21] of A, B, C, D, and E were used, each of which pertains to specific ranges of the first octet in the IP address.Null and non-applicable values are assigned a value of 0, 0-127 octet values are assigned a value of 1, and so on.Table 2 presents the classification after binning the IP addresses.

Binning Port Numbers
Port numbers were binned following the Internet Assigned Numbers Authority (IANA) [22], administering ports 0 through 65,535.Binning was performed as per wellknown ports, registered ports, and dynamic/private ports, as shown in Table 3. Table 3 presents the binning of port numbers in UWF-ZeekDataFall22.Booleans and nominal values are the non-numeric data in the dataset.The StringIndexer method from MLib [23], which is Apache Spark's scalable machine learning library, was used to convert the non-numeric values into numbers.This method also converts the invalid or null values to integer values for binning.The columns binned using this algorithm were local_orig, local_resp, protocol, conn_state, history, and service, as presented in Table 4. Table 4 shows the number of bins generated for each specific attribute.Continuous values were binned as per [19].Null values were dropped from the columns and then a 10% trim was performed from both ends.The trimmed version was used to calculate the mean and standard deviation, and the edges of the bins were generated as per the Algorithm 1, adopted from [19]: The Bucketizer function in Pyspark was used to generate bins using the edges.To maintain the desired number of bins and to avoid redundant bin ranges, the movingmean logic, presented in Algorithm 2, was inserted during the establishment of the edges, adopted as per [19]: The numbers of bins generated for all continuous valued attributes are presented in Table 5.

Information Gain
After the binning process was completed, information gain was used to assess the relevance of each of the 18 features from the Zeek Conn Log files of the UWF-ZeekDataFall22 dataset using the binned data.The information gain algorithm was used to extract the relevant attributes.
Information gain is the difference between a class's entropy and the entropy of the class and a selected feature split, with entropy measuring the extent of randomness in the dataset [24].It is an assessment of the usefulness of a feature in the classification.
The following calculations [24] were performed on each attribute to produce information gain values for ranking purposes: where where Information gain was calculated for the full dataset and the three tactics, resource development, discovery, and reconnaissance, individually.The information gain values for the attributes in the Zeek Conn logs using the full dataset are presented in Table 6, and for resource development, discovery, and reconnaissance in Table 7, Table 8, and Table 9, respectively.An analysis of the results in the different information gain tables shows that the patterns of the attributes are very similar, although a couple of attributes might have been flipped.The last three attributes (attributes with an information gain value of zero) were the same in all cases.

Machine Learning Algorithms
Based on a review of the literature on the most commonly used machine learning algorithms for classification analysis and comparison, the following supervised machine learning algorithms were used in this work: decision tree, support vector machine, random forest, naïve Bayes, logistic regression, and gradient-boosting tree.

Decision Trees
A decision tree (DT) algorithm follows a tree-like model.At the root of the tree is the attribute or feature with the highest information gain.The next level of the tree is determined by the attribute with the next-highest information gain, and so on.Thus, the algorithm works by recursively splitting the data into subsets based on the most significant feature at each node of the tree.

Support Vector Machines
Support Vector Machines (SVM) work by mapping data to a high-dimensional feature space so that data points can be categorized even when the data are not otherwise linearly separable.A separator between the categories is found.Then, the data are transformed in such a way that the separator can be drawn as a hyperplane which separates the data into classes.

Random Forest
Random forest (RF) algorithms grow multiple decision trees, which are merged together for a more accurate prediction.The logic behind the random forest model is that multiple uncorrelated models (the individual decision trees) perform better as a group.When using the random forest for classification, each tree gives a classification or "vote"; the forest chooses the classification with the majority "votes".

Naïve Bayes
Naïve Bayes (NB), which is probabilistic in nature and based on the Bayes Theorem, is commonly used for classification tasks.It is based on two key assumptions: that the features are independent of one another, and that each feature contributes equally to the outcome.

Logistic Regression
Logistic regression is a statistical model used for binary classification, which predicts the probability of an object belonging to one of two groups based on input variables.
It assumes a linear relationship between features and can handle both continuous and discrete variables.

Gradient-Boosting Trees
Gradient-boosting trees are a popular form of classification algorithms due in part to their ease of interpretation.Gradient-boosting trees utilize a large number of trees with individual nodes coming off each for various dataset segmentations, often starting with the most informative attribute.The gradient aspect stems from putting a higher emphasis on the more accurate trees and associated nodes and bases.When coupled together, this provides an extensive review of large quantities of data with a clear delineation of where the most accurate pathway is through the data [25].

Results
The first step was to determine the best set of configuration parameters for Spark in relation to the UWF-ZeekDataFall22 dataset.Since DT is a robust classifier [26], the Spark configuration parameters were tested using the DT classifier.Finally, all six classifiers were run based on the best set of configuration parameters for Spark.

Determining Spark's Best Configuration Parameters
To determine Spark's optimum parameters, using all 18 attributes from the Conn log files of the UWF-ZeekDataFall22 dataset, the DT classifier was used with the following Spark configuration parameters: executor count, executor core count, total executor cores, executor memory, and total executor memory.
Executor count defines the number of executors available for each node [23].
Executor core is the computational power of the CPU.This parameter defines the number of cores available for each executor and the number of concurrent tasks that can be run on each executor [23].
Spark's executor memory is the amount of memory provided for each executor to complete the task.Defining the executor memory will control the executor heap size and reduce the garbage collection delay [23].
Driver core is the main logic that triggers a Spark job.It is responsible for submitting the job and coordinating the execution of Spark application across the cluster nodes.By default, the number of cores available for a driver program is 1.Using the -drivercores parameter, we can set up the number of Spark driver cores required for the Spark application [23].
Driver memory is the amount of memory allocated to the driver and basically depends on the number of times we retrieve the data and ranges from 2 GB to 4 GB [23].
Shuffle partition configures the number of partitions to be allocated for shuffling the data while performing RDD operations, such as joins and aggregations.This parameter has no effect when the Spark application has only DataFrame operations [23].
Table 10 shows the impact of varying these parameters on the binning time, training time, and testing time (in seconds).As can be seen from Table 10, Test 19's Spark parameters performed the best in terms of binning time, training time, and testing time.Table 11 shows the effect of varying the driver cores, driver memory, and shuffle partitions on the total time taken by the decision tree classifier.The total time is the sum of binning, training, and testing time.Here, too, Test 19 performed the best.Figures 2 and 3 show the effects of total executor memory (in GBs) and total number of executor cores on the total processing time (in seconds), respectively.According to Figure 2, there does not appear to be a strong correlation between total executor memory and total processing time.In most runs, increasing the executor memory from 20 to 120 did not reduce the processing time significantly.However, as per Figure 3, the total processing time decreased when the total number of executor cores increased from 10 to 96.The total number of executor cores can be obtained by multiplying the total number of executors with the number of cores per executor.Figure 4 shows how the total number of executor cores and the number of executors affect the processing time.The size of a bubble indicates the processing time; the smaller the bubble size, the shorter the total processing time.As the total number of executor cores is increased, the bubbles become smaller.The grey bubbles, for example, are large and closer to the x-axis when the number of executor cores is low, indicating higher processing times.The processing time was the shortest for the red bubble, which contained 96 total executor cores and 12 executors.
According to Figure 5, the processing time increases as the shuffle partitions increase.This could be due to the overhead associated with distributing and collecting data over a network.From Figure 4, it can be seen that the processing time was lowest for the red bubble, with 24 shuffle partitions and 12 executors.Hence, Test 18's Spark configuration parameters from Table 1 were used for testing the classifiers, since this allows for the use   Figure 4 shows how the total number of executor cores and the number of executors affect the processing time.The size of a bubble indicates the processing time; the smaller the bubble size, the shorter the total processing time.As the total number of executor cores is increased, the bubbles become smaller.The grey bubbles, for example, are large and closer to the x-axis when the number of executor cores is low, indicating higher processing times.The processing time was the shortest for the red bubble, which contained 96 total executor cores and 12 executors.
According to Figure 5, the processing time increases as the shuffle partitions increase.This could be due to the overhead associated with distributing and collecting data over a network.From Figure 4, it can be seen that the processing time was lowest for the red bubble, with 24 shuffle partitions and 12 executors.Hence, Test 18's Spark configuration  Figure 4 shows how the total number of executor cores and the number of executors affect the processing time.The size of a bubble indicates the processing time; the smaller the bubble size, the shorter the total processing time.As the total number of executor cores is increased, the bubbles become smaller.The grey bubbles, for example, are large and closer to the x-axis when the number of executor cores is low, indicating higher processing times.The processing time was the shortest for the red bubble, which contained 96 total executor cores and 12 executors.

Analyzing Training Time
The DT classifier was run for 6, 9, 12, and 18 attributes, and the training times were recorded as shown in Figure 6.The six attribute runs were run using the top six attributes from the respective information gain tables, where nine attributes would be the top nine attributes from the respective information gain tables, and so on.There was no significant difference between the number of attributes used with respect to the training time.The training time for resource development was high because of the larger number of records (82,543).The training times for reconnaissance and discovery, however, were low, since there were fewer records.According to Figure 5, the processing time increases as the shuffle partitions increase.This could be due to the overhead associated with distributing and collecting data over a network.From Figure 4, it can be seen that the processing time was lowest for the red bubble, with 24 shuffle partitions and 12 executors.Hence, Test 18's Spark configuration parameters from Table 1 were used for testing the classifiers, since this allows for the use of fewer partitions.

Analyzing Training Time
The DT classifier was run for 6, 9, 12, and 18 attributes, and the training times were recorded as shown in Figure 6.The six attribute runs were run using the top six attributes from the respective information gain tables, where nine attributes would be the top nine attributes from the respective information gain tables, and so on.There was no significant difference between the number of attributes used with respect to the training time.The training time for resource development was high because of the larger number of records (82,543).The training times for reconnaissance and discovery, however, were low, since there were fewer records.

Analyzing Training Time
The DT classifier was run for 6, 9, 12, and 18 attributes, and the training times were recorded as shown in Figure 6.The six attribute runs were run using the top six attributes from the respective information gain tables, where nine attributes would be the top nine attributes from the respective information gain tables, and so on.There was no significant difference between the number of attributes used with respect to the training time.The training time for resource development was high because of the larger number of records (82,543).The training times for reconnaissance and discovery, however, were low, since there were fewer records.
x FOR PEER REVIEW 15 of 26

Performance of the Machine Learning Classifiers Using UWF-ZeekDataFall22
Six different machine learning classifiers, decision tree, random forest, naïve Bayes, logistic regression, support vector machines, and gradient-boosting trees, were used for classifying the data.Testing was performed on two datasets, UWF-ZeekDataFall22 and UWF-ZeekData22, both available in [12].The results of the binary classification are presented in Tables 12-14.The results of the multinomial classification are presented in Table 15.The multinomial classification was only performed for the UWF-ZeekDataFall22 dataset.

Performance of the Machine Learning Classifiers Using UWF-ZeekDataFall22
Six different machine learning classifiers, decision tree, random forest, naïve Bayes, logistic regression, support vector machines, and gradient-boosting trees, were used for classifying the data.Testing was performed on two datasets, UWF-ZeekDataFall22 and UWF-ZeekData22, both available in [12].The results of the binary classification are presented in Tables 12-14.The results of the multinomial classification are presented in Table 15.The multinomial classification was only performed for the UWF-ZeekDataFall22 dataset.
Precision: Precision is the number of correct positive predictions divided by the total number of positive predictions: Precision = TP/(TP + FP) (5) Recall: Recall can be also defined as the true positive rate.It can be calculated as the number of true positives divided by all real positives: Recall = TP/(TP + FN) (6) False Positive Rate (FPR): The FPR is the total number of incorrect positive predictions divided by all real negatives: FPR = FP/(TN + FP) F-Measure: The F-Measure is the harmonic mean of precision and recall: Training Time: The time taken by a classifier to train, in seconds.
Testing Time: The time taken by the classifier to perform the predictions, in seconds.

Machine Learning Classifier Results for Binary Classification
Tables 12-14 present the binary classification results for all six classifiers using the UWF-ZeekDataFall22 dataset for the reconnaissance, discovery, and resource development tactics, respectively.Binary classification implies using a combination of the respective tactic data with the benign data.

The Reconnaissance Tactic
The results of the binary classification for the reconnaissance tactic using the six different classifiers, LR, NB, RF, GBT, DT, and SVM, are presented in Table 12.
From Table 12, it can be noted that RF, GBT, and DT performed the best, for all combinations of attributes, for accuracy, precision, recall, and the F-measure (all at 100%), although the results of accuracy, precision, and recall were also very close for NB and LR.Except for 18 attributes, SVM did not perform as well, especially for 12 attributes.SVM performed quite poorly in terms of precision, recall, and the F-measure too.
The false positive rates were also relatively high for SVM, especially for 12 attributes, as shown in Figure 7. Again, RF, GBT, and DT have a 0.00% FPR, and LR and NB have a FPR slightly higher than 0.00% (as shown in Table 12 and Figure 7).Figure 9 shows the averages for all algorithms by the number of features for the reconnaissance tactic.From Figure 9, the number of features does not seem to have an effect on accuracy, precision, recall, or the F-measure.In terms of training time, as shown in Figure 8 and Table 12, SVM had a higher overall training time; the second highest was GBT followed by RF and then DT.NB had the lowest training time.Figure 9 shows the averages for all algorithms by the number of features for the reconnaissance tactic.From Figure 9, the number of features does not seem to have an effect on accuracy, precision, recall, or the F-measure.Figure 9 shows the averages for all algorithms by the number of features for the reconnaissance tactic.From Figure 9, the number of features does not seem to have an effect on accuracy, precision, recall, or the F-measure.

The Discovery Tactic
Results of the binary classification for the discovery tactic using the six different classifiers, LR, NB, RF, GBT, DT, and SVM, are presented in Table 13.
From Table 13, it can be noted that GBT and DT performed the best, for all combinations of attributes, for accuracy, precision, recall, and the F-measure, although the results of accuracy, precision, and recall were also very close for RF and LR followed by NB.SVM performed somewhat randomly.SVM performed almost as well for 18 attributes, moderately well for nine attributes, and the worst for 12 attributes.
In terms of recall, all algorithms performed well for all number of attributes except for SVM.In fact, SVM with 12 attributes performed very poorly.
In terms of FPRs, GBT and DT had a rate of 0.00% for all attribute combinations, as shown in Figure 10 as well as in Table 13.Other FPRs were also relatively low except for SVM with six attributes, which had a FPR of 11%.Train Time(nano seconds) In terms of training time, SVM had the highest overall training time followed by GBT and then RF, as shown in Figure 11 as well as in Table 13.NB had the lowest training time.
Figure 12 shows the averages for all algorithms according to the number of features for the discovery tactic.From Figure 12, it can be noted that six attributes were relatively low in terms of average precision, and 12 and 18 attributes were relatively low in terms of average recall and the F-measure, respectively.

The Resource Development Tactic
The results of the binary classification for the resource development tactic using the six different classifiers, LR, NB, RF, GBT, DT, and SVM, are presented in Table 14

The Resource Development Tactic
The results of the binary classification for the resource development tactic using the six different classifiers, LR, NB, RF, GBT, DT, and SVM, are presented in Table 14

The Resource Development Tactic
The results of the binary classification for the resource development tactic using the six different classifiers, LR, NB, RF, GBT, DT, and SVM, are presented in Table 14.
Both GBT and DT had 100% accuracy, precision, and recall when tested with the UWF-ZeekDataFall22 dataset using the resource development tactic.In terms of accuracy, the results of the LR, NB, and RF were closely behind GBT and DT, but SVM with nine and 12 attributes performed very poorly compared to the rest.The same pattern can be seen for recall, and the FPRs are shown in Figure 13.Both GBT and DT had 100% accuracy, precision, and recall when tested with the UWF-ZeekDataFall22 dataset using the resource development tactic.In terms of accuracy, the results of the LR, NB, and RF were closely behind GBT and DT, but SVM with nine and 12 attributes performed very poorly compared to the rest.The same pattern can be seen for recall, and the FPRs are shown in Figure 13.
In terms of training time, as can be seen in Figure 14, SVM followed, by GBT, showed the poorest level of performance.Here, NB had the overall lowest training time.In terms of training time, as can be seen in Figure 14, SVM followed, by GBT, showed the poorest level of performance.Here, NB had the overall lowest training time.Figure 15 shows the average accuracy, precision, recall, and F-measure of all the algorithms with 6, 9, 12, and 18 attributes.It appears that the averages were relatively higher for six attributes, and the other attributes performed almost the same.For multi-classification, the data from the three tactics-reconnaissance, discovery, and resource development-and benign data were used.For multi-classification, weighted accuracy, precision, recall, and the F-measure were used due to the imbalanced nature of the data.
Table 15 presents the multi-classification results for the four classifiers, LR, NB, RF, and DT, using the UWF-ZeekDataFall22 dataset with the four different attribute combinations.Currently, the pyspark MLIb does not support multi-class classification for SVM and GBT and, therefore, these algorithms were not included in the testing.
DT and RF had the highest performance in terms of the weighted precision, weighted recall, weighted F-measure, and weighted accuracy as well as FPR.However, the training  Figure 15 shows the average accuracy, precision, recall, and F-measure of all the algorithms with 6, 9, 12, and 18 attributes.It appears that the averages were relatively higher for six attributes, and the other attributes performed almost the same.
x FOR PEER REVIEW 21 of 26  Figure 15 shows the average accuracy, precision, recall, and F-measure of all the algorithms with 6, 9, 12, and 18 attributes.It appears that the averages were relatively higher for six attributes, and the other attributes performed almost the same.

Machine Learning Classifier Results for Multinomial Classification Using the UWF-ZeekDataFall22 Dataset
For multi-classification, the data from the three tactics-reconnaissance, discovery, and resource development-and benign data were used.For multi-classification, weighted accuracy, precision, recall, and the F-measure were used due to the imbalanced nature of the data.
Table 15 presents the multi-classification results for the four classifiers, LR, NB, RF, and DT, using the UWF-ZeekDataFall22 dataset with the four different attribute combinations.Currently, the pyspark MLIb does not support multi-class classification for SVM and GBT and, therefore, these algorithms were not included in the testing.
DT and RF had the highest performance in terms of the weighted precision, weighted recall, weighted F-measure, and weighted accuracy as well as FPR.However, the training For multi-classification, the data from the three tactics-reconnaissance, discovery, and resource development-and benign data were used.For multi-classification, weighted accuracy, precision, recall, and the F-measure were used due to the imbalanced nature of the data.
Table 15 presents the multi-classification results for the four classifiers, LR, NB, RF, and DT, using the UWF-ZeekDataFall22 dataset with the four different attribute combinations.Currently, the pyspark MLIb does not support multi-class classification for SVM and GBT and, therefore, these algorithms were not included in the testing.
DT and RF had the highest performance in terms of the weighted precision, weighted recall, weighted F-measure, and weighted accuracy as well as FPR.However, the training time was the highest for the RF classifier.NB had the poorest performance in terms of weighted precision, weighted recall, weighted F-measure, and weighted accuracy.
Binary classification outperformed the multi-classification for all classifiers.Naïve Bayes exhibited very poor performance in the multi-class scenario but achieved an average accuracy of 99.3% with binary classification.The RF and DT classifiers demonstrated good performance in both binary and multi-class classification, making them suitable for both types of classification.

Comparing the UWF-ZeekDataFall22 and UWF-ZeekData22 Datasets
This section compares the results of two datasets, UWF-ZeekDataFall22 and UWF-ZeekData22.The results of the UWF-ZeekData22 dataset can be found in a previously published study [19].Figures 16 and 17 compare the average binary classification accuracy and precision, respectively, for the reconnaissance tactic using the two different datasets.Figures 18 and 19 compare the average binary classification accuracy and precision, respectively, for the discovery tactic using the two different datasets.These are the averages for all four attribute combinations (6, 9, 12, and 18) for all the classifiers.

Limitations of this Study
One major limitation of this study is that, since other tactic datasets are not available, only two datasets could be compared.

Summary of Key Findings
Several conclusions can be drawn from this paper.First, in terms of testing Spark's configuration parameters, there does not appear to be a strong correlation between total executor memory and total process time.Also, the processing time increases as the shuffle partitions increase.This would be due to the overhead associated with distributing and collecting data over a network.From Figure 4, it can be seen that the processing time was lowest with 24 shuffle partitions and 12 executors.
In terms of classification results, it can be noted that, overall, the binary classifiers performed better than the multinomial classifiers, though DT and RF performed almost equally well in the binary as well as the multi-class classifications.For both reconnaissance and discovery, the average accuracy and precision of all the classifiers except SVM was better with the new UWF-ZeekDataFall22 dataset.On average, the multinomial classifiers took longer than the binary classifiers.Finally, overall, the tree-based classifiers performed best in both the binary as well as the multi-class classifications.
Of the three tree-based classifiers, random forest, decision tree, and gradient-boosting tree, the decision tree had the shortest training time for binary classification as well as for multinomial classification for all tactics, and this was the case for all four combinations of attributes.Hence, taking training time into consideration, decision trees can be consid-  Although the results were very close for both reconnaissance and discovery, the average accuracy and precision of all the classifiers except SVM were slightly better using the new UWF-ZeekDataFall22 dataset.RF and DT performed the best for both datasets.

Limitations of This Study
One major limitation of this study is that, since other tactic datasets are not available, only two datasets could be compared.

Summary of Key Findings
Several conclusions can be drawn from this paper.First, in terms of testing Spark's configuration parameters, there does not appear to be a strong correlation between total executor memory and total process time.Also, the processing time increases as the shuffle partitions increase.This would be due to the overhead associated with distributing and collecting data over a network.From Figure 4, it can be seen that the processing time was lowest with 24 shuffle partitions and 12 executors.
In terms of classification results, it can be noted that, overall, the binary classifiers performed better than the multinomial classifiers, though DT and RF performed almost equally well in the binary as well as the multi-class classifications.For both reconnaissance and discovery, the average accuracy and precision of all the classifiers except SVM was better with the new UWF-ZeekDataFall22 dataset.On average, the multinomial classifiers took longer than the binary classifiers.Finally, overall, the tree-based classifiers performed best in both the binary as well as the multi-class classifications.
Of the three tree-based classifiers, random forest, decision tree, and gradient-boosting tree, the decision tree had the shortest training time for binary classification as well as for multinomial classification for all tactics, and this was the case for all four combinations of attributes.Hence, taking training time into consideration, decision trees can be considered to be the most efficient classifier using Apache Spark, for the UWF-ZeekDataFall22 dataset, labeled as per the MITRE ATT&CK framework.
Elaborating on the high results obtained by the tree-based classifiers, which were at 100% or close to 100%, these results could also be attributed to the successful preprocessing, which reduced the noise in the data.However, it can be noted that SVM performed randomly compared to the other classifiers.Using the UWF-ZeekDataFall22 dataset, for all tactics, SVM performed almost as well for 18 attributes as the other classifiers, and moderately well for 6 or 9 attributes, but performed very poorly for 12 attributes.This needs to be examined further in a future study.
In terms of the number of attributes that performed the best, the top six attributes (as per the information gain measure) performed the best using the resource development tactic.The results using the discovery tactic were not conclusive since both 9 and 18 attributes seemed to perform equally well, in which case 9 attributes would be selected for classification.For the reconnaissance tactic, 6 as well as 18 attributes seemed to perform almost equally well, in which case 6 attributes would be selected for the classification.

Figure 1 .
Figure 1.Overview of the experimentation on the UWF-ZeekDataFall22 dataset.

Figure 1 .
Figure 1.Overview of the experimentation on the UWF-ZeekDataFall22 dataset.

Figure 4 .
Figure 4. UWF-ZeekDataFall22: Processing times varying the numbers of executors and executor cores.

Figure 5 .
Figure 5. UWF-ZeekDataFall22: Processing time varying the number of executors and shuffle partitions.

Figure 4 .
Figure 4. UWF-ZeekDataFall22: Processing times varying the numbers of executors and executor cores.

Figure 4 .
Figure 4. UWF-ZeekDataFall22: Processing times varying the numbers of executors and executor cores.

Figure 5 .
Figure 5. UWF-ZeekDataFall22: Processing time varying the number of executors and shuffle partitions.

Figure 5 .
Figure 5. UWF-ZeekDataFall22: Processing time varying the number of executors and shuffle partitions.

Figure 7 .
Figure 7. UWF-ZeekDataFall22: Reconnaissance: FPR of algorithms according to the number of features used.

Figure 8 .
Figure 8. UWF-ZeekDataFall22: Reconnaissance: Training time of algorithms according to the number of features used. 0

Figure 7 .
Figure 7. UWF-ZeekDataFall22: Reconnaissance: FPR of algorithms according to the number of features used.

Figure 7 .
Figure 7. UWF-ZeekDataFall22: Reconnaissance: FPR of algorithms according to the number of features used.

Figure 8 .
Figure 8. UWF-ZeekDataFall22: Reconnaissance: Training time of algorithms according to the number of features used.

Figure 8 .
Figure 8. UWF-ZeekDataFall22: Reconnaissance: Training time of algorithms according to the number of features used.

Figure 9 .
Figure 9. UWF-ZeekDataFall22: Reconnaissance: Averages for all algorithms according to the number of features.

Figure 10 .
Figure 10.UWF-ZeekDataFall22: Discovery: FPR of algorithms according to the number of features used.

Figure 10 .
UWF-ZeekDataFall22: Discovery: FPR of algorithms according to the number of features used.

Figure 11 .
Figure 11.UWF-ZeekDataFall22: Discovery: Training time of algorithms according to the number of features used.

Figure 12 .
Figure 12.UWF-ZeekDataFall22: Discovery: Averages for all algorithms according to the number of features.

Figure 11 .
Figure 11.UWF-ZeekDataFall22: Discovery: Training time of algorithms according to the number of features used.

Figure 10 .
Figure 10.UWF-ZeekDataFall22: Discovery: FPR of algorithms according to the number of features used.

Figure 11 .
Figure 11.UWF-ZeekDataFall22: Discovery: Training time of algorithms according to the number of features used.

Figure 12 .
Figure 12.UWF-ZeekDataFall22: Discovery: Averages for all algorithms according to the number of features.

Figure 14 .
Figure 14.UWF-ZeekDataFall22: Resource development: Training time of algorithms according to the number of features used.

Figure 15 .
Figure 15.UWF-ZeekDataFall22: Resource development: Averages for all algorithms according to the number of features.

Figure 14 .
Figure 14.UWF-ZeekDataFall22: Resource development: Training time of algorithms according to the number of features used.

Figure 13 .
Figure 13.UWF-ZeekDataFall22: Resource development: FPRs of algorithms according to the number of features used.

Figure 14 .
Figure 14.UWF-ZeekDataFall22: Resource development: Training time of algorithms according to the number of features used.

Figure 15 .
Figure 15.UWF-ZeekDataFall22: Resource development: Averages for all algorithms according to the number of features.

Table 3 .
Binning port numbers in the UWF-ZeekDataFall22 dataset.

Table 4 .
Bins generated for Boolean and nominal attributes for the UWF-ZeekDataFall22 dataset.

Table 5 .
Number of bins for continuous valued attributes in the UWF-ZeekDataFall22 dataset.

Table 6 .
Information gain values for the full dataset, UWF-ZeekDataFall22.

Table 11 .
UWF-ZeekDataFall22: Total time for additional Spark configuration parameters.

Table 13 .
UWF-ZeekDataFall22: Binary classification results for the discovery tactic.

Table 14 .
UWF-ZeekDataFall22: Binary classification results for the resource development tactic.

Table 15 .
UWF-ZeekDataFall22: Multinomial classification results.-measure; preprocessing time, in terms of binning time; training time; and testing time.Below are some of the commonly used terminologies used in the evaluation metrics: True Positive (TP): Number of correct positive predictions.True Negative (TN): Number of correct negative predictions.False Positive (FP): Number of negative predictions which were incorrectly classified as positive.False Negative (FN): Number of positive predictions which were incorrectly classified as negative.Accuracy: Accuracy is the number of correct predictions divided by the total number of predictions: Accuracy = (TP + TN)/(TP + TN + FP + FN) . .

Table 14 .
UWF-ZeekDataFall22: Binary classification results for the resource development tactic.