Next Article in Journal
Influential Factors in the Design and Development of a Sustainable Web3/Metaverse and Its Applications
Previous Article in Journal
Addressing ZSM Security Issues with Blockchain Technology
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Resampling Imbalanced Network Intrusion Datasets to Identify Rare Attacks

1
Department of Computer Science, University of West Florida, Pensacola, FL 32514, USA
2
Department of Mathematics and Statistics, University of West Florida, Pensacola, FL 32514, USA
*
Author to whom correspondence should be addressed.
Future Internet 2023, 15(4), 130; https://doi.org/10.3390/fi15040130
Submission received: 12 March 2023 / Revised: 22 March 2023 / Accepted: 27 March 2023 / Published: 29 March 2023
(This article belongs to the Section Cybersecurity)

Abstract

:
This study, focusing on identifying rare attacks in imbalanced network intrusion datasets, explored the effect of using different ratios of oversampled to undersampled data for binary classification. Two designs were compared: random undersampling before splitting the training and testing data and random undersampling after splitting the training and testing data. This study also examines how oversampling/undersampling ratios affect random forest classification rates in datasets with minority dataor rare attacks. The results suggest that random undersampling before splitting gives better classification rates; however, random undersampling after oversampling with BSMOTE allows for the use of lower ratios of oversampled data.

1. Introduction

The internet generates traffic at a rate of 6.59 billion GB per second [1]. Approximately 1–3% of this traffic is malicious [2]. Advances in machine learning (ML) have allowed us to detect some of these attacks but not all. The challenge that cyber security experts face is not one of having enough data but one of having enough of the right type of data. Some organizations never see anomalous data on their network; most network traffic is generated by routine workplace business. Given this scenario, it is hard to know what to look for since we have such a small sample size of the attacks. Some types of attacks are more frequent than others, and it is only in the infrequent or rare attacks, or minority data, that this detection problem lies. Machine learning models, by their very nature, are good at detecting patterns where there is more data (in majority data); hence, detection of rare attacks (minority data) is challenging for machine learning models.
Minority data is a tiny percentage of network intrusion datasets, and for the purposes of this work, we are defining minority data as less than 0.1%. Various oversampling techniques, including different smote methods, random oversampling, ADASYN, and others, have been used by researchers to try to model minority data better. Oftentimes a combination of oversampling (of minority data) and undersampling (of majority data) is used [3].
When oversampling and undersampling are used, researchers have to determine how much undersampling should occur, how much oversampling should occur, and when the resampling should occur in the process. This research presents two different design methodologies, one performing random undersampling before splitting the training and testing data and the second performing random undersampling after splitting the training and testing data. In addition, for each of these two designs, the paper compares different ratios of random undersampling to oversampling using BSMOTE. Finally, binary classification using random forest is used to classify the various combinations of the data. For this research, two datasets have been used—the first being a widely used network intrusion dataset, UNSW-NB15 [4], and the second being a newly created network intrusion dataset based on the MITRE ATT&CK framework, UWF-ZeekData22 [5].
The rest of this paper is organized as follows. Section 2 presents the background; Section 3 presents works related to oversampling and undersampling; Section 4 explains the datasets used in this study; Section 5 presents the experimental designs; Section 6 presents the hardware and software configurations; Section 7 presents the metrics used for assessment of the classification results; Section 8 presents the results and discussion; Section 9 presents the conclusions; and Section 10 presents the future works.

2. Background

To adequately address resampling, the resampling techniques used in this paper are briefly explained next. Though oversampling and undersampling are aimed at changing the ratios between the minority and majority classes, respectively, this is particularly important when the minority classes are minimal, that is, in the case of rare attacks.
A minority class refers to very low-occurring data categories. In this work, since a binary classification was performed, the dataset for each experimentation contained only two data types—one attack category and benign data. For example, for the analysis of worms, the dataset contained only worms and benign data. Out of these two categories of data, worms would be considered the minority class. Table 1, which presents the UNSW-NB15 dataset, shows that worms account for 0.006% of the data and 0.007% of the benign data. The raw data count for worms is 174 compared to the benign data count of 2,218,761.
Oversampling is the process of generating samples to increase the size of the minority class. Random oversampling involves randomly selecting samples from the minority class, with replacement, and adding them to the dataset [6]. Random oversampling can also lead to overfitting the data [3].
Various oversampling techniques are available, including various types of Synthetic Minority Oversampling Techniques (SMOTE) and ADASYN. SMOTE generates synthetic samples rather than resampling with replacement. These samples are generated along the line segment adjoining k-minority class neighbors [7]. In Borderline SMOTE (BSMOTE), a variant of the original SMOTE, borderline minority samples are identified and then synthetic samples are generated [8].
ADASYN [9] is a pseudo-probabilistic oversampling technique that uses a weighted distribution for different minority data points. This weighted distribution is based on the level of difficulty in learning, and more synthetic data are generated for minority class examples that are harder to learn than minority examples that are easier to learn [10].
Random undersampling samples the majority class by randomly picking samples with or without replacement from the majority class [11]. After random undersampling, the number of cases of the majority class decreases, significantly reducing the model’s training time. However, the removed data points may include significant information leading to a decrease in classification results [3].

3. Related Works

For cybersecurity or network intrusion detection analysis, it is difficult to obtain a good workable dataset; among the few available, most are highly imbalanced due to the nature of the attacks. Certain attacks are rare in the real world, but measures must be taken to safeguard against them. The rarer the attack, the lesser the chance of it getting detected by a machine learning model.
Oversampling and undersampling are standard methods used by researchers to tackle class imbalance problems [3,7,12]. Oversampling generates synthetic samples from the existing minority class to balance the data or to have more instances for the classifier. Undersampling, on the other hand, reduces the majority class instances to bring the imbalance scale to normalcy. Though both these methods have inherent advantages and disadvantages, they can be used separately or combined with balancing ratios. It is possible to selectively downsize only negative examples and keep all positive examples in the training set [13]. A major disadvantage of undersampling is the loss of data, and hence possibly critical information.
Oversampling increases the training time due to an increase in the training set [12], and may overfit the model [14]. Ref. [14] found that oversampling minority data before partitioning resulted in 40% to 50% AUC score improvement. When the minority oversampling is applied after the split, the actual AUC improvement is 4% to 10%. This behavior is due to what is termed as data leakage, caused by generating training samples correlated with the original data points that end up in the testing set [14]. Ref. [15] found that the synthetic instance creation approach plays a more significant role than the minority instance selection approach. The critical difference between these two papers is that the first process creates instances along the line connecting the two minority instances, while the second approach creates synthetic minority samples in the bounding rectangle created by joining the two minority instances.
Hence, both oversampling and undersampling can have potential overfitting or underfitting, respectively. Studies have been carried out with various combinations of oversampling and undersampling [12,16,17]. Random undersampling was also combined with each oversampling technique—SMOTE, ADASYN, Borderline, SVM-SMOTE, and random oversampling in [18]. Various percentages were used from each of these techniques to study the effect on minority class predictions. The results of each combination are highly dependent on the domain and distribution of the dataset [19]. In one domain, undersampling might help more than oversampling, but in another domain, it may be vice versa.
Some researchers have combined synthetic oversampling with other techniques, such as Grid Search (GS), to improve the prediction of attack traffic [20].
While the above papers present novel ways to examine the class imbalance classification problem, none directly addresses the problem of identifying rare attacks using different ratios of the sampling methods. The novelty of this research is that it considers different ratios of majority data to minority data when identifying rare attacks. Additionally, two different designs are compared—random undersampling before splitting the data and random undersampling after splitting the data, with different oversampling ratios.

4. The Datasets

Two datasets were used in this study: (i) a well-known network intrusion dataset, UNSW-NB15 [4] and (ii) a new network intrusion dataset created based on the MITRE ATT&CK framework, UWF-ZeekData22 [21].

4.1. UNSW-NB15

UNSW-NB15 [4], published in 2015, is a hybrid of real-world network data and simulated network attacks, comprising 49 features and 2.5 million rows. There are 2.2 million rows of normal or benign traffic, and the other 300,000 rows comprise nine different modern attack categories: Fuzzers, Reconnaissance, Shellcode, Analysis, Backdoors, DOS, Exploits, Worms, and Generic. Some attack categories, such as Worms, Shellcode, and Backdoors, that comprised only 0.006%, 0.059%, and 0.091%, of the total traffic, respectively, can be considered rare attacks. These are the three attack categories of particular interest in this research, and the rare attacks will be considered the minority classes. Table 1 presents the distribution of the attack families in this dataset, ordered from the smallest category to the largest category.

4.2. UWF-ZeekData22

UWF-ZeekData22 [5,21], published in 2022, is developed from data collected from Zeek, an open-source network-monitoring tool, and labeled based on the MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) framework. This dataset has approximately 9.280 million attack records and 9.281 benign records [22]. The breakdown of the data, that is, the percent of attack data for each attack type, is presented in Table 2, ordered from the smallest category to the largest category. For this analysis, two rare tactics were used: Privilege_escalation and Credential_access. Privilege_escalation and Credential_access form 0.00007% and 0.00016% of the total network traffic, respectively, and hence can be classified as rare attacks. These rare attacks will be treated as the minority classes.

5. Experimental Design

The experimental design compares two approaches to solve the problem of having significantly small samples of minority data: (a) resampling or random undersampling before a stratified split (Figure 1a) and (b) resampling or random undersampling after a stratified split (Figure 1b). Since the objective of this work is to identify minority data or rare attacks, stratified sampling was used to guarantee that the training and testing data included minority data. However, the question is, when is it best to perform stratified sampling? In stratified sampling, the data are divided into groups and a certain percentage of samples are taken randomly from each group. This guarantees that there will be samples from the minority class in the data. For the purposes of this study, significantly small minority data are defined as <0.1 % of the total sample.
Basically, the two approaches differ in the preprocessing sequence prior to the machine learning model execution and have different approaches to creating training and testing data. In the first design, resampling before splitting, the data are preprocessed, and then random undersampling is performed. This is followed by a stratified split and oversampling using borderline SMOTE (B-SMOTE). Finally, these data were used for training and testing the machine learning model using random forest.
In the second design, resampling after splitting, after the initial preprocessing steps, the data are split (using stratified sampling). This method retains the stratified ratio of the majority to minority class since the stratified split was performed on the whole data. This is followed by oversampling using BSMOTE and random undersampling. Finally, these data were used to train and test the machine learning model using random forest.

5.1. The Classifier Used: Random Forest

Random forest (RF) is a highly used machine learning classifier that is basically an ensemble way of classifying records. In the RF algorithm, the decision of multiple decision trees taken together is used to come up with the final classification label [23].
The RF algorithm works by first generating each decision tree by randomly selecting a bootstrap sample. A bootstrap sample is a randomly selected sample from a dataset with replacement [24]. Each decision tree is trained with a separate bootstrap sample. Hence, if a random forest has N trees, N bootstrap samples will be required. Randomization is introduced during tree training. Features are randomly selected when a decision tree node splits, and of the randomly selected features, the best feature is selected based on statistical measures, such as information gain and Gini index [24]. Once forest training is complete, N-trained decision trees are created. A classification is made on one or more samples. RF classifies samples by querying each decision tree with the sample. A tally is kept to aggregate all classifications made by the decision trees [24]. Once all trees have voted with a classification, the label with the most votes is chosen. Hence, RF is basically an ensemble technique based on decision trees.

5.2. Preprocessing

With regard to preprocessing, each dataset was handled differently, but information gain was used on both datasets to identify the relevant features. First, the information gain algorithm is explained, and then the preprocessing performed in each dataset is presented.

5.2.1. Information Gain

Information gain (IG) is used to assess the relative relevance of features in a dataset and is useful for classification. Information gain is calculated by removing the randomness in the dataset, which is measured by a class’s entropy [23].
The following calculations were performed on each feature to produce information gain values for ranking purposes [23].
G a i n ( A ) = I n f o D I n f o A ( D )
where
I n f o D = i = 1 m p i l o g 2 ( p i )
I n f o A D = j = 1 V D j D × I n f o D j
where:
  • Info(D) is the average amount of information needed to identify the class level of a tuple in the data, D;
  • InfoA(D) is the expected information required to classify a tuple from D based on the partitioning by attribute A;
  • pi is the nonzero probability that an arbitrary tuple belongs to a class;
  • |Dj|/|D| is the weight of the jth partition;
  • V is the number of distinct values in attribute A.

5.2.2. Preprocessing UNSW-NB15

For preprocessing UNSW-NB15, first, the following columns were dropped:
  • ct_flw_http_mthd and is_ftp_login;
  • Unique identifiers and time stamps;
  • IP addresses.
Other preprocessing that was performed:
  • The attack categories, NaN, were filled with zeros;
  • Categorical data were turned into numeric representation: protocol, state, and attack category;
  • A normalization technique was used on continuous data for all numeric variables:
X = x i μ s
where μ is the column mean and s is the column standard deviation.
Finally, IG was calculated on the remaining columns of this dataset, and columns with low information gain were not used. Columns dropped due to low information gain were service, dloss, stepd, dtcpb, res_bdy_len, trans_depth, and is_sm_ips_ports.

5.2.3. Preprocessing UWF-ZeekData22

For UWF-ZeekData22, the following preprocessing was performed following [22]:
  • Continuous features, duration, orig_bytes, orig_pkts, orig_ip_bytes, resp_bytes, resp_pkts, resp_ip_bytes, and missed_bytes were binned using a moving mean;
  • Nominal features, that is, features that contain non-numeric data, were converted to numbers using the StringIndexer method from MLib [25], Apache Spark’s scalable machine learning library. The nominal features in this dataset were proto, conn_state, local_orig, history, and service;
  • The IP address columns were categorized using the commonly recognized network classifications [26];
  • Port numbers were binned as per the Internet Assigned Numbers Authority (IANA) [27].
Following the binning, IG was calculated on the binned dataset. Attributes with low IG were removed and not used for classification.

6. Hardware and Software Configurations

Table 3 and Table 4 present the hardware and software configurations and python libraries, respectively, used in this research.

6.1. Hardware and Software Used in Random Undersampling before Stratified Splitting

Random undersampling before stratified splitting simulations was run on an x64-based processor with a 64-bit operating system. This Windows Home machine was running version 22H2 build 22621.819 and had an AMD Ryzen 7 5700 with 32 GB of RAM and an RTX 3060 Video card. CUDA for Python was installed; however, none of the trials used the parallel processing abilities of the machine.

6.2. Python Libraries Used in Random Undersampling before Stratified Splitting

Python version 3.9 was used with Jupyter Notebooks on Anaconda version 2022.10. Packages included pandas 1.5.2, scikit-learn 1.9.3, NumPy 1.23.5, and imblearn 0.10.0. The random forest machine learning algorithm was implemented using the scikit-learn RandomForestRegressor module. Borderline SMOTE was implemented using the BorderlineSMOTE module of the imblearn.over_sampling package.

6.3. Hardware and Software Used in Random Undersampling after Stratified Splitting

Random undersampling after stratified splitting simulations was run on a machine with Windows Home version 21H2 with OS build 22000.1219 run by an Intel Core i7 1165G7 processor and 16 GB RAM.

6.4. Python Libraries Used in Random Undersampling after Stratified Splitting

Python version 3.10.4 was used with Jupyter Notebooks on Anaconda version 2021.05. Packages included pandas 1.5.0, scikit-learn 1.0.2, NumPy 1.23.4, and imblearn 0.0.

6.5. Stratified Sampling

Scikit-learn in Python was used to generate the training and testing stratified splits.

7. Metrics Used for the Assessment of Results

The objective of this research is to show how a combination of undersampling and oversampling techniques helps build a model which classifies minority classes more accurately. Two approaches were taken for preparing the data: random undersampling before splitting and random undersampling after splitting, and different proportions of oversampling were used (from 0.1 to 1.0, incremented at intervals of 0.1), while the undersampling was maintained at a constant of 0.5 (50%).

7.1. Classification Metrics Used

For evaluating the classification of the random forest runs, the following matrices were used: accuracy, precision, recall, F-Score, and macro precision.
Accuracy: Accuracy is the number of correctly classified instances (i.e., True Positives and True Negatives) divided by the total number of instances [28].
Accuracy = [True Positives (TP) + True Negatives (TN)]/[TP + False Positives
+ TN + False Negatives]
Precision: Precision is the proportion of predicted positive cases that are correctly labeled as positive [29]. Precision by label considers only one class and measures the number of times a specific label was predicted correctly, normalized by the number of times that label appears in the output.
Precision = Positive Predictive Value = [True Positives]/[True Positives +
False Positives]
Recall: Recall is the ability of the classifier to find all the positive samples, or the True Positive Rate (TPR). Recall is also known as sensitivity, and is the proportion of Real Positive cases that are correctly predicted as Positive (TP) [29].
  • All Real Positives = [True Positives + False Negatives]
  • All Real Negatives = [True Negatives + False Positives]
Recall = True Positive Rate = [True Positives]/[All Real Positives]
F-Score: The F-score is the harmonic mean of a prediction’s precision and recall metrics. It is another overall measure of the test’s accuracy [29].
F-Score = 2 ∗ [Precision ∗ Recall]/[Precision + Recall]
Recall, sensitivity, and TPR connate the same measure.
Macro Precision: Macro precision finds the unweighted mean of the precision values. This does not take label imbalance into account [30].

7.2. Welch’s t-Tests

Welch’s t-tests were used to find the differences in the means of the different oversampling percentages. When comparing metrics across two successive percentage runs, the increase or decrease in the metric being evaluated has to be determined, and hence a one-tailed Welch’s t-test was used. Welch’s t-test is calculated using the formula:
( x 1 x 2 ) / ( s 1 2 / n 1 + s 2 2 / n 2 )
where x 1 and x 2 are sample means of the metrics, s12 and s22 are sample variances, n1 and n2 are sample sizes, and the df v is calculated using Satterwaite approximation.
The mean of the individual runs for each metric for one oversampling percentage is compared with another, for example, 0.1 vs. 0.2. If the t-score value is high, there is more difference between the two means. In order to determine whether this difference is significant, the p-value was calculated for each t-score. The significance level was kept at 0.1, making the test more sensitive to results and increasing the significance zone. If the p-value is less than the threshold, then the increase or decrease in the t-score value is significant between the two means.

8. Results and Discussion

This section presents the statistical results, followed by a discussion. Since several oversampling techniques are available, the first step was to perform a study to select an oversampling technique. Then, the results of the random undersampling before stratified splitting are presented, followed by the results of the random undersampling after stratified splitting.

8.1. Selection of an Oversampling Technique

An initial analysis was conducted to determine the best synthetic oversampling technique among a few commonly used oversampling techniques, SMOTE, Borderline SMOTE, and ADASYN.
Traditionally, a 70-30 training-testing split is performed on the data. However, in this case, since the occurrences of the minority class are minimal, there is no guarantee that samples from the minority class will be present in both the training data and the testing data without sample stratification by category. Hence, stratified sampling was used to split the data into a 70-30 training-testing ratio.
From the evaluation metrics shown in Figure 2, it is evident that BSMOTE performs better than the other resampling techniques in terms of precision, F-score, and macro precision. SMOTE and ADASYN had better recall than BSMOTE, but the latter performed better overall.
Hence, all future analyses in this work use BSMOTE as the primary oversampling technique, combined with random undersampling. Several researchers have confirmed that neither oversampling nor undersampling alone can successfully classify imbalanced datasets [12,19], let alone identify rare attacks. Hence, this study looks at the effects of varying oversampling percentages.

8.2. Resampling before and after Splitting

For random undersampling before stratified splitting, as well as random undersampling after stratified splitting, the percent of undersampling was kept at 0.5 (50%) and the percent of oversampling varied from 0.1 (10%) of the data to 1.0 (100%) of the data, at increments of 0.1 (10%). That is, 50% of the instances were selected at random from the majority class, and from the minority class, first 0.1 (10%) oversampling was used, then 0.2 (20%), then 0.3 (30%), and so on, until 1.0 (100%).
For UNSW-NB15, the three most minor attack categories, Worms, Shellcode, and Backdoors, were used. Worms, Shellcode, and Backdoors comprise 0.006%, 0.059%, and 0.091% of the total data, respectively. For UWF-ZeekData22, two tactics were used: credential access and privilege escalation. Privilege escalation and credential access were contained in 0.00007 and 0.00016% of the total data respectively.
For each of the attack categories, for random undersampling before stratified splitting, as well as random undersampling after stratified splitting, an average of ten runs were performed. Values are presented for accuracy, precision, recall, F-score, and macro precision for random undersampling of the majority data at 0.5 (50%), and oversampling the minority data using BSMOTE varied from 0.1 (10%) to 1.0 (100%), at increments of 0.1 (10%). The standard deviations (SDs) are also presented. Since only binary classification is performed in this study, the majority of the data were the benign or non-attack data and the minority of the data were the respective attack category, for example, worms or credential access.

8.2.1. Random Undersampling before Stratified Splitting

Table 5 presents the classification results for Random Undersampling Before Stratified Splitting for Worms (UNSW-NB15) for the various oversampling percentages (0.1 to 1.0, at intervals of 0.1). The best results are highlighted in green.
To determine the best results among the various oversampling percentages, Welch’s t-tests were performed. Results of Welch’s t-tests for UNSW-NB15, Worms Random Sampling before Stratified Splitting (Table 5), are presented in Table 6. The results presented in Table 6 show the t-test comparisons between the results of various oversampling runs. In the first row of comparison between 0.1 and 0.2, we observed that the p-values across all metrics are above 0.1, the significance level. Hence, there is no statistical difference between the compared results. However, in the second row of comparison between 0.1 and 0.3, the recall and F-score had significant differences, and both the t-score values were positive. This implies that 10% oversampling performed better than 30%.
Hence, based on the analysis in Table 6, the best results were obtained at 0.5 oversampling for Worms (highlighted in green in Table 5). There are, however, sampling ratios that are statistically equivalent to using 0.5, as shown in Table 6. However, an oversampling of 0.5 was chosen as the best since it has the smallest amount of oversampled data and thus would take the least computational time.
Table 7 presents the classification results for Random Undersampling Before Splitting for Shellcode (UNSW-NB15) for the various oversampling percentages (0.1 to 1.0, at intervals of 0.1). The best results are highlighted in green.
Results of Welch’s t-tests for UNSW-NB15 Shellcode Random Undersampling before Splitting (Table 7) are presented in Table 8. Based on the analysis in Table 8, the best results were obtained at 0.4 oversampling for Shellcode (highlighted in green in Table 7).
Table 9 presents the classification results for Random Undersampling Before Splitting for Backdoors (UNSW-NB15) for the various oversampling percentages (0.1 to 1.0, at intervals of 0.1). The best results are highlighted in green.
Results of Welch’s t-tests for UNSW-NB15 Backdoors Random Undersampling before Splitting (Table 9) are presented in Table 10. Based on the analysis in Table 10, the best results were obtained at 0.2 oversampling for Backdoors (highlighted in green in Table 9).
Table 11 presents the classification results for Random Undersampling Before Splitting for Credential Access (UWF-ZeekData22) for the various oversampling percentages (0.1 to 1.0, at intervals of 0.1). The best results are highlighted in green.
Results of Welch’s t-tests for UWF-ZeekData22 Credential Access Random Undersampling before Splitting (Table 11) are presented in Table 12. Based on the analysis in Table 12, the best results were obtained at 0.5 oversampling for credential access (highlighted in green in Table 11).
Table 13 presents the classification results for Random Undersampling Before Splitting for Privilege Escalation (UWF-ZeekData22) for the various oversampling percentages (0.1 to 1.0, at intervals of 0.1). The best results are highlighted in green.
Results of Welch’s t-tests for UWF-ZeekData22 Privilege Escalation Random Undersampling before Splitting (Table 13) are presented in Table 14. Based on the analysis in Table 14, the best results were obtained at 0.1 oversampling for privilege escalation (highlighted in green in Table 13). Though there are sampling ratios that are statistically equivalent to 0.1 (as shown in Table 14), 0.1 was chosen as the best since 0.1 has the smallest amount of oversampled data, thus taking the least computational time.
After analyzing the classification results using Welch’s t-tests, for Random Undersampling Before Stratified Splitting, it appears that a random undersampling of 0.5 of the majority data before stratified splitting gives the best results when the BSMOTE oversampling is also at 0.5 for two datasets: Worms and credential access. The best results for Shellcode were achieved at a random undersampling at 0.5 and a BSMOTE oversampling of 0.4, and at a BSMOTE oversampling of 0.2 and 0.1 for Backdoors and privilege escalation, respectively.

8.2.2. Random Undersampling after Stratified Splitting

Table 15 presents the classification results for Random Undersampling After Splitting for Worms (UNSW-NB15) for the various oversampling percentages (0.1 to 1.0, at intervals of 0.1). The best results are highlighted in green.
Results of Welch’s t-tests for UNSW-NB15, Worms for Random Undersampling after Splitting (Table 15), are presented in Table 16. Consider the t-test comparison between 0.1 vs. 0.3 in Table 16. It was found that the t-values of precision, recall, and macro precision have statistical significance. Since precision and macro precision have positive t-values, it implies that 0.1 oversampling performed better in these two metrics. However, the recall value was negative, which means that 0.3 had better true positive predictions than 0.1. If the p-value is higher than 0.1, then there is statistically no significant difference between the two means. This is the case in the comparison of 0.1 vs. 0.2 oversampling in Table 16.
Based on the analysis in Table 16, the best results were obtained at 0.1 oversampling for Worms (highlighted in green in Table 15). Though there are sampling ratios that are statistically equivalent to 0.1 (as shown in Table 16), 0.1 was chosen as the best since it has the smallest amount of oversampled data, thus taking the least computational time.
Table 17 presents the classification results for Random Undersampling After Splitting for Shellcode (UNSW-NB15) for the various oversampling percentages (0.1 to 1.0, at intervals of 0.1). The best results are highlighted in green.
Results of Welch’s t-tests for Random Undersampling after Splitting for Shellcode (UNSW-NB15) (Table 17) are presented in Table 18. Based on the analysis in Table 18, the best results were obtained at 0.1 oversampling for Shellcode (highlighted in green in Table 17). Though there are sampling ratios that are statistically equivalent to 0.1 (as shown in Table 18), 0.1 was chosen as the best since it has the smallest amount of oversampled data, thus taking the least computational time.
Table 19 presents the classification results for Random Undersampling After Splitting for Backdoors (UNSW-NB15) for the various oversampling percentages (0.1 to 1.0, at intervals of 0.1). The best results are highlighted in green.
Results of Welch’s t-tests for UNSW-NB15 Backdoors for Random Undersampling after Splitting (Table 19) are presented in Table 20. Based on the analysis in Table 20, the best results were obtained at 0.1 oversampling for Backdoors (highlighted in green in Table 19). Again, although there are sampling ratios that are statistically equivalent to 0.1 (as shown in Table 20), 0.1 was chosen as the best since it has the smallest amount of oversampled data, thus taking the least computational time.
Table 21 presents the classification results for Random Undersampling After Splitting for Credential Access (UWF-ZeekData22) for the various oversampling percentages (0.1 to 1.0, at intervals of 0.1). The best results are highlighted in green.
Results of Welch’s t-tests for UWF-ZeekData22 Credential Access for Random Undersampling after Splitting (Table 21) are presented in Table 22. Based on the analysis in Table 22, the best results were obtained at 0.5 oversampling for credential access (highlighted in green in Table 21). Again, although there are sampling ratios that are statistically equivalent to 0.5 (as shown in Table 22), 0.5 was chosen as the best since it has the smallest amount of oversampled data, thus taking the least computational time.
Table 23 presents the classification results for Random Undersampling After Splitting for Privilege Escalation (UWF-ZeekData22) for the various oversampling percentages (0.1 to 1.0, at intervals of 0.1). The best results are highlighted in green.
Results of Welch’s t-tests for UNSW-NB15 Backdoors for Random Undersampling after Splitting (Table 23) are presented in Table 24. Based on the analysis in Table 24, the best results were obtained at 0.1 oversampling for privilege escalation (highlighted in green in Table 23). Again, although there are sampling ratios that are statistically equivalent to 0.1 (as shown in Table 24), 0.1 was chosen as the best since it has the smallest amount of oversampled data, thus taking the least computational time.
After analyzing the classification results using Welch’s t-tests, for Random Undersampling After Stratified Splitting, it can be seen that an undersampling of 0.5 for the majority of the data after stratified splitting gives the best results when the oversampling is at 0.1, for four of the five datasets. All the UNSW-NB15 datasets performed better at an oversampling of 0.1 and one of the UWF-ZeekData22 datasets, privilege escalation, also performed better at 0.1. Credential access, however, performed better at 0.5. This means that, for four out of the five datasets, generating more synthetic minority class samples beyond 0.1 will not result in a better prediction by the model.

9. Conclusions

In this paper, two different designs that address the issue of class imbalance in network intrusion or cybersecurity datasets were compared using resampling techniques. The objective was to see how combinations of undersampling and oversampling help to better predict the minority classes in highly imbalanced datasets. Comparing both design approaches, we found that the ratio of 0.5 random undersampling to 0.1–0.5 oversampling using BSMOTE works best (based on the dataset) for random undersampling before stratified splitting of the training and testing data. On the other hand, the ratio of 0.5 random undersampling to 0.1 oversampling using BSMOTE works best (in most cases) for random undersampling after stratified splitting of the training and testing data. Random undersampling after oversampling using BSMOTE allows for the use of lower ratios of oversampled data. However, although the average accuracy would appear comparable for both methods, the average precision, recall, and other measures were higher in the random undersampling before splitting. This can be attributed to stratified train/test splitting before random undersampling, ensuring that the train/test samples mimic the actual ratios of the majority to minority classes.

10. Future Work

For future work, we plan to look at the following. We fixed random undersampling to 0.50% of the original data and varied the percentages of oversampling. Future work would vary both undersampling and oversampling. Additionally, we would like to extend this to other data with small minority classes and compare them against other classifiers.

Author Contributions

This work was conceptualized by S.B. (Sikha Bagui), D.M., S.B. (Subhash Bagui), S.S. and D.W.; methodology was performed by S.B. (Sikha Bagui), D.M., S.B. (Subhash Bagui), S.S. and D.W.; validation was performed by S.B. (Sikha Bagui), S.B. (Subhash Bagui), S.S. and D.W.; formal analysis was performed by S.S. and D.W.; investigation was performed by S.S. and D.W.; resources were provided by S.B. (Sikha Bagui), D.M., S.S. and D.W.; data curation was performed by S.S. and D.W; original draft preparation was performed by S.B. (Sikha Bagui), S.S. and D.W., reviewing and editing was performed by S.B. (Sikha Bagui), D.M., S.B. (Subhash Bagui), S.S. and D.W.; visualizations were performed by S.B. (Sikha Bagui), S.S. and D.W., supervision was performed by S.B. (Sikha Bagui), D.M. and S.B. (Subhash Bagui); project administration was performed by S.B. (Sikha Bagui), D.M., S.B. (Subhash Bagui), S.S. and D.W.; funding acquisition was performed by S.B. (Sikha Bagui), D.M. and S.B. (Subhash Bagui). All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by 2021 NCAE-C-002: Cyber Research Innovation Grant Program, grant number H98230-21-1-0170.

Data Availability Statement

UWF-ZeekData22 is available at datasets.uwf.edu (accessed on 1 March 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zippia, How Many People Use the Internet? Available online: https://www.zippia.com/advice/how-many-people-use-the-internet/ (accessed on 1 March 2023).
  2. CSO, Up to Three Percent of Internet Traffic is Malicious, Researcher Says. Available online: https://www.csoonline.com/article/2122506/up-to-three-percent-of-internet-traffic-is-malicious--researcher-says.html (accessed on 15 February 2023).
  3. Bagui, S.; Li, K. Resampling Imbalanced Data for Network Intrusion Detection Datasets. J. Big Data 2021, 8, 6. [Google Scholar] [CrossRef]
  4. Moustafa, N.; Slay, J. UNSW-NB15: A Comprehensive Data Set for Network Intrusion Detection Systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, ACT, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar] [CrossRef]
  5. UWF-ZeekData22 Dataset. Available online: Datasets.uwf.edu (accessed on 1 February 2023).
  6. Machine Learning Mastery Random Oversampling and Undersampling for Imbalanced Classification. Available online: https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html#imblearn.under_sampling.RandomUnderSampler (accessed on 12 December 2022).
  7. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  8. Han, H.; Wang, W.-Y.; Mao, B.-G. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005. [Google Scholar] [CrossRef]
  9. He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
  10. Abdi, L.; Hashemi, S. To Combat Multi-class Imbalanced Problems by Means of Over-sampling Techniques. IEEE 2016, 28, 238–251. [Google Scholar] [CrossRef]
  11. Imbalanced-Learn, RandomUnderSampler. Available online: https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html (accessed on 5 January 2023).
  12. Shamsudin, H.; Yusof, U.; Jayalakshmi, A.; Akmal Khalid, M. Combining Oversampling and Undersampling Techniques for Imbalanced Classification: A Comparative Study Using Credit Card Fraudulent Transaction Dataset. In Proceedings of the 2020 IEEE 16th International Conference on Control & Automation, Singapore, 9–11 October 2020. [Google Scholar]
  13. Barandela, R.; Sánchez, J.S.; García, V.; Rangel, E. Strategies for Learning in Class Imbalance Problems. Pattern Recognit. 2003, 36, 849–851. [Google Scholar] [CrossRef]
  14. Vandewiele, G.; Dehaene, I.; Kovács, G.; Sterckx, L.; Janssens, O.; Ongenae, F.; De Backere, F.; De Turck, F.; Roelens, K.; Decruyenaere, J.; et al. Overly Optimistic Prediction Results on Imbalanced Data: Flaws and benefits of Applying Over-sampling. Artif. Intell. Med. 2020. preprint. [Google Scholar] [CrossRef] [PubMed]
  15. Bajer, D.; Zonć, B.; Dudjak, M.; Martinović, G. Performance Analysis of SMOTE-based Oversampling Techniques When Dealing with Data Imbalance. In Proceedings of the 2019 International Conference on Systems, Signals and Image Processing (IWSSIP), Osijek, Croatia, 5–7 June 2019; pp. 265–271. [Google Scholar] [CrossRef]
  16. Bagui, S.; Simonds, J.; Plenkers, R.; Bennett, T.A.; Bagui, S. Classifying UNSW-NB15 Network Traffic in the Big Data Framework Using Random Forest in Spark. Int. J. Big Data Intell. Appl. 2021, 2, 39–61. [Google Scholar] [CrossRef]
  17. Koziarski, M. CSMOUTE: Combined Synthetic Oversampling and Undersampling Technique for Imbalanced Data Classification. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
  18. Liu, A.Y. The Effect of Oversampling and Undersampling on Classifying Imbalanced Text Datasets. Ph.D. Thesis, The University of Texas at Austin, Austin, TX, USA, 2004. [Google Scholar]
  19. Estabrooks, A.; Jo, T.; Japkowicz, N. A Multiple Resampling Method for Learning from Imbalanced Data Sets. Comput. Intell. 2004, 20, 18–36. [Google Scholar] [CrossRef] [Green Version]
  20. Gonzalez-Cuautle, D.; Hernandez-Suarez, A.; Sanchez-Perez, G.; Toscano-Medina, L.K.; Portillo-Portillo, J.; Olivares-Mercado, J.; Perez-Meana, H.M.; Sandoval-Orozco, A.L. Synthetic Minority Oversampling Technique for Optimizing Classification Tasks in Botnet and Intrusion-Detection-System Datasets. Appl. Sci. 2020, 10, 794. [Google Scholar] [CrossRef] [Green Version]
  21. Bagui, S.S.; Mink, D.; Bagui, S.C.; Ghosh, T.; Plenkers, R.; McElroy, T.; Dulaney, S.; Shabanali, S. Introducing UWF-ZeekData22: A Comprehensive Network Traffic Dataset Based on the MITRE ATT&CK Framework. Data 2023, 8, 18. [Google Scholar] [CrossRef]
  22. Bagui, S.; Mink, D.; Bagui, S.; Ghosh, T.; McElroy, T.; Paredes, E.; Khasnavis, N.; Plenkers, R. Detecting Reconnaissance and Discovery Tactics from the MITRE ATT&CK Framework in Zeek Conn Logs Using Spark’s Machine Learning in the Big Data Framework. Sensors 2022, 22, 7999. [Google Scholar] [CrossRef] [PubMed]
  23. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; Morgan Kaufmann: Burlington, MA, USA, 2022. [Google Scholar]
  24. Brieman, L. Random Forests. Mach. Learn. 2001, 45, 1. [Google Scholar]
  25. SparkApache StringIndexer. Available online: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html. (accessed on 1 March 2023).
  26. Understand TCP/IP Addressing and Subnetting Basics. Available online: https://docs.microsoft.com/en-us/troubleshoot/windows-client/networking/tcpip-addressing-and-subnetting (accessed on 1 March 2023).
  27. Service Name and Transport Protocol Port Number Registry. Available online: https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml (accessed on 2 March 2023).
  28. Scikit Learn 3.3 Metrics and Scoring: Quantifying the Quality of Predictions. Available online: https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score. (accessed on 12 February 2023).
  29. Powders, D.M.W. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
  30. sklearn.metrics.precision_recall_fscore_support. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html (accessed on 12 February 2023).
Figure 1. Experimental design: (a) Resampling Before Splitting; (b) Resampling After Splitting.
Figure 1. Experimental design: (a) Resampling Before Splitting; (b) Resampling After Splitting.
Futureinternet 15 00130 g001
Figure 2. Comparison of oversampling techniques.
Figure 2. Comparison of oversampling techniques.
Futureinternet 15 00130 g002
Table 1. UNSW-NB15: distribution of attack families [4].
Table 1. UNSW-NB15: distribution of attack families [4].
Type of AttackCount% of Attack Data% of Benign Data% of Total Data
Worms1740.0540.0070.006
Shellcode15110.470.0680.059
Backdoors23290.7240.1040.091
Analysis26770.8330.120.105
Reconnaissance13,9874.3530.630.55
DoS16,3535.0890.7370.643
Fuzzers24,2467.5461.0920.954
Exploits44,52513.8582.0061.752
Generic215,48167.0689.7118.483
Total attack data321,283---
Benign data2,218,761--87.351
Total2,540,044---
Table 2. UWF-ZeekData22: distribution of MITRE ATT&CK tactics [5,21].
Table 2. UWF-ZeekData22: distribution of MITRE ATT&CK tactics [5,21].
Label_TacticCount% of Attack Data% of Benign Data% of Total Data
Persistence10.000010.000010.000005
Initial_access10.000010.000010.000005
Defense_evasion10.000010.000010.000005
Resource_development30.000030.000030.00001
Lateral_movement40.000040.000040.00002
Exfiltration70.000070.000070.00003
Privilege_escalation130.000140.000140.00007
Credential_access310.000330.000330.00016
Discovery20860.022470.022470.01123
Reconnaissance9,278,72299.9768699.96949.98646
Total attack data9,280,869---
Benign_data9,281,599--50.00196
Total18,562,468---
Table 3. Hardware and software configurations.
Table 3. Hardware and software configurations.
Random Undersampling before Stratified SplittingRandom Undersampling after Stratified Splitting
ProcessorAMD Ryzon 7 5700Intel Core i7 1165G7
RAM32 GB16 GB
OSWindows 11 HomeWindows 11 Home
OS Version22 H221 H2
OS Build22621.81922000.1219
GPURTX 3060NA
Table 4. Python library versions.
Table 4. Python library versions.
Random Undersampling before Stratified SplittingRandom Undersampling after Stratified Splitting
Python3.93.10.4
Anaconda2022.12021.5
Pandas1.5.21.5.0
Scikit-learn1.9.31.0.2
Numpy1.23.51.23.4
Imblearn0.10.00
Table 5. UNSW-NB15: Worms—classification results for random undersampling before splitting.
Table 5. UNSW-NB15: Worms—classification results for random undersampling before splitting.
Oversampling % AccuracyPrecisionRecallF-ScoreMacro Precision
0.1Avg0.99990.7690.7940.7780.884
SD 0.0560.0630.0340.028
0.2Avg0.999910.8060.7790.7890.902
SD 0.0650.090.0580.032
0.3Avg0.99990.7990.7170.7520.899
SD 0.0590.0750.0470.029
0.4Avg0.999910.8020.7830.7870.901
SD 0.0680.0760.0430.034
0.5Avg0.9990.8360.8710.8520.918
SD 0.0510.0370.0330.026
0.6Avg0.9990.8280.8510.8380.914
SD 0.0570.0570.0420.029
0.7Avg0.9990.8490.880.8620.924
SD 0.0530.0510.0260.027
0.8Avg0.9990.8470.8510.8470.924
SD 0.0560.0510.0390.028
0.9Avg0.9990.8070.8920.8450.903
SD 0.0560.0430.0340.028
1.0Avg0.9990.8180.750.7820.909
SD 0.0690.0510.0550.034
Averages 0.81610.81680.81320.9078
Table 6. Welch’s t-test results: UNSW-NB15: Worms—random undersampling before splitting.
Table 6. Welch’s t-test results: UNSW-NB15: Worms—random undersampling before splitting.
Welch’s t-Test Results (p < 0.10)Precision t-ValueRecall t-ValueF-Score t-ValueMacro Precision t-ValueAnalysis
0.1 vs. 0.2−1.3660.441−0.451−1.3660.1 and 0.2 are statistically equal
0.1 vs. 0.3−1.1672.4891.423−1.1670.1 is better
0.1 vs. 0.4−1.1790.368−0.516−1.1800.1 and 0.4 are statistically equal
0.1 vs. 0.5−2.793−3.315−4.955−2.7950.5 is better than 0.1
0.5 vs. 0.60.3260.9490.8630.3260.5 and 0.6 are statistically equal
0.5 vs. 0.7−0.551−0.423−0.722−0.5510.5 and 0.7 are statistically equal
0.5 vs. 0.8−0.4671.0220.291−0.4660.5 and 0.8 are statistically equal
0.5 vs. 0.91.213−1.1300.4421.2130.5 and 0.9 are statistically equal
1 vs. 0.50.6496.0753.4510.6510.5 is better than 1.0
Table 7. UNSW-NB15: Shellcode—classification results for random undersampling before splitting.
Table 7. UNSW-NB15: Shellcode—classification results for random undersampling before splitting.
Oversampling % AccuracyPrecisionRecallF-ScoreMacro Precision
0.1Avg0.99960.8490.9410.8920.924
SD 0.0130.0140.0120.006
0.2Avg0.99970.8890.9690.9270.944
SD 0.0150.010.0070.007
0.3Avg0.99970.8960.9620.9270.948
SD 0.0130.0120.0070.006
0.4Avg0.99970.8980.9580.9270.949
SD 0.0080.0080.0070.004
0.5Avg0.99960.8870.9640.9240.944
SD 0.0140.0070.0070.007
0.6Avg0.99960.8760.9640.9180.938
SD 0.0120.0140.0080.006
0.7Avg0.99950.8440.9250.8830.922
SD 0.0120.0130.0090.006
0.8Avg0.99950.8460.9320.8870.923
SD 0.0120.0130.0050.006
0.9Avg0.99950.8490.9290.8870.924
SD 0.0130.010.0070.006
1Avg0.99960.8550.9390.8950.928
SD 0.0140.0140.0080.007
Averages0.99960.86890.94830.90670.9344
Table 8. Welch’s t-test results: UNSW-NB15: Shellcode—random undersampling before splitting.
Table 8. Welch’s t-test results: UNSW-NB15: Shellcode—random undersampling before splitting.
Welch’s t-Test Results (p < 0.10)Precision t-ValueRecall t-ValueF-Score t-ValueMacro Precision t-ValueAnalysis
0.1 vs. 0.2−6.530−5.250−8.049−6.5370.2 is better than 0.1
0.2 vs. 0.3−1.0931.438−0.107−1.0920.2 better than 0.3
0.2 vs. 0.4−1.6132.616−0.005−1.6090.2 has better recall
0.4 vs. 0.52.015−1.7840.8882.0120.5 has better recall
0.4 vs. 0.64.557−1.1762.5074.5530.4 is better than 0.6
04. vs. 0.711.1576.76311.79911.1630.4 is better than 0.7
0.4 vs. 0.811.0975.62013.96511.1100.4 is better than 0.8
0.4 vs. 0.910.1047.20112.71210.1130.4 is better than 0.9
0.4 vs. 18.2903.9089.3628.2960.4 is better than 1.0
Table 9. UNSW-NB15: Backdoors—classification results for random undersampling before splitting.
Table 9. UNSW-NB15: Backdoors—classification results for random undersampling before splitting.
Oversampling % AccuracyPrecisionRecallF-ScoreMacro Precision
0.1Avg0.99980.9620.960.9610.981
SD 0.0060.0060.0030.003
0.2Avg0.99980.9690.9640.9660.984
SD 0.0090.0070.0060.004
0.3Avg0.99970.9620.950.9560.981
SD 0.0050.010.0050.003
0.4Avg0.99980.970.9560.9630.985
SD 0.0070.0090.0050.003
0.5Avg0.99980.970.9510.9610.985
SD 0.0070.0080.0050.003
0.6Avg0.99980.9650.9540.960.982
SD 0.0070.0090.0050.003
0.7Avg0.99970.9680.9460.9570.984
SD 0.0060.0110.0060.003
0.8Avg0.99980.9660.950.9580.983
SD 0.0050.010.0060.002
0.9Avg0.99980.9670.9570.9620.983
SD 0.0080.0060.0040.004
1Avg0.99980.9680.9520.960.984
SD 0.0090.010.0080.004
Averages0.999780.96670.9540.96040.9832
Table 10. Welch’s t-test results: UNSW-NB15: Backdoors—random undersampling before splitting.
Table 10. Welch’s t-test results: UNSW-NB15: Backdoors—random undersampling before splitting.
Welch’s t-Test Results (p < 0.10)Precision t-ValueRecall t-ValueF-Score t-ValueMacro Precision t-ValueAnalysis
0.1 vs. 0.2−1.813−1.602−2.697−1.8180.2 is better than 0.1 across all metrics
0.2 vs. 0.31.9553.6274.4921.9700.2 is better than 0.1 across all metrics
0.2 vs. 0.4−0.4042.3381.491−0.3970.2 is better than 0.4 in recall and F-score
0.2 vs. 0.5−0.4743.8662.359−0.4640.2 is better than 0.5 in recall and F-score
0.2 vs. 0.61.0702.6212.8611.0790.2 is better than 0.6 in recall and F-score
0.2 vs. 0.70.1894.4153.5630.2060.2 is better than 0.7 in recall and F-score
0.2 vs. 0.80.8293.6613.2540.8420.2 is better than 0.8 in recall and F-score
0.2 vs. 0.90.5242.3782.1360.5310.2 is better than 0.9 in recall and F-score
0.2 vs. 1.00.1803.0432.1490.1890.2 is better than 1.0 in recall and F-score
Table 11. UWF-ZeekData22: credential access—classification results for random undersampling before splitting.
Table 11. UWF-ZeekData22: credential access—classification results for random undersampling before splitting.
Oversampling % AccuracyPrecisionRecallF-ScoreMacro Precision
0.1Avg0.9990.7420.8890.8010.871
SD 0.1340.1650.1260.067
0.2Avg0.9990.8850.9110.8910.942
SD 0.050.1390.0750.024
0.3Avg0.9990.8470.8670.8490.923
SD 0.1010.1470.1070.05
0.4Avg0.9990.8360.8560.830.918
SD 0.1060.1650.1070.053
0.5Avg0.9990.9360.9110.9190.968
SD 0.0690.1090.0720.034
0.6Avg0.9990.9060.8670.870.953
SD 0.1030.1630.0940.052
0.7Avg0.9990.8820.9220.8940.941
SD 0.0670.1220.060.033
0.8Avg0.9990.8260.9110.8530.913
SD 0.1430.0970.0840.071
0.9Avg0.9990.8290.9670.8890.915
SD 0.0790.0710.0540.04
1Avg0.9999980.8320.9110.8640.916
SD 0.1090.0970.0760.054
Averages0.99910.85210.90120.8660.926
Table 12. Welch’s t-test results: UWF-ZeekData22: credential access—random undersampling before splitting.
Table 12. Welch’s t-test results: UWF-ZeekData22: credential access—random undersampling before splitting.
Welch’s t-Test Results (p < 0.10)Precision t-ValueRecall t-ValueF-Score t-ValueMacro Precision t-ValueAnalysis
0.1 vs. 0.2−3.162−0.322−1.941−3.1550.2 is better than 0.1 in precision, F-score, and macro precision
0.2 vs. 0.31.0660.6881.0161.0830.2 and 0.3 are statistically the same
0.2 vs. 0.41.3220.8061.4761.3040.2 is better than 0.4 in F-score
0.2 vs. 0.5−1.8930.000−0.852−1.9760.5 is better than 0.2 in precision and macro precision
0.5 vs. 0.60.7650.7101.3090.7630.5 and 0.6 are statistically equal
0.5 vs. 0.70.113−0.188−0.0990.0770.5 and 0.7 are statistically equal
0.5 vs. 0.82.1910.0001.8862.2090.5 is better than 0.8 in precision, F-score, and macro precision
0.5 vs. 0.93.226−1.3611.0543.1930.5 is better than 0.9 in precision and macro precision
0.5 vs. 1.01.3980.0000.8001.3910.5 is better than 1.0 in precision and macro precision
Table 13. UWF-ZeekData22: privilege escalation—classification results for random undersampling before splitting.
Table 13. UWF-ZeekData22: privilege escalation—classification results for random undersampling before splitting.
Oversampling % AccuracyPrecisionRecallF-ScoreMacro Precision
0.1Avg0.99990.9020.7750.810.951
SD 0.1250.2360.1560.063
0.2Avg0.9999960.9040.80.8280.952
SD 0.1380.2320.1650.069
0.3Avg0.9999960.8950.8250.8410.947
SD 0.150.2430.1850.075
0.4Avg0.9999960.90.7940.8240.95
SD 0.1470.2490.1850.074
0.5Avg0.9999960.9080.8150.8390.954
SD 0.1390.2440.1770.069
0.6Avg0.9999960.9070.8460.8570.954
SD 0.1360.2330.1690.068
0.7Avg0.9999960.9050.8430.8530.952
SD 0.1390.240.1750.07
0.8Avg0.9999960.9060.8340.8480.953
SD 0.1390.2470.1820.069
0.9Avg0.9999960.8990.8440.8510.95
SD 0.140.240.1780.07
1Avg0.9999960.8990.8480.8520.949
SD 0.140.2420.1810.07
Averages0.99999590.90250.82240.84030.9512
Table 14. Welch’s t-test results: UWF-ZeekData22: privilege escalation—random undersampling before splitting.
Table 14. Welch’s t-test results: UWF-ZeekData22: privilege escalation—random undersampling before splitting.
Welch’s t-Test Results (p < 0.10)Precision t-ValueRecall t-ValueF-Score t-ValueMacro Precision t-ValueAnalysis
0.1 vs. 0.2−0.042−0.239−0.252−0.0420.1 and 0.2 are statistically equal
0.1 vs. 0.30.108−0.467−0.4110.1080.1 and 0.3 are statistically equal
0.1 vs. 0.40.034−0.173−0.1780.0340.1 and 0.4 are statistically equal
0.1 vs. 0.5−0.102−0.373−0.387−0.1020.1 and 0.5 are statistically equal
0.1 vs. 0.6−0.100−0.675−0.644−0.1000.1 and 0.5 are statistically equal
0.1 vs. 0.7−0.056−0.638−0.587−0.0560.1 and 0.7 are statistically equal
0.1 vs. 0.8−0.074−0.550−0.502−0.0740.1 and 0.8 are statistically equal
0.1 vs. 0.90.037−0.652−0.5550.0370.1 and 0.9 are statistically equal
0.2 vs. 1.00.048−0.678−0.5560.0480.1 and 1.0 are statistically equal
Table 15. UNSW-NB15: Worms—classification results for random undersampling after splitting.
Table 15. UNSW-NB15: Worms—classification results for random undersampling after splitting.
Oversampling % AccuracyPrecisionRecallF-ScoreMacro Precision
0.1Avg0.9990.6080.7370.6650.804
SDNA0.0670.0440.04950.034
0.2Avg0.9990.6010.7120.6460.8
SDNA0.1080.0890.0830.054
0.3Avg0.9990.5660.7730.6510.783
SDNA0.0660.0590.0510.033
0.4Avg0.9990.5660.7810.6540.783
SDNA0.0350.0630.0260.018
0.5Avg0.9990.5810.7380.650.791
SDNA0.0780.0820.0790.039
0.6Avg0.9990.5870.760.6560.793
SDNA0.0970.0440.0610.049
0.7Avg0.9990.620.7530.6790.81
SDNA0.0530.0460.0410.026
0.8Avg0.9990.540.7190.6140.77
SDNA0.0810.0360.0410.018
0.9Avg0.9990.5730.7110.6290.787
SDNA0.1170.0970.0890.058
1Avg0.9990.6010.750.6660.801
SDNA0.0620.0810.060.031
Averages0.9990.58430.74340.6510.7922
Table 16. Welch’s t-test results: UNSW-NB15: Worms—random undersampling after splitting.
Table 16. Welch’s t-test results: UNSW-NB15: Worms—random undersampling after splitting.
Welch’s t-Test Results (p < 0.10)Precision t-ValueRecall t-ValueF-Score t-ValueMacro Precision t-ValueAnalysis
0.1 vs. 0.20.1830.7990.6120.183No significant difference
0.1 vs. 0.31.413−1.5630.5921.4130.1 is better than 0.3 in precision and macro precision, but 0.3 is better in recall
0.1 vs. 0.41.773−1.8260.6281.7730.1 is better than 0.4 in precision and macro precision, but 0.3 is better in recall
0.1 vs. 0.50.836−0.0650.5060.8360.1 and 0.5 are statistically equal
0.1 vs. 0.60.451−0.9570.2640.4510.1 and 0.6 are statistically equal
0.1 vs. 0.7−0.431−0.859−0.706−0.4310.1 and 0.7 are statistically equal
0.1 vs. 0.82.0520.9592.52.8260.1 better than 0.8 except for recall, where both of them are statistically equal
0.1 vs. 0.90.8210.7451.1310.8210.1 and 0.9 are statically equal
0.1 vs. 10.219−0.746−0.0330.3020.1 and 1 are statically equal
Table 17. UNSW-NB15: Shellcode—classification results for random undersampling after splitting.
Table 17. UNSW-NB15: Shellcode—classification results for random undersampling after splitting.
Oversampling % AccuracyPrecisionRecallF-ScoreMacro Precision
0.1Avg0.9990.6980.9060.7880.849
SDNA0.0160.0170.0100.008
0.2Avg0.9990.6940.9070.7860.847
SDNA0.0160.0150.0110.008
0.3Avg0.9990.6910.9110.7860.846
SDNA0.0170.0150.0110.008
0.4Avg0.9990.6860.9050.7800.843
SDNA0.0140.0110.0100.007
0.5Avg0.9990.6780.9060.7760.839
SDNA0.0110.0140.0080.005
0.6Avg0.9990.6990.9080.7900.849
SDNA0.0200.0030.0130.010
0.7Avg0.9990.6990.9080.7900.849
SDNA0.0210.0160.0160.011
0.8Avg0.9990.6920.9110.7870.846
SDNA0.0110.0210.0130.006
0.9Avg0.9990.6880.9010.7800.844
SDNA0.0210.0100.0150.011
1.0Avg0.9990.6880.8880.7750.844
SDNA0.0180.0150.0160.009
Averages0.9990.69130.90510.78380.8456
Table 18. Welch’s t-test results: UNSW-NB15: Shellcode—random undersampling after splitting.
Table 18. Welch’s t-test results: UNSW-NB15: Shellcode—random undersampling after splitting.
Welch’s t-Test Results (p < 0.10)Precision t-ValueRecall t-ValueF-Score t-ValueMacro Precision t-ValueAnalysis
0.1 vs. 0.20.566−0.2760.3860.5660.1 and 0.2 are statistically equal
0.1 vs. 0.30.874−0.7060.4420.8740.1 and 0.3 are statistically equal
0.1 vs. 0.41.8050.071.7021.8060.1 better than 0.4 in precision, F-score, and macro precision
0.1 vs. 0.53.349−0.1283.1043.3490.1 better than 0.5 in precision, F-score, and macro precision
0.1 vs. 0.6−0.084−0.489−0.284−0.0840.1 and 0.6 are statistically equal
0.1 vs. 0.70.684−0.730.2930.6830.1 and 0.7 are statistically equal
0.1 vs. 0.81.620.521.481.620.1 better than 0.8 in precision, F-score, and macro precision
0.1 vs. 0.91.2022.8142.1871.2040.1 is better than 0.9 in recall and F-score
0.1 vs. 12.1010.2461.8322.1010.1 is better than 1 in all metrics except recall
Table 19. UNSW-NB15: Backdoors—classification results for random undersampling after splitting.
Table 19. UNSW-NB15: Backdoors—classification results for random undersampling after splitting.
Oversampling % AccuracyPrecisionRecallF-ScoreMacro Precision
0.1Avg0.9990.9390.9510.9450.969
SDNA0.010.0050.0040.005
0.2Avg0.9990.9380.9480.9430.969
SDNA0.0130.0080.0060.006
0.3Avg0.9990.9360.9450.9410.968
SDNA0.0090.0110.0070.004
0.4Avg0.9990.9390.950.9440.969
SDNA0.010.0080.0070.005
0.5Avg0.9990.9380.9480.9430.969
SDNA0.010.0070.0060.005
0.6Avg0.9990.9410.9450.9430.97
SDNA0.0060.0110.0050.003
0.7Avg0.9990.9460.9480.9470.973
SDNA0.0070.0030.0030.003
0.8Avg0.9990.9460.9430.9440.973
SDNA0.0120.0090.010.006
0.9Avg0.9990.940.9490.9440.97
SDNA0.0070.0110.0030.003
1Avg0.9990.9440.9430.9430.972
SDNA0.0120.0070.0050.006
Averages0.9990.94070.9470.94370.9702
Table 20. Welch’s t-test results: UNSW-NB15: Backdoors—random undersampling after splitting.
Table 20. Welch’s t-test results: UNSW-NB15: Backdoors—random undersampling after splitting.
Welch’s t-Test Results (p < 0.10)Precision t-ValueRecall t-ValueF-Score t-ValueMacro Precision t-ValueAnalysis
0.1 vs. 0.20.0130.8120.5550.0130.1 and 0.2 are statistically equal
0.1 vs. 0.30.4931.4611.5790.4950.1 better than 0.3 in recall and F-score
0.1 vs. 0.4−0.0360.2250.097−0.0360.1 and 0.4 statistically equal
0.1 vs. 0.50.1020.9780.680.1030.1 and 0.5 statistically equal
0.1 vs. 0.6−0.6761.2960.657−0.6750.1 and 0.6 statistically equal
0.1 vs. 0.7−1.7381.318−1.316−1.7380.1 and 0.7 statistically equal
0.1 vs. 0.8−1.3892.2640.063−1.3870.8 is better than 0.1 in precision and F-score while recall is better in 0.1
0.1 vs. 0.91.336−1.37−0.0111.3340.8 and 0.9 are statistically equal but 0.9 has better recall
0.1 vs. 10.35300.2820.3530.8 and 1 are statistically equal
Table 21. UWF-ZeekData22: credential access—classification results for random undersampling after splitting.
Table 21. UWF-ZeekData22: credential access—classification results for random undersampling after splitting.
Oversampling % AccuracyPrecisionRecallF-ScoreMacro Precision
0.1Avg0.9990.8220.8780.8330.911
SDNA0.1760.1160.1120.088
0.2Avg0.9990.7320.9000.7880.867
SDNA0.1410.1680.1230.071
0.3Avg0.9990.7990.9110.8470.899
SDNA0.1120.1200.1010.056
0.4Avg0.9990.7440.9110.8040.872
SDNA0.1620.0970.1000.081
0.5Avg0.9990.7700.9440.8350.885
SDNA0.1540.0750.0850.077
0.6Avg0.9990.6960.9440.7930.848
SDNA0.1160.1020.0920.056
0.7Avg0.9990.7130.9220.7930.857
SDNA0.1550.1220.1160.0777
0.8Avg0.9990.6390.9330.7490.820
SDNA0.0670.1330.0580.034
0.9Avg0.9990.7220.9220.8000.861
SDNA0.1100.1000.0520.055
1.0Avg0.9990.7420.8890.7890.871
SDNA0.1460.1310.0830.073
Averages0.9990.73790.91540.80310.8691
Table 22. Welch’s t-test results: UWF-ZeekData22: credential access—random undersampling after splitting.
Table 22. Welch’s t-test results: UWF-ZeekData22: credential access—random undersampling after splitting.
Welch’s t-Test Results (p < 0.10)Precision t-ValueRecall t-ValueF-Score t-ValueMacro Precision t-ValueAnalysis
0.1 vs. 0.21.252−0.3440.8581.2520.1 and 0.2 are statistically equal
0.1 vs. 0.30.348−0.632−0.2860.3480.1 and 0.3 are statistically equal
0.1 vs. 0.41.023−0.6970.6111.0230.1 and 0.4 are statistically equal
0.1 vs. 0.50.704−1.528−0.0470.7040.1 and 0.5 are statistically equal, but 0.5 has better recall
0.1 vs. 0.61.20401.061.2040.5 and 0.6 are statistically equal
0.1 vs. 0.70.8140.490.9360.8140.5 and 0.7 are statistically equal
0.1 vs. 0.82.4610.232.6532.4610.5 is better than 0.8 except for recall
0.1 vs. 0.90.7860.5631.1120.7860.5 and 0.9 are statistically equal
0.1vs. 10.4081.1621.220.4080.5 and 1.0 are statistically equal
Table 23. UWF-ZeekData22: privilege escalation—classification results for random undersampling after splitting.
Table 23. UWF-ZeekData22: privilege escalation—classification results for random undersampling after splitting.
Oversampling % AccuracyPrecisionRecallF-ScoreMacro Precision
0.1Avg0.9990.9000.7500.7790.949
SDNA0.2130.2730.2260.106
0.2Avg0.9990.8130.7250.7120.906
SDNA0.2030.3430.2600.101
0.3Avg0.9990.8200.8250.7680.909
SDNA0.1300.2750.1470.065
0.4Avg0.9990.8000.8000.7880.899
SDNA0.3220.3310.3160.161
0.5Avg0.9990.8430.8250.8110.921
SDNA0.2530.2750.2380.126
0.6Avg0.9990.7450.8990.8040.872
SDNA0.0910.1350.0680.045
0.7Avg0.9990.7040.9110.7850.852
SDNA0.1210.1470.1070.060
0.8Avg0.9990.7070.8880.7810.853
SDNA0.0940.1570.0990.047
0.9Avg0.9990.7380.8550.7780.869
SDNA0.1380.1310.0870.069
1.0Avg0.9990.7590.9660.8430.879
SDNA0.1310.0710.0870.065
Averages0.9990.78330.8450.78530.8915
Table 24. Welch’s t-test results: UWF-ZeekData22: privilege escalation—random undersampling after splitting.
Table 24. Welch’s t-test results: UWF-ZeekData22: privilege escalation—random undersampling after splitting.
Welch’s t-Test Results (p < 0.10)Precision t-ValueRecall t-ValueF-Score t-ValueMacro Precision t-ValueAnalysis
0.1 vs. 0.20.9290.1790.6110.9290.1 and 0.2 are statistically equal
0.1 vs. 0.31.012−0.6110.1181.0120.1 and 0.3 are statistically equal
0.1 vs. 0.40.817−0.367−0.080.8170.1 and 0.4 are statistically equal
0.1 vs. 0.50.541−0.611−0.3070.5410.1 and 0.4 are statistically equal
0.1 vs. 0.62.1−1.552−0.3442.10.1 is better than 0.6 except for recall
0.1 vs. 0.72.52−1.638−0.0862.520.1 is better than 0.7 except for recall
0.1 vs. 0.82.606−1.391−0.0272.6060.1 is better than 0.8 except for recall
0.1 vs. 0.91.999−1.0980.0111.9990.1 is better than 0.9 in precision and macro precision
0.1 vs. 11.77−2.421−0.8341.770.1 is better than 1 except for recall
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bagui, S.; Mink, D.; Bagui, S.; Subramaniam, S.; Wallace, D. Resampling Imbalanced Network Intrusion Datasets to Identify Rare Attacks. Future Internet 2023, 15, 130. https://doi.org/10.3390/fi15040130

AMA Style

Bagui S, Mink D, Bagui S, Subramaniam S, Wallace D. Resampling Imbalanced Network Intrusion Datasets to Identify Rare Attacks. Future Internet. 2023; 15(4):130. https://doi.org/10.3390/fi15040130

Chicago/Turabian Style

Bagui, Sikha, Dustin Mink, Subhash Bagui, Sakthivel Subramaniam, and Daniel Wallace. 2023. "Resampling Imbalanced Network Intrusion Datasets to Identify Rare Attacks" Future Internet 15, no. 4: 130. https://doi.org/10.3390/fi15040130

APA Style

Bagui, S., Mink, D., Bagui, S., Subramaniam, S., & Wallace, D. (2023). Resampling Imbalanced Network Intrusion Datasets to Identify Rare Attacks. Future Internet, 15(4), 130. https://doi.org/10.3390/fi15040130

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop