Feature Selection for Improving ANN and CNN Models for Attack Detection in Zeek Network Data

Bagui, Sikha S.; Elbatouty, Mohamed; Mink, Dustin; Bagui, Subhash C.

doi:10.3390/fi18070333

Open AccessArticle

Feature Selection for Improving ANN and CNN Models for Attack Detection in Zeek Network Data

¹

Department of Computer Science, University of West Florida, Pensacola, FL 32514, USA

²

Department of Cybersecurity, University of West Florida, Pensacola, FL 32514, USA

³

Department of Mathematics and Statistics, University of West Florida, Pensacola, FL 32514, USA

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(7), 333; https://doi.org/10.3390/fi18070333 (registering DOI)

Submission received: 18 May 2026 / Revised: 15 June 2026 / Accepted: 22 June 2026 / Published: 24 June 2026

(This article belongs to the Special Issue State-of-the-Art Future Internet Technology in USA 2026–2027)

Download

Browse Figures

Versions Notes

Abstract

In the past few years, cyber-attacks have risen at an exponential rate across all sectors, and both private and public institutions have faced increasingly sophisticated threats. As this upward trend continues, the need for advanced and efficient threat detection systems is essential. This paper investigates the use of feature importance (FI) Coefficients to improve Artificial Neural Network (ANN) and Convolutional Neural Network (CNN) models, leveraging feature selection to enhance model interpretability and optimize performance. By systematically filtering out the weaker features, we examine the reduced features’ impact on model accuracy, precision, recall, and F1 score. Experiments were conducted on two new datasets, UWF-ZeekDataSum2025-1 and UWF-ZeekDataSum2025-2, using a baseline ANN/CNN architecture and multiple architectural variants. The results on UWF-ZeekDataSum2025-1 show a clear performance gain for certain feature importance thresholds, with models such as ANN-Minimal, ANN-Overfit-Wide, ANN-Shallow-Low-Optimization, CNN-Shallow, and CNN-Very-Shallow outperforming the baseline after reducing the feature space from seventeen features to fewer than four. For UWF-ZeekDataSum2025-2, improvements occur across a broader range of thresholds, with models including ANN-Deep-Sub-Conv, ANN-Shallow-Low-Opt, CNN-Shallow, CNN-Very-Shallow, and ANN-Minimal exceeding 95% performance around the 0.25–0.28 thresholds, with additional gains at 0.31–0.32 for some architectures. These findings demonstrate that by strategically leveraging feature importance coefficient thresholds, we can significantly enhance neural network intrusion detection systems, offering a reproducible pathway for adapting these methods on similar environments.

Keywords:

cybersecurity; artificial neural networks; convolutional neural networks; feature importance; PySpark; intrusion detection; Zeek data; binary classification

1. Introduction

As technology has advanced, so has the dependence on storing information digitally. Passcodes, emails, bank accounts, and websites used on a day-to-day basis store information for quick retrieval. In 2023, 900 million malware executables existed, compared to 50 million in 2010 [1]. Cybercrime is a global epidemic that costs users around half a trillion dollars annually. It is important for governments, companies, and individuals to learn and understand how to protect themselves from threats. The first step is the basics—firewalls, antiviruses, and intrusion detection [2]. This may help defend from the most elementary of attacks; however, the need to adapt to new cyber-attacks is constant. In this new age of technology, the use of machine learning (ML) and Artificial Intelligence (AI) can play an important role in keeping sensitive data secure. Although threats over the internet may contain some uniqueness, one may notice there is an overall common, reusable element to varying attacks. By implementing AI/ML techniques, we can leverage their decision-making and threat identification in cybersecurity.

This paper investigates the use of feature importance (FI) coefficients to improve Artificial Neural Network (ANN) and Convolutional Neural Network (CNN) models by leveraging feature selection not only to enhance model interpretability but also to optimize performance. By systematically filtering out weaker features, we examine how reducing the feature set impacts model accuracy, precision, recall, and the F1 score. Experiments were conducted on two new datasets, UWF-ZeekDataSum2025-1 and UWF-ZeekDataSum2025-2, available at [3], using a baseline ANN/CNN architecture and multiple architectural variants.

Although weight-based feature importance has been investigated in prior neural network research, the primary contribution of this work is not the introduction of a new feature importance metric. Rather, this study presents a scalable feature selection framework that leverages Multilayer Perceptron Classifier (MLPC)-derived feature importance (FI) coefficients within the PySpark MLlib ecosystem that applies the framework to the problem of MITRE ATT&CK tactic classification. Existing weight-based feature selection approaches are evaluated on traditional benchmark datasets and implemented in standalone machine learning environments. In contrast, this work demonstrates how FI-based feature selection can be integrated into a distributed big data processing framework, enabling efficient analysis of large-scale cybersecurity datasets.

Furthermore, this study demonstrates that the selected subsets can effectively be transferred across multiple deep learning architectures. Rather than evaluating feature selection solely within the MLPC model used to derive the FI coefficients, the reduced feature sets are used to train and evaluate both Artificial Neural Network (ANN) and Convolutional Neural Network (CNN) classifiers. This provides evidence that the extracted features capture meaningful characteristics of the underlying data and are not merely artifacts of a single model architecture. The experimental results show that FI-based feature reduction can maintain or improve classification performance while simultaneously decreasing the number of input features, reducing training time, and lowering computational overhead.

Therefore, the novelty of this work lies in the development of a scalable, MLPC-driven feature selection framework for cybersecurity analytics, its integration within the PySpark MLlib environment, and its empirical validation across multiple deep learning architectures for MITRE ATT&CK tactic classification.

The rest of this paper is organized as follows. Section 2 presents the background needed to understand the work in this paper, that is, the ANN, CNN and feature selection; Section 3 presents the related works; Section 4 presents the data; Section 5 presents the model specifics; Section 6 presents the methodology and work; Section 7 presents the results and Section 8 presents the conclusion; and Section 9 presents the future works.

2. Background

2.1. Artificial Neural Networks

Artificial Neural Networks (ANNs) are adaptive systems inspired by the structure and function of the human brain. ANNs consist of interconnected nodes that process and transmit information. The adaptive quality comes from the model’s ability to adjust internal weights and biases in response to data, enabling learning over time. Due to their ability to derive internal “rules” that guide mapping from inputs to outputs, ANNs are particularly effective for solving problems involving complex, nonlinear relationships [4]. At their core, ANNs are organized into layers that process and transform data. The process begins with the input layer, which receives the raw features of a dataset. Each neuron corresponds to a single input variable. This data is then passed to one or more hidden layers, which serve as the network’s computational engine. Within these hidden layers, nonlinear transformations are applied, allowing the network to learn complex and abstract representations of the input data [5]. The transformed information is subsequently passed to the output layer, which produces the final prediction or classification. Communication between neurons across all layers is governed by weights and biases. Weights quantify the strength and direction of influence between connected nodes, while biases act as offsets that allow neurons to activate even when input values are zero. Inside each neuron, the weighted sum of inputs and the bias are passed through an activation function to introduce nonlinearity, enabling the network to model complex, real-world relationships [6]. Figure 1 presents the architecture of an ANN.

The knowledge within an ANN is entirely derived through its weights, which are computed from the training process. The goal of the training is to adjust these weights to remove any discrepancy between the network’s predicted output and the actual value, a difference quantified by our loss function. This adjustment is achieved through a multi-step process. Firstly, forward propagation generates a prediction that the loss function uses to calculate the error. Then, the backpropagation algorithm calculates the gradient of the loss with respect to every weight in the network. This will determine how much each weight contributes to the error. An optimizer will then use these gradients to update the weights and move them in a direction that systematically reduces the loss. This iterative optimization cycle continues over the desired epochs until the performance converges. Our results should then lead us to a model capable of making accurate, generalized predictions.

2.2. Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are architecturally composed of a sequence of specialized layers designed to exploit spatial locality and hierarchical feature learning [7]. At their core, CNNs are convolutional layers. These layers apply multiple learnable kernels (filters) that convolve across the input feature maps using defined strides and padding schemes [8]. Each kernel performs a dot product over localized receptive fields, producing activation maps that highlight the presence of learned spatial patterns such as edges, corners, and textures [7]. The use of shared weights across spatial locations significantly reduces the parameter count compared to fully connected layers, while preserving translational equivariance [9]. Padding strategies (e.g., same or valid padding) control spatial dimensionality, and stride selection governs the granularity of feature extraction, directly influencing both computational complexity and representational resolution [8]. These architectural design choices collectively allow CNNs to efficiently model spatial correlations and encode higher-level abstractions when depth increases [10]. Figure 2 presents a CNN architecture.

Following convolutional operations, nonlinear activation functions are applied to introduce nonlinearity and enable the network to approximate complex decision boundaries [9]. Pooling layers, such as max pooling or average pooling, are typically interleaved between convolutional blocks to perform spatial downsampling. This is used to reduce dimensionality and enhance translation invariance by retaining the most salient features within local neighborhoods [8]. Batch normalization layers are frequently integrated to stabilize gradient flow, accelerate convergence, and mitigate internal covariate shift during training [10]. Together, these components form convolutional blocks that serve as the fundamental building units of CNN architectures, systematically transforming raw input data into compact, high-level feature representations suitable for downstream classification or regression tasks [7].

CNNs often transition from convolutional feature extraction to fully connected (dense) layers, which integrate the learned spatial features into a global representation for final prediction [8]. Modern CNN designs extend this basic pipeline through advanced architectural motifs such as residual connections, which enable gradient propagation across very deep networks [9]. Inception-style modules parallelize convolutions of varying kernel sizes to capture multi-scale features within the same layer [10]. Additionally, global average pooling layers are increasingly used in place of large fully connected layers to reduce overfitting and parameter overhead while preserving effective spatial summarization [8]. These architectural innovations collectively enhance scalability, training stability, and representational richness, allowing CNNs to thrive in high-dimensional perceptual learning tasks and contemporary deep learning (DL) systems [9].

2.3. Feature Importance

Feature importance is a crucial concept in ML, aimed at quantifying how much an input feature contributes to a model’s prediction or overall performance. In complex, nonlinear models like neural networks (ANNs and CNNs) used in PySpark applications, determining feature importance helps transform a “black box” into a transparent system [11]. This transparency is vital, particularly in critical domains like cybersecurity where justifying model behavior, such as why certain network characteristics are flagged as attack indicators, is essential for trust, auditability, and domain knowledge discovery. Modern frameworks like SHapley Additive exPlanations (SHAP) [12] provide a unified, rigorous approach derived from cooperative game theory to accurately and fairly assign credit to individual features, making it the gold standard in quantitative model interpretation.

The need for deep learning interpretability has driven research into various model-specific and model-agnostic techniques. Model-agnostic approaches, such as the generalized Permutation FI method popularized by Fisher, Rudin, and Dominici [13], work by measuring the decrease in prediction accuracy when a feature’s values are randomly shuffled. This effectively tests causality regardless of the underlying model structure. For neural networks specifically, powerful model-specific techniques have emerged that analyze the propagation of the prediction signal [14]. Further developed by Montavon [15], it provides a way to backpropagate the final output relevance signal through the network layers, attributing a score to each input feature that reflects its contribution to the final prediction. These sophisticated methods represent the cutting edge of feature importance in deep learning.

Our Application of Feature Importance

Feature importance (FI) coefficients are derived from a trained Multilayer Perceptron Classifier (MLPC) through analyzing the connection weights between the input layer and first hidden layer. After model training, the MLPC produces a vector containing all learned parameters. From this vector, the weights associated with the connections between the input features and the neurons in the first hidden layer are extracted and reshaped into a weight matrix (W), where each row corresponds to an input feature and each column corresponds to a hidden-layer neuron.

The underlying assumption is that the first hidden layer directly receives information from the original input features. Therefore, the magnitude of the weights connecting an input feature to the hidden neurons reflects the influence of that feature on the network’s learned representation. To quantify this influence, an FI coefficient is computed for each input feature by aggregating the absolute values of all outgoing weights from that feature to the neurons in the first hidden layer. Specifically, the FI coefficient for feature (i) is calculated as follows:

S_{i} = (1 \div | H_{1} |) \sum_{j \in H_{1}} | W_{i . j} |

(1)

where (

W_{i . j}

) denotes the weight connecting input feature (i) to hidden neuron (j), and (

| H_{1} |

) is the total number of neurons in the first hidden layer. Taking the absolute value ensures that both positive and negative contributions are treated as indicators of feature influence, while averaging across all hidden neurons provides a single FI coefficient for each feature.

Once the FI coefficients are computed for all input features, they are ranked in descending order. A predefined FI threshold is then applied, and only features with coefficients greater than or equal to the threshold are retained for subsequent model training and evaluation. Features with FI coefficients below the threshold are removed from the dataset. This procedure provides a computationally efficient, model-specific feature selection mechanism that can be executed directly within the PySpark MLlib framework.

3. Related Works

There are a multitude of models for cyber-attack detection. There are studies on Random Forest, Support Vector Machines (SVMs), Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs) being used in the field. Through basic testing on accuracy, precision, recall, and F1 score, there is still a lot that is inconclusive. SVM scored higher overall but there was not much mention of features or a confusion matrix in [16].

There is the use of ML algorithms in detecting IoT network attacks on the publicly available DARPA 98, KDD99, UNSW-NB15, ISCX, CICIDS2017, and N-BaIoT datasets [17]. The experiment was run using ML classifiers such as KNN and Random Forest. There was an emphasis throughout the process on feature extraction, data preprocessing, splitting data, feature selection, and implementation of ML algorithms. The results concluded that Random Forest (RF) performed very well, with the drawback of it taking much more time than other models due to its decision-making architecture.

An intelligent approach for anomaly intrusion detection has also been proposed in [18], combining SVMs, Decision Trees (DTs), and Simulated Annealing (SA). In this study, SVMs are integrated with SA to identify the most effective candidate features for intrusion detection, while DTs paired with SA are leveraged to generate decision rules for new attack types and enhance overall classification performance. The SA method is further employed to automatically optimize parameters for both SVMs and DTs, with the model evaluated using ten-fold cross-validation to ensure robustness. The proposed method demonstrates an impressive 99.96% accuracy with a minimal set of selected features, outperforming other state-of-the-art approaches [18].

A recent study explored the use of multiple classifiers [19], including SVMs with an RBF kernel, RF, DTs, and MLPC, to distinguish normal traffic from malicious intrusions. The experimental design began with data preprocessing, which involved removing redundant features, addressing missing values, and applying normalization and standardization. Following this, the dataset is divided into training and validation sets, with each classifier trained and tested independently. To further validate robustness, the models are evaluated across three distinct feature subsets derived from the NSL-KDD benchmark dataset. The findings reveal that the SVM classifier consistently achieved an average accuracy of 98% across all subsets, outperforming both DT and MLP classifiers [19].

A group funded by the French Government used feature selection and feature extraction methods to improve the performances of ML algorithms for patient classification [20]. The paper showed that, independently of the feature extraction technique applied, feature selection is a necessary step method to improve performance, mainly when supervised methods are used. This paper sets the ground for this present work; hence, by correctly applying feature extraction, we could see better results.

Feature selection techniques enhanced by deep learning (DL) and reinforcement learning (RL) have also provided superior performance. In [21], using neural networks (NNs) for feature ranking to identify the most relevant features, there was a noticeable increase in model accuracy with reduced dimensionality. This real-time feature selection process is especially advantageous in dynamic or evolving datasets, where feature importance may shift over time. AI-driven methods demonstrate clear advantages, but also come with some drawbacks. The most cost-observed limitation is the increased computational cost associated with the training and real-time adaptation required. The time required for training models with AI-based preprocessing and feature selection was longer compared to traditional methods. These methods require large amounts of data to achieve optimal results, and in cases where data was scarce or noisy, the performance of AI models lag.

A group in Mississippi reviewed explainable AI (XAI) for IDS. There were three main takeaways from their research. They stated the need to define explainability for IDS, the need to create explanations tailored to various stakeholders, and the need to design metrics to evaluate explanations [22]. They found this through two approaches; the first was to make the model inherently interpretable. The second approach required post hoc techniques such as Local Interpretable Model-Agnostic Explanations (LIME) and SHAP. While the former approach provided a more detailed explanation to assist decision-making, its prediction performance is in general outperformed by the latter. It was determined that the field of IDS requires a high degree of precision to prevent attacks and avoid false positives [22].

In recent history, a design of a Network Intrusion Detection System (NIDS) was used in a cloud environment to mitigate evolving threats while also minimizing false positives. The algorithm combines the fundamental aspects of network intrusion detection with the attention mechanism intrinsic to the Transformer model, facilitating a relationship between input features and diverse intrusion types, bolstering detection accuracy. The accuracy of the NIDS model was over 93%, which was comparable to that of a CNN-LSTM model [23].

4. Datasets

This work uses the Zeek Conn log [24] and MITRE ATT&CK framework [25] labeled datasets, UWF-ZeekDataSum2025-1 and UWF-ZeekDataSum2025-2, available in [3], generated using the Cyber Range [26] at the University of West Florida (UWF). Zeek is an efficient, widely used network monitoring tool that produces extensive logs to describe network activity [24,27]. The MITRE ATT&CK framework, presently composed of 14 tactics, is a knowledgebase that serves as a foundation for the development of modern threat models used in the private sector as well as government [25,27].

The distribution of attack tactics in UWF-ZeekDataSum25-1 and UWF-ZeekDataSum-2 is presented in Table 1 and Table 2 respectively. Table 3 presents a brief description of the features of the attack tactics.

Tactic Descriptions

Reconnaissance: The primary target of reconnaissance attacks consists of gaining information for future operations and attacks. Information may consist of details about the organization, infrastructure, and staff. This information can then be used to assist in gaining initial access, prioritizing objects, or probing for additional reconnaissance efforts [28].

Privilege Escalation: Privilege Escalation is used to gain higher-level permissions. This attack exploits system weaknesses, misconfigurations, and vulnerabilities. Elevated access can consist of a user account with admin access, local admin, or even user accounts with specific system access [29].

Defense Evasion: Defense Evasion is used to avoid detection throughout the attack. This includes disabling any security software or encrypting data and scripts. Defense evasion abuses what one might believe to be a trusted process to hide malicious software [30].

Discovery: Discovery is a technique used to gain knowledge about systems and networks. This allows attackers to observe the environment to best plan a course of action. One may think of discovery as the learning phase [31].

Lateral Movement: Lateral Movement allows one to enter and control remote systems on a network. To find the target, one may need to pivot and search through multiple systems. An attacker may install their own remote access or even use real credentials for the system to move around [32].

Initial Access: Initial Access uses various entry vectors to get that foot in the door of the network. Techniques consist of targeted spearphishing and abusing weaknesses on public-facing web servers. Accounts gained from initial access may allow for continuous access, only becoming limited when the password is changed [33].

5. The Models

Five different variants of the ANN model were tested: ANN-Deep-Sub-Conv, ANN-Minimal, ANN-Overfit-Wide, ANN-Shallow-Low-Optimization (ANN-Shallow-Low-Opt), and ANN-Wide-Sub-Conv. Four different variants of the CNN model were tested: CNN-Deep, CNN-Shallow, CNN-Very-Shallow, and CNN-Very-Wide/Deep. Table 4 presents details of the configurations used for the various ANN and CNN variants.

Due to the limitations ofPySpark’s MLlib, which lacks complex DL models, both architectures are realized using the DL estimator: the Multilayer Perceptron Classifier (MLPC). The main distinction between the ANN and CNN groups is the configuration of the layer dimensions, rather than by the presence of convolutional or recurrent mathematical operations. ANN configurations serve as a feed-forward baseline, while the CNN configurations structurally simulate finalized and dense layers found in complex DL models.

The ANN configurations establish a spectrum of standard feed-forward network complexity. This group includes very elementary architectures, such as the “Minimal (Standard)” model with a single, narrow hidden layer, suitable for identifying linear or simple nonlinear feature interactions efficiently. It also contains deeper and wider baseline models, such as the Overfit-Wide and Deep-Sub-Conv. These models test the impact of added capacity, using the L-BFGS solver for effective convergence on simple problems or Gradient Descent with conservative step sizes for models designed to explore complex feature environments. For optimization, the L-BFGS solver is employed for smaller and simpler model configurations due to its fast convergence and robustness in well-conditioned problem spaces. On the contrary, Gradient Descent is applied to deeper or wider architectures with conservative step sizes to ensure stable learning dynamics in the more complex feature environments. This combination of solver strategies enables us to have a controlled comparison between efficient optimization for lower-capacity models and cautious exploration of higher-capacity networks.

The CNN configurations are designed to match the high capacity required to process features extracted from convoluted or pooling layers in a traditional sequential task. This group exhibits a wider range of layer dimensions to test robustness across varying structural complexities. The CNN -Very-Shallow model, built with one narrow hidden layer, serves as a minimum-capacity counterpart to the simple ANN baselines. The CNN-Very-Wide/Deep configuration features four hidden layers and increased node counts, providing a substantial capacity for complex nonlinear mapping.

Although the models were trained for only 10 iterations (15 for CNN-Very-Wide/Deep), all models had already converged by the end of training. In Figure 3 and Figure 4, we have presented the training loss at each iteration per model. As the line straightens out, each model converges. It was important to ensure that we were maximizing resources due to the extreme size of the dataset. Past the 10 iterations for all models other than CNN-Very Wide/Deep, which was set at 15, additional iterations were tested and did not produce meaningful improvements in performance metrics, indicating that further training would have provided no additional benefit.

Performance Metrics

The following performance metrics were used to assess the results of the FI coefficients to best train the ML models: accuracy, precision, recall, and F1 score.

Accuracy is an evaluation metric used in classification tasks due to its ability to provide a general indication of a model’s overall predictive capability. It measures the proportion of correctly classified instances, positive and negative, relative to the absolute number of predictions [34]. Accuracy is defined as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(2)

Accuracy is useful for obtaining a quick overview of performance. However, its interpretability diminishes significantly in the presence of class imbalance. When one class dominates the dataset, a model may return high accuracy simply by predicting the majority class more frequently, meaning it will fail to meaningfully detect the minority class. Due to this, accuracy alone may obscure poor performance on the instances of greatest concern [35].

Precision (Positive Predictive Value) focuses on a model’s positive predictions. It specifically highlights how many instances classified as positive are labeled correctly, making it highly relevant in situations where false positives may negatively impact use of resources. Precision is calculated as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(3)

A model with high precision produces few false alarms, which is important in cybersecurity threat detection. In these contexts, incorrectly flagging normal instances as positive may lead to unnecessary interventions. Precision provides insight into the discriminative power of a classifier and its ability to avoid incorrectly labeling negative samples as positive.

Recall (Sensitivity or True Positive Rate) evaluates a model’s ability to detect positive instances in a dataset. It measures the count of actual positives that are correctly identified, catching the model’s effectiveness [36]. The formula for recall is

R e c a l l = \frac{T P}{T P + F N}

(4)

High recall gives us the confidence that the model can identify the majority of positive cases. Recall complements precision by allowing for positive events to not be overlooked.

F1 score provides a unified measure of performance by combining precision and recall into a single value using the harmonic mean [37]. This makes the F1 score effective when precision and recall must be balanced, or when a classifier’s performance varies across classes. It is defined as follows:

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(5)

Since the F1 score penalizes disparities between precision and recall, it is well-suited for evaluating models operating on imbalanced datasets. Instead of relying on accuracy, which may fail to reflect minority-class performance, the F1 score provides a robust assessment of a model’s predictability.

6. Methodology

The experimental methodology employs a structured two-step process to evaluate the impact of FI coefficient-based feature selection (FS) on the performance of the ANN and CNN in a binary classification task.

6.1. Flowchart and Explanation

The methodology is structured as an iterative, two-step feature selection (FS) process for network intrusion detection. The pipeline, as presented in Figure 5, begins by loading the raw Zeek data and performing necessary preprocessing—including mean imputation for numerical values, handling Boolean flags, converting categorical features to indices using StringIndexer, and creating the binary Normal/Attack label by mapping all non-‘none’ tactics to the Attack class. This clean, preprocessed data is then used to assemble the full feature vector. The core experimentation is controlled by a loop that iterates over an FI coefficient threshold to systematically test different feature subsets.

Within each iteration, a model is first trained on the full feature set (Step 1: Train FS Model) to determine feature relevance. The relevance is calculated through feature importance, which was discussed earlier in the paper. Features are then selected only if their calculated importance score exceeds the current value, resulting in the reduced input vector. A new, independent model with an adjusted input layer size is then trained on the reduced feature set in Step 2 (Train Final Model). This is followed by standard evaluation using metrics like accuracy, F1 score, precision, and recall. This entire two-step process repeats until all predefined values have been tested, culminating in a single summary file that combines all threshold results for comprehensive analysis.

6.2. Preprocessing

Initial data preprocessing is crucial for preparing the raw network data for ML. The process begins with dropping identifying or irrelevant columns such as connection timestamps, IDs, and IP addresses. Missing numerical values are handled by imputation, specifically by filling them with the calculated mean of their respective columns. Boolean columns are first handled by imputing False and then casting the column data type to DoubleType for numerical processing. Categorical features (service, protocol, conn_state, history) are converted into numerical indices using the PySpark StringIndexer, resulting in the initial set of features. No class weight adjustments, random undersampling, random oversampling, synthetic oversampling techniques (e.g., SMOTE), or other balancing interventions were applied. This design choice was intentional to ensure that the observed performance variations were attributable to the proposed FI-based feature selection framework and neural network architectures rather than to external balancing mechanisms.

6.3. Training and Testing

The core of the experiment relies on a two-step training paradigm to explicitly separate the feature selection process from the final evaluation. In both steps, the dataset is split using an 80/20 ratio for training and testing with a fixed seed (42) to ensure reproducibility. Step 1 (feature selection round) involves training the models on the full, scaled feature vector. The resulting trained model is then queried to extract feature importance scores (FI coefficients or weights). Step 2 (final evaluation) involves re-assembling and re-scaling the data using only the features that passed the FI coefficient threshold (lambda). A new instance of the model, now configured with a smaller input layer corresponding to the number of selected features, is trained on this filtered 80% training set and evaluated on the filtered 20% test set. All models are trained with a maximum of 10 iterations (max_iter = 10) except CNN Wide/Deep at 15 iterations (max_iter = 15) to maintain consistent training effort across configurations.

6.4. Binary Classification

Despite the original dataset containing multi-class labels related to various MITRE ATT&CK tactics, the pipeline executes binary classification. This design choice simplifies the task to one of anomaly detection. All instances are mapped to index 0 (Non-Attack) or 1.0 (Attack) class. The MLPC model is therefore configured with an output layer size of 2, aligning with this binary label structure. Performance metrics are, specifically for this binary separation, focusing on the model’s ability to discriminate between the “normal” and “malicious” traffic categories.

6.5. Hardware and Software Used

The experiment was run on a virtualized environment hosted on a Virtual Machine (VM). The VM was set to ensure consistent resources for the duration of our experiment. The software stack for the analysis was Python version 3.14.6, with PySpark serving as the framework. Pyspark was primarily used for data manipulation and implementing machine learning models such as MLPC. We also had standard packages such as Pandas 3.0 and Matplotlib 3.11.0 to do tabulation and visualization. Our Spark clusters were configured to interface with a dedicated Hadoop Distributed File System [38], which stored the raw Zeek network traffic datasets.

The hardware configuration of the VM allocated 4 CPUs with 8 GB of RAM. The VM ran on hardware version 21. This environment provides a controlled and reproducible infrastructure for executing the work.

7. Results

7.1. Calculating Feature Importance

Table 5 and Table 6 present the feature importance scores of each feature, calculated as per Equation (1) for each respective ANN and CNN model, using UWF-ZeekDataSum-1 and UWF-ZeekDataSum-2 respectively. Feature importance is calculated using the full datasets respectively.

7.2. Determining the Features to Be Used in Training Using Coefficient Thresholds

Table 7 and Table 8 represent the number of features used to train each respective ANN and CNN model using the FI coefficient threshold; that is, in Table 7, the F1 coefficient threshold represents the number of features with an FI score (from Table 5) greater than the threshold that were used to train each respective ANN and CNN model using UWF-ZeekDataSum2025-1. For all the models, the FI coefficient threshold of seventeen means that all the features were used. If the number of features increased compared to previous coefficients and went back to seventeen, this means that all features fell below the current threshold and will return the same results as the baseline. The training is done using 80% of the data.

Likewise, in Table 8, the F1 coefficient threshold represents the number of features with an FI value (from Table 6) greater than the threshold that were used to train each respective ANN and CNN model using UWF-ZeekDataSum2025-2. The training is done using 80% of the data.

7.3. Testing

Testing was performed for each of the respective ANN and CNN models, at the various FI coefficient thresholds (presented in Table 7 and Table 8), using the number of features specified for each of the respective ANN and CNN models. In total, 20% of the data was used for the testing. The evaluation metrices, accuracy, precision, recall, and F1 score, were recorded, and the results presented are an average of three runs.

7.3.1. Analysis of Accuracy Results

The results are presented by dataset, and finally, the two dataset results are compared.

UWF-ZeekDataSum2025-1

The accuracy of the various ANN and CNN models at different FI coefficient thresholds using UWF-ZeekDataSum2025-1 is presented in Table 9. The highest accuracies (higher than an accuracy of 95%) are bolded in Table 9. Graphically, the results are presented in Figure 6.

The accuracy results are summarized below and the best resulting features for each model are bolded in Table 10.

ANN-Minimal achieved the highest accuracy of 99.84% at the F1 coefficient threshold of 0.35, with two features, duration and history.
ANN-Overfit-Wide also achieved the highest accuracy of 99.83% at an FI coefficient threshold of 0.33, with one feature, history. Performance was poor at other thresholds using more features, hovering around 50% accuracy.
ANN-Shallow-Low-Opt had the highest performance of 99.84% at the FI coefficient thresholds of 0.33 and 0.34 and 99.83% at the FI coefficient threshold of 0.35, with one or two features, duration and history, respectively. Performance was poor at other thresholds, hovering around 50% accuracy.
CNN-Shallow had the highest accuracy of 99.83% at FI coefficient thresholds of 0.33, 0.34 and 0.35 and with one or two features, duration and history. Performance was poor at other thresholds, hovering around 50% accuracy.
CNN-Very-Shallow had the highest accuracy of 99.84% at an FI coefficient threshold of 0.35 and with two features, duration and history.
ANN-Deep-Sub-Conv, ANN-Wide-Sub-Conv, CNN-Deep, and CNN-Very-Wide/Deep performed poorly, hovering around an accuracy of 50%.

Overall, for UWF-ZeekDataSum25-1, as presented in Table 9, the best accuracy results for ANN-Minimal, ANN-Overfit-Wide, ANN-Shallow-Low-Opt, CNN-Shallow and CNN-Very-Shallow were mostly for FI coefficient thresholds between 0.33 and 0.35. ANN-Minimal, ANN-Shallow-Low-Opt and CNN-Very-Shallow had slightly higher accuracy at 99.84%. This is also reflected in Figure 6. From Table 10, it can be noted that the most important features for most of the models were duration and history.

UWF-ZeekDataSum2025-2

The accuracy of the various ANN and CNN models at different FI coefficient thresholds using UWF-ZeekDataSum2025-2 are presented in Table 11. The highest accuracies are bolded. Models with highest accuracies below 95% are not bolded. Graphically, the results are presented in Figure 7.

The accuracy results are summarized below and the best resulting features for each model are bolded in Table 12 as follows:

ANN-Minimal achieved the best accuracy of 99.73% at an FI coefficient threshold of 0.26 with 11 features, though the FI coefficient threshold of 0.28 was also very close in accuracy at 99.72% using seven features.
ANN-Overfit-Wide achieved its highest accuracy of 95.79% at an FI coefficient threshold of 0.29 using eight features.
ANN-Shallow-Low-Opt had the highest performance of 97.49% at the FI coefficient threshold of 0.27 with eight features.
CNN-Shallow achieved the highest accuracy of 97.50% at an FI coefficient threshold of 0.29 with five features.
CNN-Very-Shallow had the highest accuracy of 99.76% at an FI coefficient threshold of 0.26 with 11 features.
ANN-Wide-Sub-Conv, CNN-Deep, and CNN-Very-Wide/Deep performed poorly across most thresholds, with no regard for the number of features used. Though ANN-Wide-Sub-Conv had some accuracy above 75%, the other two models hovered around 50% accuracy.

Overall, for UWF-ZeekDataSum25-2, the best accuracy results for ANN-Deep_Sub-Conv, ANN-Minimal, ANN-Overfit-Wide, ANN-Shallow-Low-Opt, CNN-Shallow, and CNN-Very-Shallow occurred for FI coefficient thresholds between 0.26 and 0.29, but, as can also be seen from Figure 7, there were also some good results around thresholds 0.31 and 0.32. ANN-Minimal and CNN-Very-Shallow achieved the highest accuracies at 99.73% and 99.76%, respectively, both at FI coefficient thresholds of 0.26 with the same 11 features.

Comparing UWF-ZeekDataSum25-1 and UWF-ZeekDataSum-2, in terms of accuracy, UWF-ZeekDataSum-1 generally achieved a higher accuracy with fewer number of features, while UWF-ZeekDataSum-2 generally needed more features. The highest accuracy of UWF-ZeekDataSum-1 was a little higher. Also, though only four models performed well using UWF-ZeekDataSum1, more models performed well with UWF-ZeekDataSum-2.

Table 13 presents the percent increase in accuracy for the various ANN and CNN models for each of the datasets. The baseline for the percent increase, for all the models, uses all the features. That is, as compared to using all the features, what is the best accuracy we could achieve with a reduced number of features. CNN-Deep and CNN-Very-Wide/Deep had no increase in accuracy in either of the two datasets with reduced features. ANN-Minimal, ANN-Overfit-Wide, ANN-Shallow-Low-Opt and CNN-Shallow had a very high increase, compared to the baseline, using a reduced number of features. ANN-Deep-Sub-Conv and ANN-Wide-Sub-Cov had no improvement with UWF-ZeekDataSum25-1, but had 91% and 51% improvement in accuracy, respectively, with a reduced number of features.

7.3.2. Analysis of Precision Results

Firstly, an analysis is done by dataset, and then the overall analysis is presented.

UWF-ZeekDataSum2025-1

The precision results of the various ANN and CNN models at different FI coefficient thresholds using UWF-ZeekDataSum2025-1 are presented in Table 14. Due to the emphasis on positive predictions, this metric will test our models to see if they can detect attacks efficiently. The highest precision for each model is bolded in Table 14. Models with highest accuracies below 95% are not bolded.

Graphically, the results are presented in Figure 8.

The precision results for UWF-ZeekDataSum-1, from Table 14, can be summarized as follows:

ANN-Minimal achieved a very high precision of 99.84% at the FI coefficient threshold of 0.35 using two features, duration and history.
ANN-Overfit-Wide achieved the highest precision of 99.83% at the FI coefficient threshold of 0.33, using just one feature, history.
ANN-Shallow-Low-Opt had the highest precision of 99.84% at the FI coefficient thresholds of 0.33 and 0.34 using two features, duration and history, and precision of 99.83% at 0.35, the neighboring higher threshold, using one feature.
CNN-Shallow had the highest precision of 99.83% at the FI coefficient thresholds of 0.33, 0.34 and 0.35 using two features. However, the FI coefficient thresholds of 0.33 and 0.34 actually achieved the same precision with one feature.
CNN-Very-Shallow had the highest precision of 99.84% at the FI coefficient threshold of 0.35 using two features.

Overall, for UWF-ZeekDataSum25-1, the best precision results for ANN-Minimal, ANN-Overfit-Wide, ANN-Shallow-Low-Opt, CNN-Shallow and CNN-Very-Shallow occurred mostly between FI coefficient thresholds of 0.33 and 0.35, using one or two features. And this is also reflected in Figure 8. In terms of the highest precision, the three models, ANN-Minimal, ANN-Shallow-Low-Opt and CNN-Very-Shallow, had the highest precision at 99.84%. As was expected, these results are very similar to the accuracy results.

Table 15 presents the best features for the various ANN and CNN models for UWF-ZeekDataSum-1. From Table 15, it can be noted that the most important features for most of the models were duration and history.

UWF-ZeekDataSum2025-2

The precision of the various ANN and CNN models at different FI coefficient thresholds using UWF-ZeekDataSum2025-2 is presented in Table 16. The highest precision for each model is bolded, though models with highest accuracies below 95% are not bolded. Graphically, the results are presented in Figure 9.

The precision results for UWF-ZeekDataSum-2, from Table 16, can be summarized as follows:

ANN-Deep Sub-Conv achieved a high precision of 96.14% at the FI coefficient threshold of 0.28 using seven features. Precision at the other FI coefficient thresholds was very low, hovering around 25%.
ANN-Minimal achieved the best precision of 99.73% at the FI coefficient threshold of 0.26 using 11 features. The FI coefficient threshold of 0.28 came very close with a precision of 99.72%, using seven features.
ANN-Overfit-Wideachieved the highest precision of 96.11% at the FI coefficient threshold of 0.29 using eight features.
ANN-Shallow-Low-Opt had the highest performance of 97.50% at the FI coefficient threshold of 0.27 also using eight features.
CNN-Shallow had the highest precision of 97.51% at the FI coefficient threshold of 0.29, using five features.
CNN-Very-Shallow had the highest precision of 99.76% at the FI coefficient threshold of 0.26, using eleven features.
CNN-Deep and CNN-Very-Wide/Deep performed poorly, hovering around a precision of 25%. ANN-Wdie-Sub-Conv performed moderately, with the highest precision at 81.17%, with a wide range of precision values.

Overall, for UWF-ZeekDataSum25-2, the best precision results for ANN-Deep-Sub-Conv, ANN-Minimal, ANN-Overfit Wide, CNN-Shallow-Low-Opt, CNN-Shallow and CNN-Very Shallow occurred for FI coefficient thresholds between 0.26 and 0.29. From Figure 9, we can also see some good results at a threshold of 0.31.

Table 17 presents the best features for the various ANN and CNN models for UWF-ZeekDataSum-2. From Table 17, it can also be noted that ANN-Minimal and CNN-Very Shallow achieved the highest accuracies at 99.73% and 99.76% respectively, both at FI coefficient thresholds of 0.26 with the same 11 features. As expected, these results are very close to the accuracy results.

Comparing UWF-ZeekDataSum25-1 and UWF-ZeekDataSum-2, in terms of precision, UWF-ZeekDataSum-1 generally achieved a higher precision with fewer number of features, while UWF-ZeekDataSum-2 generally needed more features. The highest accuracy of UWF-ZeekDataSum-1 was a little higher. Also, though only four models performed well using UWF-ZeekDataSum1, more models performed well with UWF-ZeekDataSum-2. Again, these results are very similar to the accuracy results.

Table 18 presents the percent increase in accuracy for the various ANN and CNN models for each of the datasets. Overall, UWF-ZeekDataSum25-1 had less of an increase than UWF-ZeekDataSum25-2. ANN-Deep-Sub-Conv, ANN-Wide-Sub-Cov, ANN-Deep and CNN-Very-Wide/Deep had no increase at all for UWF-ZeekDataSum25-1, and the latter two models had no increase in either of the two datasets.

7.3.3. Analysis of Recall Results

The results of recall for the various ANN and CNN models at different FI coefficient thresholds for ZeekDataSum2025-1 and ZeekDataSum2025-2 are presented in Table 19 and Table 20 and Figure 10 and Figure 11 respectively. The best recall results match up with the accuracy and precision results; hence, a detailed analysis is not presented.

UWF-ZeekDataSum2025-1

The recall results of the various ANN and CNN models at different FI coefficient thresholds using UWF-ZeekDataSum2025-1 are presented in Table 19. Due to the emphasis on positive predictions, this metric will test our models to see if they can detect attacks efficiently. The highest recall for each model is bolded in Table 19. Models with highest accuracies below 95% are not bolded. Graphically, the results are presented in Figure 10.

As can be observed from Table 19 and Figure 10, for UWF-ZeekDataSum25-1, the best recall results are also very close to the accuracy and precision results, so there is no need for a detailed analysis. The highest recall went down very insignificantly, by less than one percent from the precision results, for ANN-Deep-Sub-Conv, ANN-Overfit Wide, ANN-Shallow-Low-Opt, and CNN-Shallow, but the FI coefficient thresholds remained the same; hence, there is no need for a detailed analysis.

UWF-ZeekDataSum2025-2

The recall results of the various ANN and CNN models at different FI coefficient thresholds using UWF-ZeekDataSum2025-2 are presented in Table 20. Due to the emphasis on positive predictions, this metric will test our models to see if they can detect attacks efficiently. The highest recall for each model is bolded in Table 20. Models with highest accuracies below 95% are not bolded. Graphically, the results are presented in Figure 11.

Table 21 presents the percent increase in accuracy for the various ANN and CNN models for each of the datasets. ANN-Deep-Sub-Conv, ANN-Wide-Sub-Cov, ANN-Deep and CNN-Very-Wide/Deep had no increase at all for UWF-ZeekDataSum25-1, and the latter two models had no increase in either of the two datasets. ANN-Minimal, ANN-Overfit-Wide, ANN-Shallow-Low-Opt, CNN-Shallow and CNN-Very-Shallow had a good percent increase in accuracy using fewer features.

7.3.4. Analysis of F1 Results

The results are presented by dataset, and then the two dataset results are compared.

UWF-ZeekDataSum2025-1

The results of the F1 scores for the various ANN and CNN models at different FI coefficient thresholds for ZeekDataSum2025-1 and ZeekDataSum2025-2 are presented in Table 22 and Table 23 and Figure 12 and Figure 13 respectively. Again, the best F1 score results for UWF-ZeekDataSum25-1 exactly match up with the accuracy and precision results; hence, a detailed analysis is not presented.

UWF-ZeekDataSum2025-2

For UWF-ZeekDataSum25-2, as with the recall results, the F1 score results are also very close to the accuracy and precision results, so there is no need for a detailed analysis. The highest F1 score went down very insignificantly, by less than one percent from the precision results, for ANN-Deep-Sub-Conv, ANN-Overfit Wide, ANN-Shallow-Low-Opt, CNN-Shallow, as well as CNN-Very-Shallow, but the FI coefficient thresholds remained the same; hence, there is no need for a detailed analysis.

Table 24 presents the percent increase in accuracy for the various ANN and CNN models for each of the datasets. ANN-Deep-Sub-Conv, ANN-Wide-Sub-Cov, ANN-Deep and CNN-Very-Wide/Deep had no increase at all with fewer features for UWF-ZeekDataSum25-1, and the latter two models had no increase in either of the two datasets. ANN-Minimal, ANN-Overfit-Wide, ANN-Shallow-Low-Opt, CNN-Shallow and CNN-Very-Shallow had a very high percent increase in accuracy using fewer features.

7.4. Summary of Key Findings

Several conclusions can be drawn from this paper about using feature importance to fine-tune ANN and CNN models. The results of the runs show that without feature selection, the results of most models in both the datasets hovered between 25% and 75%. Significantly higher results were obtained using features with higher feature importance scores.

Using the UWF-ZeekDataSum2025-1 dataset, all the evaluation metrics for five of the nine models, ANN-Minimal, ANN-Overfit Wide, ANN-Shallow-Low-Opt, CNN-Shallow and CNN-Very-Shallow, presented the best results (greater than 99.5%) using just one or two features. Using up to four features, some of these models achieved accuracy greater than 96%.

Using the UWF-ZeekDataSum25-2 dataset, all the evaluation metrics for six of the nine models, ANN-Deep_Sub-Conv, ANN-Minimal, ANN-Overfit-Wide, ANN-Shallow-Low-Opt, CNN-Shallow, and CNN-Very-Shallow, presented results greater than 99.76% using 5–11 features. This is still lower than the original 17 features in this dataset, showing that filtering some features can dramatically improve our metric percentages.

8. Conclusions

The findings of this study demonstrate that feature importance-driven optimization offers a highly effective pathway for improving neural network-based intrusion detection systems. Across both datasets, strategically filtering out low-value features boosted the performance of several ANN and CNN variants. In UWF-ZeekDataSum2025-1, models achieved their strongest results after reducing the feature set to fewer than four features, highlighting the value of eliminating noise in high-dimensional security data. The more complex UWF-ZeekDataSum2025-2 dataset showed improvements across a wider set of thresholds, but similarly revealed that targeted feature reduction enabled multiple architectures to surpass 95% accuracy.

Feature sensitivity analysis revealed an unexpected pattern in which most models retained chance-level performance across many feature removal thresholds before exhibiting an increase in classification accuracy when only a small number of features remained. In particular, the Zeek “history” feature enabled the ANN model to achieve 99.83% accuracy as a single predictor. This result is plausible because the history field encodes compact representations of TCP connection behavior, including connection establishment, acknowledgements, payload transmission, and termination events. As documented in Zeek’s connection logs, the “history” attribute summarizes the sequence of actions observed between communicating hosts, thereby capturing behavioral characteristics that can be highly discriminative for intrusion detection. Nevertheless, the exceptionally high performance obtained from a single feature warrants further investigation to determine whether it reflects true predictive power across datasets, dataset-specific traffic patterns, or potential information leakage.

Overall, this work illustrates that integrating feature importance coefficients into the model development pipeline can strengthen ANN and CNN performance, reduce computational overhead, and enhance the reliability of intrusion detection systems in operational environments.

9. Future Works

A limitation of the present study is that the proposed FI coefficient metric does not directly compare against established feature importance techniques such as SHAP (SHapley Additive exPlanations), permutation importance, or mutual information. While SHAP is widely regarded as a benchmark method for interpreting machine learning models, its computational cost can become prohibitive for large-scale distributed experiments involving numerous model configurations and feature selection thresholds. The current work focused on evaluating a weight-based FI coefficient approach within the PySpark MLlib environment. Future research will analyze the FI coefficient metric and established feature importance methods, including SHAP, permutation importance, and mutual information. This evaluation will provide validation of the proposed approach, quantify the degree of agreement between methods, and assess the trade-offs in large-scale network intrusion detection applications.

Author Contributions

Conceptualization, M.E., S.S.B., D.M., and S.C.B.; methodology, M.E., S.S.B., D.M., and S.C.B.; software, M.E.; validation, S.S.B., D.M., and S.C.B.; formal analysis, M.E.; investigation, M.E. and S.S.B.; resources, S.S.B., D.M., and S.C.B.; data curation, M.E. and D.M.; writing—original draft preparation, M.E.; writing—review and editing, S.S.B., D.M., and S.C.B.; visualization, M.E.; supervision, S.S.B., D.M., and S.C.B.; project administration, S.S.B., D.M., and S.C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets are available at https://datasets.uwf.edu/ (accessed on 8 April 2026).

Acknowledgments

This research was also partially supported by the Askew Institute at the University of West Florida.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Razaulla, S.; Fachkha, C.; Markarian, C.; Gawanmeh, A.; Mansoor, W.; Fung, B.C.; Assi, C. The age of ransomware: A survey on the evolution, taxonomy, and research directions. IEEE Access 2023, 11, 40698–40723. [Google Scholar] [CrossRef]
SDIWC. Optimizing Threat Intelligence Strategies. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 11. [Google Scholar] [CrossRef]
UWF Datasets. Available online: https://datasets.uwf.edu/ (accessed on 2 May 2026).
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Haykin, S. Neural Networks: A Comprehensive Foundation, 2nd ed.; Prentice Hall: Upper Saddle River, NJ, USA, 1999. [Google Scholar]
Krichen, M. Convolutional Neural Networks: A Survey. Computers 2023, 12, 151. [Google Scholar] [CrossRef]
Zhao, X.; Wang, L.; Zhang, Y.; Han, X.; Deveci, M.; Parmar, M. A review of convolutional neural networks in computer vision. Artif. Intell. Rev. 2024, 57, 99. [Google Scholar] [CrossRef]
Raj, R.; Kos, A. An Extensive Study of Convolutional Neural Networks: Applications in Computer Vision for Improved Robotics Perceptions. Sensors 2025, 25, 1033. [Google Scholar] [CrossRef] [PubMed]
Younesi, A.; Ansari, M.; Fazli, M.; Ejlali, A.; Shafique, M.; Henkel, J. A Comprehensive Survey of Convolutions in Deep Learning: Applications, Challenges, and Future Trends. arXiv 2024, arXiv:2402.15490. [Google Scholar]
Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. arXiv 2017, arXiv:1705.07874v2. [Google Scholar]
Fisher, A.; Rudin, C.; Dominici, F. All models are wrong, but many are useful: Learning a variable’s importance by predicting what will happen when it is missing. J. Am. Stat. Assoc. 2019, 114, 1019–1031. [Google Scholar]
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, K.-R.; Samek, W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 2015, 10, e0130149. [Google Scholar] [CrossRef] [PubMed]
Montavon, G.; Bach, S.; Binder, A.; Müller, K.-R.; Samek, W. Explaining non-linear classification decisions with deep Taylor decomposition. J. Mach. Learn. Res. 2017, 17, 3738–3773. [Google Scholar]
Priya, A.S.; Sandhiya, A. Machine Learning Based Cyber Attack Detection on Internet Traffic. Int. J. Sci. Res. Autom. 2024, 11, 619–624. [Google Scholar] [CrossRef]
Khan, M.A.; Alsamiri, J. Internet of Things Cyber Attacks Detection. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 12. [Google Scholar] [CrossRef]
Lin, S.-W.; Ying, K.-C.; Lee, C.-Y.; Lee, Z.-J. An intelligent algorithm with feature selection and decision rules applied to anomaly intrusion detection. Appl. Soft Comput. 2012, 12, 3285–3290. [Google Scholar] [CrossRef]
Ahanger, A.S.; Khan, M.S.; Masoodi, F. An effective intrusion detection system using supervised machine learning techniques. In Proceedings of the 5th International Conference on Computing Methodologies and Communication (ICCMC); IEEE: Piscataway, NJ, USA, 2021; pp. 1639–1644. [Google Scholar]
Labory, J. Benchmarking feature selection and extraction methods to improve classification performance. Comput. Struct. Biotechnol. J. 2024, 23, 1274–1287. [Google Scholar] [PubMed]
Azeroual, O. AI Meets Data Science: Preprocessing and Feature Selection Reimagined. In Proceedings of the 3rd Cognitive Models and Artificial Intelligence Conference (AICCONF); IEEE: Piscataway, NJ, USA, 2025. [Google Scholar]
Neupane, S.; Ables, J.; Anderson, W.; Mittal, S.; Rahimi, S.; Banicescu, I.; Seale, M. Explainable Intrusion Detection Systems (X-IDS): A Survey of Current Methods, Challenges, and Opportunities. IEEE Access 2022, 10, 112392–112415. [Google Scholar] [CrossRef]
Long, Z.; Yan, H.; Shen, G.; Zhang, X.; He, H.; Cheng, L. A Transformer-Based Network Intrusion Detection Approach for Cloud Security. J. Cloud Comput. 2024, 13, 5. [Google Scholar] [CrossRef]
Zeek Project. Zeek conn.log and Log Format Documentation. Available online: https://docs.zeek.org/en/current/log-formats.html (accessed on 2 May 2026).
MITRE ATT&CK® Framework. Available online: https://attack.mitre.org/ (accessed on 2 May 2026).
Miller, E.; Mink, D.; Spellings, P.; Bagui, S.S.; Bagui, S.C. Classifying Cyber Ranges: A Case-Based Analysis Using the UWF Cyber Range. Encyclopedia 2025, 5, 162. [Google Scholar] [CrossRef]
Bagui, S.; Mink, D.; Bagui, S.; Ghosh, T.; McElroy, T.; Paredes, E.; Khasnavis, N.; Plenkers, R. Detecting Reconnaissance and Discovery Tactics from the MITRE ATT&CK Framework in Zeek Conn Logs Using Spark’s Machine Learning in the Big Data Framework. Sensors 2022, 22, 7999. [Google Scholar] [CrossRef] [PubMed]
MITRE ATT&CK. Reconnaissance, Tactic TA0043—Enterprise. 2020. Available online: https://attack.mitre.org/tactics/TA0043/ (accessed on 8 August 2025).
MITRE ATT&CK. Privilege Escalation, Tactic TA0004—Enterprise. 2018. Available online: https://attack.mitre.org/tactics/TA0004/ (accessed on 8 August 2025).
MITRE ATT&CK. Defense Evasion, Tactic TA0005—Enterprise. 2018. Available online: https://attack.mitre.org/tactics/TA0005/ (accessed on 8 August 2025).
MITRE ATT&CK. Discovery, Tactic TA0007—Enterprise. 2018. Available online: https://attack.mitre.org/tactics/TA0007/ (accessed on 8 August 2025).
MITRE ATT&CK. Lateral Movement, Tactic TA0008—Enterprise. 2018. Available online: https://attack.mitre.org/tactics/TA0008/ (accessed on 8 August 2025).
MITRE ATT&CK. Initial Access, Tactic TA0001—Enterprise. 2018. Available online: https://attack.mitre.org/tactics/TA0001/ (accessed on 8 August 2025).
Scikit-Learn. Model Evaluation: Accuracy Score. 2026. Available online: https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score (accessed on 25 February 2026).
Scikit-Learn. Precision Score. 2026. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html (accessed on 25 February 2026).
Scikit-Learn. Recall Score. 2026. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html (accessed on 25 February 2026).
Scikit-Learn. F1 Score. 2026. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html (accessed on 25 February 2026).
Bagui, S.; Spratlin, S. A Review of Data Mining Algorithms on Hadoop’s MapReduce. Int. J. Data Sci. 2018, 3, 146–169. [Google Scholar] [CrossRef]

Figure 1. Artificial Neural Network architecture.

Figure 2. Convolution Neural Network architecture.

Figure 3. Training loss vs. iterations per model for UWF-ZeekData2025-1.

Figure 4. Training loss vs. iterations per model for UWF-ZeekData2025-2.

Figure 5. Experimental workflow.

Figure 6. UWF-ZeekDataSum2025-1: Accuracy vs. feature importance threshold by model.

Figure 7. UWF-ZeekDataSum2025-2: Accuracy vs. feature importance threshold by model.

Figure 8. UWF-ZeekDataSum2025-1: Precision vs. feature importance threshold by model.

Figure 9. UWF-ZeekDataSum2025-2: Precision vs. feature importance threshold by model.

Figure 10. UWF-ZeekDataSum2025-1: Recall vs. feature importance threshold by model.

Figure 11. UWF-ZeekDataSum2025-2: Recall vs. feature importance threshold by model.

Figure 12. UWF-ZeekDataSum2025-1: F1 score vs. feature importance threshold by model.

Figure 13. UWF-ZeekDataSum2025-2: F1 score vs. feature importance threshold by model.

Table 1. UWF-ZeekDataSum25-1: Distribution of attack tactics.

UWF-ZeekDataSum25-1
Tactic	Count
Reconnaissance	568,260
Privilege Escalation	3532
Defense Evasion	3532
Discovery	1099
Lateral Movement	326
Initial Access	1
Benign	576,759

Table 2. UWF-ZeekDataSum25-1: Distribution of attack tactics.

UWF-ZeekDataSum25-2
Reconnaissance	457,182
Discovery	477,271
Benign	934,226

Table 3. Description of features [26].

Field	Definition
community_id	A standardized hash created from the flow’s tuple to uniquely identify a network connection
conn_state	A symbolic Zeek code summarizing how the connection progressed (e.g., SYN seen, fully established, reset)
duration	The total time in seconds that the connection remained active
history	A compact sequence showing packet-level events such as SYNs, FINs, ACKs, and RSTs
src_ip_zeek	The IP address of the host that initiated the connection
src_port_zeek	The source port used by the initiating system
dest_ip_zeek	The IP of the host receiving the connection
dest_port_zeek	The service port on the destination host targeted by the flow
local_orig	Indicates whether Zeek classifies the originating host as part of the internally monitored network
local_resp	Indicates whether the responding host belongs to the local network
missed_bytes	Number of payload bytes Zeek could not process due to packet loss or truncation
orig_bytes	Total application-layer bytes transmitted by the originator
orig_ip_bytes	Total bytes including IP headers sent by the originator
orig_pkts	Number of packets sent by the originator
proto	The transport protocol used in the connection (TCP/UDP/ICMP)
resp_bytes	Application-layer bytes transmitted by the responder
resp_ip_bytes	Total bytes including IP headers sent by the responder
resp_pkts	Number of packets sent by the responder
service	The application-layer protocol that Zeek infers (e.g., HTTP, DNS)
ts	Epoch timestamp when Zeek first detected the connection
uid	Unique identifier Zeek assigns to group all logs for a given connection
datetime	A human-readable timestamp converted from the Zeek epoch-based ts value
label_tactic	The MITRE ATT&CK tactic describing the high-level adversarial goal (e.g., Discovery, Persistence)
label_technique	The MITRE ATT&CK technique represents the specific malicious behavior observed
label_binary	A binary classification label indicating whether the flow is benign (0) or malicious (1)
label_cve	The CVE identifier mapped to the malicious activity when applicable

Table 4. ANN and CNN configurations by model.

Model Type	Model Name	Layer Configuration	Max Iter	Block Size	Solver	Step Size
ANN	Overfit Wide	[128, 64, 32]	10	128	l-bfgs	0.03
ANN	Shallow Low Opt	[32, 16]	10	64	l-bfgs	0.1
ANN	Wide Sub-Conv	[150, 150]	10	256	gd	0.005
ANN	Deep Sub-Conv	[50, 50, 50]	10	512	gd	0.005
ANN	Minimal	[16]	10	128	l-bfgs	0.03
CNN	Shallow	[32, 16]	10	128	l-bfgs	0.03
CNN	Deep	[64, 32, 16]	10	128	gd	0.005
CNN	Very Shallow	[16]	10	64	l-bfgs	0.05
CNN	Very Wide/Deep	[128, 64, 64, 32]	15	256	gd	0.001

Table 5. UWF-ZeekData2025-1: Feature importance scores.

Feature	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
community_id	0.3015	0.295	0.315	0.305	0.285	0.255	0.24	0.24	0.275
conn_state	0.2589	0.2817	0.2986	0.2356	0.3036	0.2665	0.2356	0.2817	0.2986
duration	0.3218	0.3892	0.2977	0.3489	0.3005	0.3017	0.3489	0.3892	0.2977
history	0.3538	0.3719	0.3317	0.3716	0.3186	0.3462	0.3716	0.3719	0.3317
src_port_zeek	0.263	0.2181	0.2738	0.2538	0.2724	0.2717	0.2538	0.2181	0.2738
dest_port_zeek	0.2943	0.3281	0.3029	0.3255	0.2967	0.3068	0.3255	0.3281	0.3029
local_orig	0.2734	0.238	0.3058	0.2696	0.3106	0.2783	0.2696	0.238	0.3058
local_resp	0.2676	0.2435	0.2752	0.2779	0.2783	0.2651	0.2779	0.2435	0.2752
missed_bytes	0.2762	0.2685	0.2983	0.2691	0.298	0.2554	0.2691	0.2685	0.2983
orig_bytes	0.2664	0.2095	0.2825	0.2595	0.2852	0.2562	0.2595	0.2095	0.2825
orig_ip_bytes	0.2606	0.2215	0.304	0.2116	0.3045	0.2716	0.2116	0.2215	0.304
orig_pkts	0.284	0.2187	0.2904	0.2419	0.2893	0.2694	0.2419	0.2187	0.2904
resp_bytes	0.2987	0.3144	0.2895	0.2792	0.289	0.3067	0.2792	0.3144	0.2895
resp_ip_bytes	0.3205	0.2983	0.3102	0.2975	0.3046	0.3131	0.2975	0.2983	0.3102
resp_pkts	0.2985	0.2518	0.2789	0.3009	0.2896	0.2761	0.3009	0.2518	0.2789
service	0.2881	0.3432	0.288	0.301	0.2829	0.2832	0.301	0.3432	0.288
ts	0.3052	0.315	0.3209	0.3158	0.3108	0.3167	0.3158	0.315	0.3209

Table 6. UWF-ZeekData2025-2: Feature importance scores.

Feature	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
community_id	0.271	0.2828	0.2752	0.2784	0.2833	0.2675	0.2784	0.2828	0.2752
conn_state	0.3517	0.4122	0.3301	0.3532	0.3142	0.3415	0.3532	0.4122	0.3301
duration	0.2558	0.2119	0.2834	0.2435	0.2794	0.2536	0.2435	0.2119	0.2834
history	0.2761	0.3021	0.2866	0.2621	0.2987	0.2779	0.2621	0.3021	0.2866
src_port_zeek	0.2716	0.2578	0.2641	0.2968	0.2653	0.2669	0.2968	0.2578	0.2641
dest_port_zeek	0.3208	0.3013	0.2811	0.3104	0.276	0.3111	0.3104	0.3013	0.2811
local_orig	0.2656	0.2294	0.2958	0.2473	0.2937	0.2636	0.2473	0.2294	0.2958
local_resp	0.261	0.2515	0.2592	0.2482	0.2599	0.2623	0.2482	0.2515	0.2592
missed_bytes	0.2789	0.2703	0.295	0.2592	0.2993	0.282	0.2592	0.2703	0.295
orig_bytes	0.2943	0.2855	0.2953	0.2754	0.3035	0.2901	0.2754	0.2855	0.2953
orig_ip_bytes	0.2525	0.2787	0.2734	0.2417	0.2657	0.2633	0.2417	0.2787	0.2734
orig_pkts	0.24	0.238	0.2949	0.2314	0.3034	0.2621	0.2314	0.238	0.2949
resp_bytes	0.3067	0.3234	0.3171	0.2957	0.3109	0.3132	0.2957	0.3234	0.3171
resp_ip_bytes	0.2854	0.2708	0.2913	0.3006	0.2814	0.2854	0.3006	0.2708	0.2913
resp_pkts	0.3024	0.2658	0.2979	0.2688	0.2973	0.2812	0.2688	0.2658	0.2979
service	0.286	0.3128	0.2743	0.2838	0.2761	0.266	0.2838	0.3128	0.2743
Ts	0.2461	0.2176	0.2819	0.243	0.2768	0.2575	0.243	0.2176	0.2819

Table 7. UWF-ZeekDataSum2025-1: Features trained per model at each feature coefficient threshold.

FI Coefficient Threshold	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
0	17	17	17	17	17	17	17	17	17
0.25	17	10	17	13	17	17	13	10	17
0.26	15	9	17	11	17	14	11	9	17
0.27	11	8	17	9	17	11	9	8	17
0.28	9	8	13	7	14	7	7	8	13
0.29	7	7	10	7	9	6	7	7	10
0.3	4	6	6	6	7	6	6	6	6
0.31	3	6	3	4	3	3	4	6	3
0.32	3	4	2	3	17	1	3	4	2
0.33	1	3	1	2	17	1	2	3	1
0.34	1	3	17	2	17	1	2	3	17
0.35	1	2	17	1	17	17	1	2	17
0.4	17	17	17	17	17	17	17	17	17

Table 8. UWF-ZeekDataSum2025-2: Features trained per model at each feature coefficient threshold.

FI Coefficient Threshold	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
0	17	17	17	17	17	17	17	17	17
0.25	15	13	17	11	17	17	11	13	17
0.26	13	11	16	10	16	15	10	11	16
0.27	11	10	15	8	14	8	8	10	15
0.28	7	7	12	6	10	7	6	7	12
0.29	5	5	8	5	8	4	5	5	8
0.3	4	5	2	3	4	3	3	5	2
0.31	2	3	2	2	2	3	2	3	2
0.32	2	2	1	1	17	1	1	2	1
0.33	1	1	1	1	17	1	1	1	1
0.34	1	1	17	1	17	1	1	1	17
0.35	1	1	17	1	17	17	1	1	17
0.4	17	1	17	17	17	17	17	1	17

Table 9. Accuracy results for the various ANN and CNN models for UWF-ZeekDataSum-1.

FI Coefficient Threshold	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
0	50.01%	49.99%	50.00%	50.00%	49.99%	50.01%	50.00%	49.99%	49.99%
0.25	50.01%	50.01%	50.00%	50.00%	49.99%	50.01%	50.00%	50.01%	49.99%
0.26	50.01%	49.99%	50.00%	49.99%	49.99%	49.99%	49.99%	49.99%	49.99%
0.27	49.99%	49.99%	50.00%	49.99%	49.99%	50.01%	49.99%	49.99%	49.99%
0.28	49.99%	49.99%	49.99%	49.99%	50.00%	49.99%	49.99%	49.99%	49.99%
0.29	50.01%	50.01%	49.99%	49.99%	49.99%	50.01%	49.99%	50.01%	49.99%
0.3	50.01%	50.01%	49.99%	49.99%	49.99%	50.01%	49.99%	50.01%	50.01%
0.31	50.01%	50.01%	49.99%	49.99%	50.01%	50.01%	49.99%	50.01%	49.99%
0.32	50.01%	75.10%	49.99%	50.01%	49.99%	49.99%	50.01%	95.93%	50.01%
0.33	49.99%	99.50%	99.83%	99.84%	49.99%	49.99%	99.83%	99.50%	50.01%
0.34	49.99%	99.50%	50.00%	99.84%	49.99%	49.99%	99.83%	99.50%	49.99%
0.35	49.99%	99.84%	50.00%	99.83%	49.99%	50.01%	99.83%	99.84%	49.99%
0.4	50.01%	49.99%	50.00%	50.00%	49.99%	50.01%	50.00%	49.99%	49.99%

Table 10. Best features for the various ANN and CNN models for UWF-ZeekDataSum-1.

Feature	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
community_id	0.3015	0.295	0.315	0.305	0.285	0.255	0.24	0.24	0.275
conn_state	0.2589	0.2817	0.2986	0.2356	0.3036	0.2665	0.2356	0.2817	0.2986
duration	0.3218	0.3892	0.2977	0.3489	0.3005	0.3017	0.3489	0.3892	0.2977
history	0.3538	0.3719	0.3317	0.3716	0.3186	0.3462	0.3716	0.3719	0.3317
src_port_zeek	0.263	0.2181	0.2738	0.2538	0.2724	0.2717	0.2538	0.2181	0.2738
dest_port_zeek	0.2943	0.3281	0.3029	0.3255	0.2967	0.3068	0.3255	0.3281	0.3029
local_orig	0.2734	0.238	0.3058	0.2696	0.3106	0.2783	0.2696	0.238	0.3058
local_resp	0.2676	0.2435	0.2752	0.2779	0.2783	0.2651	0.2779	0.2435	0.2752
missed_bytes	0.2762	0.2685	0.2983	0.2691	0.298	0.2554	0.2691	0.2685	0.2983
orig_bytes	0.2664	0.2095	0.2825	0.2595	0.2852	0.2562	0.2595	0.2095	0.2825
orig_ip_bytes	0.2606	0.2215	0.304	0.2116	0.3045	0.2716	0.2116	0.2215	0.304
orig_pkts	0.284	0.2187	0.2904	0.2419	0.2893	0.2694	0.2419	0.2187	0.2904
resp_bytes	0.2987	0.3144	0.2895	0.2792	0.289	0.3067	0.2792	0.3144	0.2895
resp_ip_bytes	0.3205	0.2983	0.3102	0.2975	0.3046	0.3131	0.2975	0.2983	0.3102
resp_pkts	0.2985	0.2518	0.2789	0.3009	0.2896	0.2761	0.3009	0.2518	0.2789
service	0.2881	0.3432	0.288	0.301	0.2829	0.2832	0.301	0.3432	0.288
ts	0.3052	0.315	0.3209	0.3158	0.3108	0.3167	0.3158	0.315	0.3209

Table 11. Accuracy results for the various ANN and CNN models for UWF-ZeekDataSum-2.

FI Coefficient Threshold	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
0	50.10%	49.90%	49.90%	49.90%	49.90%	50.10%	49.90%	49.90%	50.10%
0.25	49.90%	96.99%	49.90%	95.51%	49.90%	50.10%	97.05%	97.05%	50.10%
0.26	50.10%	99.73%	49.90%	96.87%	50.10%	49.90%	95.50%	99.76%	50.10%
0.27	50.10%	99.69%	49.90%	97.49%	50.10%	50.10%	97.19%	99.69%	50.10%
0.28	95.83%	99.72%	49.90%	96.62%	75.15%	50.10%	85.18%	99.68%	49.90%
0.29	50.10%	49.75%	95.79%	95.12%	75.17%	50.10%	97.50%	49.75%	50.10%
0.3	49.90%	49.75%	95.63%	49.90%	75.52%	49.90%	49.90%	49.75%	49.90%
0.31	49.90%	99.16%	95.63%	49.90%	70.57%	49.90%	49.87%	99.16%	49.90%
0.32	49.90%	95.63%	70.14%	70.14%	49.90%	50.10%	70.14%	95.63%	49.90%
0.33	50.10%	70.14%	70.14%	70.14%	49.90%	50.10%	70.14%	70.14%	49.90%
0.34	50.10%	70.14%	49.90%	70.14%	49.90%	50.10%	70.14%	70.14%	50.10%
0.35	50.10%	70.14%	49.90%	70.14%	49.90%	50.10%	70.14%	70.14%	50.10%
0.4	50.10%	70.14%	49.90%	49.90%	49.90%	50.10%	49.90%	70.14%	50.10%

Table 12. Best features for the various ANN and CNN models for UWF-ZeekDataSum-2.

Feature	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
community_id	0.271	0.2828	0.2752	0.2784	0.2833	0.2675	0.2784	0.2828	0.2752
conn_state	0.3517	0.4122	0.3301	0.3532	0.3142	0.3415	0.3532	0.4122	0.3301
duration	0.2558	0.2119	0.2834	0.2435	0.2794	0.2536	0.2435	0.2119	0.2834
history	0.2761	0.3021	0.2866	0.2621	0.2987	0.2779	0.2621	0.3021	0.2866
src_port_zeek	0.2716	0.2578	0.2641	0.2968	0.2653	0.2669	0.2968	0.2578	0.2641
dest_port_zeek	0.3208	0.3013	0.2811	0.3104	0.276	0.3111	0.3104	0.3013	0.2811
local_orig	0.2656	0.2294	0.2958	0.2473	0.2937	0.2636	0.2473	0.2294	0.2958
local_resp	0.261	0.2515	0.2592	0.2482	0.2599	0.2623	0.2482	0.2515	0.2592
missed_bytes	0.2789	0.2703	0.295	0.2592	0.2993	0.282	0.2592	0.2703	0.295
orig_bytes	0.2943	0.2855	0.2953	0.2754	0.3035	0.2901	0.2754	0.2855	0.2953
orig_ip_bytes	0.2525	0.2787	0.2734	0.2417	0.2657	0.2633	0.2417	0.2787	0.2734
orig_pkts	0.24	0.238	0.2949	0.2314	0.3034	0.2621	0.2314	0.238	0.2949
resp_bytes	0.3067	0.3234	0.3171	0.2957	0.3109	0.3132	0.2957	0.3234	0.3171
resp_ip_bytes	0.2854	0.2708	0.2913	0.3006	0.2814	0.2854	0.3006	0.2708	0.2913
resp_pkts	0.3024	0.2658	0.2979	0.2688	0.2973	0.2812	0.2688	0.2658	0.2979
service	0.286	0.3128	0.2743	0.2838	0.2761	0.266	0.2838	0.3128	0.2743
ts	0.2461	0.2176	0.2819	0.243	0.2768	0.2575	0.243	0.2176	0.2819

Table 13. Percent increase in accuracy for the ANN and CNN models.

Dataset	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
UWF-ZeekDataSum25-1	0%	99%	99%	99%	0%	0%	99%	99%	0%
UWF-ZeekDataSum25-2	91%	99%	92%	95%	51%	0%	95%	99%	0%

Table 14. Precision results for the various ANN and CNN models for UWF-ZeekDataSum-1.

FI Coefficient Threshold	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
0	25.01%	75.00%	75.00%	75.00%	75.00%	25.01%	75.00%	75.00%	24.99%
0.25	25.01%	25.01%	75.00%	75.00%	75.00%	25.01%	75.00%	25.01%	24.99%
0.26	25.01%	75.00%	75.00%	75.00%	75.00%	24.99%	75.00%	75.00%	24.99%
0.27	24.99%	75.00%	75.00%	24.99%	75.00%	25.01%	24.99%	75.00%	24.99%
0.28	24.99%	75.00%	75.00%	24.99%	25.01%	24.99%	24.99%	75.00%	24.99%
0.29	25.01%	25.01%	24.99%	24.99%	24.99%	25.01%	24.99%	25.01%	24.99%
0.3	25.01%	25.01%	24.99%	24.99%	24.99%	25.01%	24.99%	25.01%	25.01%
0.31	25.01%	25.01%	24.99%	24.99%	25.01%	25.01%	24.99%	25.01%	24.99%
0.32	25.01%	82.64%	24.99%	25.01%	75.00%	24.99%	25.01%	96.11%	25.01%
0.33	24.99%	99.50%	99.83%	99.84%	75.00%	24.99%	99.83%	99.50%	25.01%
0.34	24.99%	99.50%	75.00%	99.84%	75.00%	24.99%	99.83%	99.50%	24.99%
0.35	24.99%	99.84%	75.00%	99.83%	75.00%	25.01%	99.83%	99.84%	24.99%
0.4	25.01%	75.00%	75.00%	75.00%	75.00%	25.01%	75.00%	75.00%	24.99%

Table 15. Best features for the various ANN and CNN models for UWF-ZeekDataSum-1.

Feature	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
community_id	0.3015	0.295	0.315	0.305	0.285	0.255	0.24	0.24	0.275
conn_state	0.2589	0.2817	0.2986	0.2356	0.3036	0.2665	0.2356	0.2817	0.2986
duration	0.3218	0.3892	0.2977	0.3489	0.3005	0.3017	0.3489	0.3892	0.2977
history	0.3538	0.3719	0.3317	0.3716	0.3186	0.3462	0.3716	0.3719	0.3317
src_port_zeek	0.263	0.2181	0.2738	0.2538	0.2724	0.2717	0.2538	0.2181	0.2738
dest_port_zeek	0.2943	0.3281	0.3029	0.3255	0.2967	0.3068	0.3255	0.3281	0.3029
local_orig	0.2734	0.238	0.3058	0.2696	0.3106	0.2783	0.2696	0.238	0.3058
local_resp	0.2676	0.2435	0.2752	0.2779	0.2783	0.2651	0.2779	0.2435	0.2752
missed_bytes	0.2762	0.2685	0.2983	0.2691	0.298	0.2554	0.2691	0.2685	0.2983
orig_bytes	0.2664	0.2095	0.2825	0.2595	0.2852	0.2562	0.2595	0.2095	0.2825
orig_ip_bytes	0.2606	0.2215	0.304	0.2116	0.3045	0.2716	0.2116	0.2215	0.304
orig_pkts	0.284	0.2187	0.2904	0.2419	0.2893	0.2694	0.2419	0.2187	0.2904
resp_bytes	0.2987	0.3144	0.2895	0.2792	0.289	0.3067	0.2792	0.3144	0.2895
resp_ip_bytes	0.3205	0.2983	0.3102	0.2975	0.3046	0.3131	0.2975	0.2983	0.3102
resp_pkts	0.2985	0.2518	0.2789	0.3009	0.2896	0.2761	0.3009	0.2518	0.2789
service	0.2881	0.3432	0.288	0.301	0.2829	0.2832	0.301	0.3432	0.288
ts	0.3052	0.315	0.3209	0.3158	0.3108	0.3167	0.3158	0.315	0.3209

Table 16. Precision results for the various ANN and CNN models for UWF-ZeekDataSum-2.

FI Coefficient Threshold	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
0	25.10%	75.00%	24.90%	24.90%	24.90%	25.10%	75.00%	75.00%	25.10%
0.25	24.90%	96.99%	24.90%	95.75%	24.90%	25.10%	97.05%	97.11%	25.10%
0.26	25.10%	99.73%	75.00%	96.87%	25.10%	24.90%	95.63%	99.76%	25.10%
0.27	25.10%	99.69%	75.00%	97.50%	25.10%	25.10%	97.20%	99.69%	25.10%
0.28	96.14%	99.72%	75.00%	96.63%	78.15%	25.10%	88.01%	99.68%	24.90%
0.29	25.10%	24.86%	96.11%	95.37%	78.16%	25.10%	97.51%	24.86%	25.10%
0.3	24.90%	24.86%	95.98%	24.90%	78.72%	24.90%	24.90%	24.86%	24.90%
0.31	24.90%	99.17%	95.98%	24.90%	81.17%	24.90%	25.33%	99.17%	24.90%
0.32	24.90%	95.98%	80.00%	80.00%	24.90%	25.10%	80.00%	95.98%	24.90%
0.33	25.10%	80.00%	80.00%	80.00%	24.90%	25.10%	80.00%	80.00%	24.90%
0.34	25.10%	80.00%	24.90%	80.00%	24.90%	25.10%	80.00%	80.00%	25.10%
0.35	25.10%	80.00%	24.90%	80.00%	24.90%	25.10%	80.00%	80.00%	25.10%
0.4	25.10%	80.00%	24.90%	24.90%	24.90%	25.10%	75.00%	80.00%	25.10%

Table 17. Best features for the various ANN and CNN models for UWF-ZeekDataSum-2.

Feature	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
community_id	0.271	0.2828	0.2752	0.2784	0.2833	0.2675	0.2784	0.2828	0.2752
conn_state	0.3517	0.4122	0.3301	0.3532	0.3142	0.3415	0.3532	0.4122	0.3301
duration	0.2558	0.2119	0.2834	0.2435	0.2794	0.2536	0.2435	0.2119	0.2834
history	0.2761	0.3021	0.2866	0.2621	0.2987	0.2779	0.2621	0.3021	0.2866
src_port_zeek	0.2716	0.2578	0.2641	0.2968	0.2653	0.2669	0.2968	0.2578	0.2641
dest_port_zeek	0.3208	0.3013	0.2811	0.3104	0.276	0.3111	0.3104	0.3013	0.2811
local_orig	0.2656	0.2294	0.2958	0.2473	0.2937	0.2636	0.2473	0.2294	0.2958
local_resp	0.261	0.2515	0.2592	0.2482	0.2599	0.2623	0.2482	0.2515	0.2592
missed_bytes	0.2789	0.2703	0.295	0.2592	0.2993	0.282	0.2592	0.2703	0.295
orig_bytes	0.2943	0.2855	0.2953	0.2754	0.3035	0.2901	0.2754	0.2855	0.2953
orig_ip_bytes	0.2525	0.2787	0.2734	0.2417	0.2657	0.2633	0.2417	0.2787	0.2734
orig_pkts	0.24	0.238	0.2949	0.2314	0.3034	0.2621	0.2314	0.238	0.2949
resp_bytes	0.3067	0.3234	0.3171	0.2957	0.3109	0.3132	0.2957	0.3234	0.3171
resp_ip_bytes	0.2854	0.2708	0.2913	0.3006	0.2814	0.2854	0.3006	0.2708	0.2913
resp_pkts	0.3024	0.2658	0.2979	0.2688	0.2973	0.2812	0.2688	0.2658	0.2979
service	0.286	0.3128	0.2743	0.2838	0.2761	0.266	0.2838	0.3128	0.2743
ts	0.2461	0.2176	0.2819	0.243	0.2768	0.2575	0.243	0.2176	0.2819

Table 18. Percent increase in accuracy for the ANN and CNN models.

Dataset	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
UWF-ZeekDataSum25-1	0%	33%	33%	33%	0%	0%	33%	35%	0%
UWF-ZeekDataSum25-2	284%	33%	286%	291%	226%	0%	30%	33%	0%

Table 19. Recall results for the various ANN and CNN models for UWF-ZeekDataSum-1.

FI Coefficient Threshold	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
0	50.01%	49.99%	50.00%	50.00%	49.99%	50.01%	50.00%	49.99%	49.99%
0.25	50.01%	50.01%	50.00%	50.00%	49.99%	50.01%	50.00%	50.01%	49.99%
0.26	50.01%	49.99%	50.00%	49.99%	49.99%	49.99%	49.99%	49.99%	49.99%
0.27	49.99%	49.99%	50.00%	49.99%	49.99%	50.01%	49.99%	49.99%	49.99%
0.28	49.99%	49.99%	49.99%	49.99%	50.00%	49.99%	49.99%	49.99%	49.99%
0.29	50.01%	50.01%	49.99%	49.99%	49.99%	50.01%	49.99%	50.01%	49.99%
0.3	50.01%	50.01%	49.99%	49.99%	49.99%	50.01%	49.99%	50.01%	50.01%
0.31	50.01%	50.01%	49.99%	49.99%	50.01%	50.01%	49.99%	50.01%	49.99%
0.32	50.01%	75.10%	49.99%	50.01%	49.99%	49.99%	50.01%	95.93%	50.01%
0.33	49.99%	99.50%	99.83%	99.84%	49.99%	49.99%	99.83%	99.50%	50.01%
0.34	49.99%	99.50%	50.00%	99.84%	49.99%	49.99%	99.83%	99.50%	49.99%
0.35	49.99%	99.84%	50.00%	99.83%	49.99%	50.01%	99.83%	99.84%	49.99%
0.4	50.01%	49.99%	50.00%	50.00%	49.99%	50.01%	50.00%	49.99%	49.99%

Table 20. Recall results for the various ANN and CNN models for UWF-ZeekDataSum-2.

FI Coefficient Threshold	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
0	50.10%	49.90%	49.90%	49.90%	49.90%	50.10%	49.90%	49.90%	50.10%
0.25	49.90%	96.99%	49.90%	95.51%	49.90%	50.10%	97.05%	97.05%	50.10%
0.26	50.10%	99.73%	49.90%	96.87%	50.10%	49.90%	95.50%	99.76%	50.10%
0.27	50.10%	99.69%	49.90%	97.49%	50.10%	50.10%	97.19%	99.69%	50.10%
0.28	95.83%	99.72%	49.90%	96.62%	75.15%	50.10%	85.18%	99.68%	49.90%
0.29	50.10%	49.75%	95.79%	95.12%	75.17%	50.10%	97.50%	49.75%	50.10%
0.3	49.90%	49.75%	95.63%	49.90%	75.52%	49.90%	49.90%	49.75%	49.90%
0.31	49.90%	99.16%	95.63%	49.90%	70.57%	49.90%	49.87%	99.16%	49.90%
0.32	49.90%	95.63%	70.14%	70.14%	49.90%	50.10%	70.14%	95.63%	49.90%
0.33	50.10%	70.14%	70.14%	70.14%	49.90%	50.10%	70.14%	70.14%	49.90%
0.34	50.10%	70.14%	49.90%	70.14%	49.90%	50.10%	70.14%	70.14%	50.10%
0.35	50.10%	70.14%	49.90%	70.14%	49.90%	50.10%	70.14%	70.14%	50.10%
0.4	50.10%	70.14%	49.90%	49.90%	49.90%	50.10%	49.90%	70.14%	50.10%

Table 21. Percent increase for various ANN and CNN models vs. datasets.

Dataset	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
UWF-ZeekDataSum25-1	0%	99%	99%	99%	0%	0%	99%	99%	0%
UWF-ZeekDataSum25-2	91%	99%	92%	95%	51%	0%	95%	99%	0%

Table 22. F1 score results for the various ANN and CNN models for UWF-ZeekDataSum-1.

FI Coefficient Threshold	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
0	33.35%	33.33%	33.34%	33.34%	33.33%	33.35%	33.34%	33.33%	33.32%
0.25	33.35%	33.34%	33.34%	33.34%	33.33%	33.35%	33.34%	33.34%	33.32%
0.26	33.35%	33.33%	33.34%	33.32%	33.33%	33.32%	33.32%	33.33%	33.32%
0.27	33.32%	33.32%	33.34%	33.32%	33.33%	33.35%	33.32%	33.32%	33.32%
0.28	33.32%	33.32%	33.33%	33.32%	33.34%	33.32%	33.32%	33.32%	33.32%
0.29	33.35%	33.35%	33.32%	33.32%	33.32%	33.35%	33.32%	33.35%	33.32%
0.3	33.35%	33.35%	33.32%	33.32%	33.32%	33.35%	33.32%	33.35%	33.35%
0.31	33.35%	33.35%	33.32%	33.32%	33.35%	33.35%	33.32%	33.35%	33.32%
0.32	33.35%	73.58%	33.32%	33.35%	33.33%	33.32%	33.35%	95.93%	33.35%
0.33	33.32%	99.50%	99.83%	99.84%	33.33%	33.32%	99.83%	99.50%	33.35%
0.34	33.32%	99.50%	33.34%	99.84%	33.33%	33.32%	99.83%	99.50%	33.32%
0.35	33.32%	99.84%	33.34%	99.83%	33.33%	33.35%	99.83%	99.84%	33.32%
0.4	33.35%	33.33%	33.34%	33.34%	33.33%	33.35%	33.34%	33.33%	33.32%

Table 23. F1 score results for the various ANN and CNN models for UWF-ZeekDataSum-2.

FI Coefficient Threshold	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
0	33.44%	33.22%	33.22%	33.22%	33.22%	33.44%	33.22%	33.22%	33.44%
0.25	33.22%	96.99%	33.22%	95.51%	33.22%	33.44%	97.05%	97.05%	33.44%
0.26	33.44%	99.73%	33.22%	96.87%	33.44%	33.22%	95.50%	99.76%	33.44%
0.27	33.44%	99.69%	33.22%	97.49%	33.44%	33.44%	97.19%	99.69%	33.44%
0.28	95.82%	99.72%	33.22%	96.62%	74.46%	33.44%	84.90%	99.68%	33.22%
0.29	33.44%	33.15%	95.78%	95.11%	74.48%	33.44%	97.50%	33.15%	33.44%
0.3	33.22%	33.15%	95.62%	33.22%	74.81%	33.22%	33.22%	33.15%	33.22%
0.31	33.22%	99.16%	95.62%	33.22%	67.82%	33.22%	33.21%	99.16%	33.22%
0.32	33.22%	95.62%	67.45%	67.45%	33.22%	33.44%	67.45%	95.62%	33.22%
0.33	33.44%	67.45%	67.45%	67.45%	33.22%	33.44%	67.45%	67.45%	33.22%
0.34	33.44%	67.45%	33.22%	67.45%	33.22%	33.44%	67.45%	67.45%	33.44%
0.35	33.44%	67.45%	33.22%	67.45%	33.22%	33.44%	67.45%	67.45%	33.44%
0.4	33.44%	67.45%	33.22%	33.22%	33.22%	33.44%	33.22%	67.45%	33.44%

Table 24. Percent increase in accuracy for the ANN and CNN models.

Dataset	ANN-Deep-Sub-Conv	ANN-Minimal	ANN-Overfit-Wide	ANN-Shallow-Low-Opt	ANN-Wide-Sub-Conv	CNN-Deep	CNN-Shallow	CNN-Very-Shallow	CNN-Very-Wide/Deep
UWF-ZeekDataSum25-1	0%	200%	200%	200%	0%	0%	200%	200%	0%
UWF-ZeekDataSum25-2	187%	199%	187%	192%	124%	0%	193%	200%	0%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bagui, S.S.; Elbatouty, M.; Mink, D.; Bagui, S.C. Feature Selection for Improving ANN and CNN Models for Attack Detection in Zeek Network Data. Future Internet 2026, 18, 333. https://doi.org/10.3390/fi18070333

AMA Style

Bagui SS, Elbatouty M, Mink D, Bagui SC. Feature Selection for Improving ANN and CNN Models for Attack Detection in Zeek Network Data. Future Internet. 2026; 18(7):333. https://doi.org/10.3390/fi18070333

Chicago/Turabian Style

Bagui, Sikha S., Mohamed Elbatouty, Dustin Mink, and Subhash C. Bagui. 2026. "Feature Selection for Improving ANN and CNN Models for Attack Detection in Zeek Network Data" Future Internet 18, no. 7: 333. https://doi.org/10.3390/fi18070333

APA Style

Bagui, S. S., Elbatouty, M., Mink, D., & Bagui, S. C. (2026). Feature Selection for Improving ANN and CNN Models for Attack Detection in Zeek Network Data. Future Internet, 18(7), 333. https://doi.org/10.3390/fi18070333

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Feature Selection for Improving ANN and CNN Models for Attack Detection in Zeek Network Data

Abstract

1. Introduction

2. Background

2.1. Artificial Neural Networks

2.2. Convolutional Neural Networks

2.3. Feature Importance

Our Application of Feature Importance

3. Related Works

4. Datasets

Tactic Descriptions

5. The Models

Performance Metrics

6. Methodology

6.1. Flowchart and Explanation

6.2. Preprocessing

6.3. Training and Testing

6.4. Binary Classification

6.5. Hardware and Software Used

7. Results

7.1. Calculating Feature Importance

7.2. Determining the Features to Be Used in Training Using Coefficient Thresholds

7.3. Testing

7.3.1. Analysis of Accuracy Results

UWF-ZeekDataSum2025-1

UWF-ZeekDataSum2025-2

7.3.2. Analysis of Precision Results

UWF-ZeekDataSum2025-1

UWF-ZeekDataSum2025-2

7.3.3. Analysis of Recall Results

UWF-ZeekDataSum2025-1

UWF-ZeekDataSum2025-2

7.3.4. Analysis of F1 Results

UWF-ZeekDataSum2025-1

UWF-ZeekDataSum2025-2

7.4. Summary of Key Findings

8. Conclusions

9. Future Works

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI