Comparative Analysis of Supervised and Unsupervised Learning for Intrusion Detection in Network Logs

Castro, Paulo; Santos, Fernando; Lopes, Pedro

doi:10.3390/computation14040092

Open AccessArticle

Comparative Analysis of Supervised and Unsupervised Learning for Intrusion Detection in Network Logs

by

Paulo Castro

^1,*

,

Fernando Santos

²

and

Pedro Lopes

²

¹

Escola Superior de Tecnologia e Gestão de Lamego, Instituto Politécnico de Viseu, 5100-074 Lamego, Portugal

²

CISeD—Research Centre in Digital Services, Instituto Politécnico de Viseu, 3504-510 Viseu, Portugal

^*

Author to whom correspondence should be addressed.

Computation 2026, 14(4), 92; https://doi.org/10.3390/computation14040092

Submission received: 13 March 2026 / Revised: 9 April 2026 / Accepted: 13 April 2026 / Published: 15 April 2026

(This article belongs to the Section Computational Engineering)

Download

Browse Figures

Versions Notes

Abstract

The escalating complexity of network infrastructures and the increasing sophistication of cyber threats require increasingly robust and automated Intrusion Detection Systems (IDS). This article presents a comparative investigation of the effectiveness of various Machine Learning and Deep Learning architectures in detecting network anomalies in network logs. The methodology encompassed classic supervised and ensemble algorithms, such as Random Forest and XGBoost, to sequential Deep Learning approaches (LSTM, GRU) and unsupervised models based on latent reconstruction (VAE, DeepLog). The results demonstrate that supervised approaches significantly outperformed unsupervised methods in the analyzed context. The optimized XGBoost model established a performance benchmark, achieving a Recall of 0.96 and a Precision of 0.85, thereby offering an optimal balance between detecting rare threats and minimizing false alarms. In contrast, unsupervised models revealed critical limitations, suggesting that statistical mimicry between normal and anomalous traffic hinders detection based solely on reconstruction error. Additionally, the study documents the technical interoperability challenges when attempting to integrate state-of-the-art language models, such as BERT. In conclusion, this work validates the effectiveness of Gradient Boosting algorithms and recurrent networks as viable and scalable solutions for critical network security, providing guidelines for model selection in real monitoring environments.

Keywords:

cybersecurity; anomaly detection; machine learning; XGBoost; deep learning; recurrent networks

Graphical Abstract

1. Introduction

The exponential growth of network infrastructures and the escalating sophistication of cyber threats have exposed systems to unprecedented vulnerabilities, rendering traditional defense strategies inadequate against modern attack vectors [1,2]. Signature-based and static rule-based systems, while fundamental, demonstrate critical limitations in identifying zero-day attacks and generate high volumes of false positives, compromising the response capabilities of security teams.

In this context, intelligent analysis of network logs using Artificial Intelligence (AI) emerges as a promising solution. However, this approach faces complex challenges, including high data heterogeneity and extreme class imbalance between legitimate activities and intrusions. The motivation for this research lies in the necessity to optimize the reliability of detection models, prioritizing the reduction in false negatives to ensure operational continuity in critical infrastructures.

This article proposes a comparative methodology evaluating supervised and unsupervised models applied to real data from an institutional environment. The primary contributions of this study are as follows:

The design of a robust pipeline for processing network logs, ensuring data integrity for predictive analysis;
The implementation of optimized feature engineering techniques to ensure interoperability between Machine Learning and Deep Learning algorithms;
A rigorous evaluation of model performance under severe class imbalance conditions, with a strategic focus on maximizing Recall.

The remainder of this paper is as follows: Section 2 reviews related work regarding intrusion detection using Machine learning and Deep learning approaches; Section 3 describes the computing environment and hardware specifications; Section 4 details the experimental methodology, including preprocessing and feature engineering; Section 5 discusses the model evaluation strategy and analyzes the results for supervised and unsupervised architectures; and Section 6 concludes the paper with final remarks.

2. Related Works

Intrusion detection based on log and network traffic analysis has been extensively investigated using machine learning and deep learning approaches. The evolution and sophistication of cyber threats have driven the transition from traditional signature-based systems to intelligent models capable of identifying anomalous patterns and zero-day attacks with greater effectiveness than classic misuse detection methods [3,4,5]. However, the practical implementation of these models faces critical challenges, namely extreme class imbalance, where anomalies represent a minimal fraction of the records, and the heterogeneous and sequential nature of the data, which necessitates advanced preprocessing strategies for the preservation of temporal information [6].

In the field of Ensemble Learning, algorithms such as XGBoost and Random Forest stand out for their robustness and generalization capacity [7,8]. Recent studies identify XGBoost as one of the most effective models overall, often reporting accuracy values above 98% in the identification of complex attacks [7,9]. Nevertheless, the literature recognizes that this high performance in global metrics does not always translate into operational effectiveness, since an excessive focus on accuracy can neglect Recall, a vital metric in security contexts where a false negative can compromise the entire infrastructure [8].

Simultaneously, deep neural networks have emerged as key solutions due to their ability to model long-term temporal dependencies [10,11,12,13]. The gate mechanism present in architectures such as LSTM and GRU allows for selective information retention, resulting in effective learning of network event sequences. Recent work indicates that the combination of convolutional networks with recurrent units (CNN + GRU) offers an optimized balance between accuracy and computational efficiency, making it particularly suitable for real-time applications [11,12].

Beyond traditional centralized architectures, recent advancements in IDS have explored Federated Learning (FL) to enhance data privacy, allowing models to be trained across distributed nodes without sharing raw network logs [14].

Furthermore, the integration of Explainable Artificial Intelligence (XAI) techniques, such as Shapley Additive exPlanations (SHAP) or Local Interpretable Model-agnostic Explanations (LIME), has become crucial for providing transparency in Deep Learning decisions, addressing the ‘black-box’ nature of complex neural networks in security-critical environments [15]. Hybrid IDS frameworks, which combine signature-based methods with anomaly-based machine learning, are also gaining traction to mitigate the limitations of individual approaches [16].

While this study utilizes a private institutional dataset to ensure practical relevance [1], literature frequently employs standard benchmarks such as CIC-IDS2017 and UNSW-NB15 to validate model generalizability across diverse attack vectors [17].

To better contextualize the current landscape of IDS, Table 1 presents a structured taxonomy of the primary methodologies discussed in the literature, highlighting their core mechanisms and typical applications.

Despite these advances, significant gaps remain in current research. Most studies prioritize global metrics rather than minimizing false negatives and lack validation in simulated scenarios that test model generalization against artificially reproduced intrusions. This study seeks to address these gaps by presenting a comparative analysis focused on reducing false negatives and evaluating the model’s robustness in conditions close to real environments.

3. Computing Environment

The practical implementation and comparative evaluation of the algorithms were conducted on a workstation equipped with an Intel^® Core™ i7-10870H CPU @ 2.20GHz (8 cores, 16 threads) (Intel Corporation, Santa Clara, CA, USA), configured to optimize the parallel processing required by decision tree-based models. The system features 16 GB of DDR4 RAM for efficient handling of large volumes of memory data and an NVIDIA GeForce RTX 2060 (6 GB GDDR6) graphics processing unit (GPU), which is essential for hardware acceleration during the training of Deep Learning architectures. Storage and input/output (I/O) operations were managed by an NVMe SSD drive to minimize latency within the data pipeline.

Under this hardware configuration, the system demonstrated high efficiency for real-time monitoring. The average inference latency was measured at 0.15 ms per record for tree-based models (XGBoost) and 0.42 ms for Deep Learning architectures. Training sessions, accelerated by GPU, were completed within 30 to 45 min, utilizing EarlyStopping (with a patience of 3 epochs) to ensure computational efficiency and prevent overfitting.

4. Methodology and Results

The systematic evaluation of the proposed models is grounded in the use of real security data provided by the Polytechnic Institute of Viseu. This dataset, characterized by its high dimensionality and complexity, allows for a direct comparison between empirical results and theoretical assumptions found in the literature, ensuring the relevance of the results in a practical application scenario.

To provide a formal mathematical foundation for our intrusion detection framework, we define the detection task as a mapping function

f : ϰ \to y

. Let

χ = \{x_{1}, x, \dots x_{n}\}

be the set of

d

—dimensional feature vectors extracted from network logs, where

x_{i} \in R^{d}

. The target space is defined as

y = \{0,1\}

, representing normal traffic and anomalies, respectively. The objective is to determine an optimal set of parameters

θ

for the model

f (x; θ)

that minimizes a specific loss function

L

based on the observed data.

Given the critical nature of network security, the methodology favors Recall as the central performance metric. This choice is justified by the imperative need to minimize false negatives, since the failure to detect a real intrusion entails significantly more serious consequences than the occurrence of false positives.

Preprocessing and Feature Engineering

The original dataset, comprising multiple CSV files, presented significant challenges of high dimensionality, data type heterogeneity, and a high incidence of Not a Number (NaN). The data preparation process was structured to ensure data integrity and maximize the computational efficiency of the models.

Data Cleaning and Consolidation: After concatenating the records, rigorous filtering was applied based on the informational relevance of the variables. Columns with an incidence of null values greater than 70% were discarded, while the remaining (NaN) were imputed using the mode for categorical variables and the median for numeric variables. In addition, metadata normalization was performed to remove coding inconsistencies and invisible characters in column names, ensuring referential consistency throughout the pipeline.

Feature Engineering and Temporal Transformation: Extracting knowledge from temporal attributes was crucial for capturing the dynamic behavior of the network. Timestamp and duration variables were decomposed into discrete components (e.g., year, month, day of the week, hour, and second) and converted to the Epoch format. This process enabled the transformation of raw date/time data into numerical features interpreted by learning algorithms.

Dimensionality Control and Encoding: To mitigate memory explosion and redundancy issues, a multilevel dimensionality reduction strategy was implemented:

Identifier Filtering: Removal of high-cardinality variables with no predictive value, such as registration IDs;
Hashing Encoding: Applied to variables such as Source and Destination to map thousands of categories into a fixed and manageable vector space [18];
Rare Category Clustering: Classes with less than 0.5% representation were consolidated into a generic category, reducing noise in subsequent encoding.
One-Hot Encoding and Variance: After binarizing the categorical variables, low-variance features (where more than 99.9% of the values were identical) were eliminated, optimizing the dataset for the training phase.

Target Variable Definition and Class Imbalance: The target variable was defined based on the “Severity” column rather than the “Action” column. This methodological decision is predicated on the premise that severity reflects the intrinsic risk of the event, regardless of the system’s reactive response (blocking or acceptance). By categorizing events of ‘High’ and ‘Critical’ severity as anomalies (1) and the rest as normal traffic (0), the model can identify threats that may have bypassed existing security rules.

The final dataset is extremely unbalanced, with the anomalous class representing only 0.02% of the total records. This characteristic requires rigorous evaluation metrics focused on minimizing false negatives.

The anomalous class encompasses a variety of real-world network threats, including unauthorized access attempts, brute-force attacks and port scanning. Although these events represent only 0.02% of the total records, their inclusion is vital for training models capable of identifying high-impact security breaches that often bypass traditional static rules.

5. Model Evaluation and Selection Strategy

Given the substantial volume of the dataset and the diversity of architectures explored, a phased optimization methodology was adopted to ensure the computational feasibility of the study. The experimental procedure was structured in two main stages:

Baseline Evaluation: All models were initially trained and evaluated using their default configurations. This phase enabled a direct comparison of the intrinsic performance of each algorithm relative to the specific nature of the network data, functioning as a feasibility filter to identify the most robust candidates.
Selective Optimization: Only the algorithms demonstrating the most promising and relevant performance in the context of intrusion detection, specifically the balance between Recall and latency, were selected for a subsequent phase of hyperparameter fine-tuning. This approach allows computational resources to be concentrated on architectures with the greatest potential for practical application in real-time security environments.

5.1. Supervised Machine Learning Classification

This section presents the experimental procedures conducted with supervised Machine Learning models. To formalize this approach, we define each model as a function

f (x; θ)

that maps the input features

x

to a predicted

y

. Given the severe class imbalance (0.02% anomalies), the models are optimized using a cost-sensitive Binary Cross-Entropy loss function. This approach ensures that the minority class (intrusions) is prioritized to maximize Recall. This optimization is defined as follows:

L (θ) = \frac{1}{N} \sum_{i = 1}^{N} [w \cdot y_{i} \log (f (x_{i}; θ) + (1 - y_{i}) \log (1 - f (x_{i}; θ))]

where

w

represents the weight assigned to the anomalous class to prevent the decision boundary from being biased towards normal traffic.

For model optimization, specifically for XGBoost, a RandomizedSearchCV was employed, exploring 50 distinct parameter combinations with a 3-fold Stratified Cross-Validation. This rigorous process ensured that the model’s high Recall (0.96) was achieved through a robust search for the optimal trade-off between sensitivity and precision.

Each model is trained, evaluated, and compared based on the same data set and metrics to ensure a consistent analysis of its performance analysis.

5.1.1. Random Forest

The implementation of the Random Forest algorithm was structured to reconcile computational feasibility with statistical robustness, given the substantial volume of the dataset (2,250,300 records). The experimental protocol began with an exploratory phase using a stratified sample of 10%, allowing the sensitivity of the extracted network metrics to be validated before expanding to full training.

When processing the complete dataset, an 80/20 split was applied to the training and test sets, respectively. The preservation of the minority class proportion in both subsets was ensured through stratified sampling (‘stratify = y’). Prior to training, the numerical variables underwent a ‘StandardScaler’ process, ensuring that discrepancies in magnitude between traffic characteristics did not introduce biases during the construction of decision trees.

The strategic choice of the ‘class_weight = balanced’ parameter proved to be decisive for the effectiveness of the model. By assigning a higher penalty to classification errors in the minority class, this technique mitigated extreme imbalance without resorting to synthetic oversampling methods such as SMOTE. This choice preserves the integrity of the original data and avoids introducing artificial variance, a critical factor in institutional network security analyses where synthetic noise can mask real intrusion patterns.

After training, the model obtained the results displayed in Figure 1:

The results obtained revealed high effectiveness in filtering false positives, with a Precision of 0.98. The F1-score of 0.90 and the PR-AUC of 0.97 confirm the robustness of the model in distinguishing between legitimate and anomalous traffic. However, the analysis focused on Recall (0.83) identified a limitation, namely the failure to detect approximately 17% of actual intrusions (false negatives). This sensitivity deficit, although acceptable in generic contexts, is considered high for critical infrastructure cybersecurity systems, motivating the exploration of gradient boosting algorithms with greater adjustment capacity.

To address this limitation, a hyperparameter optimization phase for the Random Forest model was conducted using Grid Search. Tuning focused on key parameters such as ‘n_estimators’ (ranging from 100 to 500) and ‘max_depth’ (from 10 to 30). Despite these adjustments, the performance gains in Recall were marginal, and the model continued to struggle with the extreme class imbalance compared to boosting architectures. Consequently, the study prioritized the extensive refinement of the XGBoost and Stacked GRU models, which demonstrated a superior capacity to minimize false negatives in this specific institutional environment.

5.1.2. XGBoost

The XGBoost algorithm was selected as representative of Gradient Boosting architectures, due to its recognized efficiency in handling high-dimensional tabular data. The technical implementation followed a binary classification protocol using the ‘binary: logistic’ cost function. Formally, the XGBoost model optimizes a regularized objective function

O b j (θ)

that combines a specific loss function

L (θ)

with penalty term

Ω (θ)

to prevent overfitting:

O b j (θ) = \sum_{i} L (y_{i}, {\hat{y}}_{i}) + \sum_{k} Ω (f_{k})

where

L

represents the cost-sensitive Binary Cross-Entropy and

Ω

penalizes the complexity of the regression trees

f_{k}

.

To control complexity and prevent overfitting, the learning rate was set at 0.1, monitored via the logloss metric during 100 training iterations.

Given the severely unbalanced nature of the dataset, the model configuration focused on the cost-sensitive learning technique. To this end, the ‘scale_pos_weight’ parameter was implemented, with its value calculated as the ratio between the normal and anomalous classes. This parameterization allows the algorithm to adjust its decision boundary, assigning greater importance to the correct classification of the minority class, without requiring external data manipulation, such as oversampling.

To optimize performance, a RandomizedSearchCV with 3-fold Stratified Cross-Validation was employed. Furthermore, a feature importance analysis based on SHAP principles was conducted, revealing that Destination Port, Protocol, and Flow Duration were the most decisive features. This interpretability layer ensures that model predictions are congruent with established network security heuristics.

After training, the performance metrics obtained for the XGBoost algorithm, presented in Figure 2, were as follows:

5.1.3. Stochastic Gradient Descent

The exploration of linear decision boundaries using Support Vector Machines (SVM) proved computationally infeasible given the substantial volume of data (approx. 1.8 million samples). Traditional SVM algorithms, such as Sequential Minimal Optimization (SMO), have a complexity that typically scales between

(O (n^{2})) a n d (O (n^{3}))

, which require unaffordable RAM resources and processing times.

To circumvent this limitation, we opted for Stochastic Gradient Descent (SGD), configured with a Hinge loss function to emulate a linear SVM classifier. Unlike batch methods, SGD updates the model parameters iteratively using individual samples, reducing the complexity to

(O (n))

. This approach ensures the scalability needed to handle the high dimensionality of the dataset without compromising memory efficiency.

As a result, the model after training achieved the following results, presented in Figure 3:

The results obtained by the SGD classifier reveal a statistical phenomenon highly relevant to intrusion detection. The model achieved a Recall of 1.00, successfully identifying all 48 anomalies in the test set. However, this performance was accompanied by an extremely low Precision of 0.02, due to the generation of 2623 false positives.

This discrepancy highlights the inherent risk of relying on ROC-AUC as the sole evaluation metric in cybersecurity. Although the ROC-AUC of 0.993 suggests an almost perfect theoretical separation, the metric is insensitive to the impact of false positives when the minority class is extremely small. The low F1-score (0.04) and PR-AUC (0.20) confirm that, despite its overall sensitivity, the model lacks specificity for practical implementation in a real monitoring system.

5.1.4. Supervised Machine Learning—Analytical Comparison

The comparative analysis of supervised models, summarized in Table 2, shows a clear dichotomy between theoretical sensitivity and practical applicability. The SGD classifier, despite reaching the maximum Recall limit (1.00), proved to be operationally infeasible. The excessive generation of false alerts (2623 occurrences) would result in “alert fatigue” for cybersecurity teams, neutralizing the practical usefulness of the detection system. The F1-score and PR-AUC metrics corroborate this inadequacy, demonstrating that total sensitivity does not compensate for the critical loss of accuracy.

Regarding the balance between tree-based models, there is a distinct technical trade-off between Random Forest and XGBoost:

Random Forest favored specificity, presenting the lowest false alarm rate but failing to identify 8 critical anomalies (false negatives).
XGBoost demonstrated superior sensitivity, capturing 46 of the 48 real threats. Although this performance resulted in a residual increase in false positives (11 cases), the operational cost of this inaccuracy is largely offset by the benefits of risk mitigation.

In conclusion, XGBoost’s ability to detect 12.5% more anomalies than Random Forest positions it as the most resilient solution for the domain under study. In critical infrastructure contexts, the overriding priority is threat visibility. Therefore, XGBoost’s robustness in reducing false negatives provides a decisive strategic advantage for integration into proactive monitoring systems.

5.1.5. XGBoost Optimization

After identifying XGBoost as the most promising model, a systematic optimization phase was conducted to maximize the balance between Precision and Recall. To ensure computational feasibility given the vast dimensionality, the Randomized Search Cross-Validation (RandomizedSearchCV) technique was employed, exploring 50 different combinations of hyperparameters. The process was supported by stratified cross-validation (Stratified K-Fold, K = 3), ensuring that the representativeness of the minority class was preserved across all partitions. The target metric for optimization was the F1-score of the minority class, as it represents the most robust compromise between detection sensitivity and operational reliability.

As shown in Table 3, the comparison between the baseline model and the optimized version reveals that high detection capacity was maintained, with Recall set at 0.96 (identification of 46 out of 48 anomalies). This result confirms that the refinement of the parameters did not compromise the fundamental sensitivity of the algorithm.

The most significant gain was observed in Precision, which rose from 0.81 to 0.85. This increase translates into a reduction of approximately 27% in the volume of false positives (from 11 to 8 occurrences). In a production environment, this refinement is crucial, as it reduces the cognitive load on security analysts and reinforces confidence in the alerts generated by the system. The F1-score followed this evolution, reaching 0.90, while the ROC-AUC and PR-AUC metrics remained at excellent levels, validating the stability and generalization of the final optimized model.

The superiority of the optimized XGBoost model is demonstrated by its practical impact on threat visibility within the institutional network. While the Random Forest model failed to identify 8 critical anomalies (False Negatives), the optimized XGBoost reduced this gap significantly, capturing 46 out of 48 real threats. In the context of critical infrastructure cybersecurity, this 12.5% increase in detection capability (Recall) is more decisive than global statistical metrics, as each undetected ‘High/Critical’ event represents a high-risk security breach.

5.2. Unsupervised Machine Learning

This section presents the experimental procedures performed with the Unsupervised Machine Learning models. These algorithms were applied under the same methodological conditions defined for the previous models, enabling an evaluation to identify anomalous patterns without relying on labeled data.

In this unsupervised context, the objective is to model the underlying distribution

P (X)

of the network data without the guidance of prior labels. Anomaly detection is mathematically formulated as identifying observations

x_{i}

that significantly deviate from the learned nominal clusters or density regions. Formally, we define an anomaly score

S (x)

, where an alert is triggered if

S (x) > λ

, with

λ

being a decision threshold derived from the statistical properties of the legitimate traffic distribution.

5.2.1. Isolation Forest

The unsupervised approach began with the Isolation Forest algorithm, aiming to identify divergent patterns without the guidance of the target variable. Unlike supervised Machine Learning models, this architecture is based on the premise that anomalies are rare observations and are more susceptible to isolation within decision tree structures.

Model initialization included 100 estimators, and a sensitivity analysis focused on the ‘contamination’ parameter, which defines the expected proportion of outliers. Initially, the actual anomaly ratio (0.000106%) was applied, resulting in a conservative convergence that classified all records as legitimate traffic. To diagnose the model’s response to different decision thresholds, the contamination rate was raised to 0.01 (1%), forcing a more comprehensive decision boundary.

The results demonstrated a critical flaw in the applicability of this architecture to the institutional dataset. The model recorded Precision, Recall, and F1-score values of 0.00, failing to identify any of the 48 anomalies present in the test set. Simultaneously, the generation of 50 false positives introduced unjustified noise into the system. The PR-AUC value (0.0004) confirms the practical inability of the algorithm to establish an effective anomaly score.

After completing the training, the algorithm was evaluated on the test set, with the performance metrics detailed in Table 4:

The analysis also revealed a discrepancy between the configured contamination parameter (1%) and the volume of instances isolated (50 out of 450,060). This behavior suggests that the anomalies present in the network logs do not manifest themselves as obvious global outliers in terms of distance or isolation, but rather as subtle patterns integrated into the normal traffic density. In summary, Isolation Forest’s inability to detect any anomalies, coupled with the generation of false positives, makes it infeasible for the research objective.

5.2.2. K-Means

The application of the K-means algorithm was based on spatial density analysis, starting from the premise that instances significantly distant from the centroids of the dominant clusters represent anomalous behavior. Given the sensitivity of the technique to the specification of the number of groups

(k)

, an exploratory strategy was implemented, varying the granularity of the grouping to assess the ability to isolate threats in the feature hyperspace.

Initially,

k = 8

was used to consolidate legitimate traffic into macro behavioral patterns. Given the low sensitivity observed, the hyperparameter was increased to

k = 50

, allowing for more detailed segmentation of the data space and testing the hypothesis that anomalies would be associated with small residual clusters. Anomaly classification was determined by calculating the Euclidean distance between each observation and its respective centroid, utilizing decision thresholds based on the 99.99% and 99.9% distance percentiles.

The performance metrics resulting from these two configurations are presented in Table 5.

The experimental results demonstrated the ineffectiveness of this approach for the cybersecurity scenario under study. Regardless of the granularity applied, the model recorded zero Precision, Recall, and F1-score values, failing to identify any of the 48 real threats present in the test set. The increase in the number of clusters only resulted in a substantial increase in false positives, without any gain in the detection rate.

Visualization using Principal Component Analysis (PCA), illustrated in Figure 4, corroborated the failure of the algorithm, demonstrating that anomalies (representing real attack vectors) do not manifest as isolated points, but are scattered non-linearly among legitimate traffic data. The low performance of PR-AUC (0.0001) and a ROC-AUC close to a random classifier suggest that the centroid distance premise is insufficient to capture the complexity of institutional intrusions, where malicious patterns statistically mimic the structure of normal traffic.

5.2.3. DBSCAN

The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm was explored for its theoretical ability to identify clusters of arbitrary geometry and isolate low-density points, categorizing them as noise (representing potential anomalies). The effectiveness of this approach critically depends on the definition of the neighborhood radius (‘eps’) and the minimum number of points required for forming dense clusters (‘min_samples’).

Given the high dimensionality of the dataset (297 features), a two-step evaluation was performed to determine the feasibility of the model:

Restrictive Configuration: ‘min_samples = 594’ was defined, based on the heuristic of $2 \times n ú m e r o d e f e a t u r e s$ , with $e p s = 0.5$ . This parameterization proved to be overly conservative, resulting in the total fragmentation of the data, where almost all observations were labeled as noise.
Flexible Configuration: To promote the convergence of legitimate clusters, ‘min_samples’ was reduced to 30. The objective was to assess whether a less demanding density definition would allow normal traffic to be isolated, reducing the volume of false positives.

The results obtained, summarized in Table 6, were as follows:

The results show that DBSCAN, in both configurations, was unable to establish a useful distinction between normal traffic and anomalies. Although the model recorded a Recall of 1.00, this value is of no practical relevance, as zero Precision indicates that the algorithm classified approximately 1.78 million normal instances as anomalies.

This systematic failure suggests that the structure of the network logs under study exhibits high dispersion in the feature hyperspace. The inability to form stable clusters, even with the reduction in the ‘min_samples’ parameter, indicates that the data does not exhibit the local density necessary for the effectiveness of algorithms based on spatial connectivity. Consistent with the observations in the SGD model, the unbalanced performance of DBSCAN makes its operational application infeasible, reinforcing the premise that detection in these environments requires models capable of learning more complex decision boundaries or compressed latent representations.

5.3. Supervised Deep Learning

This section addresses the practical implementation of Supervised Deep Learning models. The training, validation, and evaluation processes of the different architectures are described, ensuring methodological uniformity regarding Machine Learning approaches.

Deep Learning, specifically Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), extends the supervised framework by modeling temporal dependencies through hidden states

h_{i}

. The transition function is defined as

h_{i} = σ (W x_{t} + U h_{t - 1} + b)

, where

W

and

U

are weight matrices optimized during training and

b

represents the bias vector. Like the Machine Learning models previously discussed, these neural networks minimize the Binary Cross-Entropy loss by leveraging Backpropagation Through Time to update parameters across sequential steps.

5.3.1. LSTM

LSTM architecture was selected for its intrinsic ability to model sequential dependencies in data flows, as it is widely recognized in the literature for its effectiveness in network log analysis. For implementation, the data was transformed from a two-dimensional structure to a three-dimensional format (samples, time windows, features), which is an essential requirement for processing in recurrent layers.

To optimize computational efficiency and minimize processing latency, a unitary time window (‘n_timesteps = 1’) was defined. This configuration allows the model to process events atomically, while maintaining compatibility with the LSTM structure. The sequential architecture consisted of:

Recurrent Layer: 64 LSTM units for latent pattern extraction;
Regularization: A Dropout layer (0.2) to prevent overfitting through stochastic omission of neurons;
Output Layer: A densely connected neuron with a Sigmoid activation function for binary classification.

Training was optimized by applying class weights to compensate for the extreme imbalance in the dataset. The Adam optimizer and the Binary Cross-Entropy loss function were used, with GPU acceleration support. To ensure generalization, an Early Stopping mechanism was incorporated with a patience of 3 epochs, monitoring the validation loss.

After model training, the following results, illustrated in Figure 5, were obtained:

The LSTM model demonstrated high robustness, achieving a Recall of 0.96 (identification of 46 out of 48 anomalies). This performance matches the sensitivity of the XGBoost model, confirming the effectiveness of neural networks in detecting rare threats.

However, an operational trade-off was observed in Precision (0.59), which resulted in a higher volume of false positives compared to ensemble models. This compromise is characteristic of deep models driven by class weights, which tend to be more aggressive in signaling anomalies. Despite lower accuracy, the ROC-AUC (1.00) and PR-AUC (0.9473) values validate LSTM as a powerful tool for security systems that prioritize total visibility over infrastructure, even at the cost of increased alert screening requirements.

5.3.2. BERT

The integration of the Bidirectional Encoder Representations from Transformers (BERT) model was planned to establish a performance benchmark based on large-scale pre-trained models. BERT is widely recognized for its ability to capture bidirectional semantic context, offering a theoretical advantage over traditional sequential architectures for classifying complex patterns in network logs.

However, the practical implementation of this architecture revealed critical technical constraints concerning interoperability between Deep Learning frameworks. The execution of the ‘TFBertForSequenceClassification’ class within the TensorFlow environment depends on the dynamic conversion of pre-trained weights originating from PyTorch 2.2.0. During this process, structural incompatibilities were identified between the Transformers library and recent versions of Keras (v3.x), resulting from the deprecation of low-level functions in the library’s backend.

Mitigation attempts, including reverting to legacy versions of TensorFlow and using the ‘tf-Keras’ compatibility package, proved infeasible due to dependency conflicts in the operating system and hardware infrastructure used.

A definitive resolution of these anomalies would require the complete migration of the experimental pipeline to PyTorch 2.2.0. However, such restructuring was ruled out for methodological reasons. The framework transition would introduce external variables that could skew the direct comparison between models. This would prevent a reliable assessment of whether performance variations stemmed from the model architecture or the intrinsic optimizations of the execution platform.

In summary, the exclusion of BERT is grounded in preserving the methodological integrity of the study while documenting the real portability challenges encountered when integrating cutting-edge models into rapidly evolving development ecosystems.

Therefore, the decision to prioritize a controlled and uniform experimental environment, using the TensorFlow/Keras backend for all compared architectures, was essential to ensure that performance variations are attributed to the algorithmic nature rather than framework-specific optimizations. This strategic choice preserves the methodological integrity of comparative study, even at the cost of excluding heterogeneous Transformer-based models.

5.3.3. GRU

The GRU model was implemented as an optimized alternative to the LSTM architecture, aiming to assess the impact of a faster recurrent structure on intrusion detection. The motivation for choosing GRU lies in its computational efficiency, by merging the control mechanisms in the Update and Reset Gates vectors. The algorithm reduces the number of trainable parameters, accelerating convergence without compromising the ability to model long-term temporal dependencies.

To ensure experimental comparability, GRU adopted a sequential structure equivalent to that of the LSTM model:

Processing Layer: 64 GRUs for sequential feature extraction;
Regularization: A dropout layer at 0.2 for overfitting mitigation;
Classifier: A dense layer with Sigmoid activation for generating binary probabilities.

Based on this process, the results obtained are presented in Figure 6:

5.3.4. Supervised Deep Learning—Analytical Comparison

As shown in Table 7, a comparative analysis between LSTM and GRU architectures reveals distinct approaches to managing the trade-off between sensitivity and specificity. The LSTM model proved to be a more balanced architecture, achieving an F1-score of 0.89. This robustness translates into an effective mitigation of both false negatives and false positives, positioning LSTM as a consistent classifier for environments where alert reliability is critical. The PR-AUC value (0.9470) corroborates its effectiveness in maintaining selectivity even under the pressure of an extremely unbalanced dataset.

On the other hand, the GRU stood out for its superiority in Recall, ensuring the identification of 46 of the 48 real threats (96% capture effectiveness). However, this high operational sensitivity comes at a cost in terms of Precision, resulting in an increase in false alarms that may increase the cognitive load on security analysts. The GRU’s high PR-AUC confirms that the architecture is extremely powerful in extracting minority class features, prioritizing total threat visibility over classification accuracy.

5.3.5. GRU Optimization

The refinement of the GRU model aimed to balance the trade-off between Precision and Recall, mitigating the discrepancy observed in the baseline configuration. The optimization strategy evolved from a simple architecture to a stacked layer configuration (Stacked GRU), designed to enhance the extraction of hierarchical features in complex time series.

The new architecture was structured into two successive layers:

Primary Layer: 64 GRUs with ‘return_sequences = True’, allowing the propagation of the temporal dimension to the next level;
Secondary Layer: 32 GRUs, focused on distilling learned temporal dependencies and reducing latent dimensionality.

To prevent overfitting in this deeper structure, Dropout layers were integrated between the recurrent blocks. The training protocol remained consistent, using the Adam optimizer, the Binary Cross-Entropy loss function, and the application of class weights to compensate for severe data imbalance. The process was monitored via Early Stopping to ensure optimal convergence and computational resource efficiency.

The results, detailed in Table 8, show a significant reconfiguration of the metric balance. Precision demonstrated a remarkable improvement, rising from 0.59 to 0.91. In practical terms, this improvement translates into a drastic reduction in false alarms (from 32 to only 4 occurrences), which minimizes security analyst fatigue. The F1-score followed this trend, reaching 0.88.

Although there was a marginal reduction in Recall (from 0.96 to 0.85), resulting in 7 false negatives compared to the original 2, the optimized model established a higher level of reliability. The stability of the ROC-AUC and PR-AUC metrics (0.9253) confirms that the stacked architecture is more resilient and suitable for production environments, where alert accuracy is critical to the effectiveness of incident response plans.

5.4. Unsupervised Deep Learning

In this section, experiments concerning unsupervised Deep Learning models are conducted. The configuration, training, and analysis steps are presented to evaluate their practical effectiveness within the context of the dataset under study.

For deep unsupervised models, the goal is to learn a compressed latent representation

z

that captures the nominal distribution of network events. The detection criterion is formally based on the Reconstruction Error

R_{(x)}

, calculated as the squared

L_{2}

norm between the input

x

and its reconstructed version

\hat{x}

:

R (x) = ‖ x - f_{d e c} (f_{e n c} (x) ‖^{2}

High values of

R_{(x)}

indicate that the network log entry does not conform to the learned patterns of legitimate activity, thereby flagging it as a potential intrusion.

5.4.1. VAE

The exploration of unsupervised methods based on Deep Learning focused on the Variational Autoencoder (VAE). The objective was to learn a compact, probabilistic representation of network ‘normality’, using reconstruction error as a discriminatory metric for intrusion detection.

The model was trained exclusively with instances of the normal class, as this subsample defines the nominal behavior of the system. The implemented architecture followed the classic encoder–decoder structure:

Encoder: Maps input features to Gaussian distribution parameters in a latent space of dimensions;
Latent Space: Acts as an information bottleneck, forcing the compression of essential attributes;
Decoder: Reconstructs the original data from samples in the latent space.

The anomaly detection mechanism in the VAE is formally grounded in the Reconstruction Error

R (x)

. This metric quantifies the divergence between the input vector

x

and its reconstructed version

\hat{x}

generated by the decoder. Mathematically, it is defined as the squared

L_{2}

norm:

R (x) = ‖ x - f_{d e c} (f_{e n c} (x)) ‖^{2}

In this theoretical framework, the model assumes that anomalous network events do not conform to the learned latent distribution of ‘normality’. Consequently, an intrusion is identified whenever

R (x) > λ

, where

λ

is a predefined statistical threshold derived from the reconstruction residuals of the training set.

Training was conducted with the Adam optimizer for 100 epochs, using dense layers of 128 neurons to capture subtle variations in the data distribution. Detection was based on the definition of thresholds applied to the reconstruction error, calculated from the error percentiles of the training set.

After training, the model was evaluated on the test set, obtaining the following results, as shown in Figure 7:

A detailed description of the Reconstruction Error

(R (x))

for the VAE model revealed a near-total overlap between the distributions of legitimate and anomalous traffic. This residual convergence indicates that the features extracted from the IPV logs resulted in a latent representation where anomalies statistically mimic nominal behavior.

Consequently, the selection of any decision threshold

(λ)

based on error percentiles (e.g., 95th or 99th percentile) resulted in a prohibitive False Positive Rate (FPR).

Specifically, to achieve a target Recall (TPR) of 0.95, the error density forced an FPR close to 1.0, validating the operational ineffectiveness of this unsupervised approach for this specific dataset.

5.4.2. LogBERT

LogBERT architecture was considered a natural extension to this study, representing the state of the art in specializing Transformer-based models for the log domain. Its conceptual advantage lies in the use of Self-Attention mechanisms and the Masked Language Modeling (MLM) task, which enables the capture of long-range contextual dependencies and semantic relationships between network events that elude traditional sequential models.

Despite its theoretical potential, the practical implementation of LogBERT proved infeasible within this study, inheriting the interoperability limitations documented in the previous section regarding the BERT model. Given its direct dependence on the Transformers library, the integration of LogBERT would have required an exclusive migration to the PyTorch ecosystem.

The decision not to proceed with this implementation was based on two critical pillars:

Ecosystem Consistency: Introducing an exclusive dependency on PyTorch would create a methodological asymmetry, since all other Deep Learning architectures in this study were developed and optimized on the TensorFlow/Keras backend.
Comparative Validity: To ensure the reliability of the analysis, it is imperative that performance variations be attributed to the nature of algorithms rather than intrinsic optimizations or processing differences between frameworks.

Thus, the exclusion of LogBERT does not stem from conceptual limitations of the model, but rather from a strategic choice to ensure a uniform and controlled experimental environment. This analysis reinforces the persistent challenge of ecosystem fragmentation in AI research, where the portability of cutting-edge models across development platforms remains an obstacle to the integration of heterogeneous tools.

5.4.3. DeepLog

The DeepLog architecture was implemented to evaluate the effectiveness of Recurrent Autoencoders in modeling nominal patterns in network logs. The fundamental premise of this approach is predicated on the hypothesis that anomalous instances, by diverging from the learned sequentially, will exhibit a significantly higher reconstruction error than legitimate traffic.

The architecture was developed on a sequential structure composed of stacked GRU layers (128 and 64 units, respectively), integrating a ‘TimeDistributed’ layer in the decoding block to allow element-by-element reconstruction of the original sequences.

The core of its detection engine is the Mean Squared Error (MSE) of reconstruction, which serves as the formal anomaly score

S (x)

. For a given input sentence

x

, the reconstruction error is formulated as:

R (x) = ‖ x - \hat{x} ‖^{2}

where

\hat{x}

represents the element-by-element reconstruction performed by the ‘TimeDistributed’ decoding block. An intrusion is flagged when the reconstruction residue significantly exceeds the nominal patterns learned during the training phase, indicating a structural deviation in the network log behavior.

Given the extreme disproportion of the dataset, a balanced subsampling strategy was applied for training, consolidating a set of samples. The model was optimized using the Adam algorithm, using Mean Squared Error (MSE) as the loss function and an Early Stopping mechanism to ensure convergence without overfitting.

The following results, detailed in Figure 8, demonstrate the model’s performance:

The experimental results revealed that this approach was not feasible for the scenario under study. The model recorded zero values for Recall, Precision, and F1-score, failing to identify all 48 threats present in the test set. The PR-AUC value (0.0001) confirms that the classifier operated at a level of statistical indistinguishability.

This limitation suggests a structural incompatibility between the nature of DeepLog and the specific characteristics of the institutional logs analyzed. Contrary to what the architecture assumes, anomalies in this context do not manifest through deviations in temporal sequence that generate high reconstruction residuals. This residual convergence phenomenon indicates that the model was able to reconstruct malicious events with the same accuracy as legitimate events, making it impossible to define a detection threshold.

6. Discussion

In contrast to the success of boosting algorithms, unsupervised architectures revealed critical limitations in segregations of anomalous events. This performance disparity stems fundamentally from the intrinsic nature of the dataset and the high data compression applied during preprocessing, which may have attenuated behavioral variables decisive for identifying structural deviations. In the absence of marked temporal dependencies or distinctive informational variability in the features, models based on reconstruction error converged to a representation where legitimate and malicious traffic became latently indistinguishable.

Thus, methodological evidence suggests that the success of supervised approaches in this domain is derived from their ability to map complex nonlinear relationships through explicit feedback from the target variable. In contrast, the effectiveness of unsupervised methods remains strictly dependent on contextual richness that allows the model to isolate anomalies without prior supervision. For the institutional scenario analyzed, explicit labeling and severity-focused feature engineering proved to be the pillars of a robust detection system, refuting the hypothesis of detection by mere statistical divergence.

To further validate these findings and evaluate the comparative standing of the proposed optimized XGBoost model, Table 9 compares its performance on the institutional dataset with State-of-the-Art (SOTA) results reported in the literature for standard benchmarks, such as CICIDS2017 and UNSW-NB15.

While the precision in our real institutional environment is slightly lower due to extreme class imbalance and high data heterogeneity, the achieved Recall of 0.96 demonstrates that our pipeline is highly effective at identifying critical threats, performing at a level comparable to models trained on more balanced public datasets.

From a deployment perspective, the optimized XGBoost model aligns with real-time monitoring requirements, achieving an average inference latency of 0.15 ms per record. While the model generated 8 false positives, the operational cost of investigating these alerts is marginal compared to the strategic benefit of identifying 46 out of 48 critical threats.

This high visibility, combined with the demonstrated hardware efficiency of the training pipeline, validates the system for integration into professional security environments, such as the ELK Stack. By leveraging GPU acceleration to maintain low training times, the proposed framework offers a scalable and sustainable solution for institutional infrastructures that require frequent model retraining to adapt to evolving threat landscapes.

7. Conclusions

This study presents a comprehensive comparative evaluation of machine learning and deep learning architectures for intrusion detection in institutional network logs. A rigorous experimental framework was adopted, encompassing both supervised and unsupervised paradigms, in order to assess their effectiveness in handling the intrinsic challenges of cybersecurity data, namely class imbalance and high dimensionality.

The empirical results consistently indicate the superior performance of supervised approaches, in line with findings reported in the recent literature on intrusion detection systems. In particular, the XGBoost algorithm emerged as the most effective model following systematic hyperparameter optimization, achieving a Recall of 0.96 and a Precision of 0.85. This trade-off is especially relevant in operational environments, where maximizing threat detection while maintaining a manageable false positive rate is critical for reducing analyst fatigue and ensuring timely response.

Recurrent neural network architectures, specifically Stacked GRU, also demonstrated competitive performance, highlighting their ability to capture temporal dependencies and to learn robust latent representations from sequential log data. These findings are consistent with prior work emphasizing the relevance of sequential modeling in intrusion detection contexts. In contrast, unsupervised approaches based on reconstruction error, such as Variational Autoencoders (VAE) and DeepLog, exhibited notable limitations. The results suggest that, in this dataset, anomalous events share substantial statistical similarity with legitimate traffic, thereby reducing the discriminative power of reconstruction-based methods.

Additionally, the study highlights practical challenges related to interoperability across modern machine learning frameworks. In particular, the difficulty in integrating Transformer-based architectures, such as BERT and LogBERT, due to incompatibilities between TensorFlow and PyTorch ecosystems, underscores a broader limitation in current AI development workflows. This observation aligns with ongoing discussions in the literature regarding the need for more flexible and framework-agnostic solutions in applied machine learning systems.

Future work will focus on the integration of the proposed detection framework into production-level Security Information and Event Management (SIEM) systems. This will enable real-time processing of streaming network data and support the development of interactive dashboards for enhanced situational awareness and proactive incident response. Furthermore, validation on publicly available benchmark datasets and the exploration of hybrid detection strategies are expected to strengthen the generalizability and operational robustness of the proposed approach.

Author Contributions

Conceptualization, P.C.; methodology, P.C.; software, P.C.; validation, P.C., F.S. and P.L.; formal analysis, P.C.; investigation, P.C.; resources, F.S. and P.L.; data curation, P.C.; writing—original draft preparation, P.C.; writing—review and editing, F.S. and P.L.; visualization, P.C.; supervision, F.S. and P.L.; project administration, F.S.; funding acquisition, F.S. and P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by National Funds through the FCT—Foundation for Science and Technology, I.P., within the scope of the project Ref.UIDB/05583/2020.

Data Availability Statement

Data sharing is not applicable to this article due to privacy and security restrictions regarding the analyzed network logs. Further inquiries can be directed at the corresponding authors.

Acknowledgments

The authors would like to thank the Research Center in Digital Services (CISeD) and the Instituto Politécnico de Viseu for their support.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
BERT	Bidirectional Encoder Representations from Transformers
CNN	Convolutional Neural Networks
DBSCAN	Density-Based Spatial Clustering of Applications with Noise
FL	Federated Learning
GPU	Graphics Processing Unit
GRU	Gated Recurrent Unit
I/O	Input/Output
IDS	Intrusion Detection System
LIME	Local Interpretable Model-agnostic Explanations
LSTM	Long Short-Term Memory
MLM	Masked Language Modeling
MSE	Mean Squared Error
NaN	Not a Number
PCA	Principal Component Analysis
PR-AUC	Precision-Recall Area Under the Curve
ROC-AUC	Receiver Operating Characteristic Area Under the Curve
SGD	Stochastic Gradient Descent
SHAP	Shapley Additive exPlanations
SIEM	Security Information and Event Management
SMO	Sequential Minimal Optimization
SVM	Support Vector Machine
VAE	Variational Autoencoder
XAI	Explainable Artificial Intelligence

References

Castro, P.; Santos, F.; Lopes, P. Artificial Intelligence Models for Log Event Analysis. Millenium—J. Educ. Technol. Health 2025, 2025, e41569. [Google Scholar] [CrossRef]
Pinto, J.C.O. Sistema de Detecção de Intrusão em Redes Informáticas. Master’s Thesis, Instituto Superior de Engenharia do Porto, Porto, Portugal, 2009. [Google Scholar]
Gutierrez-Garcia, J.L.; Sanchez-DelaCruz, E.; Pozos-Parra, M.d.P. A Review of Intrusion Detection Systems Using Machine Learning: Attacks, Algorithms and Challenges. Lect. Notes Netw. Syst. 2023, 652, 59–78. [Google Scholar] [CrossRef]
Ranjan, A.K.; Dubey, A.K. Evolution and Advancements in Intrusion Detection Systems: From Traditional Methods to Deep Learning and Federated Learning Approaches. Accent Trans. Inf. Secur. 2024, 9, 15–19. Available online: https://accentsjournals.org/PaperDirectory/Journal/TIS/2024/7/1.pdf (accessed on 9 April 2026).
Ali, A.H.; Charfeddine, M.; Ammar, B.; Hamed, B.B.; Albalwy, F.; Alqarafi, A.; Hussain, A. Unveiling Machine Learning Strategies and Considerations in Intrusion Detection Systems: A Comprehensive Survey. Front. Comput. Sci. 2024, 6, 1387354. [Google Scholar] [CrossRef]
Landauer, M.; Onder, S.; Skopik, F.; Wurzenberger, M. Deep Learning for Anomaly Detection in Log Data: A Survey. Mach. Learn. Appl. 2023, 12, 100470. [Google Scholar] [CrossRef]
Sharma, V.; Kumar, M. Comparative Analysis of Machine Learning Models for Intrusion Detection Systems. Panam. Math. J. 2025, 35, 273–285. [Google Scholar] [CrossRef]
Fan, Z.; You, Z. Research on Network Intrusion Detection Based on XGBoost Algorithm and Multiple Machine Learning Algorithms. Theor. Nat. Sci. 2024, 31, 161–166. [Google Scholar] [CrossRef]
Dhaliwal, S.S.; Nahid, A.; Abbas, R. Effective Intrusion Detection System Using XGBoost. Information 2018, 9, 149. [Google Scholar] [CrossRef]
Lindemann, B.; Maschler, B.; Sahlab, N.; Weyrich, M. A Survey on Anomaly Detection for Technical Systems Using LSTM Networks. Comput. Ind. 2021, 131, 103498. [Google Scholar] [CrossRef]
What Is LSTM—Long Short Term Memory?—GeeksforGeeks. Available online: https://www.geeksforgeeks.org/deep-learning/deep-learning-introduction-to-long-short-term-memory/ (accessed on 10 November 2025).
Immastephy, A.J.A.; Punitha, K. A Systematic Review on Network Intrusion Detection System Based on Machine Learning and Deep Learning Approach. E3S Web Conf. 2024, 540, 14006. [Google Scholar] [CrossRef]
Kim, J.; Kim, J.; Thu, H.L.T.; Kim, H. Long Short Term Memory Recurrent Neural Network Classifier for Intrusion Detection. In Proceedings of the 2016 International Conference on Platform Technology and Service, Jeju, Republic of Korea, 15–17 February 2016. [Google Scholar] [CrossRef]
Nguyen, D.C.; Ding, M.; Pathirana, P.N.; Seneviratne, A.; Li, J.; Vincent Poor, H. Federated Learning for Internet of Things: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2021, 23, 1622–1658. [Google Scholar] [CrossRef]
Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Ferrag, M.A.; Maglaras, L.; Moschoyiannis, S.; Janicke, H. Deep Learning for Cyber Security Intrusion Detection: Approaches, Datasets, and Comparative Study. J. Inf. Secur. Appl. 2020, 50, 102419. [Google Scholar] [CrossRef]
Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proceedings of the 4th International Conference on Information Systems Security and Privacy ICISSP, Funchal, Portugal, 22–24 January 2018; pp. 108–116. [Google Scholar] [CrossRef]
Alaref, A. Hash Encoding (Or Feature Hashing). Available online: https://www.kaggle.com/code/adnanalaref/hash-encoding-or-feature-hashing (accessed on 20 October 2025).

Figure 1. Evaluation metrics for the Random Forest model on (1) Classification Report, (2) Confusion Matrix, (3) Precision-Recall Curve.

Figure 2. Evaluation metrics for the XGBoost model on (1) Classification Report, (2) Confusion Matrix, (3) Precision-Recall Curve.

Figure 3. Evaluation metrics for the SGD model on (1) Classification Report, (2) Confusion Matrix, (3) Precision-Recall Curve.

Figure 4. Visualization of K-means clustering with PCA. The blue symbols represent the data points (instances), while the red points represent the anomalies (real attack vectors). The overlapping of colors demonstrates that anomalies are scattered among legitimate traffic, confirming the model’s inability to isolate threats using centroid-based distances.

Figure 5. Evaluation metrics for the LSTM model on (1) Classification Report, (2) Confusion Matrix, (3) Precision-Recall Curve.

Figure 6. Evaluation metrics for the GRU model on (1) Classification Report, (2) Confusion Matrix, (3) Precision-Recall Curve.

Figure 7. Evaluation metrics for the VAE model on (1) Classification Report, (2) Confusion Matrix, (3) Precision-Recall Curve.

Figure 8. Evaluation metrics for the DeepLog model on (1) Classification Report, (2) Confusion Matrix, (3) Precision-Recall Curve.

Table 1. A structured taxonomy of current IDS.

Category	Sub-Types	Core Mechanism
Supervised	XGBoost, Random Forest, LSTM	Mapping features to known labels (0/1)
Unsupervised	VAE, K-Means, Isolation Forest	Detecting deviations from normal statistical patterns
Hybrid	CNN-GRU, Ensemble-Rules	Combining multiple architectures for decision making
Distributed	Federated Learning	Decentralized training across multiple nodes

Table 2. Comparison of the results of the Supervised Machine Learning models.

Metrics	Random Forest	XGBoost	SGD
Precision	0.98	0.81	0.02
Recall	0.83	0.96	1.00
F1-score	0.90	0.88	0.04
False Positives	1	11	2623
False Negatives	8	2	0
ROC-AUC	1.00	1.00	0.9993
PR-AUC	0.9743	0.9705	0.2033

Table 3. Comparative performance between the base and optimized XGBoost models.

Metrics	Base XGBoost	Optimized XGBoost
Precision	0.81	0.85
Recall	0.96	0.96
F1-score	0.88	0.90
False Positives	11	8
False Negatives	2	2
ROC-AUC	1.00	0.9998
PR-AUC	0.9705	0.9703

Table 4. Comparative results of the performance of the Isolation Forest model with different levels of contamination.

Metrics	Isolation Forest (0.000106)	Isolation Forest (0.01)
Precision	0.00	0.00
Recall	0.00	0.00
F1-score	0.00	0.00
False Positives	50	4358
False Negatives	48	48
ROC-AUC	0.7996	0.7996
PR-AUC	0.0004	0.0004

Table 5. Comparative performance results of the K-means model with different numbers of clusters.

Metrics	K-Means (8)	K-Means (50)
Precision	0.00	0.00
Recall	0.00	0.00
F1-score	0.00	0.00
False Positives	901	451
False Negatives	48	48
ROC-AUC	0.5631	0.6152
PR-AUC	0.0001	0.0001

Table 6. Comparative performance results of the DBSCAN model in ‘min_samples’ configurations.

Metrics	DBSCAN (594)	DBSCAN (30)
Precision	0	0.00
Recall	1.00	1.00
F1-score	0.00	0.0
False Positives	1,783,931	1,662,820
False Negatives	0	0
ROC-AUC	0.7301	0.8425
PR-AUC	0.0002	0.0003

Table 7. Comparison of the results of Supervised Deep Learning models.

Metrics	LSTM	BERT	GRU
Precision	0.59	-	0.70
Recall	0.96	-	0.94
F1-score	0.73	-	0.80
False Positives	32	-	19
False Negatives	2	-	3
ROC-AUC	1.00	-	1.00
PR-AUC	0.9473	-	0.9440

Table 8. Comparative performance between the base and optimized GRU models.

Metrics	Base GRU	Optimized GRU
Precision	0.7	0.91
Recall	0.94	0.85
F1-score	0.80	0.88
False Positives	19	4
False Negatives	3	7
ROC-AUC	1.00	1.0
PR-AUC	0.9440	0.9253

Table 9. Comparative analysis with SOTA results from standard IDS datasets.

Metrics	Model	Recall	Precision	F1-Score
CICIDS2017	XGBoost	0.98	0.97	0.97
UNSW-NB15	Random Forest	0.89	0.92	0.90
IPV Logs	Optimized XGBoost	0.96	0.85	0.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Castro, P.; Santos, F.; Lopes, P. Comparative Analysis of Supervised and Unsupervised Learning for Intrusion Detection in Network Logs. Computation 2026, 14, 92. https://doi.org/10.3390/computation14040092

AMA Style

Castro P, Santos F, Lopes P. Comparative Analysis of Supervised and Unsupervised Learning for Intrusion Detection in Network Logs. Computation. 2026; 14(4):92. https://doi.org/10.3390/computation14040092

Chicago/Turabian Style

Castro, Paulo, Fernando Santos, and Pedro Lopes. 2026. "Comparative Analysis of Supervised and Unsupervised Learning for Intrusion Detection in Network Logs" Computation 14, no. 4: 92. https://doi.org/10.3390/computation14040092

APA Style

Castro, P., Santos, F., & Lopes, P. (2026). Comparative Analysis of Supervised and Unsupervised Learning for Intrusion Detection in Network Logs. Computation, 14(4), 92. https://doi.org/10.3390/computation14040092

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparative Analysis of Supervised and Unsupervised Learning for Intrusion Detection in Network Logs

Abstract

1. Introduction

2. Related Works

3. Computing Environment

4. Methodology and Results

Preprocessing and Feature Engineering

5. Model Evaluation and Selection Strategy

5.1. Supervised Machine Learning Classification

5.1.1. Random Forest

5.1.2. XGBoost

5.1.3. Stochastic Gradient Descent

5.1.4. Supervised Machine Learning—Analytical Comparison

5.1.5. XGBoost Optimization

5.2. Unsupervised Machine Learning

5.2.1. Isolation Forest

5.2.2. K-Means

5.2.3. DBSCAN

5.3. Supervised Deep Learning

5.3.1. LSTM

5.3.2. BERT

5.3.3. GRU

5.3.4. Supervised Deep Learning—Analytical Comparison

5.3.5. GRU Optimization

5.4. Unsupervised Deep Learning

5.4.1. VAE

5.4.2. LogBERT

5.4.3. DeepLog

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI