1. Introduction
In today’s hyperconnected society, safeguarding digital infrastructures has become a cornerstone of national, economic, and social stability. As cyber threats continue to evolve in scale and complexity, organizations must adopt intelligent and proactive defense mechanisms. Intrusion detection systems (IDSs) represent a critical layer in this defense architecture, serving to monitor and identify malicious or anomalous activity in real time. Traditionally, IDSs have relied on signature-based or rule-based techniques, which match observed patterns against known attack signatures. Although effective against previously encountered threats, these approaches often fail to detect new or obfuscated attacks and typically suffer from high false positive rates [
1]. To address these limitations, recent efforts have turned to machine learning (ML), deep learning (DL), and hybrid methods capable of generalizing from data and adapting to new threat landscapes.
Machine learning models (both supervised and unsupervised) enable systems to learn behavioral patterns from historical traffic data [
2]. Classical classifiers, such as Naïve Bayes (NB), Logistic Regression (LR), and Linear Discriminant Analysis (LDA), apply well-established mathematical frameworks to detect anomalies in network activity [
1]. While these models are often interpretable and computationally lightweight, they can struggle with non-linear relationships and high-dimensional feature spaces typical of modern cybersecurity datasets.
In contrast, deep learning (DL) architectures provide powerful tools for automated representation learning and pattern recognition. In particular, Autoencoders (AEs) can learn compact, informative latent embeddings from network traffic data [
3,
4,
5]. These embeddings can then be fed to simple yet effective downstream classifiers such as LR, yielding hybrid pipelines (AE+LR) that combine unsupervised feature learning with interpretable decision boundaries. Such DL-based representations have demonstrated strong performance in identifying complex and evolving cyber threats [
1,
6]. However, practical deployment still faces challenges related to computational demands, data requirements, and explainability [
7,
8].
Beyond intrusion detection, deep neural architectures have also been applied to security-critical communication scenarios, such as IRS (intelligent reflecting surfaces)-assisted NOMA (nonorthogonal multiple access) systems optimized via cooperative graph neural networks (CO-GNN) [
9], which emphasize rich visualization of performance metrics to complement tabular evaluations.
In the context of applied AI and cybersecurity engineering, conducting rigorous comparative studies is vital. Such evaluations clarify when interpretable statistical models are sufficient for effective detection and when representation learning can provide measurable gains when paired with lightweight classifiers.
This study presents a deterministic comparison of three classical classifiers (NB, LR, and LDA) and a hybrid deep-learning pipeline (AE+LR) for intrusion detection [
10]. To ensure robustness, we employ two widely used benchmark datasets: NSL-KDD [
11] (derived from KDD’99) and CICIDS2017 [
12], which include up-to-date traffic and attack types. These datasets provide a balanced basis to examine model generalization, performance across diverse attacks, and real-world feasibility.
We assess each model using standard performance metrics: Accuracy, Precision, Recall, F1-score, AUC, and FAR. In addition, we provide complete tabular reporting. The key contribution of this paper includes the following:
(i) a unified, leakage-free, and deterministic evaluation protocol applied consistently across NB, LR, LDA, and a hybrid AE+LR pipeline; (ii) dual training regimes (with/without SMOTE strictly on training folds) to isolate the effect of class rebalancing; (iii) a transparent deep-representation + linear classifier baseline (AE+LR) that combines competitive AUC/F1 with auditability; (iv) complete tables (Accuracy, Precision, Recall, F1, AUC, FAR) for both NSL-KDD and CICIDS2017; and (v) a replication package (code, configs, fixed seeds) enabling reproducibility.
From an operational standpoint, we interpret transparency as the use of models whose decision functions can be inspected, reproduced, and calibrated by security analysts. For this reason, we concentrate on linear probabilistic classifiers (LR/LDA) and on a hybrid AE+LR pipeline in which the deep Autoencoder is used solely as a representation learner, while the final decision layer remains a simple, auditable logistic regression classifier.
The remainder of this article is structured as follows.
Section 2 provides a review of related research on IDS using ML and DL.
Section 3 describes the experimental methodology.
Section 4 presents the results.
Section 5 discusses the findings, limitations, contextual relevance, recommendations and directions for future work.
4. Results
This section reports the performance of the four evaluated approaches: Naïve Bayes (NB), Logistic Regression (LR), Linear Discriminant Analysis (LDA), and the hybrid Autoencoder embeddings + Logistic Regression (AE+LR) in NSL-KDD and CICIDS2017 datasets. We present Accuracy, Precision, Recall, F1, AUC, and False Alarm Rate (FAR), consistent with the evaluation protocol described earlier. Unless noted, models use default thresholds (0.5 for probabilistic outputs), and all preprocessing/SMOTE steps are fit on the training split only.
4.1. Overall Performance on NSL-KDD
Table 4 and
Table 5 summarize the binary results on NSL-KDD without and with SMOTE, respectively. Across both regimes, LR, LDA, and AE+LR clearly outperform NB, which shows very high Precision but poor Recall (i.e., conservative attack labeling that misses many positives).
For reference, after applying SMOTE only on the training split, the NSL-KDD training set becomes exactly balanced at Benign:Attack = 80,892:80,892 (1:1).
Table 4.
Performance metrics of all models on the NSL-KDD test set (without SMOTE).
Table 4.
Performance metrics of all models on the NSL-KDD test set (without SMOTE).
| Model | Accuracy | Precision | Recall | F1-Score | FAR | AUC |
|---|
| Naïve Bayes | 0.5649 | 0.9800 | 0.2406 | 0.3864 | 0.0065 | 0.7987 |
| Logistic Regression (LR) | 0.7443 | 0.9142 | 0.6079 | 0.7302 | 0.0754 | 0.8276 |
| LDA | 0.7616 | 0.9249 | 0.6326 | 0.7514 | 0.0679 | 0.8484 |
| Autoencoder (AE) + LR | 0.7616 | 0.9165 | 0.6397 | 0.7534 | 0.0770 | 0.9040 |
Table 5.
Performance metrics of all models on the NSL-KDD test set (with SMOTE on training data only).
Table 5.
Performance metrics of all models on the NSL-KDD test set (with SMOTE on training data only).
| Model | Accuracy | Precision | Recall | F1-Score | FAR | AUC |
|---|
| Naïve Bayes | 0.5650 | 0.9800 | 0.2408 | 0.3866 | 0.0065 | 0.7978 |
| Logistic Regression (LR) | 0.7458 | 0.9137 | 0.6112 | 0.7324 | 0.0763 | 0.8227 |
| LDA | 0.7632 | 0.9238 | 0.6365 | 0.7537 | 0.0694 | 0.8525 |
| Autoencoder (AE) + LR | 0.7654 | 0.9166 | 0.6467 | 0.7583 | 0.0778 | 0.9043 |
NB: Exceptionally high Precision (≈0.98) but low Recall (≈0.24) produces the lowest F1; FAR is minimal due to the conservative decision boundary.
LR vs. LDA: LDA edges LR slightly on Accuracy/F1 and AUC in both regimes; differences are modest.
AE+LR: Delivers the highest AUC (≈0.904) and the best Recall/F1 trade-off among the four, with Accuracy comparable to LDA.
Effect of SMOTE: Improvements are small but consistent where they appear (e.g., LDA F1: 0.7514→0.7537; AE+LR F1: 0.7534→0.7583), indicating mild gains in detecting attacks at an approximately unchanged FAR.
On NSL-KDD,
Figure 1 complements
Table 4 and
Table 5 by showing that the AE+LR pipeline not only attains the highest AUC (approximately 0.904) but also yields a ROC curve that lies above those of LR and LDA for most operating points, while NB lags behind, particularly in the mid-to-high FPR region. This confirms that AE+LR achieves the best overall separability between normal and attack traffic among the evaluated models.
4.2. Overall Performance on CICIDS2017
Table 6 and
Table 7 summarize binary results on CICIDS2017. Performances are generally high across LR, LDA, and AE+LR, with NB trailing due to a high FAR and lower Precision.
Table 6.
Performance metrics of all models on the CICIDS2017 dataset (without SMOTE).
Table 6.
Performance metrics of all models on the CICIDS2017 dataset (without SMOTE).
| Model | Accuracy | Precision | Recall | F1-Score | FAR | AUC |
|---|
| Naïve Bayes | 0.8565 | 0.6364 | 0.9742 | 0.7699 | 0.1820 | 0.9711 |
| Logistic Regression (LR) | 0.9878 | 0.9722 | 0.9783 | 0.9752 | 0.0091 | 0.9961 |
| LDA | 0.9784 | 0.9426 | 0.9715 | 0.9569 | 0.0193 | 0.9923 |
| Autoencoder (AE) + LR | 0.9859 | 0.9691 | 0.9738 | 0.9715 | 0.0101 | 0.9960 |
Table 7.
Performance metrics of all models on the CICIDS2017 dataset (with SMOTE on training data only).
Table 7.
Performance metrics of all models on the CICIDS2017 dataset (with SMOTE on training data only).
| Model | Accuracy | Precision | Recall | F1-Score | FAR | AUC |
|---|
| Naïve Bayes | 0.8564 | 0.6362 | 0.9742 | 0.7697 | 0.1821 | 0.9686 |
| Logistic Regression (LR) | 0.9862 | 0.9623 | 0.9824 | 0.9723 | 0.0126 | 0.9962 |
| LDA | 0.9735 | 0.9156 | 0.9830 | 0.9481 | 0.0296 | 0.9924 |
| Autoencoder (AE) + LR | 0.9806 | 0.9428 | 0.9805 | 0.9613 | 0.0194 | 0.9954 |
LR and AE+LR: Both achieve a near-ceiling AUC (≈0.996). LR attains the best overall Accuracy/F1 (0.9878/0.9752) without SMOTE; AE+LR is a close second (0.9859/0.9715).
LDA: Strong but consistently below LR/AE+LR in this dataset, with slightly higher FAR.
NB: High Recall (≈0.97) but much lower Precision and a high FAR (≈0.18), yielding the lowest F1.
Effect of SMOTE: On CICIDS2017 (already balanced at the aggregate level after unification), SMOTE slightly reduces Accuracy/F1 for LR and AE+LR and increases FAR, which is consistent with minor overcompensation when the base split is relatively well balanced.
On CICIDS2017, the impact of SMOTE is inherently constrained by the near-ceiling performance already achieved by LR, LDA, and AE+LR. As reported in
Table 6 and
Table 7, these models reach ROC AUC values close to 0.996 and very high F1-scores even without oversampling, and SMOTE only produces small fluctuations at the third or fourth decimal place. In this regime, the decision boundary already separates benign and attack traffic very effectively, so synthetic minority examples generated by SMOTE tend to lie in regions of the feature space that are already densely populated by correctly classified points, adding little new information. In addition, some attack subcategories appear with very few and potentially noisy samples, making them poor candidates for interpolation: oversampling around isolated, overlapping, or mislabeled points may fail to improve, or even slightly perturb, the classifier’s decision surface. These observations clarify that, for CICIDS2017, SMOTE operates under conditions where its theoretical benefits are limited, and its role is mainly to provide a consistent comparison with NSL-KDD rather than to drive substantial performance gains.
For CICIDS2017,
Figure 2 confirms that LR, LDA, and AE+LR all operate in a near-ceiling regime, with ROC curves that are almost indistinguishable (AUC
for LR and AE+LR;
Table 6 and
Table 7). NB, in contrast, traces a noticeably lower ROC curve, reflecting its higher FAR and reduced Precision despite very high Recall. Overall, once the feature space is well separated, the additional non-linear representation from AE does not translate into a visible ROC advantage over LR on this benchmark, even though AE+LR still matches LR in AUC.
Runtime Analysis
Table 8 shows that, on the smaller NSL-KDD benchmark, all models train in less than 26 s and complete inference on the entire test split in at most 0.084 s. On the larger CICIDS2017 dataset, training times increase by roughly two orders of magnitude due to the 1.68 million training flows, yet even the most expensive configuration (AE+LR with SMOTE) finishes training in about 421 s (approximately 7 min). Inference on CICIDS2017 remains inexpensive for the classical models (no more than 1.9 s to score the whole test split), and although AE+LR requires about 71–73 s to compute embeddings and predictions for the 419,995 test flows, this still corresponds to sub-millisecond latency per flow. Overall, all four pipelines keep the runtime at a level compatible with near real-time IDS deployment on commodity hardware. These results are shown in
Appendix B, in
Table A17,
Table A18 and
Table A19.
4.3. Takeaways
Across both datasets, NB provides a fast but conservative baseline with high Precision and low Recall. LDA is a strong classical competitor and slightly edges LR on NSL-KDD, whereas LR leads on CICIDS2017. The hybrid AE+LR consistently offers the best AUC and competitive F1, suggesting that unsupervised embeddings help linear classifiers capture non-linear structure without sacrificing interpretability. By using the autoencoder purely to learn latent embeddings and delegating all final decisions to a linear LR classifier, AE+LR improves separability (AUC/F1) over LR/LDA while preserving the same transparent, feature-weight–based decision layer that operators can inspect, calibrate, and document. SMOTE yields modest gains in NSL-KDD (where minority attacks are scarce) and does not help—and can slightly hurt—on the unified CICIDS2017 split. These patterns align with prior IDS evidence: classical linear/probabilistic baselines remain valuable, and hybrid deep representations can further enhance linear decision functions under tabular traffic features.
4.3.1. Comparison with Related Work
It is instructive to situate our findings alongside recent IDS evaluations on NSL-KDD and CICIDS2017. Broadly, prior works report that deep learning (DL) models frequently attain near-ceiling discrimination on these corpora, whereas well-regularized linear/probabilistic baselines remain competitive, transparent references. A compact side-by-side comparison with recent IDS studies is summarized in
Table 9.
Linear/Probabilistic baselines. Ali et al. [
1] report LR around ∼97% Accuracy, and NB near ∼64% on their benchmark—trends that mirror our binary CICIDS2017 results, where LR achieves 98.78% Accuracy (F1 = 0.9752), and NB substantially trails due to its low Precision and high FAR (
Table 6). Our LDA is also competitive, consistent with surveys that position LR/LDA as strong, reproducible tabular baselines [
23,
26].
Deep learning and hybrids. Several studies show DL advantages on NSL-KDD/CICIDS2017. Umer and Abbasi [
31] report an AE–LSTM (Long Short-Term Memory) two-stage system with ∼89% accuracy on NSL-KDD, prioritizing low false alarms; Chen et al. [
32] obtain 86.8% on CICIDS2017 with a Deep Belief Network (DBN)+LSTM; and Roy et al. [
20] reach 99.11% on CICIDS2017 using a lightweight supervised ensemble. Rather than training a full end-to-end DL classifier, our study adopts a hybrid approach (AE+LR): an autoencoder provides unsupervised embeddings and a linear LR supplies an auditable decision layer. This yields near-ceiling AUC on CICIDS2017 (AUC ≈ 0.996) and the top AUC on NSL-KDD among our four families (AUC ≈ 0.904), aligning with evidence that AE-driven representation learning can improve linear separability while preserving interpretability [
5,
19,
23,
26,
27].
On metric deltas across papers. Absolute numbers in cross-paper comparisons routinely vary with dataset curation (binary vs. multi-class), train/test protocol (official splits vs. random), feature engineering and normalization, imbalance remedies, and leakage controls. For instance, our pipelines fit all preprocessing (and SMOTE when used) strictly on training data and evaluate on held-out splits, which can produce more conservative but comparable scores. Prior work also documents that SMOTE often helps minority Recall on NSL-KDD but has mixed value on CICIDS2017, depending on the split [
23,
25]. Within this context, our results corroborate the prevailing ranking—NB ≪ LR/LDA ≲ DL—while showing that a simple AE+LR hybrid can deliver DL-like AUC with a transparent classifier.
4.3.2. Key Takeaways
In summary, no single model is universally best for all deployment constraints. Among classical baselines, LR and LDA provide strong, transparent performance with low computational cost, as confirmed by the training and inference times in
Table 8, making them attractive when interpretability and latency are priorities [
23]. NB is generally not competitive on modern IDS corpora unless paired with additional design choices (e.g., feature engineering or hybridization; see Gu & Lu [
14]).
Our hybrid pipeline (AE+LR) shows that using an autoencoder purely as an unsupervised representation learner can deliver near state-of-the-art AUC on CICIDS2017 while preserving an auditable linear decision layer, aligning with reports that deep representations can improve linear separability without sacrificing explainability [
23,
26].
Operationally, the results underscore two practical points: (i) addressing class imbalance (e.g., with SMOTE applied only on training data) improves minority-class Recall on NSL-KDD, whereas its benefit on CICIDS2017 depends on the split; and (ii) reporting FAR alongside Recall is essential to balance missed attacks against analyst workload. Overall, a layered IDS that uses fast statistical models for coarse filtering and learned representations for refined decisions can leverage complementary strengths under realistic constraints.
4.4. Statistical Robustness and Uncertainty Quantification
4.4.1. Statistical Analysis on NSL-KDD
The statistical evaluation on the NSL-KDD dataset, presented in
Table A18, reveals varying levels of stability across the tested models:
High Stability in LR and NB: The Logistic Regression (LR) and Naïve Bayes (NB) models demonstrate extreme stability, particularly when run without SMOTE. Their SD values are negligible (SD < 0.0001 for most metrics), and the 95% CIs are extremely narrow (e.g., CIAcc. = [0.7443, 0.7443] for LR without SMOTE). This confirms that the deterministic results initially reported for these linear models are highly reliable.
Hybrid Model Consistency: The Autoencoder-based hybrid models (LR + AE) also exhibit high consistency, showing low SD values (SDAcc. ≈ 0.0008) and tight 95% CIs. This indicates that the observed performance improvement from deep representation learning is statistically robust and consistent across different initialization states.
Highest Instability in LDA: In contrast, the Linear Discriminant Analysis (LDA) model without SMOTE displays the highest sensitivity to the random seed. The SDAcc. is significantly higher (0.1390), resulting in a wide RangeAcc. of [0.5650, 0.7616]. This high variability suggests that while LDA can occasionally achieve high performance, its average performance and practical reliability on this dataset are inconsistent.
Overall, the statistical analysis confirms that the primary conclusions regarding the relative performance and stability of the models on NSL-KDD, especially the consistent performance of the LR + AE hybrid approach, are statistically sound and independent of the random seed, except for the volatile LDA model.
4.4.2. Statistical Analysis on CICIDS2017
The statistical assessment on the CICIDS2017 dataset, presented in
Table A19, demonstrates a remarkably high degree of stability across all tested models:
Near-Zero Variance Across All Models: Both classical and hybrid models exhibit exceptional consistency, with SD values generally below 0.0005 for key metrics like Accuracy, Precision, and F1-score. For instance, LR without SMOTE shows SD ≤ 0.0004 across all reported metrics. Even the deep learning-based LR + AE models, which typically have greater stochasticity, maintain very low SDs (SDAcc. ≤ 0.0005).
Statistical Significance of Performance Differences: The clear and high separation in the mean performance between the top-performing models (e.g., LR without SMOTE with MeanAcc. = 0.9877) and the lower-performing classical models (e.g., LDA without SMOTE with MeanAcc. = 0.8529) is validated by the non-overlapping 95% CIs. This confirms that the ranking of models is statistically significant.
Decisive Robustness: The very low uncertainty across the board confirms that the high detection quality (e.g., AUC ≥ 0.99 for top models) is an intrinsic and robust property of the models on the CICIDS2017 dataset.
5. Conclusions and Future Work
This study presented a rigorous, reproducible comparison of three classical clas- sifiers—Naïve Bayes (NB), Logistic Regression (LR), and Linear Discriminant Analysis (LDA)—and a hybrid deep-representation pipeline based on Autoencoder embeddings with a linear classifier (AE+LR). Using the NSL-KDD and CICIDS2017 benchmarks, we reported Accuracy, Precision, Recall, F1, AUC, and False Alarm Rate (FAR), with all preprocessing and any class rebalancing (SMOTE) fit strictly on the training split. In addition, we measured wall-clock training and inference times for all four pipelines (
Table 8), showing that NB, LR, and LDA train in seconds on NSL-KDD and in at most a few hundred seconds on CICIDS2017, while inference for all models, including AE+LR, remains well below one millisecond per flow, which is compatible with near real-time IDS operation on commodity hardware.
On NSL-KDD, AE+LR and LDA offered the strongest overall trade-offs. AE+LR attained the highest AUC (≈0.904;
Table 5) and the best F1 among the four models in both regimes (e.g., 0.7583 with SMOTE), while LDA slightly edged LR on Accuracy/F1 and AUC (e.g., F1 0.7537 vs. 0.7324 with SMOTE). NB achieved very high Precision (≈0.98) but very low Recall (≈0.24), yielding the weakest F1; its low FAR reflects a conservative decision boundary that misses many attacks.
On CICIDS2017, performance was strong for LR, LDA, and AE+LR. LR achieved the best Accuracy/F1 without SMOTE (0.9878/0.9752), with AE+LR a close second (0.9859/0.9715), and both near-ceiling AUC (≈0.996). NB again trailed due to a much higher FAR (≈0.18) and lower Precision, despite high Recall.
Applying SMOTE to training data produced modest gains on NSL-KDD (where rare attacks are genuinely scarce), e.g., small F1 improvements for LDA and AE+LR with essentially unchanged FAR. In the unified binary split of CICIDS2017, SMOTE did not help and in some cases slightly reduced Accuracy/F1 while increasing FAR—consistent with mild overcompensation when the base class balance is already adequate.
For operators seeking transparent and lightweight deployment, LDA and LR remain strong, reproducible baselines. Where a small computational overhead is acceptable, AE+LR provides additional separability (highest AUC) and competitive F1 by leveraging an unsupervised representation while keeping a simple, auditable linear classifier. NB remains useful as a sanity-check baseline and for resource-constrained scenarios, but its low Recall on NSL-KDD cautions against relying on it as a stand-alone detector.
Our primary focus was the binary setting; then, a complete multiclass analysis and cost-sensitive operating points were not the central objective here. In addition, AE was used strictly as a feature learner (not as a thresholded anomaly detector), so conclusions about reconstruction-threshold tuning are beyond scope.
Future Work
Multiclass and cost-aware evaluation. Extend to full multiclass NSL-KDD with macro/micro metrics and per-class PR (Precision–Recall) curves; explore cost-sensitive thresholds (e.g., penalizing FNs (false negatives) for critical attacks).
Non-linear classical baselines. Include Support Vector Machines (with kernels) and tree ensembles (Random Forest, Gradient Boosting) as non-linear yet interpretable tabular baselines.
Representation learning variants. Compare AE with denoising/variational variants and different bottleneck sizes; quantify when AE+LR’s gains over LR/LDA justify the added cost.
Imbalance strategies. Evaluate cost-sensitive learning, probability calibration, and focal-type objectives; compare SMOTE with alternatives such as Synthetic Minority Over-sampling Technique for Nominal and Continuous features (SMOTE-NC) and Adaptive Synthetic Sampling (ADASYN) and informed under-sampling.
Cross-dataset generalization. Train on one domain and test on another (e.g., NSL-KDD → CICIDS2017 or the UNSW-NB15 intrusion detection dataset) with domain adaptation/transfer learning; examine robustness to concept drift.
Operationalization and explainability. Integrate local explanations, such as Shapley Additive exPlanations (SHAP) for LR/LDA and in the AE latent space, into alerting workflows; these additions extend our focus on operational transparency by enabling analysts to see which features drive individual alerts and to validate or refine decision policies accordingly.
Reproducible pipelines. Release notebooks that regenerate all tables and reported metrics from predict_proba/decision_function; add regression tests to guard against library drift.
Cross-dataset validation. Replicate the full pipeline on UNSW-NB15 and the CIC-IDS-2018 dataset from the Canadian Institute for Cybersecurity (CIC) and run train→test cross-domain transfers to assess robustness under dataset shift.
Dataset fusion and harmonization. Explore concatenating harmonized feature-spaces from NSL-KDD and CICIDS2017 (with a domain indicator) to train unified models, and quantify trade-offs in Accuracy/FAR vs. domain shift. Evaluate whether AE-based representations ease cross-domain fusion.
Overall, the results support a pragmatic recommendation: start with LR/LDA for simplicity and transparency, and adopt AE+LR when consistent separability gains (AUC) are desired without abandoning a linear, interpretable classifier. Careful use of imbalance remedies and alarm-centric metrics (Recall and FAR) is essential for reliable IDS deployment.