A Hybrid Time Series Forecasting Model Combining ARIMA and Decision Trees to Detect Attacks in MITRE ATT&CK Labeled Zeek Log Data

Freeman, Raymond; Bagui, Sikha S.; Bagui, Subhash C.; Mink, Dustin; Cameron, Sarah; Carvalho, Germano Correa Silva De

doi:10.3390/electronics15040871

Open AccessEditor’s ChoiceArticle

A Hybrid Time Series Forecasting Model Combining ARIMA and Decision Trees to Detect Attacks in MITRE ATT&CK Labeled Zeek Log Data

by

Raymond Freeman

¹,

Sikha S. Bagui

^1,*

,

Subhash C. Bagui

²

,

Dustin Mink

³

,

Sarah Cameron

¹

and

Germano Correa Silva De Carvalho

¹

Department of Computer Science, The University of West Florida, Pensacola, FL 32514, USA

²

Department of Mathematics and Statistics, The University of West Florida, Pensacola, FL 32514, USA

³

Department of Cybersecurity, The University of West Florida, Pensacola, FL 32514, USA

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(4), 871; https://doi.org/10.3390/electronics15040871

Submission received: 30 December 2025 / Revised: 15 February 2026 / Accepted: 18 February 2026 / Published: 19 February 2026

(This article belongs to the Special Issue Recent Advances in Intrusion Detection Systems Using Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Intrusion detection systems face challenges in processing high-volume network traffic while maintaining accuracy across diverse low volume attack types. This study presents a hybrid approach combining ARIMA time series forecasting with Decision Tree classification to detect attacks in Zeek network flow data labeled with MITRE ATT&CK tactics, leveraging PySpark for scalability. ARIMA identifies temporal anomalies which Decision Trees then classify by attack type. The ARIMA model was evaluated across 13 MITRE ATT&CK tactics, though only 7 maintained sufficient class balance for valid assessment. Results are reported at three evaluation levels: Baseline (Decision Tree only), ARIMA-DT (Decision Tree tested on ARIMA-filtered anomalies), and End-to-End (pipeline performance measured against the original test population). The hybrid model demonstrated two distinct benefits: performance improvement for detectable attacks and detection enablement for previously undetectable attacks. For high-volume attacks with existing baseline detection, ARIMA preprocessing substantially improved performance, for example, Reconnaissance achieved an ARIMA-DT F1 score of 99.71% (from a baseline of 80.88%) with End-to-End metrics confirming this improvement at 97.59% F1-score. Credential Access reached a perfect 100% precision and recall on the ARIMA-filtered subset (from a baseline recall of 7.48%); however, End-to-End evaluation revealed that ARIMA filtering removed the vast majority of Credential Access attacks, resulting in a 1.28% End-to-End F1-score—worse than the baseline F1-score of 7.41%—demonstrating that the hybrid pipeline is counterproductive for attack types whose flow characteristics closely resemble legitimate traffic. More significantly, ARIMA preprocessing enabled detection where traditional Decision Trees completely failed (0% recall) for four stealthy attack types: Defense Evasion (ARIMA-DT recall of 93.22%, End-to-End 67.83%), Discovery (ARIMA-DT recall of 100%, End-to-End 63.43%), Persistence (ARIMA-DT recall of 86.92%, End-to-End 73.38%), and Privilege Escalation (ARIMA-DT recall of 89.93%, End-to-End 64.68%). These results demonstrate that ARIMA-based statistical anomaly detection is particularly effective for attacks involving subtle, low-volume activities that blend with legitimate operations, while also improving classification accuracy for high-volume reconnaissance activities.

Keywords:

ARIMA; Decision Tree classifier; Intrusion Detection Systems; MITRE ATT&CK framework; Zeek Log Data; time series forecasting; anomaly detection; hybrid machine learning model

1. Introduction

Cybersecurity threats present significant challenges to the integrity and reliability of modern network systems, particularly within large-scale and high-throughput environments. As cyberattacks grow more dynamic and complex, traditional intrusion detection systems (IDS) often fail to adapt to the evolving patterns characteristic of malicious behavior. Rule-based systems rely on static signatures that cannot adapt to evolving attack patterns unfolding over time, while conventional machine learning models treat network events as independent observations, failing to capture the sequential dependencies and autocorrelation inherent in multi-stage attacks. These systems lack the temporal modeling capability necessary to detect sophisticated threats like slow-burn lateral movement or gradual privilege escalation, where malicious behavior emerges only through analysis of time-ordered patterns rather than individual events. Addressing this gap requires intrusion detection methods that explicitly model temporal dependencies in network traffic.

Time series analysis provides a powerful framework for modeling sequential data, enabling the identification of anomalies that may correspond to cyber threats. Specifically, the AutoRegressive Integrated Moving Average (ARIMA) model [1] captures trends, seasonality, and autocorrelation in numerical features, making it well-suited for detecting deviations in traffic patterns over time. Cybersecurity research has increasingly adopted time series methods to address the dynamic and evolving nature of threats, leveraging their ability to model temporal dependencies that traditional IDS approaches overlook.

While ARIMA excels at temporal anomaly detection through its modeling of autocorrelation, trends, and seasonality, it possesses a fundamental limitation: it identifies anomalous behavior but cannot determine what type of attack is occurring. ARIMA produces forecasting errors and residual scores indicating deviations from expected patterns, yet these numerical outputs lack any classification capability. Conversely, supervised classifiers such as Decision Trees can effectively categorize attack types using labeled training data and provide the interpretability needed for operational deployment, but they lack inherent temporal awareness–treating each network event as independent and missing the sequential dependencies that characterize multi-stage attacks. A hybrid approach that combines ARIMA’s temporal anomaly detection with Decision Tree classification addresses both limitations: ARIMA serves as a temporal anomaly detection filter identifying traffic that deviates from expected patterns, while the Decision Tree provides attack categorization based on MITRE ATT&CK framework labels [2].

Among supervised classification algorithms, Decision Trees was selected for this hybrid framework for two primary reasons. First, as the study’s research objective is to evaluate whether ARIMA preprocessing enables detection of previously undetectable attacks, the classifier must serve as a controlled baseline that isolates the effect of ARIMA filtering rather than confounding results with classifier-specific optimizations. Decision Trees with default hyperparameters provide this controlled baseline—any performance improvement between baseline and ARIMA-enhanced results can be attributed to the temporal preprocessing rather than classifier sophistication. Second, Decision Trees offer practical advantages for the cybersecurity context: inherent interpretability through their rule-based structure, enabling security analysts to understand and validate classification logic—a critical requirement for operational IDS deployment where false positives must be explicable and actionable; effective handling of continuous network traffic features (duration, bytes transferred, packet counts) without requiring extensive feature scaling that SVMs demand; and computational efficiency for real-time classification in high-throughput environments. While ARIMA-derived anomaly scores could theoretically serve as additional features, the current architecture uses ARIMA exclusively for data filtering to maintain simplicity and avoid temporal aggregation misalignment between windowed forecasts and individual flow records. Maximizing classification accuracy was explicitly not the primary goal of this study. The use of ensemble methods such as Random Forests or gradient boosting, or more complex models such as ANNs, would likely improve classification performance but would also obscure whether observed improvements stem from ARIMA preprocessing or from the classifier’s greater capacity. The deliberate use of an untuned Decision Tree establishes a conservative lower bound on the hybrid model’s effectiveness—the detection enablement demonstrated for stealthy attacks (64–100% End-to-End recall from 0% baseline) is attributable to ARIMA’s statistical filtering rather than classifier power. Future work should investigate whether pairing ARIMA preprocessing with more powerful classifiers yields further improvement beyond the baseline established here.

This research applies ARIMA forecasting to high-volume network flow data from the UWF-ZeekData22 [3] and UWF-ZeekData24 [4] datasets, each containing millions of rows of packet-level and byte-level information. Additional information about these datasets is available in [5]. To support real-time detection and scalability, this study integrates PySpark for big data processing and applies partitioned windows to extract both short- and long-term dependencies. However, while ARIMA is effective at identifying temporal anomalies, it cannot label or classify the nature of these anomalies. To bridge this gap, the study introduces a hybrid approach that combines ARIMA-based anomaly detection with a supervised Decision Tree (DT) classifier trained using the MITRE ATT&CK framework labelled data [2], enabling classification of attack types.

To address these limitations, this study explores the use of time series forecasting as a foundation for network intrusion detection, emphasizing its ability to uncover subtle deviations and statistical anomalies in network traffic. Experimental results across multiple 100,000-row partitions for thirteen MITRE ATT&CK tactics demonstrate varied outcomes. Seven tactics maintained adequate class balance for valid assessment. Results are reported at three evaluation levels: Baseline (Decision Tree only), ARIMA-DT (Decision Tree tested on ARIMA-filtered anomalies), and End-to-End (pipeline performance measured against the original test population). The hybrid approach demonstrated two distinct benefits: performance improvement and detection enablement. For high-volume attacks with existing baseline detection capabilities, ARIMA preprocessing substantially improved performance, for example, Reconnaissance achieved 99.71% ARIMA-DT F1-score (improved from a baseline of 80.88%) with a 99% reduction in false positives, with End-to-End metrics confirming this improvement at 97.59% F1-score. Credential Access achieved perfect 100% precision and recall on the ARIMA-filtered subset (improved from a baseline recall of 7.48%); however, End-to-End evaluation revealed that ARIMA filtering removed most Credential Access attacks, resulting in a 1.28% End-to-End F1-score, indicating that this attack type’s flow characteristics closely resemble legitimate authentication traffic. More significantly, ARIMA preprocessing enabled detection where baseline Decision Trees completely failed (recall of 0%) for four stealthy attack types: Defense Evasion (achieving 91.83% ARIMA-DT recall, 67.83% End-to-End recall), Discovery (achieving 100% ARIMA-DT recall, 63.43% End-to-End recall), Persistence (achieving 86.92% ARIMA-DT recall, 73.38% End-to-End recall), and Privilege Escalation (achieving 89.93% ARIMA-DT recall, 64.68% End-to-End recall). However, Exfiltration remained undetectable under all three evaluation levels, indicating fundamental limitations in flow-based detection for this attack type. These findings confirm that ARIMA-based statistical anomaly detection is particularly effective for low-volume, stealthy attacks that blend with legitimate operations, while also substantially improving classification accuracy for high-volume reconnaissance activities. The study also identifies methodological challenges in maintaining balanced class representation across diverse attack types, providing insights for future hybrid model development.

While prior research has demonstrated the effectiveness of combining ARIMA with machine learning classifiers, existing hybrid approaches operate within a fundamentally different paradigm than the one proposed here. These studies, discussed in Section 2, employ a decomposition-and-recombination architecture in which time series data is separated into linear and nonlinear components—ARIMA models the stationary or linear component, while classifiers such as decision trees or neural networks capture nonlinear residuals—and the outputs are recombined to improve overall forecasting accuracy. In contrast, this study adopts a sequential filter-then-classify pipeline in which ARIMA serves as a statistical anomaly detection filter rather than a forecasting component. ARIMA identifies network traffic that deviates from expected temporal patterns, and only these flagged anomalies are passed to the Decision Tree for attack type classification using MITRE ATT&CK labels. The classifier does not model residuals from ARIMA’s forecasts; instead, it operates on a pre-filtered subset of the data, fundamentally changing the detection landscape by concentrating the classifier’s effort on statistically anomalous traffic. This architectural distinction produces a qualitatively different outcome: rather than incrementally improving forecast accuracy as in prior decomposition-based hybrids, the proposed pipeline enables detection of attack types that were previously undetectable by the classifier alone—a capability not demonstrated in existing ARIMA-classifier combinations. Furthermore, unlike prior hybrid studies that target general time series domains such as financial or energy forecasting, this work applies the ARIMA-classifier paradigm specifically to large-scale network intrusion detection, evaluating performance across 13 MITRE ATT&CK tactics and introducing a three-level evaluation framework (Baseline, ARIMA-DT, and End-to-End) to rigorously assess the pipeline’s real-world detection capability.

The main contributions of this study include:

A novel hybrid intrusion detection framework that combines ARIMA time series forecasting for temporal anomaly detection with Decision Tree classification for attack categorization using MITRE ATT&CK framework labels, addressing the complementary limitations of both approaches—ARIMA’s inability to classify attack types and Decision Trees’ lack of temporal awareness.
Empirical demonstration of detection enablement for stealthy attacks, showing that ARIMA preprocessing enables detection of four previously undetectable attack types (Defense Evasion, Discovery, Persistence, and Privilege Escalation) where baseline Decision Tree models achieved 0% recall, with the hybrid approach achieving ARIMA-DT recall rates between 86 and 100% and End-to-End recall rates between 63 and 73%.
Significant performance improvements for high-volume attack detection, particularly for Reconnaissance (97.59% End-to-End F1-score compared to a baseline of 80.88%), and identification of attack-type-dependent limitations where Credential Access achieved perfect ARIMA-DT classification but poor End-to-End performance (1.28% F1-score) due to flow-level similarity between attack and benign traffic.
Scalable implementation using PySpark for processing large-scale network flow data from UWF-ZeekData22 [3] and UWF-ZeekData24 [4] datasets containing millions of rows, with partitioned window analysis supporting real-time detection capabilities.
Comprehensive evaluation across 13 MITRE ATT&CK tactics at three evaluation levels (Baseline, ARIMA-DT, and End-to-End) demonstrating the differential effectiveness of ARIMA preprocessing, revealing that statistical anomaly detection is particularly effective for low-volume, stealthy attacks while also improving classification accuracy for high-volume reconnaissance activities.

2. Related Works

ARIMA models have been applied in cybersecurity primarily as forecasting tools for predicting vulnerability trends and attack frequency. Gencer and Basciftci [6] applied ARIMA and deep learning algorithms to forecast vulnerabilities in Android operating systems, finding that ARIMA provided competitive accuracy relative to more complex methods. Pokhrel et al. [7] compared ARIMA and Support Vector Machine (SVM) models for operating system vulnerability forecasting, observing that ARIMA captured linear trends effectively while SVM handled nonlinear patterns, suggesting that hybrid approaches may be necessary where both trend types coexist. Werner et al. [8] further demonstrated ARIMA’s forecasting utility by predicting future attack frequency using the Hackmageddon dataset, achieving improved accuracy over naive baselines. While these studies confirm ARIMA’s capacity to model temporal patterns in cybersecurity data, they employ ARIMA exclusively as a forecasting tool—predicting future vulnerability counts or attack volumes—rather than exploiting its anomaly detection potential for identifying malicious traffic within network flows. This leaves open the question of whether ARIMA’s ability to detect deviations from expected temporal patterns can be leveraged as a filtering mechanism within an intrusion detection pipeline.

Complementary research has explored other temporal and statistical methods for network anomaly detection. Brutlag [9] combined wavelet analysis with Bayesian methods to detect network attack onset by identifying intervals of high posterior probability in traffic features such as connection attempts and byte volume, demonstrating improved detection accuracy and processing efficiency compared to raw time series analysis. Such approaches confirm that temporal anomaly detection can effectively identify malicious network behavior; however, they share a fundamental limitation with ARIMA-based forecasting: they detect that an anomaly has occurred but do not classify what type of attack is present. For operational intrusion detection, anomaly identification alone is insufficient—security analysts require attack type categorization to prioritize and respond effectively.

More broadly, data-driven approaches to attack detection have advanced across both network security and adjacent cyber–physical domains. In network intrusion detection, hybrid deep learning frameworks combining architectures such as CNNs, LSTMs, and autoencoders have demonstrated strong performance by capturing spatial and temporal patterns in network traffic, while Decision Trees and ensemble methods remain widely used due to their interpretability and computational efficiency. Data-driven detection has also been applied to infrastructure security—for example, subspace identification methods have been used to construct input–output models for detecting and localizing false data injection attacks in DC microgrids [10]. While these studies collectively advance the state of data-driven attack detection, they predominantly employ end-to-end classification or deep learning architectures that treat temporal feature extraction and attack classification as a unified learning task. In contrast, the present study decouples these functions into a two-stage pipeline—statistical anomaly filtering followed by supervised classification—providing a fundamentally different architectural approach to leveraging temporal patterns for intrusion detection.

To address the respective limitations of temporal models and supervised classifiers, several studies have explored hybrid architectures combining ARIMA with machine learning. Liang and Ismail [11] integrated ARIMA with optimized decision tree models using Complete Empirical Mode Decomposition (CEEMD), decomposing time series into Intrinsic Mode Functions and applying ARIMA to stationary components while using decision trees for non-stationary components. Çavuş, Büyükşahin, and Ertekin [12] similarly integrated ARIMA with Artificial Neural Networks via Empirical Mode Decomposition, while Khandelwal, Adhikari, and Verma [13] used Discrete Wavelet Transform to separate linear components (modeled by ARIMA) from nonlinear components (modeled by ANN). These hybrid approaches share a common architectural paradigm: decomposition-and-recombination, in which ARIMA and the classifier jointly contribute to improved forecasting accuracy by modeling different signal components. Critically, however, they all operate in non-cybersecurity domains (financial forecasting, energy prediction) and target forecast accuracy metrics (MSE, MAE) rather than intrusion detection objectives. None employ ARIMA as an anomaly detection filter that pre-selects suspicious traffic for subsequent attack classification—a fundamentally different architectural role than modeling linear residuals within a decomposition pipeline.

From a practical standpoint, processing the large-scale network datasets required for realistic intrusion detection evaluation demands scalable computational frameworks. Moustafa and Slay [14] underscored this requirement in their application of Apache Spark to large-scale network traffic analysis, demonstrating that distributed processing architectures are essential when datasets contain tens of millions of records. While their work focused on clustering methods, the scalability imperative they identified extends directly to any pipeline processing packet-level and byte-level network flow data at the scale present in datasets such as UWF-ZeekData22 [3] and UWF-ZeekData24 [4].

The foregoing review reveals a convergence of capabilities that have not yet been combined in the literature. ARIMA has proven effective at temporal pattern modeling in cybersecurity contexts but has been used only for forecasting, not detection. Temporal anomaly detection methods identify malicious deviations but lack classification capability. Existing ARIMA-classifier hybrids improve forecasting accuracy through decomposition but have not been applied to intrusion detection, nor have they used ARIMA as an anomaly filter rather than a linear trend modeler. This study addresses these collective gaps by proposing a sequential filter-then-classify architecture in which ARIMA serves as a statistical anomaly detection filter on large-scale network flow data, and a Decision Tree classifier categorizes the filtered anomalies using MITRE ATT&CK framework labels—an approach that differs architecturally, operationally, and in evaluation objectives from existing ARIMA-classifier combinations.

The rest of this paper is organized as follows. Section 3 presents the materials and methods, that is, a description of the datasets used, the theoretical framework and experimental design. Section 4 presents the results; Section 5 presents the discussion and Section 6 presents the conclusions.

3. Materials and Methods

3.1. Datasets

This work utilizes two newly created MITRE ATT&CK labeled datasets, UWF-ZeekData22 [3] and UWF-ZeekData24 [4], both available at [5]. Appendix A and Appendix B provide a breakdown of how these datasets were used in this work. The tables (Table A1 and Table A2) include a breakdown of whether the sub dataset contained attack or benign data, and the quantity of each attack data type it contained.

This study evaluated the hybrid model’s performance across 13 attack types: Collection, Command and Control, Credential Access, Defense Evasion, Discovery, Execution, Exfiltration, Impact, Lateral Movement, Persistence, Privilege Escalation, Reconnaissance, and Resource Development. Each attack type was processed independently to assess performance variations across different tactical behaviors. However, post-analysis revealed that only 7 attack types maintained sufficient benign sample representation for valid classification assessment. The remaining 6 attack types (Collection, Command and Control, Execution, Impact, Lateral Movement, Resource Development) lacked attack samples in their test sets, resulting in uninformative perfect scores that could not assess false positive rates or real-world detection capability.

Though the two datasets, UWF-ZeekData22 [3] and UWF-ZeekData24 [4], have some common attack types, the proportion or number of attacks vary. For example, both datasets contained credential access data, but UWF-ZeekData22 [3] only contains 31 total entries whereas UWF-ZeekData24’s [4] has 871,188 total entries. Table 1 and Table 2 show a break down on the attacks of UWF-ZeekData22 [3] and UWF-ZeekData24 [4], respectively.

Table 3 and Table 4 show the attack types with their descriptions that were used in this study. The attack types that contained too few entries to be utilized in the evaluation were excluded from the study.

3.2. Theoretical Framework

This section briefly introduces some of the topics needed to understand this work: the ARIMA model; the decision tree model; time series analysis; and the MITRE ATT&CK Framework.

3.2.1. A Brief Explanation of the ARIMA Model

The AutoRegressive Integrated Moving Average (ARIMA) model is a widely used statistical technique for time series forecasting that combines three key components: autoregression (AR), differencing (I), and moving average (MA) [25]. The AR component models the relationship between an observation and several lagged observations, the I component makes the time series stationary by differencing the data, and the MA component captures the relationship between an observation and a residual error from a moving average model applied to lagged observations. ARIMA is particularly effective for univariate time series data with patterns such as trend and autocorrelation, and its parameters (p, d, q) correspond to the order of the AR, I, and MA parts, respectively.

3.2.2. A Brief Explanation of the Decision Tree Model

The Decision Tree model is a supervised learning algorithm used for classification and regression tasks that operates by recursively partitioning the data space into subsets based on feature values, forming a tree-like structure where each internal node represents a decision on an attribute, each branch represents an outcome of the decision, and each leaf node represents a class label or output values [26]. The model is interpretable, handles both numerical and categorical data, and can capture non-linear patterns without requiring feature scaling, making it widely used in various domains including cybersecurity [27].

3.2.3. A Brief Explanation of the Time-Series Analysis

Time series analysis is a statistical method focused on modeling data collected sequentially over time. It enables the examination of trends, temporal dependencies and cyclic behavior within a dataset [8]. It can also be used for forecasting and anomaly detection in various domains [28]. By modeling temporal dependencies within data, time-series analysis supports decision-making in fields like cybersecurity, often leveraging methods like ARIMA, exponential smoothing, and machine learning approaches to predict and understand system dynamics [29,30]. Accurate time-series analysis is essential for real-time monitoring and proactive interventions in dynamic environments [31].

3.2.4. A Brief Explanation of the MITRE ATT&CK Framework

The MITRE ATT&CK (Adversarial Tactics, Techniques, and Common Knowledge) Framework [32] is a comprehensive, globally accessible knowledge base that details the behaviors and techniques cyber adversaries use across various stages of an attack lifecycle. Developed by the MITRE Corporation, ATT&CK is organized around real-world observations and is structured into matrices representing tactics or the “why” of an attack and the techniques or the “how” of an attack. This framework helps organizations detect, prevent, and respond to cyber threats more effectively. It supports threat intelligence, security assessments, and the development of detection and mitigation strategies by mapping threats to specific adversarial actions. ATT&CK is widely adopted in both the public and private sectors for improving cybersecurity postures through standardized terminology and methodologies [33].

3.3. Hybrid Model Approach

In this study, time series analysis is applied to window segments of continuous network flow data from the UWF-ZeekData22 [3,5] and UWF-ZeekData24 [4,5] datasets. Given the large scale and streaming nature of this data, time series analysis is a natural and effective approach. The datasets include timestamp information, allowing for the creation of time interval windows for analysis. An AutoRegressive Integrated Moving Average (ARIMA) model for time series forecasting [8] is applied. ARIMA predicts future values based on past observations by combining three components: the autoregressive order (p), the degree of differencing required to achieve stationarity (d), and the moving average order (q). Before building the model, these parameters must be carefully selected to optimize predictive accuracy [8].

Given the nature of the UWF-ZeekData22 [3,5] and UWF-ZeekData24 [4,5] datasets and the predictive goals of this research, ARIMA is particularly well-suited for uncovering statistical patterns and forecasting network behavior. Building upon this foundation, the next sections outline the methodology for implementing the ARIMA model, including preprocessing steps to ensure stationarity, determining optimal (p, d, q) values, and evaluating forecasting performance. By leveraging ARIMA, the project aims to identify meaningful deviations in network traffic that may indicate anomalous events. However, while ARIMA is highly effective for detecting temporal anomalies, it cannot independently classify the nature of these anomalies. To address this limitation, the study integrates a supervised Decision Tree classifier trained on MITRE ATT&CK-labeled data. This hybrid approach enables not only the detection of deviations in network behavior but also their classification into specific cyberattack types. The combination of ARIMA and Decision Tree models forms the core of the hybrid methodology described in subsequent sections.

3.4. Experimental Design

Three experimental methodologies were used. The first focused on applying the ARIMA model to forecast network activity and detect anomalies. The second used a Decision Tree model to classify data as either attacks or benign. The third and final methodology combined both models by first establishing a baseline using only a Decision Tree, then comparing it to a hybrid approach in which anomalies detected by ARIMA were passed to the Decision Tree for classification. Each methodology includes a discussion of outcomes and adjustments made as needed. The following sections present the results, demonstrating how the hybrid approach improved prediction accuracy.

3.5. ARIMA Model Implementations

Figure 1 presents the methodology taken for the ARIMA model. First the datasets were preprocessed, followed by the actual processing, that is, using the ARIMA model, and then the evaluation of the ARIMA model. The final subsection presents a discussion of the shortcoming of just using the ARIMA model.

Both the UWF-ZeekData22 [3,5] and UWF-ZeekData24 [4,5] datasets have 23 attributes [4,34]. Given the time series nature of this study, the following attributes were selected: time-related attributes (ts, duration); byte-related attributes (orig_bytes, resp_bytes, missed_bytes); packet-related attributes (orig_pkts, resp_pkts); IP byte-related attributes (orig_ip_bytes, resp_ip_bytes); and label attributes (label_tactic). These attributes were chosen to account for time durations necessary for time series analysis, volume-based measurements over time, and labeling to differentiate between benign and attack data. A summary of these attributes and the reason for their inclusion is provided in Table 5.

Once the attributes were selected, data preparation continued with the handling of missing values. Numerical values were filled with zeros, and string values were filled with “none” to prevent skewing the data due to unaccounted-for entries.

The next step involved converting attributes into usable forms, including transforming the ts attribute into a usable timestamp for the ARIMA model and converting the label_tactic attribute into a binary value to indicate whether a data row represented benign or attack behavior. The label_tactic attribute was primarily used for performance assessment and not for ARIMA model training; however, it was later utilized as a supervised learning label for the Decision Tree model.

Window size selection is critical for ARIMA [35]. After experimentation, 1 min intervals were selected to balance sensitivity, data stability, and processing time. Larger window sizes were found to decrease the model’s ability to effectively detect anomalies.

For the train/test split, 70% of the data used for training and 30% reserved for testing ARIMA model performance. Other split ratios were applied in later methods to explore concept validation and alternative approaches.

Once the dataset was preprocessed, the ARIMA model was trained and the fitted ARIMA model was used to generate forecasts by leveraging the built-in functions of the ARIMA model library.

Model parameters included (p, d, q) values, which represent the AutoRegressive (AR) order, Differencing (I) order for stationarity, and Moving Average (MA) order, respectively. The AR order (p) is the number of lag observations (past values) included in the model to predict the current value. It represents how strongly past values influence the present. A higher (p) means more past terms are considered. Differencing order (d) is the number of times the time series is differenced to make it stationary (i.e., to remove trends or seasonality and stabilize the mean). If the series is already stationary, d = 0. Moving average (q) is the number of lagged forecast errors (residuals) included in the model to predict the current value. It captures the relationship between past forecast errors and the current observation.

Prior to parameter selection, data preprocessing and stationarity validation were performed. To stabilize variance in the network traffic features, a log transformation was applied using the log1p function (log(1 + x)) on the sum_orig_bytes feature before ARIMA model fitting. Stationarity was then assessed using the Augmented Dickey–Fuller (ADF) test on the log-transformed training data, with the null hypothesis of non-stationarity rejected when p-value < 0.05, indicating the time series was stationary and suitable for ARIMA modeling.

Seasonality detection was performed using the Autocorrelation Function (ACF) computed over 50 lags. Significant autocorrelations were identified using the statistical threshold of ±1.96/√n, where n is the sample size. The algorithm detected seasonal patterns by examining gaps between consecutive significant lags—when consistent periodic spacing was identified, a seasonal period m was determined, and seasonal ARIMA models with seasonal_order = (0, 0, 0, m) were evaluated. When no clear periodicity was detected, non-seasonal ARIMA models were fitted.

For this application, (p, d, q) values were limited to a search range of p ∈ {0, 1, 2, 3, 4}, d ∈ {0, 1}, and q ∈ {0, 1, 2, 3, 4}, yielding 50 candidate model configurations per partition. Initial exploratory analysis used Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots, which are generally qualitative by nature and would require manual review for each partition. To enable scalable and reproducible parameter selection across hundreds of data partitions, automated iterations for (p, d, q) selection were developed that tested all combinations through exhaustive grid search, evaluating each model’s Akaike Information Criterion (AIC) and selecting the configuration with the lowest AIC value. For each candidate (p, d, q) combination, the ARIMA model was fitted using maximum likelihood estimation, and the AIC was computed. The AIC balances model fit against complexity using the formula AIC = 2k − 2ln(L), where k is the number of parameters and L is the maximized likelihood, thereby penalizing overly complex models. For scale, a stepwise search algorithm could be used if higher (p, d, q) values were required for a given dataset, but limiting possible (p, d, q) selections to a range of 0 to 5 reduced computational overhead while proving sufficient for the network traffic patterns observed in this study.

Since AIC testing penalizes models with more parameters to avoid overfitting, many testing iterations resulted in models with low (p, d, q) values such as (0, 0, 0). This result is common for datasets with a high volume of random fluctuations around a constant mean; therefore, adding autoregressive (p) or moving average (q) terms did not improve AIC results enough to offset the penalty of increased complexity. Additionally, if the time series was already stationary (as confirmed by ADF testing), differencing (d) was unnecessary and remained 0. Other iterations resulted in higher (p, d, q) values such as (2, 0, 3), (0, 0, 1), or (1, 0, 3). These results provided strong support for reducing the testing range when searching for (p, d, q) values to a maximum of 5, since no datasets used for this research exceeded this maximum.

Model validation was performed using Mean Absolute Error (MAE) on the held-out test set, calculated as MAE = (1/n)Σ|forecast_i − actual_i|, where forecast_i represents the ARIMA-predicted value and actual_i represents the observed value. The MAE threshold (forecast ± MAE) serves to filter the dataset: observations falling outside this range are flagged as temporal anomalies and retained, while those within the range are excluded as normal traffic. As will be discussed in Section 3.7, the filtered dataset is then classified by the Decision Tree using the original network flow features for attack type identification.

Table 6 presents the optimal (p, d, q) parameter configurations determined through AIC-based grid search for the Reconnaissance tactic across five consecutive 100,000-row partitions of the combined UWF-ZeekData22 [3,5] and UWF-ZeekData24 [4,5] datasets. Each partition was independently optimized, with the parameter combination yielding the lowest AIC selected from 50 candidate models per partition. The resulting parameter heterogeneity—ranging from simple white noise models (0, 0, 0) to more complex specifications such as ARIMA (2, 0, 3)—demonstrates that network traffic exhibits non-stationary statistical properties that vary across data partitions. This variation validates the adaptive per-partition optimization strategy employed in this study: applying universal (p, d, q) values across all data chunks would be suboptimal given the diverse autocorrelation structures observed in different time periods and attack densities. Similar parameter variation was observed when modeling other MITRE ATT&CK tactics, confirming that optimal ARIMA configurations are dependent on both the temporal characteristics of the data partition and the specific attack type being modeled. Note that d = 0 in Table 6 across all partitions which indicates the log-transformed time series was already stationary, eliminating the need for differencing.

Figure 2 presents a sample of the forecasted results for UWF-ZeekData22 [3]. All values in the testing set are graphed against the forecasted values, along with the Mean Absolute Error (MAE) threshold calculated by the model. The MAE threshold is computed as the mean of the absolute differences between forecasted and actual values in the test set: MAE = (1/n)Σ|forecast_i − actual_i|. The anomaly detection boundaries are then defined as [forecast − MAE, forecast + MAE], creating symmetric bounds around each forecasted point. This approach differs from alternative threshold methods such as using μ ± kσ (where μ is the mean prediction error and σ is the standard deviation with k typically set to 2 or 3), or percentile-based thresholds (such as retaining only observations beyond the 95th percentile of residuals). The MAE-based threshold was selected for this study because it is less sensitive to extreme outliers compared to standard deviation-based methods and provides interpretable, consistent bounds across varying traffic volumes.

This testing set contains only attack data. If the ARIMA model were used to identify attacks solely based on whether actual values fall outside the MAE threshold boundaries, then any values within the threshold would be ignored—even though they may also represent attacks. This highlights a flaw in relying exclusively on time series analysis for attack detection.

3.6. Decision Tree Implementation

The preprocessing for the Decision Tree models involved loading multiple datasets from the Hadoop Distributed File System (HDFS) [36], which were in Parquet format, for both the UWF-ZeekData22 [3] and UWF-ZeekData24 [4] datasets. The Decision Tree classifier was implemented using Apache Spark’s (version 3.5.5) MLlib framework, which provides scalable machine learning capabilities suitable for processing the large-scale network flow data contained in the UWF-ZeekData datasets.

The Decision Tree model utilized five numerical features extracted from network flow data to classify connections as either benign or malicious for each MITRE ATT&CK tactic. These features capture both temporal and volumetric characteristics of network connections and were selected based on their relevance to identifying anomalous network behavior: (1) duration—the temporal length of the network connection in seconds, which can indicate prolonged data exfiltration or command-and-control communications; (2) orig_bytes—the number of application-layer bytes sent from the connection originator (source), representing outbound payload size; (3) resp_bytes—the number of application-layer bytes sent from the responder (destination), representing inbound payload size; (4) orig_ip_bytes—the total IP-layer bytes from the originator, including headers and payload; and (5) resp_ip_bytes—the total IP-layer bytes from the responder, including headers and payload. These features were assembled into feature vectors using PySpark’s VectorAssembler, with invalid values being filtered out to ensure data quality.

The classification task was formulated as a binary problem for each of the 13 MITRE ATT&CK tactics evaluated in this study. For a given tactic (e.g., Reconnaissance, Credential Access), the label_tactic column from the dataset was transformed into a binary label where connections matching the target tactic were assigned a label of 1 (attack), and all other connections, including benign traffic and other attack types, were assigned a label of 0 (benign). This approach enabled the construction of 13 independent binary classifiers, each specialized in detecting a specific attack tactic rather than attempting multi-class classification across all tactics simultaneously.

The Decision Tree classifier was configured with the following hyperparameters, utilizing Spark MLlib’s default settings: the splitting criterion was Gini impurity, which measures node impurity based on the probability of misclassifying a randomly chosen element; the maximum tree depth was set to 5, limiting the number of splits from root to leaf to prevent overfitting; the minimum number of instances per node was 1, allowing splits to continue until maximum depth or purity is achieved; and the minimum information gain for a split was 0.0, meaning any split providing positive information gain was permitted. No post-pruning techniques were applied, as Spark’s Decision Tree implementation relies on pre-pruning through the maximum depth constraint.

These default hyperparameters were deliberately retained without tuning to establish a controlled baseline that isolates the impact of ARIMA preprocessing. While hyperparameter optimization through grid search over max_depth, minInfoGain, and minInstancesPerNode could potentially improve classifier performance, maintaining consistent Decision Tree configuration across all experiments ensures that performance improvements can be attributed definitively to ARIMA’s temporal anomaly filtering rather than confounding effects from classifier optimization. The study’s primary research objective is demonstrating that ARIMA preprocessing enables detection of previously undetectable attacks, not maximizing classifier performance through exhaustive hyperparameter search.

Regarding overfitting concerns with imbalanced classes, the severe class imbalance (approximately 30,000 benign samples vs. 10–100 attack samples per partition, ratios of ~300:1) was primarily addressed through ARIMA filtering rather than Decision Tree hyperparameter tuning. ARIMA’s MAE threshold filter removed benign traffic conforming to expected statistical patterns while preserving attack signatures, reducing the benign class by approximately 98%. This dramatic reduction in class imbalance (from ratios of 300:1 to approximately 6:1 post-filtering) mitigates overfitting risk more substantially than Decision Tree regularization parameters would achieve. Additionally, the max_depth = 5 constraint provides adequate regularization for the five-feature input space—deeper trees would risk memorizing individual flow patterns rather than learning generalizable attack signatures. Future work should explore whether ensemble methods (Random Forests, Gradient Boosting) or optimized hyperparameters improve detection beyond the ARIMA + baseline-DT framework established here.

Table 7 summarizes the complete experimental configuration, including ARIMA hyperparameters, Decision Tree settings, data partitioning strategy, and runtime environment specifications. The random seed value of 42 was consistently applied across all random operations (data shuffling, train-test splitting, and Spark randomization) to ensure experimental reproducibility. All experiments were conducted on a Hadoop Distributed File System (HDFS) cluster with dynamic Spark executor allocation, enabling scalable processing of the multi-million-row UWF-ZeekData22 [3] and UWF-ZeekData24 [4] datasets stored in compressed Parquet format.

The training process involved splitting the dataset into training and testing sets with a 70% training and 30% testing ratio for each tactic. Attack samples were separated from benign samples, and both were independently split using a random seed of 42 to ensure reproducibility. The training and testing sets were then constructed by combining the respective attack and benign splits. However, the datasets exhibited extreme class imbalance, particularly for stealthy attack types with limited attack volumes. For tactics such as Defense Evasion, Discovery, Persistence, Privilege Escalation, and Exfiltration, baseline chunks contained 99.94–99.99% benign samples (approximately 30,000+ benign instances versus 2–18 attacks per chunk in the 100,000-row partitions). This severe class imbalance presented significant challenges for the Decision Tree classifier, as discussed in Section 4, where baseline models often defaulted to classifying all instances as benign to maximize accuracy, resulting in 0% recall for minority attack classes. In contrast, high-volume attack types like Reconnaissance (9.3 million attacks) exhibited different imbalance characteristics, though benign traffic still dominated the overall dataset composition.

Model performance was evaluated using standard classification metrics computed by Spark’s MulticlassClassificationEvaluator: accuracy (the proportion of correct predictions), weighted precision (the average precision weighted by class support), weighted recall (the average recall weighted by class support), and F1 score (the harmonic mean of precision and recall). Additionally, confusion matrices were generated for each tactic to provide detailed insight into true positives, true negatives, false positives, and false negatives, enabling analysis of both detection capability and false alarm rates. These metrics were computed for both baseline Decision Tree models (DT-baseline) trained and tested on unfiltered data, and hybrid models (DT-ARIMA) trained on baseline data but tested on ARIMA-filtered anomalies, allowing direct comparison of detection performance with and without time series preprocessing.

3.7. Hybrid Model Implementation

The two models, ARIMA and DT, were combined to make a clear conclusion regarding predictions using the two datasets, UWF-ZeekData22 and UWF-ZeekData24. Neither dataset independently contained sufficient volume across all 13 evaluated MITRE ATT&CK tactics for robust classification assessment, and because the datasets were collected during separate time periods with different attack profiles and proportions (see Table 1 and Table 2), their original timestamps could not be meaningfully interleaved into a shared chronological sequence. For pre-processing, the two datasets were combined into a single data frame, randomly distributed, and partitioned into smaller sets of 100,000 rows, as per Figure 3. The 100,000-row chunk size was selected to balance multiple considerations: enabling 5-fold cross-validation for statistical reliability, providing sufficient temporal depth for ARIMA forecasting (which requires adequate historical context to model patterns), maintaining computational efficiency within PySpark memory constraints, and ensuring adequate attack representation across diverse attack volumes ranging from 559 Exfiltration attacks to 9.3 million Reconnaissance attacks in the combined dataset. The chunk size also revealed critical class imbalance challenges inherent in the datasets. For stealthy attack types with limited attack volumes—Defense Evasion, Discovery, Persistence, Privilege Escalation, and Exfiltration—baseline chunks contained 99.94–99.99% benign samples (approximately 30,000+ benign instances versus 2–18 attacks per chunk). This extreme imbalance, as discussed in Section 4, contributed directly to baseline detection failures where Decision Trees defaulted to classifying all instances as benign to maximize accuracy. The identification and quantification of this class imbalance became a key finding in understanding both the limitations of baseline detection for rare attacks and ARIMA’s role in addressing this challenge through intelligent data reduction.

For the DT-ARIMA sets, the partitions had to be pre-processed for the ARIMA model. Once the DT-ARIMA sets were pre-processed, they could be checked for stationarity before the ideal ARIMA parameters were tested for each of the partitions. Each partition was then processed through the ARIMA model with parameters used for the ARIMA model, with forecasting being generated for each and the error calculations completed to find outliers. Post processing of the data occurred to exclude outliers found from the error calculations and were converted to the required format for Decision tree testing. Figure 4 presents this process.

For the DT-baseline, preprocessing of the partitions occurred which included reconfiguration of data format, feature selection (duration, orig_bytes, resp_bytes, orig_ip_bytes, resp_ip_bytes), and label encoding to ensure the data frames were viable for the decision tree model. See Figure 5 for this process.

With both the DT-ARIMA result partitions and DT-baseline partitions prepared, each of the 13 target attack types (Collection, Command and Control, Credential Access, Defense Evasion, Discovery, Execution, Exfiltration, Impact, Lateral Movement, Persistence, Privilege Escalation, Reconnaissance, and Resource Development) was filtered independently so that only data containing that specific attack type and benign data were present in the partitions. These partitions were then processed through the DT training set against the filtered (DT-ARIMA-Result) and baseline testing sets to determine if the accuracy, precision, and other evaluations differed between the two systems. Each set also included confusion matrices to help further understand performance differences. Figure 6 presents a flowchart of this process.

It should be noted that this filtering introduces a methodological distinction from the classification formulation described in Section 3.6. During Decision Tree training, labels are formed as a binary problem where the target tactic is labeled 1 and all other connections—including other attack types and benign traffic—are labeled 0. However, the evaluation partitions described above are filtered so that only the target tactic and benign (“none”) data are present, effectively removing other attack types from the test population. This filtered evaluation represents a simpler classification problem than the “tactic vs. all other events” formulation used during training, which is closer to operational reality where the classifier must distinguish a specific attack type from all other traffic including other attacks. As a result, the reported ARIMA-DT and baseline metrics may be inflated relative to what would be observed in an operational deployment where other attack types are present in the data stream. End-to-End metrics, which measure pipeline performance against the original test population, partially mitigate this concern by accounting for attacks lost during ARIMA filtering, but the underlying evaluation partition still excludes non-target attack types. Future work should evaluate the hybrid model under the full multi-class classification problem to assess performance degradation when other attack types are present as potential confounders.

The preprocessing of the datasets was conducted to prepare them for analysis using the ARIMA and DT models. The datasets, derived from UWF-ZeekData24 and UWF-ZeekData22 attack profiles, included multiple attack types such as Credential Access, Reconnaissance, and others. Initially, the data was loaded from multiple Parquet files into Spark DataFrames. Relevant columns, including ts, duration, orig_bytes, resp_bytes, orig_ip_bytes, resp_ip_bytes, and label_tactic, were selected, and missing values were filled with default values (e.g., 0 for numerical columns and “none” for categorical columns). The combined dataset was shuffled using a random seed to ensure randomness and split into training and testing sets with a 70/30 ratio. Additionally, a pseudo-timestamp column was created by incrementing the current timestamp with one-minute intervals for each row, ensuring a consistent temporal structure for time-series analysis. The label_tactic column was processed to create binary labels (label_tactic_binary) facilitating aggregation and analysis.

For the ARIMA model, the data underwent additional preprocessing steps. The data was passed through a one-second aggregation window based on the timestamp (ts) column; however, because the pseudo-timestamps are spaced at one-minute intervals, each window contains exactly one flow record, and the aggregation functions operate as pass-through operations on individual rows. Metrics including duration, orig_bytes, resp_bytes, orig_ip_bytes, and resp_ip_bytes, the label_tactic_binary value, and one-hot encoded labels were processed per row. The Spark DataFrames [37] were then converted to Pandas DataFrames [38], as the ARIMA model required tabular data. A log transformation was applied to the sum_orig_bytes column to reduce the impact of outliers and stabilize variance. The processed data was split into training and testing datasets, with the training dataset used to fit the ARIMA model and the testing dataset used for forecasting and evaluation. Because each window contains a single flow record, no aggregation-based signal loss occurs—individual attack flows retain their original feature values without being diluted by co-located benign traffic. The Decision Tree classifier similarly operates on original per-flow features, preserving individual flow characteristics for classification. Nevertheless, the MAE threshold filtering constitutes the primary mechanism for potential attack loss: attack flows whose feature values fall within the ARIMA-forecasted bounds are removed alongside conforming benign traffic, as quantified in the end-to-end pipeline metrics. Future work should investigate adaptive threshold strategies or multi-resolution analysis to reduce attack loss during filtering.

An additional methodological consideration that affects the interpretation of all reported metrics concerns the evaluation partition structure. As noted in Section 3.7, evaluation partitions were filtered so that only the target tactic and benign traffic were present, removing other attack types from the test population. This creates a simpler binary classification problem—distinguishing one specific attack type from benign traffic—than the operational scenario where a classifier must distinguish a target attack from all other traffic including other attack types that may share overlapping feature characteristics. For example, Privilege Escalation and Persistence share identical sample counts in UWF-ZeekData24 (Table 2) and may exhibit similar flow-level characteristics, meaning their presence as confounders could degrade classification performance for either tactic individually. The reported precision, recall, and F1-scores across all three evaluation levels—Baseline, ARIMA-DT, and End-to-End—may therefore be inflated relative to what would be observed in an operational deployment where all attack types coexist in the data stream. End-to-End metrics address a different concern (attack loss during ARIMA filtering) but do not mitigate this evaluation simplification, as they are still computed on the filtered partitions that exclude non-target attack types. The magnitude of this inflation is difficult to estimate without conducting the full multi-class evaluation, but it is likely most significant for attack types with similar flow-level profiles and least significant for Reconnaissance, whose high volume and distinctive traffic patterns make it inherently easier to separate from both benign traffic and other attack types. Future work should evaluate the hybrid model under the full multi-class classification problem, where the Decision Tree must distinguish each target tactic from all other traffic—including other attacks—to establish operationally representative performance benchmarks.

An important methodological distinction must be made regarding the role of ARIMA in the hybrid model configuration versus its theoretical capabilities. In its general form, ARIMA models autocorrelation, trends, and seasonality in chronologically ordered data to detect genuine temporal dependencies. However, because the combined dataset was shuffled to ensure representative attack distribution across partitions (as described above), the original chronological ordering of network flows is not preserved. The pseudo-timestamps provide a uniform time index required by the ARIMA model but do not represent real temporal relationships between flows. Consequently, ARIMA in this hybrid configuration functions primarily as a statistical baseline estimator rather than a temporal dependency modeler: it learns the expected distribution of aggregated flow metrics across the training sequence and uses the MAE threshold to identify individual observations that deviate substantially from this learned baseline. The effectiveness of this approach—demonstrated by 95.8% attack retention with 93.5% benign removal for Reconnaissance—is attributable to the fundamental difference in flow characteristics between attack and benign traffic rather than to temporal sequencing of events. Attack flows exhibit distinctive feature values (unusual byte counts, durations, or packet volumes) that deviate from the ARIMA-forecasted baseline regardless of their position in the sequence. This statistical filtering interpretation is consistent with the study’s primary contribution: demonstrating that ARIMA’s forecasting and error quantification framework provides an effective mechanism for intelligent data reduction prior to classification, improving Decision Tree performance by removing conforming benign traffic while preserving anomalous flows.

For the DT model, the preprocessing focused on ensuring compatibility with classification tasks. The columns ts, duration, orig_bytes, resp_bytes, orig_ip_bytes, and resp_ip_bytes were cast to appropriate data types, such as DoubleType for numeric columns and StringType for categorical columns. The training and testing datasets were filtered to include only rows where the label_tactic column matched the target tactic (e.g., “Reconnaissance”) or was labeled as “none”. Missing values in numerical columns were replaced with 0. The schema of the preprocessed DataFrames was validated to confirm that all columns were correctly formatted. Additionally, the distinct label counts in the training and testing datasets were analyzed to ensure balanced representation of the target labels. These preprocessing steps ensured that the datasets were appropriately structured and optimized for the respective models, enabling accurate and efficient analysis.

For a given DT-ARIMA partition of preprocessed data, an ARIMA model was trained using the same methods outlined in the ARIMA Model section where the best p, d, and q values were obtained and used to train the model. After ARIMA model training, a forecasted DataFrame was built using the time range of the DT-ARIMA testing set. The DT-ARIMA testing set was then compared to the forecasted set and any actual testing data that exceeded the MAE threshold from the model was anomalous data. All anomalies were stored in a resulting DataFrame (DT-ARIMA-Result) and then processed to ensure formatting was consistent with the testing set requirements for the decision tree model.

The DT-baseline training set, DT-baseline testing set, and DT-ARIMA-Result set were filtered for the target attack type, Reconnaissance, to provide more accurate results versus simply testing if any attack type was present. The decision tree model was then trained using the corresponding DT-baseline training set for the same respective partition as the DT-ARIMA training set and tested using the respective DT-baseline testing set from the partition split as a control or baseline. The decision tree model was then tested using the resulting DataFrame (DT-ARIMA-Result) from the ARIMA model to determine if the anomalies were in fact attacks or benign data. This process was performed for five partitions of 100,000 rows.

Figure 7 represents one partition of 100,000 rows forecasted using the ARIMA model and graphed with the actual values in the respective testing set, bounded by the MAE threshold.

To further illustrate the density and complexity of the data, Figure 8 is a graphical representation of 10,000 rows of forecasted data with the actual data in the testing set bounded by the MAE threshold. The effectiveness of training and testing across partition sizes was a consideration that was essential for big data, and this methodology lends itself to adapting to varying partition sizes and frameworks.

4. Results

For the combined approach utilizing both the ARIMA and DT models, evaluations are provided in Table 8 and Table 9 as the result of training the decision tree model on each partition of combined and randomly distributed UWF-ZeekData22 and UWF-ZeekData24 datasets to demonstrate the initial effectiveness of the decision tree model without any assistance from the ARIMA model. For each iteration, the performance was then evaluated by testing the same decision tree model on the anomalous data from the ARIMA model instead of the original testing set used for the baseline decision tree model. In this way, the ARIMA model behaves as a preprocessing filter with an objective of maintaining DT performance and reducing computational overhead by removing some steady state data. Comprehensive evaluations were conducted across all 13 MITRE ATT&CK tactics. Analysis of the confusion matrices revealed a critical methodological finding: only 7 of the 13 attack types maintained sufficient class balance or presence of both attack and benign samples for meaningful classification performance assessment. The following attack types with results of N/A represented attack types with insufficient class balance for reasonable confidence and were not included in the following tables: Collection, Command and Control, Execution, Impact, Lateral Movement, and Resource Development. Class imbalance in this case was due to a low volume of present attacks in the original datasets. After the 70/30 split of training and testing data, there were not enough attacks present in either split to train or test to any meaningful extent. In all tested datasets for these attack types, there were 0 attacks present.

Table 8 presents results for Credential Access and Reconnaissance across three evaluation levels: Baseline (Decision Tree only), ARIMA-DT (Decision Tree tested on ARIMA-filtered anomalies), and End-to-End (pipeline performance measured against the original test population). Reconnaissance demonstrated consistent improvement across all evaluation levels. The ARIMA-DT results showed substantial gains in accuracy (+30.37%), precision (+18.69%), recall (+18.97%), and F1-score (+18.83%) over the baseline. End-to-End metrics confirmed that these improvements hold when accounting for attacks filtered by ARIMA, with precision remaining at 99.65%, recall at 95.62%, and F1-score at 97.59%, representing a meaningful improvement over the baseline F1-score of 80.88%. Credential Access presented a more nuanced outcome. The ARIMA-DT evaluation achieved perfect 100% precision, recall, and F1-score on the filtered subset, indicating that the small number of Credential Access attacks surviving ARIMA filtering were highly distinguishable. However, because ARIMA filtering retained only a small fraction of Credential Access attacks, the End-to-End metrics reveal the cost of this filtering: precision dropped to 0.64%, recall reached 92.60%, and F1-score fell to 1.28%. While End-to-End accuracy remained at 100%, the End-to-End recall of 92.60% is misleading in isolation—the extremely low precision and F1-score indicate that the pipeline’s overall detection capability for Credential Access is worse than the baseline’s 7.41% F1-score. This outcome indicates that Credential Access attack flows exhibit characteristics that closely resemble legitimate authentication traffic, causing ARIMA to filter out most attacks alongside conforming benign traffic. This result is further analyzed in Section 4.2.

Table 9 presents results for attack types where baseline Decision Tree models completely failed to detect any attacks (0% recall). ARIMA preprocessing enabled detection across four of these five tactics, with results reported at three evaluation levels: Baseline, ARIMA-DT (filtered subset), and End-to-End (pipeline performance measured against the original test population). Post-ARIMA performance on the filtered subset resulted in improved recall for Defense Evasion (91.83%), Discovery (100%), Persistence (86.92%), and Privilege Escalation (89.93%). End-to-End metrics, which account for attacks removed during ARIMA filtering, confirmed that detection enablement persisted across all four tactics: Defense Evasion achieved 67.83% recall with 94.17% precision and 75.93% F1-score; Discovery achieved 63.43% recall with 88.89% precision and 71.10% F1-score; Persistence achieved 73.38% recall with 91.00% precision and 80.10% F1-score; and Privilege Escalation achieved 64.68% recall with 94.82% precision and 74.60% F1-score. While End-to-End recall is lower than the ARIMA-filtered results due to attack loss during the filtering stage, these metrics represent a substantial improvement over the baseline 0% recall, confirming that the hybrid pipeline enables meaningful detection of stealthy attack types that were previously completely undetectable. Exfiltration was notably unable to detect attacks before or after ARIMA preprocessing. As illustrated later in Section 4.4, ARIMA filtering eliminated false negatives by removing the only present attacks from the data passed to the Decision Tree, resulting in all true negative classifications. While this produced a misleading 100% accuracy on the filtered subset, End-to-End evaluation confirms no detection capability, as no attacks survived the filtering stage. It is included in Table 9 since it did not have any detection before ARIMA preprocessing. For Table 9, N/A represents results for testing where no attack detection occurred and where a divide by zero would result since there were no true positives (TP).

4.1. Reconnaissance

Reconnaissance was selected as a detailed representative example because it provided the most reliable and informative results and the largest available count of attacks. Table 10 represents the outcome of testing for Reconnaissance attacks.

The baseline confusion matrices in Table 11, Table 12, Table 13, Table 14 and Table 15 showed a higher number of false positives and false negatives compared to the ARIMA-DT result matrices in Table 16, Table 17, Table 18, Table 19 and Table 20. For example, in the first set with a 0–100,000 row index range presented in Table 11, the baseline resulted in 4547 false positives and 4713 false negatives, and the ARIMA-DT results in Table 16 demonstrated an improvement of 39 false positives and 62 false negatives. This demonstrates the ARIMA-DT model’s effectiveness in reducing classification errors and improving the DT’s ability to correctly identify Reconnaissance attacks.

For clear representation, a comprehensive overview of Reconnaissance is illustrated in Table 21 and Table 22. Table 22 reports both ARIMA-filtered metrics, computed on the subset of data ARIMA flagged as anomalous, and End-to-End pipeline metrics, which account for attack samples that ARIMA filtered out by treating them as false negatives. ARIMA retained approximately 95.8% of attack samples across chunks while removing approximately 93.5% of benign samples, reducing the average dataset from ~30,059 rows to ~400 rows per chunk. On the ARIMA-filtered subset, the Decision Tree achieved 99.65% average precision and 99.77% average recall. However, the End-to-End pipeline metrics, which provide the operationally relevant measure of detection performance, show 99.65% precision, 95.62% recall, 96.18% accuracy, and 97.59% F1 score. This represents a substantial improvement over the baseline Decision Tree’s 80.96% precision, 80.80% recall, 69.06% accuracy, and 80.88% F1 score, confirming that ARIMA preprocessing meaningfully improves Reconnaissance detection even when accounting for the approximately 4.2% of attacks lost during filtering.

4.2. Credential Access

Beyond Reconnaissance, Credential Access illustrates an important limitation of the ARIMA filtering approach. Table 23 presents the results before using ARIMA filtering, and it can be noted that the results were worse than Table 24. On the ARIMA-filtered subset (Table 24), perfect classification was achieved with complete elimination of false positives and false negatives, indicating that the small number of Credential Access attacks surviving ARIMA filtering were highly distinguishable. However, End-to-End pipeline metrics reveal that ARIMA filtering retained only approximately 0.6% of Credential Access attacks (averaging 14 of 2240 per chunk), treating the remaining 99.4% as conforming to expected statistical patterns. The resulting End-to-End recall of 0.64% and F1-score of 1.28% are lower than the baseline model’s 7.48% recall and 7.41% F1-score. This outcome indicates that Credential Access attack flows—which involve authentication-related traffic—exhibit duration, byte count, and packet volume characteristics that closely resemble legitimate credential operations. Unlike Reconnaissance, where attack flows produce distinctive feature deviations that exceed the MAE threshold, Credential Access attacks blend with benign authentication traffic at the flow level, making MAE-based filtering counterproductive for this attack type. This finding highlights that the hybrid model’s effectiveness is dependent on the degree of feature-level distinction between attack and benign traffic.

4.3. Enabled Detection

As previously noted in Section 4.1, Defense Evasion, Discovery, Persistence, and Privilege Escalation all exhibited complete baseline detection failure (0% recall illustrated in Table 25, Table 26, Table 27 and Table 28), with decision trees classifying all instances as benign despite the presence of actual attacks. These tactics share a common characteristic: they involve subtle, low-volume activities designed to blend with legitimate system operations such as credential theft mimicking normal authentication, internal reconnaissance resembling routine network queries, persistence mechanisms appearing as standard scheduled tasks, and privilege escalation disguised as authorized access elevation. ARIMA preprocessing transformed detection performance across all four tactics. On the ARIMA-filtered subset, recall reached 91.83%, 100%, 86.92%, and 89.93% for Defense Evasion, Discovery, Persistence, and Privilege Escalation respectively (Table 29, Table 30, Table 31 and Table 32). End-to-End pipeline metrics, which account for attacks removed during ARIMA filtering, confirmed that detection enablement persisted: Defense Evasion achieved 67.83% recall with 75.93% F1-score, Discovery achieved 63.43% recall with 71.10% F1-score, Persistence achieved 73.38% recall with 80.10% F1-score, and Privilege Escalation achieved 64.68% recall with 74.60% F1-score. While End-to-End recall is lower than the ARIMA-filtered results due to some attacks being removed during the filtering stage, these metrics represent a fundamental transformation from complete detection failure (0% baseline recall) to meaningful detection capability across all four tactics. The consistent pattern of enablement demonstrates that ARIMA’s statistical anomaly filtering is particularly effective for low-volume, stealthy attack types that evade traditional signature-based or direct traffic analysis approaches.

4.4. Persistent Detection Failure

Exfiltration represents a unique case where neither baseline nor ARIMA-enhanced approaches achieved successful detection. As illustrated in Table 33 and Table 34, the baseline model detected no true positives while misclassifying 8 attacks as benign (FN), achieving 0% recall despite 99.99% accuracy. While ARIMA preprocessing eliminated these false negatives and resulted in all TN classifications with 100% accuracy on the filtered subset as seen in Table 34, this improvement is misleading. The ARIMA preprocessing eliminated false negatives because the only present attacks were filtered out of the data that was passed to the subsequent decision tree model for classification. End-to-End evaluation confirms the complete absence of detection capability, with all metrics reporting N/A-No Detection, as no attacks survived the filtering stage to be classified. The persistent failure across all three evaluation levels—baseline, ARIMA-filtered, and End-to-End—suggests fundamental limitations in using network flow statistics alone for exfiltration detection. Unlike other attack types that generate distinctive statistical patterns or volume anomalies, data exfiltration can occur through legitimate protocols at normal bandwidth rates, often mimicking authorized data transfers or cloud synchronization traffic. The lack of benign exfiltration samples in the balanced evaluation dataset further complicated model training, preventing the classifier from learning distinguishing features. These results indicate that exfiltration detection may require additional features beyond temporal flow analysis such as content inspection, destination reputation scoring, or user behavior analytics to effectively distinguish malicious data theft from legitimate data movement.

5. Discussion

This study proposed and evaluated a hybrid time series forecasting model that combines ARIMA-based anomaly detection with supervised classification using Decision Trees to detect cyberattacks in MITRE ATT&CK-labeled Zeek network flow Big Data for 13 tactics. By leveraging the statistical baseline estimation and MAE-based filtering capabilities of ARIMA and the classification capabilities of Decision Trees, the approach addresses key limitations found in traditional static models. The integration of PySpark for Big Data processing further enabled efficient handling of datasets exceeding 18 million rows, ensuring scalability and adaptability to real-world network environments. The comprehensive evaluation revealed both the promise and limitations of this approach, with outcomes varying significantly based on attack characteristics and data composition.

Of the 13 attack types evaluated, only 7 maintained sufficient class balance for valid classification assessment. Results are reported at three evaluation levels: Baseline (Decision Tree only), ARIMA-DT (Decision Tree tested on ARIMA-filtered anomalies), and End-to-End (pipeline performance measured against the original test population). The hybrid approach demonstrated two distinct benefits across these seven tactics. First, for attacks with existing baseline detection capability, ARIMA preprocessing substantially improved performance. Reconnaissance achieved the most reliable improvement, with ARIMA-DT F1-score increasing from 80.88% to 99.71%, accompanied by a 99% reduction in false positives (from 4626 to 81 average per chunk) and a 99% reduction in false negatives (from 4673 to 54 average). End-to-End metrics confirmed this improvement, with F1-score reaching 97.59%, precision at 99.65%, and recall at 95.62%. With well-balanced class representation of approximately 24,387 benign samples and 24,358 attack samples in baseline testing, Reconnaissance provides strong evidence for ARIMA’s effectiveness in high-volume attack detection. Credential Access presented a more complex outcome. The ARIMA-DT evaluation achieved perfect 100% precision, recall, and F1-score on the filtered subset, up from 7.48% baseline recall, indicating that the small number of Credential Access attacks surviving ARIMA filtering were highly distinguishable. However, End-to-End evaluation revealed that ARIMA filtering removed most Credential Access attacks, resulting in 0.64% End-to-End recall and 1.28% End-to-End F1-score—worse than the baseline model’s 7.41% F1-score. This outcome indicates that Credential Access attack flows exhibit duration, byte count, and packet volume characteristics that closely resemble legitimate authentication traffic, causing ARIMA’s MAE-based threshold to filter out attacks alongside conforming benign traffic. Unlike Reconnaissance, where attack flows produce distinctive feature deviations, Credential Access demonstrates that the hybrid model’s effectiveness is dependent on the degree of feature-level distinction between attack and benign traffic.

More significantly, ARIMA preprocessing enabled detection where baseline Decision Trees completely failed. Four stealthy attack types—Defense Evasion, Discovery, Persistence, and Privilege Escalation—produced 0% recall with baseline models, which classified all instances as benign despite the presence of actual attacks. These tactics represent sophisticated, low-volume activities designed to blend with legitimate system operations: Defense Evasion techniques that mimic normal system behavior, Discovery operations resembling routine network queries, Persistence mechanisms appearing as standard scheduled tasks, and Privilege Escalation disguised as authorized access elevation. After ARIMA preprocessing, detection became possible with substantial ARIMA-DT recall rates: Defense Evasion (91.83%), Discovery (100%), Persistence (86.92%), and Privilege Escalation (89.93%). End-to-End metrics confirmed that detection enablement persisted across all four tactics, though with reduced recall due to attack loss during filtering: Defense Evasion (67.83% recall, 75.93% F1-score), Discovery (63.43% recall, 71.10% F1-score), Persistence (73.38% recall, 80.10% F1-score), and Privilege Escalation (64.68% recall, 74.60% F1-score). This transformation from complete detection failure to meaningful performance demonstrates that ARIMA’s statistical anomaly detection exposes subtle patterns in these stealthy attacks that remain invisible in direct traffic analysis. The baseline Decision Trees were overwhelmed by the challenge of distinguishing these sophisticated attacks from the massive volume of benign traffic (approximately 30,000+ benign samples per chunk), defaulting to classifying everything as negative. ARIMA filtering reduced the data volume by approximately 98% while preserving attack signatures in the residuals, enabling the Decision Tree to focus on genuinely anomalous behavior.

Exfiltration represents the sole attack type that remained undetectable under all three evaluation levels—baseline, ARIMA-DT, and End-to-End—achieving 0% recall across all conditions. While ARIMA preprocessing eliminated the 8 false negatives present in baseline evaluation by filtering out these attack samples entirely, this resulted in all-TN classification rather than successful detection. End-to-End evaluation confirmed the complete absence of detection capability, with all metrics reporting N/A-No Detection. This persistent failure indicates fundamental limitations in using network flow statistics alone for exfiltration detection, as data theft can occur through legitimate protocols at normal bandwidth rates, mimicking authorized transfers. The lack of benign exfiltration samples in the balanced evaluation dataset further complicated model training, preventing the classifier from learning distinguishing features. These results indicate that exfiltration detection may require additional features beyond temporal flow analysis—such as content inspection, destination reputation scoring, or user behavior analytics—to effectively distinguish malicious data theft from legitimate data movement.

The contrasting outcomes between successfully detected stealthy attacks and Exfiltration’s persistent detection failure illuminate a critical distinction in how different categories of “stealth” attacks manifest in network flow data. Defense Evasion, Discovery, Persistence, and Privilege Escalation—despite being low-volume and designed to blend with legitimate operations—generate detectable statistical anomalies because their underlying activities introduce irregular operational sequences within network flows. Defense Evasion techniques such as log manipulation or security tool disabling create atypical patterns in connection timing and flow duration that deviate from the steady-state behavior ARIMA models as baseline. Discovery operations, while resembling routine network queries individually, produce clustered bursts of reconnaissance-like probing within short temporal windows that diverge from normal query periodicity. Persistence mechanisms introduce anomalous scheduling patterns—establishing callback channels or modifying startup configurations generates flow characteristics (unusual destination frequency, periodic low-bandwidth connections) that, while individually subtle, create statistically detectable deviations from forecasted traffic baselines. Privilege Escalation similarly produces transient anomalies in authentication-related flow features during the elevation event itself. In each case, the attack’s operational footprint—however small in volume—disrupts the temporal regularity that ARIMA captures, producing elevated MAE residuals that survive the anomaly threshold. Exfiltration, by contrast, lacks this temporal signature entirely. Data theft can be executed through legitimate transfer protocols (HTTPS, cloud storage APIs) at bandwidth rates indistinguishable from authorized file transfers or routine cloud synchronization. Unlike the other stealthy attack types whose operations introduce irregular sequences detectable through statistical pattern deviation, Exfiltration mimics not only the volume but also the timing and protocol characteristics of legitimate large file movements, producing flow statistics (duration, bytes transferred, packet counts) that fall within ARIMA’s expected forecasting range. The absence of temporal anomaly in Exfiltration traffic explains why ARIMA filtering removed all Exfiltration attacks alongside conforming benign traffic—the attack generates no statistical deviation for ARIMA to detect. This mechanistic distinction suggests that the boundary of ARIMA-based anomaly detection’s effectiveness lies not in attack volume, but in whether the attack introduces temporal irregularity in flow-level statistics: attacks that disrupt operational sequences are detectable regardless of volume, while attacks that perfectly replicate legitimate transfer patterns evade statistical detection entirely.

These findings establish a core methodological boundary for the hybrid approach proposed in this study: the pipeline’s detection capability is fundamentally contingent upon attack behavior generating identifiable deviations in macroscopic network flow statistical characteristics. ARIMA-based anomaly filtering operates on aggregate flow-level statistics—duration, bytes transferred, packet counts, and connection frequency—and can only flag traffic whose temporal patterns diverge from the statistical baseline established during forecasting. Attacks that introduce operational irregularities into these flow-level features, even at low volume, will produce elevated MAE residuals and survive the anomaly threshold for subsequent classification, as demonstrated by the successful detection of Defense Evasion, Discovery, Persistence, and Privilege Escalation. Conversely, attacks whose execution inherently conforms to the statistical and volumetric characteristics of legitimate traffic—as Exfiltration demonstrates through its mimicry of authorized file transfers—will be indistinguishable from benign flows at the statistical level and will be filtered out alongside conforming traffic. This boundary is not a limitation unique to ARIMA but applies broadly to any flow-level statistical anomaly detection method: detection requires that the attack leaves a measurable statistical footprint. Attack types that operate entirely within the statistical envelope of normal network behavior require fundamentally different detection strategies—such as content inspection, behavioral profiling, or endpoint-level monitoring—that operate below the flow-level abstraction at which this study’s pipeline functions.

Beyond the attack-type-specific outcomes described above, a broader question raised by the Exfiltration and Credential Access results is whether the hybrid approach genuinely detects attacks or simply selects a narrow subset of easy-to-classify events. Evaluating ARIMA’s own detection properties—attack retention rate and anomaly composition—independently of the Decision Tree’s classification performance addresses this concern. Table 22, Table 24, Table 29, Table 30, Table 31 and Table 32 report these properties for each attack type. For Reconnaissance, ARIMA retained an average of 95.8% of attacks across chunks while removing approximately 98% of benign traffic, demonstrating high attack coverage with effective noise reduction. This high retention rate, combined with the Decision Tree’s subsequent classification accuracy, confirms that ARIMA is not merely selecting an easy subset but preserving most attacks for classification. In contrast, Credential Access exhibited an average retention rate of only 0.64%, meaning ARIMA filtered out 99.4% of attacks—confirming that the perfect ARIMA-DT classification scores reflect performance on a highly selective subset rather than genuine pipeline-wide detection, as reflected in the 1.28% End-to-End F1-score. For the four stealthy attack types, retention rates varied: Defense Evasion averaged 73.40%, Discovery averaged 64.43%, Persistence averaged 88.60%, and Privilege Escalation averaged 78.89%. These retention rates, reported alongside End-to-End metrics throughout Section 4, enable assessment of ARIMA’s filtering behavior independent of the Decision Tree’s classification and demonstrate that the hybrid approach’s detection enablement for stealthy attacks is based on substantial attack retention rather than trivial subset selection. However, ARIMA’s anomaly precision—the proportion of flagged anomalies that are actual attacks versus benign outliers—varies considerably by attack type. For Reconnaissance, the filtered subset contained approximately 97% attacks (e.g., 23,477 attacks versus 366 benign samples in Chunk 0 of Table 22), indicating high anomaly precision. For stealthy attack types, the filtered subsets contained approximately 3–4% attacks (e.g., 10 attacks versus 415 benign samples for Defense Evasion in Chunk 0 of Table 29), reflecting low anomaly precision but still sufficient for the Decision Tree to achieve meaningful classification. Future work should investigate adaptive threshold strategies that optimize the trade-off between attack retention and anomaly precision across different attack profiles.

Six attack types—Collection, Command and Control, Execution, Impact, Lateral Movement, and Resource Development—achieved perfect scores but completely lacked benign samples, with true negatives equal to zero, rendering results uninformative about real-world detection capability. These results cannot assess false positive rates or specificity, revealing a critical methodological limitation in the attack-type filtering process that must be addressed in future work.

Extreme class imbalance is a pervasive and well-documented challenge in intrusion detection research based on real-world attack data, where benign traffic overwhelmingly dominates and many attack types occur at volumes several orders of magnitude lower. This study confronts this challenge directly rather than circumventing it through artificial balancing techniques. The UWF-ZeekData22 and UWF-ZeekData24 datasets reflect authentic network environments in which Reconnaissance constitutes over 99.97% of attacks in UWF-ZeekData22 while tactics such as Defense Evasion, Persistence, and Privilege Escalation each represent fractions of a percent. When partitioned into 100,000-row chunks for evaluation, stealthy attack types produced baseline test sets containing 99.94–99.99% benign samples—approximately 30,000 benign instances versus 2–18 attack instances per chunk. This imbalance directly caused the baseline Decision Tree’s 0% recall for four attack types, as the classifier defaulted to all-negative predictions to maximize overall accuracy. Rather than applying oversampling or synthetic data generation techniques that could introduce artificial patterns unrepresentative of real network conditions, this study preserves the natural class distribution and instead evaluates whether ARIMA preprocessing can overcome this imbalance through its anomaly filtering mechanism. The results demonstrate that ARIMA’s class-agnostic MAE threshold filter incidentally rebalances the data—removing approximately 98% of temporally conforming benign traffic while retaining attack samples at substantially higher rates—enabling the Decision Tree to learn meaningful decision boundaries on the filtered subset. This study also explicitly identifies and excludes from interpretation the six attack types (Collection, Command and Control, Execution, Impact, Lateral Movement, and Resource Development) that lacked sufficient class representation to produce valid evaluation metrics, avoiding the reporting of uninformative perfect scores as genuine detection capability. The decision to honestly report these limitations, rather than presenting only favorable results, reflects the inherent constraints of working with real-world attack datasets where not all tactical categories occur at volumes sufficient for rigorous machine learning evaluation. Future work should explore stratified sampling strategies, minimum class representation thresholds, and targeted data collection campaigns to address these gaps while preserving the ecological validity that real-world datasets provide.

ARIMA forecasting is effective for network flow data because it models and predicts time-series trends, such as traffic volume, connection counts, or data transfer rates, by analyzing historical patterns. This enables the detection of anomalies, such as sudden spikes or drops in network activity, which may indicate potential security events. However, ARIMA cannot directly identify attacks, as it does not account for categorical or contextual information, such as specific attack tactics or labels like “Reconnaissance” or “Credential Access” observed in the UWF-ZeekData22 and UWF-ZeekData24 datasets. It focuses solely on numerical trends and deviations, lacking the ability to classify or attribute anomalies to specific attack types without the assistance of additional supervised learning models, such as Decision Trees.

The Decision Tree model accounts for attack types (the class variable), enabling the inclusion of attack type labels and other relevant parameters from the dataset.

The results demonstrate that ARIMA’s statistical anomaly filtering provides benefits across the attack sophistication spectrum. For high-volume attacks like Reconnaissance, ARIMA reduces noise and improves classification accuracy through effective filtering of benign traffic patterns. For low-volume, stealthy attacks like Defense Evasion and Privilege Escalation that were completely undetectable with baseline approaches, ARIMA enables detection by isolating subtle statistical patterns from overwhelming benign traffic. The approximately 98% dataset reduction (from ~30,000 to ~400 rows per chunk) provides substantial computational efficiency while preserving—and in many cases exposing—attack signatures that direct classification approaches miss entirely. This dual benefit demonstrates that ARIMA preprocessing addresses both the computational challenges of high-volume traffic analysis and the detection challenges posed by sophisticated, low-volume attacks designed to evade traditional signature-based approaches. However, the Credential Access results demonstrate that this filtering mechanism is counterproductive when attack flows lack distinctive feature-level deviations from benign traffic, as ARIMA removes attacks alongside conforming benign samples.

A natural question arising from the acknowledgment in Section 3.7 that ARIMA functions primarily as a statistical baseline estimator on shuffled data—rather than modeling genuine temporal dependencies—is whether simpler statistical methods such as z-score filtering, interquartile range (IQR) outlier detection, or fixed percentile thresholds could achieve comparable filtering performance with less computational overhead. Several properties of ARIMA’s framework provide advantages over these alternatives even in the absence of true chronological ordering. First, ARIMA’s per-partition parameter optimization through AIC-based grid search produces an adaptive baseline that adjusts to the local statistical characteristics of each 100,000-row partition, whereas z-score and IQR methods assume fixed distributional properties (normality for z-scores, symmetric spread for IQR) that may not hold across partitions with varying attack densities and traffic compositions. The parameter heterogeneity observed in Table 6—ranging from white noise models (0, 0, 0) to ARIMA (2, 0, 3)—confirms that different partitions exhibit different statistical structures, and ARIMA’s model selection framework adapts accordingly. Second, ARIMA’s MAE-based threshold is derived from the model’s own forecasting error on the training sequence, providing a principled, data-driven bound that reflects the model’s learned expectations rather than an arbitrary statistical cutoff. This contrasts with z-score thresholds (which require selecting k, typically 2 or 3, based on assumed normality) or percentile-based methods (which require selecting a fixed retention proportion regardless of data characteristics). Third, ARIMA’s autoregressive and moving average components can capture local dependencies within the shuffled sequence—while these do not represent genuine temporal relationships, they model short-range statistical regularities in the data that inform more nuanced baseline estimates than a simple global mean or median. That said, a formal comparative evaluation of ARIMA against simpler statistical filters was not conducted in this study, and whether the observed filtering performance (95.8% attack retention with 93.5% benign removal for Reconnaissance) could be approximated by less computationally expensive methods remains an open empirical question. Future work should include ablation studies comparing ARIMA-based filtering against z-score, IQR, isolation forest, and percentile-based alternatives to quantify the marginal benefit of ARIMA’s model-based approach and determine whether the additional computational cost is justified across different attack profiles and data conditions.

Beyond performance metrics, this study provides important methodological insights for hybrid intrusion detection research. Class balance verification proved essential, as nearly half of evaluated attack types produced misleading results due to inadequate benign sample representation, emphasizing the need for stratified sampling and minimum class representation thresholds. Reconnaissance emerged as the gold standard evaluation due to optimal class balance, with approximately equal benign and attack samples, and large sample sizes, providing the most reliable and generalizable performance assessment. However, it is important to distinguish between the two contributions that Reconnaissance and the stealthy attack types represent. Reconnaissance, with its optimal class balance, serves as a validation benchmark demonstrating that ARIMA preprocessing improves classification accuracy under favorable conditions—its End-to-End F1-score improvement from 80.88% to 97.59% reflects the upper bound of hybrid model performance when class imbalance is not a confounding factor. The stealthy attack types—Defense Evasion, Discovery, Persistence, and Privilege Escalation—demonstrate a fundamentally different contribution: detection enablement under extreme class imbalance. Their ARIMA_DT recall rates (86–100%) and End-to-End recall rates (63–73%) should not be directly compared to Reconnaissance, as they operate under substantially different data conditions. The generalizability of the model is therefore best understood as two-fold: reliable performance improvement for well-represented attacks, and detection enablement for low-volume attacks that evade direct classification. The findings also revealed that baseline Decision Tree models struggled significantly with the extreme class imbalance present in the full dataset (approximately 30,000 benign samples versus tens to hundreds of attack samples), often defaulting to all-negative classifications. ARIMA preprocessing mitigated this challenge not through deliberate class balancing, but as an emergent effect of its anomaly detection mechanism. The MAE threshold filter is class-agnostic, retaining observations that deviate from forecasted statistical patterns regardless of their labels. Because benign traffic is temporally regular, most benign samples fall within the MAE threshold and are removed, while attack samples—which introduce statistical deviations—are retained at higher rates. The resulting filtered dataset is incidentally more balanced, enabling the Decision Tree to learn meaningful decision boundaries rather than defaulting to majority-class classification.

Future research directions include integrating additional machine learning classifiers such as neural networks, random forests, or gradient boosting machines to better capture nuanced attack patterns in ARIMA-filtered data. Exploring ensemble approaches that combine multiple detection strategies could address the limitations observed with Exfiltration attacks and improve End-to-End recall for stealthy attack types where ARIMA filtering removes a portion of attacks during the preprocessing stage. Real-time operational validation, deploying the hybrid model in live network environments to assess practical false positive and false negative rates under realistic traffic conditions, would provide essential validation of the approach. Finally, cross-dataset generalization studies validating the approach on additional labeled datasets beyond UWF-ZeekData would assess transferability across different network environments and attack vectors. Investigation of additional features beyond temporal flow statistics—such as payload inspection, protocol-specific attributes, or behavioral profiling—may improve detection for attack types like Exfiltration and Credential Access that proved resistant to flow-based analysis.

6. Conclusions

The findings of this research affirm that time-aware and label-aware hybrid models offer a promising direction for improving intrusion detection systems across diverse attack types. By aligning statistical forecasting with attack-type classification rooted in the MITRE ATT&CK framework, the model achieves both interpretability and operational relevance. Results were evaluated at three levels, namely, Baseline, ARIMA-DT (filtered subset), and End-to-End (full pipeline), providing a comprehensive assessment of the hybrid approach. The hybrid ARIMA-Decision Tree approach demonstrated two critical capabilities: substantial performance improvement for detectable high-volume attacks, and detection enablement for sophisticated, low-volume attacks that completely evade baseline classification approaches.

For high-volume attacks like Reconnaissance, ARIMA preprocessing dramatically reduced classification errors while maintaining near-perfect detection rates, with End-to-End F1-score reaching 97.59% compared to the 80.88% baseline. Credential Access achieved perfect classification on the ARIMA-filtered subset but produced an End-to-End F1-score of 1.28%, which is worse than the baseline F1-score of 7.41%, demonstrating that the hybrid pipeline is counterproductive when attack flows lack distinctive feature-level deviations from benign traffic—ARIMA removes these attacks alongside conforming benign samples, actively degrading detection capability. More importantly, for four stealthy attack types—Defense Evasion, Discovery, Persistence, and Privilege Escalation—that produced 0% baseline recall, ARIMA preprocessing enabled detection with ARIMA-DT recall rates ranging from 86% to 100% and End-to-End recall rates ranging from 63% to 73%. This detection enablement represents a fundamental contribution, demonstrating that statistical anomaly detection can expose attack patterns invisible to direct classification approaches when attacks are designed to blend with legitimate operations.

However, the research also identified important limitations. Exfiltration attacks remained undetectable under all three evaluation levels, indicating that flow-based statistical analysis alone is insufficient for certain attack types. Credential Access demonstrated that attacks whose flow characteristics closely resemble legitimate traffic are counterproductive for MAE-based filtering, as ARIMA removes attacks alongside conforming benign samples. Additionally, six attack types lacked adequate benign sample representation for valid assessment, revealing methodological challenges in maintaining balanced evaluation datasets across diverse attack categories.

This study demonstrates that hybrid approaches combining ARIMA-based statistical filtering with supervised classification can address multiple challenges simultaneously: computational efficiency through intelligent data reduction, improved accuracy for high-volume attacks through noise filtering, and detection enablement for sophisticated attacks through statistical anomaly filtering. End-to-End evaluation provides an honest assessment of the full pipeline, confirming that while ARIMA filtering introduces some attack loss during preprocessing, the net effect remains a substantial improvement over baseline detection for Reconnaissance and a transformation from complete failure to meaningful detection for stealthy attack types. These findings suggest that effective intrusion detection systems must incorporate statistical awareness alongside traditional classification approaches, particularly for detecting advanced persistent threats that deliberately minimize their behavioral deviation from normal operations.

Future work should explore adaptive ensemble approaches that combine multiple detection strategies to address attack types resistant to flow-based analysis, including adaptive threshold strategies or multi-resolution analysis to reduce attack loss during ARIMA filtering and improve End-to-End recall. Investigation of additional features beyond temporal flow statistics may improve detection for attack types like Exfiltration and Credential Access that proved resistant to flow-based analysis. Validation of the approach in real-time operational environments and assessment of generalizability across diverse network contexts remain essential next steps. The framework developed in this research provides a foundation for more comprehensive network defense strategies that balance computational efficiency with the detection of both high-volume and sophisticated, stealthy cyber threats.

Author Contributions

Conceptualization, R.F., S.S.B. and S.C.B.; methodology, R.F., S.S.B. and S.C.B.; software, R.F., S.C. and G.C.S.D.C.; validation, R.F., S.S.B., S.C.B. and G.C.S.D.C.; formal analysis, R.F., S.S.B., S.C.B. and G.C.S.D.C.; investigation, R.F., S.S.B. and S.C.B.; resources, S.S.B., D.M. and S.C.B.; data curation, D.M., R.F. and S.C.; writing—original draft preparation, R.F. and S.C.; writing—review and editing, R.F., S.S.B. and S.C.B.; visualization, R.F. and G.C.S.D.C.; supervision, S.S.B., D.M. and S.C.B.; project administration, S.S.B., D.M. and S.C.B.; funding acquisition, S.S.B., D.M. and S.C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets are available at https://datasets.uwf.edu/ (accessed on 20 August 2025).

Acknowledgments

This work was partially supported by the Askew Institute at The University of West Florida.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IDS	Intrusion Detection System
ARIMA	Autoregressive Integrated Moving Average
DT	Decision Tree
SVM	Support Vector Machine
ANN	Artificial Neural Networks
EMD	Empirical Mode Decomposition
DWT	Discrete Wavelet Transform
AR	Autoregression
I	Differencing
MA	Moving Average
ATT&CK	Adversarial Tactics, Techniques, and Common Knowledge
MAE	Mean Absolute Error
ACF	Autocorrelation Function
PACF	Partial Autocorrelation Function
HDFS	Hadoop Distributed File System
AIC	Akaike Information Criterion
DLL	Dynamic Link Library
APT	Advanced Persistent Threats
TP	True Positive
FP	False Positive
FN	False Negative
TN	True Negative

Appendix A

Table A1. UWF-ZeekData22 Dataset Breakdown.

Dataset	Attack Type	Number of Each Attack
1	Benign	-
2	Reconnaissance	9,278,720
2	Discovery	2086
3	Credential Access	31
	Privilege Escalation	13
	Exfiltration	7
	Lateral Movement	4
	Resource Development	3
	Reconnaissance	2
	Persistence	1
	Initial Access	1
	Defense Evasion	1

Appendix B

Table A2. UWF-ZeekData24 Dataset Breakdown.

Dataset	Attack Type	Number of Each Attack
1	Credential Access	236,706
	Reconnaissance	12,762
	Initial Access	2172
	Privilege Escalation	1251
	Persistence	1251
	Defense Evasion	1251
	Exfiltration	50
2	Credential Access	45,491
	Reconnaissance	3339
	Initial Access	576
	Privilege Escalation	336
	Persistence	336
	Defense Evasion	336
	Exfiltration	30
3	Credential Access	150,887
	Reconnaissance	10,834
	Initial Access	2090
	Privilege Escalation	1193
	Persistence	1193
	Defense Evasion	1193
	Exfiltration	188
4	Credential Access	341,945
	Reconnaissance	24,233
	Initial Access	4602
	Privilege Escalation	2591
	Persistence	2591
	Defense Evasion	2591
	Exfiltration	268
5	Credential Access	96,159
	Reconnaissance	6927
	Initial Access	1222
	Privilege Escalation	677
	Persistence	677
	Defense Evasion	677
	Exfiltration	23
6	Benign	-
7	Benign	-

References

Gooijer, J.G.D.; Hyndman, R.J. 25 Years of Time Series Forecasting. Int. J. Forecast. 2006, 22, 443–473. [Google Scholar] [CrossRef]
Trellix. What Is the MITRE ATT&CK Framework?|Get the 101 Guide. 2025. Available online: https://www.trellix.com/en-us/security-awareness/cybersecurity/what-is-mitre-attack-framework.html (accessed on 1 May 2025).
Bagui, S.S.; Mink, D.; Bagui, S.C.; Madhyala, P.; Uppal, N.; McElroy, T.; Plenkers, R.; Elam, M.; Prayaga, S. Introducing the UWF-ZeekDataFall22 Dataset to Classify Attack Tactics from Zeek Conn Logs Using Spark’s Machine Learning in a Big Data Framework. Electronics 2023, 12, 5039. [Google Scholar] [CrossRef]
Elam, M.; Mink, D.; Bagui, S.S.; Plenkers, R.; Bagui, S.C. Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI. Data 2025, 10, 59. [Google Scholar] [CrossRef]
UWF Datasets. Available online: https://datasets.uwf.edu/ (accessed on 1 May 2025).
Gencer, K.; Başçiftçi, F. Time Series Forecast Modeling of vulnerabilities in the Android operating system using Arima and deep learning methods. Sustain. Comput. Inform. Syst. 2021, 30, 100515. [Google Scholar] [CrossRef]
Pokhrel, N.R.; Rodrigo, H.; Tsokos, C.P. Cybersecurity: Time Series Predictive Modeling of Vulnerabilities of Desktop Operating System Using Linear and Non-Linear Approach. J. Inf. Secur. 2017, 8, 362–382. [Google Scholar] [CrossRef][Green Version]
Werner, G.; Yang, S.; McConky, K. Time series forecasting of cyber attack intensity. In Proceedings of the 12th Annual Conference on Cyber and Information Security Research, Oak Ridge, TN, USA, 4–6 April 2017; pp. 1–3. [Google Scholar] [CrossRef]
Brutlag, J.D. Aberrant Behavior Detection in Time Series for Network Monitoring. In Proceedings of the 14th USENIX Conference on System Administration 2000, New Orleans, LA, USA, 3–8 December 2000; pp. 139–146. [Google Scholar]
Wang, X.; Zhu, H.; Luo, X.; Guan, X. Data-Driven-Based Detection and Localization Framework Against False Data Injection Attacks in DC Microgrids. IEEE Internet Things J. 2025, 12, 36079–36093. [Google Scholar] [CrossRef]
Liang, Z.; Ismail, M.T. Advanced CEEMD hybrid model for VIX forecasting: Optimized decision trees and ARIMA integration. Evol. Intell. 2024, 18, 12. [Google Scholar] [CrossRef]
Büyükşahin, Ü.Ç.; Ertekin, Ş. Improving forecasting accuracy of time series data using a new ARIMA-ANN hybrid method and empirical mode decomposition. Neurocomputing 2019, 361, 151–163. [Google Scholar] [CrossRef]
Khandelwal, I.; Adhikari, R.; Verma, G. Time series forecasting using hybrid ARIMA and ANN models based on DWT decomposition. Procedia Comput. Sci. 2015, 48, 524–529. [Google Scholar] [CrossRef]
Moustafa, N.; Slay, J. UNSW-NB15: A Comprehensive Data Set for Network Intrusion Detection Systems. In Proceedings of the Military Communications and Information Systems Conference (MilCIS), Canberra, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar]
MITRE ATT&CK. Reconnaissance, Tactic TA0043—Enterprise|MITRE ATT&CK^®. 2025. Available online: https://attack.mitre.org/tactics/TA0043/ (accessed on 1 May 2025).
Skopik, F.; Schall, D.; Dustdar, S. Cyber threat intelligence sharing: Survey and research challenges. Comput. Secur. 2019, 87, 101568. [Google Scholar] [CrossRef]
MITRE ATT&CK. Discovery, Tactic TA0007—Enterprise|MITRE ATT&CK^®. 2025. Available online: https://attack.mitre.org/tactics/TA0007/ (accessed on 1 May 2025).
MITRE ATT&CK. Credential Access, Tactic TA0006—Enterprise|MITRE ATT&CK^®. 2025. Available online: https://attack.mitre.org/tactics/TA0006/ (accessed on 1 May 2025).
Ma, J.; Zhao, S.; Liu, L.; Jin, H. A Systematic Survey of Credential Stealing Attacks and Defenses. IEEE Access 2020, 8, 32755–32776. [Google Scholar]
MITRE ATT&CK. Privilege Escalation, Tactic TA0004—Enterprise|MITRE ATT&CK^®. 2025. Available online: https://attack.mitre.org/tactics/TA0004/ (accessed on 1 May 2025).
Chatzisofroniou, S.G.; Ntantogian, C. Survey of Privilege Escalation Attacks in Mobile Operating Systems. IEEE Commun. Surv. Tutor. 2020, 22, 345–372. [Google Scholar]
Zhang, Y.; Wu, L. A Survey on Persistence Techniques in Advanced Malware. In Proceedings of the IEEE International Conference Software Quality, Reliability and Security (QRS), Washington, DC, USA, 3–5 August 2015; pp. 165–170. [Google Scholar]
Holm, H. Signature Based Intrusion Detection for Zero-Day Attacks: (Not) A Closed Chapter? In Proceedings of the 47th Hawaii International Conference System Sciences (HICSS), Waikoloa, HI, USA, 6–9 January 2014; pp. 4895–4904. [Google Scholar] [CrossRef]
Javaid, A.Y.; Sun, W.; Devabhaktuni, V.K.; Alam, M. Cyber security threat analysis and modeling of an unmanned aerial vehicle system. In Proceedings of the IEEE Conference on Technologies for Homeland Security (HST), Waltham, MA, USA, 13–15 November 2012; pp. 585–590. [Google Scholar]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control, 5th ed.; Wiley: Hoboken, NJ, USA, 2015. [Google Scholar]
Quinlan, J.R. C4.5: Programs for Machine Learning, Morgan Kaufmann; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993. [Google Scholar]
Rokach, L.; Maimon, O. Top-down induction of decision trees classifiers—A survey. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2005, 35, 476–487. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C. Time Series Analysis: Forecasting and Control, 4th ed.; Wiley: Hoboken, NJ, USA, 2008. [Google Scholar]
Hyndman, R.; Athanasopoulos, G. Forecasting: Principles and Practice, 3rd ed.; OTexts: Melbourne, Australia, 2021. [Google Scholar]
Shumway, M.; Stoffer, D. Time Series Analysis and Its Applications: With R Examples, 4th ed.; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
Harvey, A.C. Forecasting, Structural Time Series Models and the Kalman Filter; Cambridge University Press: Cambridge, UK, 1990. [Google Scholar]
MITRE ATT&CK. Applications in Cybersecurity and the Way Forward. Available online: https://www.researchgate.net/publication/389090450 (accessed on 15 February 2025).
MITRE. MITRE ATT&CK^® Framework. MITRE Corporation. 2024. Available online: https://attack.mitre.org/ (accessed on 15 May 2025).
Bagui, S.; Mink, D.; Bagui, S.; Ghosh, T.; McElroy, T.; Paredes, E.; Khasnavis, N.; Plenkers, R. Detecting Reconnaissance and Discovery Tactics from the MITRE ATT&CK Framework in Zeek Conn Logs Using Spark’s Machine Learning in the Big Data Framework. Sensors 2022, 22, 7999. [Google Scholar] [CrossRef]
Inoue, A.; Jin, L.; Rossi, B. Rolling Window Selection for Out-of-Sample Forecasting with Time-Varying Parameters. J. Econom. 2016, 196, 55–67. [Google Scholar] [CrossRef]
Bagui, S.; Spratlin, S. A Review of Data Mining Algorithms on Hadoop’s MapReduce. Int. J. Data Sci. 2018, 3, 146–169. [Google Scholar] [CrossRef]
Ghazi, M.R.; Raghava, N.S. Securing cloud-enabled smart cities by detecting intrusion using spark-based stacking ensemble of machine learning algorithms. Electron. Res. Arch. 2024, 32, 1268–1307. [Google Scholar] [CrossRef]
McKinney, W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; pp. 56–61. [Google Scholar]

Figure 1. Overview of the ARIMA model methodology.

Figure 2. Test Data vs. Forecast Values for UWF-ZeekData22.

Figure 3. Overview of the initial pre-processing of datasets.

Figure 4. Overview of the DT-ARIMA processes.

Figure 5. Overview of the DT-Baseline process.

Figure 6. Overview of the processing of DT-ARIMA and DT-Baseline through Decision Trees.

Figure 7. Test Data vs. Forecasted Values: 100,000 rows.

Figure 8. Test Data vs. Forecasted Values: 10,000 rows.

Table 1. Breakdown of Attacks in UWF-ZeekData22 [3].

Attack Type	Count	%
Reconnaissance	9,278,722	0.9997687
Discovery	2086	0.0002248
Credential Access	31	3.34 × 10⁻⁶
Privilege Escalation	13	1.40 × 10⁻⁶
Exfiltration	7	7.54 × 10⁻⁷
Lateral Movement	4	4.31 × 10⁻⁷
Resource Development	3	3.23 × 10⁻⁷
Initial Access	1	1.08 × 10⁻⁷
Persistence	1	1.08 × 10⁻⁷
Defense Evasion	1	1.08 × 10⁻⁷

Table 2. Breakdown of Attacks in UWF-ZeekData24 [4].

Attack Type	Count	%
Credential Access	871,188	90.88
Reconnaissance	58,095	06.06
Initial Access	10,662	01.11
Privilege Escalation	6048	00.631
Persistence	6048	00.631
Defense Evasion	6048	00.631
Exfiltration	559	5.83 × 10⁻⁴

Table 3. UWF-ZeekData22 [3].

Attack Type	Description
Reconnaissance	Gathering information about the target system [15,16]
Discovery	Gathering information about the target system after initial access [17]

Table 4. UWF-ZeekData24 [4].

Attack Type	Description
Credential Access	Attempting to steal authentication credentials such as usernames and passwords [18,19]
Reconnaissance	Gathering information about the target system [15,16]
Privilege Escalation	Gaining higher-level permissions to execute malicious actions beyond initial access rights [20,21]
Persistence	Maintaining a foothold in a system across reboots and interruptions [22]
Defense Evasion	Techniques to avoid detection and bypass security controls [23]
Exfiltration	Stealing sensitive data from a system and transferring it outside the network [24]

Table 5. UWF-ZeekData22 [3] and UWF-ZeekData24 [4] Attribute Descriptions.

Attribute	Description	Reason for Inclusion
ts	Time of first packet	Primary time-related attribute that will be used as the timestamp for time series data
duration	How long each connection lasted	Captures the length of connection that will be used for indicating anomalies or changes in traffic patterns over time
orig_bytes	Number of payload bytes originator sent	Captures the network traffic intensity and may be used to reveal unusual patterns before an attack
resp_bytes	Number of payload bytes responder sent	Captures the response traffic and may be used to monitor increases in communication that might signal abnormal activity
missed_bytes	Indicates number of bytes missed in content gaps, representative of packet loss	Captures packet loss that may indicate malicious activity, especially if it happens suddenly. May lead to anomaly detection
orig_pkts	Number of packets originator sent	Captures the number of packets over time to detect change in traffic volume which may be linked to network attacks
resp_pkts	Number of packets responder sent	Captures the amount of incoming traffic which may increase before certain types of attacks such as reconnaissance or DDoS
orig_ip_bytes	Number of IP level bytes originator sent	Captures the total number of bytes at the IP layer sent by the originator, may be used to determine the volume of data flowing from the originator before or during an attack
resp_ip_bytes	Number of IP level bytes responder sent	Captures the total number of bytes at the IP layer sent by the responder and may be used to analyze incoming traffic to identify potential attack-related activities
label_tactic	Attack type assigned by MITRE ATT&CK framework	Used for evaluation of performance and for supervised learning in the decision tree portion

Table 6. ARIMA Parameter Optimization: Reconnaissance: Combined Datasets 2022&2024.

Row Index	AutoRegressive Order (p)	Moving Average (q)
0–100,000	0	0
100,000–200,000	0	0
200,000–300,000	1	1
300,000–400,000	2	3
400,000–500,000	0	1

Table 7. Experimental Configuration and Hyperparameters.

Component	Parameter	Value	Description
ARIMA Model
	p (AR order)	0–4	Autoregressive lag order search range
	d (Differencing)	0–1	Differencing order for stationarity
	q (MA order)	0–4	Moving average lag order search range
	ACF lags	50	Autocorrelation analysis depth for seasonality detection
	Window size	1 s	Temporal aggregation window for time series
	Model selection	AIC	Grid search optimization criterion (minimum AIC)
	Stationarity test	ADF	Augmented Dickey–Fuller test (p < 0.05)
	Variance stabilization	log₁p	Log transformation applied to sum_orig_bytes
Decision Tree
	Splitting criterion	Gini	Impurity measure for node splitting
	maxDepth	5	Maximum tree depth (pre-pruning)
	maxBins	32	Maximum bins for continuous feature discretization
	minInstancesPerNode	1	Minimum samples required per leaf node
	minInfoGain	0.0	Minimum information gain required for split
Data Splitting
	Train/test ratio	70/30	Primary dataset split proportion
	Random seed	42	Reproducibility seed for all random operations
	Partition size	100,000 rows	Processing chunk size for scalability
Runtime Environment
	Python	3.12.3	Interpreter version
	Apache Spark	MLlib	Distributed machine learning framework
	Driver memory	128 GB	Spark driver memory allocation
	Executor memory	128 GB	Per-executor memory allocation
	Driver cores	2	CPU cores allocated to driver
	Executor cores	4	CPU cores per executor
	Executor instances	2–8	Dynamic executor allocation range
	Storage format	Parquet	HDFS compressed columnar format
Python Libraries version 3.1.2
	statsmodels	—	ARIMA time series modeling
	pandas	—	Data manipulation and analysis
	numpy	—	Numerical computing
	scikit-learn	—	Confusion matrix and metrics

Table 8. Results for Combined and Shuffled 2022/2024 Data with Improved Detection.

Attack Type	Accuracy	Precision	Recall	F1 Score	DF Tested
Credential Access	86.08%	7.35%	7.48%	7.41%	Baseline
	100%	100%	100%	100%	ARIMA-DT
	100.00%	0.64%	92.60%	1.28%	End-to-End
Reconnaissance	69.06%	80.96%	80.80%	80.88%	Baseline
	99.43%	99.65%	99.77%	99.71%	ARIMA-DT
	96.18%	99.65%	95.62%	97.59%	End-to-End

Table 9. Results for Combined and Shuffled 2022/2024 Data with No Initial Detection.

Attack Type	Precision	Recall	Accuracy	F1 Score	DF Tested
Defense Evasion	N/A	0%	99.95%	N/A	Baseline
	94.17%	91.83%	99.66%	92.59%	ARIMA-DT
	94.17%	67.83%	99.98%	75.93%	End-to-End
Discovery	N/A	0%	99.98%	N/A	Baseline
	88.89%	100.00%	99.81%	93.50%	ARIMA-DT
	88.89%	63.43%	99.99%	71.10%	End-to-End
Exfiltration	N/A	0%	99.99%	N/A	Baseline
	N/A	N/A	100%	N/A	ARIMA-DT
	N/A-No Detection	N/A-No Detection	N/A-No Detection	N/A-No Detection	End-to-End
Persistence	N/A	0%	99.95%	N/A	Baseline
	94.33%	86.92%	99.43%	89.18%	ARIMA-DT
	91.00%	73.38%	99.99%	80.10%	End-to-End
Privilege Escalation	N/A	0%	99.94%	N/A	Baseline
	94.82%	89.93%	99.53%	92.02%	ARIMA-DT
	94.82%	64.68%	99.98%	74.60%	End-to-End

Table 10. Reconnaissance: Decision Tree Results for Combined and Shuffled 2022/2024 Data.

Accuracy	Precision	Recall	F1 Score	Row Index	DF Tested
69.19%	81.22%	80.67%	80.95%	0–100,000	Baseline
99.58%	99.83%	99.74%	99.78%	0–100,000	ARIMA-DT
69.08%	80.89%	80.82%	80.86%	100,000–200,000	Baseline
99.46%	99.76%	99.69%	99.73%	100,000–200,000	ARIMA-DT
69.35%	81.10%	81.13%	81.11%	200,000–300,000	Baseline
99.47%	99.72%	99.75%	99.73%	200,000–300,000	ARIMA-DT
68.70%	80.73%	80.47%	80.60%	300,000–400,000	Baseline
99.48%	99.78%	99.69%	99.74%	300,000–400,000	ARIMA-DT
68.99%	80.85%	80.92%	80.88%	400,000–500,000	Baseline
99.17%	99.18%	99.98%	99.58%	400,000–500,000	ARIMA-DT

Table 11. Confusion Matrix, Reconnaissance, 2024&2022, Set 1 Baseline.

Confusion Matrix for Reconnaissance: Row Index: 0–100,000, Baseline
	Predicted Benign	Predicted Attack
Actual Benign	1128	4547
Actual Attack	4713	19,671

Table 12. Confusion Matrix, Reconnaissance, 2024&2022, Set 2 Baseline.

Confusion Matrix for Reconnaissance: Row Index: 100,000–200,000, Baseline
	Predicted Benign	Predicted Attack
Actual Benign	1130	4637
Actual Attack	4658	19,634

Table 13. Confusion Matrix, Reconnaissance, 2024&2022, Set 3 Baseline.

Confusion Matrix for Reconnaissance: Row Index:200,000–300,000, Baseline
	Predicted Benign	Predicted Attack
Actual Benign	1061	4610
Actual Attack	4603	19,785

Table 14. Confusion Matrix, Reconnaissance, 2024&2022, Set 4 Baseline.

Confusion Matrix for Reconnaissance: Row Index:300,000–400,000, Baseline
	Predicted Benign	Predicted Attack
Actual Benign	1108	4664
Actual Attack	4743	19,544

Table 15. Confusion Matrix, Reconnaissance, 2024&2022, Set 5 Baseline.

Confusion Matrix for Reconnaissance: Row Index: 400,000–500,000, Baseline
	Predicted Benign	Predicted Attack
Actual Benign	1015	4672
Actual Attack	4650	19,722

Table 16. Confusion Matrix, Reconnaissance, 2024&2022, Set 1 Result DataFrame.

Confusion Matrix for Reconnaissance: Row Index: 0–100,000, ARMIA-DT
	Predicted Benign	Predicted Attack
Actual Benign	327	39
Actual Attack	62	23,415

Table 17. Confusion Matrix, Reconnaissance, 2024&2022, Set 2 Result DataFrame.

Confusion Matrix for Reconnaissance: Row Index: 100,000–200,000, ARIMA_DT Result
	Predicted Benign	Predicted Attack
Actual Benign	349	56
Actual Attack	72	23,283

Table 18. Confusion Matrix, Reconnaissance, 2024&2022, Set 3 Result DataFrame.

Confusion Matrix for Reconnaissance: Row Index: 200,000–300,000, ARIMA-DT
	Predicted Benign	Predicted Attack
Actual Benign	317	66
Actual Attack	59	23,207

Table 19. Confusion Matrix, Reconnaissance, 2024&2022, Set 4 Result DataFrame.

Confusion Matrix for Reconnaissance: Row Index: 300,000–400,000, ARIMA-DT
	Predicted Benign	Predicted Attack
Actual Benign	371	51
Actual Attack	72	23,147

Table 20. Confusion Matrix, Reconnaissance, 2024&2022, Set 5 Result DataFrame.

Confusion Matrix for Reconnaissance: Row Index: 400,000–500,000, Arima-DT
	Predicted Benign	Predicted Attack
Actual Benign	195	192
Actual Attack	5	23,337

Table 21. Confusion Matrix & Results, Reconnaissance Decision Tree, 2024&2022.

Chunk	Benign	Attack	TP	FP	FN	TN	Prec	Recall	Acc	F1
0	5675	24,384	19,671	4547	4713	1128	81.22%	80.67%	69.19%	80.95%
1	5767	24,292	19,634	4637	4658	1130	80.89%	80.82%	69.08%	80.86%
2	5671	24,388	19,785	4610	4603	1061	81.10%	81.13%	69.35%	81.11%
3	5772	24,287	19,544	4664	4743	1108	80.73%	80.47%	68.70%	80.60%
4	5687	24,372	19,722	4672	4650	1015	80.85%	80.92%	68.99%	80.88%
Average							80.96%	80.80%	69.06%	80.88%