Evaluation of Anomaly-Based Network Intrusion Detection Systems with Unclean Training Data for Low-Rate Attack Detection

Prabowo, Angela Oryza; Arrizki, Deka Julian; Pratomo, Baskoro Adi; Fajar, Ahmad Ibnu; Wijaya, Krisna Badru; Studiawan, Hudan; Shiddiqi, Ary Mazharuddin; Othman, Siti Hajar

doi:10.3390/jcp6010014

Open AccessArticle

Evaluation of Anomaly-Based Network Intrusion Detection Systems with Unclean Training Data for Low-Rate Attack Detection

by

Angela Oryza Prabowo

¹,

Deka Julian Arrizki

¹,

Baskoro Adi Pratomo

^1,*

,

Ahmad Ibnu Fajar

¹,

Krisna Badru Wijaya

¹,

Hudan Studiawan

¹

,

Ary Mazharuddin Shiddiqi

¹ and

Siti Hajar Othman

²

¹

Department of Informatics, Institut Teknologi Sepuluh Nopember, Surabaya 60111, East Java, Indonesia

²

Faculty of Computing, Universiti Teknologi Malaysia, Johor Bahru 81310, Malaysia

^*

Author to whom correspondence should be addressed.

J. Cybersecur. Priv. 2026, 6(1), 14; https://doi.org/10.3390/jcp6010014

Submission received: 6 November 2025 / Revised: 6 December 2025 / Accepted: 9 December 2025 / Published: 6 January 2026

(This article belongs to the Special Issue Intrusion/Malware Detection and Prevention in Networks—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Anomaly-based network intrusion detection systems (NIDSs) complement signature-based detection methods to identify unknown (zero-day) attacks. The integration of machine and deep learning enhanced the efficiency of such NIDSs. However, since anomaly-based NIDSs heavily depend on the quality of the training data, the presence of malicious traffic in the training set can significantly degrade the model’s performance. Purging the training data of such traffic is often impractical. This study investigates performance degradation caused by increasing amounts of malicious traffic in the training data. We introduced varying portions of malicious traffic into the training sets of machine and deep learning models to determine which approach is most resilient to unclean training data. Our experiments revealed that Autoencoders, using a byte frequency feature set, achieved the highest F2 score (0.8989), with only a minor decrease of 0.0009 when trained on the most contaminated dataset. This performance drop was the smallest compared to other algorithms tested, including an Isolation Forest, a Local Outlier Factor, a One-Class Support Vector Machine, and Long Short-Term Memory.

Keywords:

network intrusion detection system; unclean training data; low-rate attacks; machine learning

1. Introduction

The advancement of the internet has been accompanied by the emergence of new vulnerabilities that attackers exploit with increasingly sophisticated techniques. Among these, low-rate attacks pose significant challenges as they operate covertly by mimicking legitimate communication patterns, making them substantially more difficult to detect than high-rate attacks such as Distributed Denial of Service (DDoS) [1]. Unlike DDoS attacks that generate obvious traffic spikes, low-rate attacks silently infiltrate systems without triggering traditional volume-based detection mechanisms.

Network-based Intrusion Detection Systems (NIDSs) offer an effective method for detecting low-rate attacks by analyzing network traffic in real time to identify unauthorized intrusions or malicious activities. For instance, since a significant portion of HTTP traffic comprises printable ASCII characters and should not contain executable code, the presence of executable code in HTTP packets indicates a potential malware injection attack [2]. Traditional signature-based NIDSs rely on predefined patterns of known attacks, making them effective against recognized threats but vulnerable to zero-day attacks and novel attack variants [3]. In contrast, anomaly-based NIDSs detect deviations from established baselines of normal behavior, enabling them to identify previously unseen attacks.

Research on anomaly-based NIDSs has evolved significantly over the past few decades. Early systems were basic, rule-based structures that scrutinized system logs using predefined thresholds and statistical measures [4,5]. The integration of machine learning (ML) and deep learning (DL) has fundamentally enhanced the efficiency of anomaly-based NIDSs by enabling systems to learn patterns directly from data without requiring explicit rules for every possible scenario [6]. Recent studies have demonstrated the effectiveness of various ML approaches. For example, a study by Auskalnis, Paulauskas, and Baskys [7] employed Local Outlier Factor (LOF) to evaluate network events based on distance from k-nearest neighbors. Ripan et al. [8] showed that Isolation Forest improved classification accuracy through effective outlier removal. Zhang, Xu, and Gong [9] demonstrated that a One-Class Support Vector Machine (OCSVM) achieved higher detection rates compared to traditional methods on benchmark datasets. DL architectures have also shown promising results, with Autoencoders achieving up to 7% improvement in classification performance through reconstruction-based anomaly detection [10], and bidirectional Long Short-Term Memory (Bi-LSTM) networks demonstrating superior accuracy and detection rates compared to traditional ML approaches [11].

A typical approach to training ML-based intrusion detection models involves providing either clearly labeled malicious and benign data (supervised learning) or exclusively benign data (unsupervised learning). However, obtaining well-labeled and representative data from network traffic in real-world scenarios poses significant challenges, as manual labeling or collecting benign traffic without any malicious traffic is time-consuming given the vast volume of network traffic. Several factors contribute to this impracticality. First, the manual labeling process is extremely resource-intensive due to the enormous volume of network traffic, with malicious traffic typically constituting a small fraction of total traffic, resulting in severely imbalanced training datasets. Second, during the data collection process, some attacks may remain undetected and inadvertently be included in what is assumed to be benign training data.

Consequently, it becomes imperative to develop anomaly detection models that not only learn from benign data but are also robust when small amounts of malicious traffic are unintentionally included in the training set. In this research, such datasets are referred to as unclean training sets, where most traffic is benign but a small fraction of malicious samples may be mixed in. The possibility of contamination in real-world traffic collection has not been adequately addressed in previous NIDS research. Since malicious traffic typically constitutes a small portion of total network traffic and some attacks may evade detection during capture, it is crucial to understand how such contamination affects model performance. Therefore, this research systematically evaluates the resilience of various anomaly-based NIDS architectures when trained on unclean datasets with controlled levels of noise. The main contributions of this study are as follows:

We introduce controlled amounts of malicious traffic into otherwise benign datasets to simulate realistic contamination scenarios.
We assess multiple anomaly-based NIDS architectures, including LOF, Isolation Forest, One-Class SVM, Autoencoders, and LSTM models, under varying degrees of noise to analyze their robustness

This paper is structured as follows. In Section 2, we present the related work on anomaly-based NIDSs and their assumptions regarding training data. In Section 3, we outline the research methodology, including model architectures and threshold calculation. Next, in Section 4, we discuss the scope and environmental setting of this work. In Section 5, we present the experimental results and an analysis of the impact of contamination. Based on the above, in Section 6, we provide further explanation and in-depth analysis regarding the results acquired. Finally, the article concludes by summarising our findings and offering directions for future work in Section 7.

2. Related Works

This section begins with how anomaly-based NIDSs have evolved. We focus on various techniques and algorithms used in machine-learning-based NIDSs (ML-NIDSs) and highlight the problem with their implementation in real-world environments: the need for clean training data.

Anomaly-based intrusion detection systems (IDS) have been widely examined to counter low-rate, stealthy attacks. Early systems predominantly relied on statistical modeling and threshold-driven anomaly scoring. For instance, Bhange and Marhas used statistical profiling to detect deviations from normal traffic behavior [12]. Bhuyan et al. applied clustering to isolate anomalous traffic patterns [13], while Zhao and Wu leveraged subspace-based methods with entropy and clustering weights for large-scale anomaly detection [14]. However, as adversarial strategies evolved, these traditional approaches struggled with high false alarm rates and limited generalizability. To address these limitations, researchers began adopting machine learning (ML)-based anomaly detection techniques. Several surveys have reviewed the development trend in IDS research [15,16,17,18], emphasizing unsupervised models due to their capability to detect novel attacks and cope with imperfectly labeled datasets.

Machine learning-based IDSs initially focused on classical algorithms such as Support Vector Machines, k-Nearest Neighbors, and tree-based classifiers [7,8,19] to identify abnormal traffic patterns. These approaches often relied on handcrafted features and statistical assumptions, limiting their robustness under noisy or imbalanced traffic conditions. To overcome feature engineering challenges, deep learning emerged as an alternative, enabling automated representation extraction. Autoencoder-based techniques were introduced to learn latent representations of benign traffic [20,21], while recurrent models—particularly LSTM architectures—were employed to capture sequential dependencies in flow or payload records [22,23].

Developing robust ML-based IDS solutions requires discriminative and representative feature sets. Flow-based IDS approaches extract information such as packet counts, volume, flow duration, and connection states, and have proven scalable and lightweight in practical deployments [11,24,25]. However, flow-only features lack semantic visibility into packet content, making them ineffective in detecting low-rate or injection-driven attacks where malicious behavior is embedded within payload bytes. To address this constraint, content-based IDS research began incorporating raw packet inspection, including byte frequency modeling, n-gram payload profiling, entropy-based characterization, and deep content embeddings [26,27,28]. These representations allow detection systems to capture fine-grained structural anomalies that would otherwise appear benign in flow metadata.

Recent research trends in IDS development have focused on adaptability, online responsiveness, and hybrid detection mechanisms. Hybrid IDS frameworks combine signature-based screening with anomaly-based learning to improve coverage and reduce false alarms [29,30]. Meanwhile, adaptive and incremental learning mechanisms allow IDS models to update themselves against concept drift and evolving attack behaviors [31,32]. Despite these advancements, a persistent challenge remains: ML-based IDS require substantial and representative benign traffic samples for training, yet network traces in reality frequently contain malicious connections. This assumption of access to clean datasets, implicitly made in most prior works [21,33,34], reduces practical relevance. Therefore, robust IDS techniques must be able to tolerate unclean training sets containing contaminated or mislabeled samples, rather than assuming perfectly benign data availability.

The main purpose of this research is to look for a combination of robust features and algorithms that can handle some malicious traffic in the training set. We introduced varying quantities of malicious data into benign datasets to assess the impact of malicious data on the model’s outcomes. Then, with the prepared unclean training sets, we evaluated two types of content-based features, i.e., byte frequencies and byte subsequences, and several machine learning algorithms, such as LOF, IF, OCSVM, Autoencoders, and LSTM. As shown in Table 1, to the best of our knowledge, no extensive research has explored this area. The corresponding check mark in the table indicates that unclean data are considered in our experimental scenarios.

3. Proposed Methods

This section outlines the stages of the proposed methods for anomaly detection, encompassing dataset preparation, feature extraction, model development (training and detection phases) using ISOF, LOF, OCSVM, Autoencoders, and LSTM, and culminating in the computation and application of a threshold to judge whether a connection is malicious or not. Figure 1 illustrates the overall workflow of the proposed method.

The process begins with Dataset Preparation, where initial training sets (comprising primarily legitimate connections) are transformed into noisy training sets. These noisy sets are specifically designed to contain mostly legitimate connections, but with a controlled and intentional injection of malicious instances to reflect real-world scenarios more accurately. Following dataset preparation, representative feature is extracted from these noisy training sets. The specific feature varies for each proposed model. For LSTM, byte sequences (raw ordered sequences of bytes within a connection) are utilized. In contrast, Autoencoders and the classical machine learning models, namely ISOF, LOF, and OCSVM, leverage byte frequencies (the statistical distribution of byte values within a connection). The proposed models then proceed through the Model Development (Training Phase) to learn the patterns of their respective features from the prepared datasets.

After training, the Model Development (Detection Phase) commences. For LSTM and Autoencoder models, an anomaly score is computed for each connection based on its prediction or reconstruction error, respectively. This score is then compared against a pre-determined threshold to classify the connection. This threshold is derived from the training data’s anomaly score distribution to effectively distinguish between normal and anomalous behavior. Classical machine learning algorithms (ISOF, LOF, OCSVM), on the other hand, inherently detect anomalies using their internal algorithms, and thus do not require a separate threshold computation step. Detailed explanations of each stage and model will be provided in the subsequent subsections.

3.1. Dataset Preparation

As mentioned earlier, the dataset plays a critical role in determining the outcomes of a model, especially in the context of an anomaly detection model. The appropriate selection and processing of the dataset contribute significantly to the model’s improved performance. In this study, the UNSW-NB15 dataset was selected due to its comprehensive representation of legitimate network traffic and well-labeled attack categories [35]. To maintain a focused scope on low-rate attacks, this research considers only four out of the ten available traffic categories, namely Normal, Backdoors, Exploits, and Worms. Additionally, the analysis is restricted to HTTP, FTP, and SMTP protocols, as examining network packet content (payload) is deemed more effective in detecting low-rate attacks.

The dataset used in this study consists of PCAP files captured over two separate days. The PCAP file dated 22 January 2015 (UNSW-01) (https://unsw-my.sharepoint.com/personal/z5025758_ad_unsw_edu_au/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fz5025758%5Fad%5Funsw%-5Fedu%5Fau%2FDocuments%2FUNSW%2DNB15%20dataset%2-Fpcap%20files%2Fpcaps%2022%2D1%2D2015&viewid=f8d1dec5%2Dcd5f%2D42ae%2D8b06%2D2fece580c74a&ga=1, accessed on 8 September 2024) is designated as the training data, while the file from 17 February 2015 (UNSW-02) (https://unsw-my.sharepoint.com/personal/z5025758_ad_unsw_edu_au/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fz5025758%5Fad%5Funsw%5Fedu%5Fau%2FDocuments%2FUNSW%2DNB15%20dataset%2Fpcap%20files%2Fpcaps%2017%2D2%2D2015&viewid=f8d1dec5%2Dcd5f%2D42ae%2D8b06%2D2fe-ce580c74a&ga=1, accessed on 8 September 2024) is used for testing. For the training dataset construction, PCAP files are split based on their TCP tuples (source and destination IP addresses and ports) using tcpflow. Each extracted TCP tuple is then classified as either legitimate or malicious by indexing it against the corresponding CSV files in the UNSW-NB15 dataset. Subsequently, all identified legitimate traffic is grouped based on its respective protocol, forming three distinct PCAP files representing HTTP, SMTP, and FTP. These files serve as the clean training sets and are merged using the mergecap library while preserving the original packet timestamps.

To achieve the main objective of this research, which is to comprehend how the quantity of malicious traffic in the training set impacts the model performance, we calculated the ratio of malicious traffic for every protocol. The statistics are summarized in Table 2. As shown, the FTP protocol has the lowest proportion of malicious traffic at 0.39% of its connections classified as malicious. To ensure a consistent and fair comparative analysis across all three protocols, we therefore established a maximum injection limit of 0.4% for the experiments involving HTTP and SMTP.

To systematically assess the effect of malicious traffic in the training dataset, we gradually injected the malicious traffic to the clean training sets. The injection was performed using fine-grained increments to capture detailed model behavior under varying contamination scenarios. This injection methodology was designed to simulate realistic conditions where training data cleanliness cannot be guaranteed in operational environments. For FTP, the proportion of injected malicious connections, referred to as noise levels, is set at 0.05%, 0.1%, 0.15%, 0.2%, 0.25%, 0.3%, 0.35%, and 0.39%. For SMTP and HTTP, the noise levels are configured at 0.1%, 0.2%, 0.3%, 0.4%, 0.5%, 0.6%, 0.7%, and 0.8%. After identifying the malicious TCP connections to be injected, we combined each set of malicious traffic with legitimate traffic into a PCAP file using the mergecap library to preserve the original packet timestamps. This resulting set is what we refer to as an unclean dataset or unclean training set. An example of a noisy training set for FTP, illustrating how malicious connections appear within the legitimate traffic, is shown in Table 3. The rows highlighted in red indicate malicious connections.

3.2. Feature Extraction

In this study, payload-based analysis is employed instead of relying solely on packet headers, as it is considered more effective in detecting low-rate attacks [36]. Since application-layer messages often exceed the maximum IP packet size, they are segmented into multiple packets, which may arrive in or out of order. To ensure a complete representation of network activity, incoming packets are temporarily stored in a queue buffer and grouped based on their respective TCP flows, identified by source and destination IP addresses and port numbers, before being reassembled according to the TCP protocol standard defined in RFC 793 [37]. This reassembly process enables the system to analyze entire reconstructed payloads, providing a more comprehensive view of network traffic.

Once the TCP connection is terminated, typically marked by a FIN packet, the reassembled payload is processed by the model. However, if a connection remains incomplete beyond a predefined timeout, it is considered disrupted and is processed accordingly. By enforcing this timeout mechanism, the system prevents stalled or abandoned connections from affecting the accuracy of anomaly detection.

Once reassembled, the complete application-layer message (payload) is ready for feature extraction. To be processed by our machine learning models, the raw payload must first be transformed into a meaningful numerical format. While various payload representation methods exist, including [38,39,40], this study focuses on two complementary approaches: byte frequencies and byte sequences. The fundamental difference in how these two features are constructed from the same payload is illustrated in Figure 2. The process begins by converting the raw payload into a universal sequence of integers, where each byte is mapped to its corresponding value from 0 to 255. From this numerical sequence, the two distinct feature sets are derived. The precise formulation and rationale for each of these representations are detailed in the subsequent subsections.

3.2.1. Byte Frequencies

Byte frequencies capture the statistical distribution of byte values within network payloads, representing how often each possible byte value (0–255) appears in a given message. We chose this approach based on the observation that different types of network traffic exhibit distinct byte distribution patterns. Legitimate application-layer protocols typically conform to well-defined character sets and structural patterns. For instance, HTTP traffic predominantly comprises printable ASCII characters (bytes 32–126), while DNS queries follow specific encoding schemes. In contrast, malicious payloads, such as injected executable code, shellcode, or encrypted malware, introduce anomalous byte distributions that deviate significantly from these expected patterns.

Formally, for a payload of length N bytes, we construct a 256-dimensional feature vector where each dimension represents the frequency of occurrence of a specific byte value. However, payload lengths vary considerably across different traffic types and network conditions, potentially biasing the analysis toward longer payloads. To address this, we normalize the byte frequencies by dividing each byte count (x) by the total payload length (N), as described in Equation (1). After normalization (

x_{norm}

), the byte frequency data is used for training and testing the detection models (i.e., ISOF, LOF, OCSVM, and Autoencoders).

x_{norm} = \frac{x}{N}

(1)

3.2.2. Byte Sequences

While byte frequencies effectively capture the statistical composition of payloads, they inherently discard positional information—the order in which bytes appear. However, many attack signatures are characterized not merely by the presence of specific bytes but by particular byte patterns and their sequential arrangement. For example, return-oriented programming (ROP) chains, format string exploits, and certain injection attacks rely on specific byte sequences that would be indistinguishable from benign traffic when examining only frequency distributions [41].

To preserve this sequential information, we employ byte sequences as an alternative feature representation specifically for LSTM-based models, which are architecturally designed to capture temporal dependencies in sequential data [42]. We extract byte subsequences using a sliding window approach with configurable window size, similar to n-gram extraction in natural language processing. This method generates overlapping subsequences that capture local byte patterns within the payload. Algorithm 1 illustrates the transformation process from raw packet payload to byte subsequences.

Algorithm 1 Sliding Window for Byte Sequence

1:: procedure CreateSlidingWindow(byteSequence, windowSize, stepSize)
2:: Input: byteSequence, windowSize, stepSize
3:: Output: List of sliding windows
4:: windowsList ← []
5:: for $i \leftarrow 0$ to $| b y t e S e q u e n c e | - windowSize$ step stepSize do
6:: window ← byteSequence[i : $i + windowSize$ ]
7:: windowsList.append(window)
8:: end for
9:: return windowsList
10:: end procedure

3.3. Anomaly Detection

Anomalies are data points that deviate significantly from normal patterns, also known as outliers or rare events. Anomaly detection works by analyzing historical data to identify these unusual instances. In domains like NIDSs, this process provides critical insights by flagging potential threats.

There are three different ways to detect anomalies: supervised learning, unsupervised learning, and semi-supervised learning. This research focuses on using unsupervised learning to detect anomalies. Unlike supervised or semi-supervised learning, unsupervised learning techniques don’t require labelled training data [43]. By assuming that normal data points occur much more often than anomalous data points, unsupervised techniques detect anomalies by classifying data points that occur less frequently. Instead of assigning labels to data points, unsupervised techniques assign each data point a score that indicates how likely it is to be an anomaly. However, this approach usually works assuming that the training data are clean and do not contain any malicious instances. Therefore, this study examines the impact of different noise levels in training data on anomaly detection model performance.

As previously mentioned, this research evaluates various machine learning models for anomaly-based intrusion detection, comparing their effectiveness in identifying network intrusion. The models examined include Long Short-Term Memory (LSTM), Autoencoders, and classical machine learning techniques such as Isolation Forest (ISOF), Local Outlier Factor (LOF), and One-Class Support Vector Machine (OCSVM). This subsection provides an in-depth discussion of each model’s core concepts, architecture, and their utilization in detecting intrusions through anomaly detection.

3.3.1. Long Short-Term Memory

An Long Short-Term Memory (LSTM) model is typically used for classification problems where labelled data is provided to train the model. However, in this research, LSTM is employed for anomaly detection, requiring a different approach. The development of the LSTM model is divided into two phases, which are training and detection.

In the training phase, the LSTM model is trained to predict the next item in a subsequence obtained from the network packet payload. The model learns to predict the next byte in a byte subsequence. As illustrated in the “Byte Sequence” portion of Figure 2, the model is given an input subsequence

x_{i}

(e.g., [85, 83]) and trained to predict the immediately following byte, which serves as the target label

y_{i}

(e.g., 69).

Formulating the LSTM classifier as a simple function would be an oversimplification, as multiple operations are involved. Instead, the function is expressed in more detail in Equation (2).

E (x)

transforms input into a vector of specific dimensions, acting as an embedding layer. The function R represents the recurrent layer, which takes the embedded vectors as inputs and outputs an intermediate vector value. This intermediate vector is then processed by the softmax function

S F

, which calculates the probability distribution over possible next-byte candidates. Finally, the function selects the byte with the highest probability as the predicted next item in the sequence.

F p (x) = a r g m a x (S F (R (E (x)))))

(2)

The primary objective of training is to enable the LSTM model to remember common byte sequences found in network packet payloads. If the LSTM has encountered a byte sequence before, its prediction error is expected to be low. Conversely, unusual traffic or unseen attack patterns are likely to yield higher prediction errors due to unfamiliar byte sequences.

In the detection phase, the trained LSTM model processes incoming byte sequences, similar to the training phase. However, in addition to predicting the next byte, the model also computes prediction errors, which are used to detect anomalies. Two methods are employed for error calculation: binary anomaly scoring and floating anomaly scoring. The binary anomaly score (

a_{p}^{binary}

) assesses mispredictions by assigning a value (

v_{i}

) of one if the predicted byte (

{\hat{y}}_{i}

) does not match the actual byte (

y_{i}

), as formally defined in Equation (3). In contrast, the floating anomaly score (

a_{p}^{float}

) quantifies the numerical deviation between the softmax output of the model

(P r o b ({\hat{y}}_{i}))

and the expected probability distribution

(P r o b (y_{i}))

, as defined in Equation (4). Both scoring methods are normalized by the message length (l) to ensure comparability across sequences. A connection is flagged as malicious if its resulting anomaly score exceeds a predetermined threshold. While this threshold can be set manually, in this research it is computed statistically. (See Section 3.4)

a_{p}^{binary} = \frac{\sum_{i = 0}^{l - n} v_{i}}{l} \{\begin{matrix} v_{i} = 1, & {\hat{y}}_{i} = y_{i} \\ v_{i} = 0, & otherwise \end{matrix}

(3)

a_{p}^{f l o a t} = \frac{\sum_{i = 0}^{l - n} (Prob {({\hat{y}}_{i} - Prob (y_{i}))}^{2}}{l}

(4)

3.3.2. Autoencoders

Unlike LSTM networks, which process byte sequences, Autoencoders operate on vectorized representations of byte frequencies, as they are not designed to handle sequential data with variable lengths. Formally, an Autoencoder model is defined as a non-linear function

G (p)

, which maps an input vector X to its reconstructed output

\hat{X}

, as expressed in Equation (5). The function

G (p)

consists of stacked neural network layers and is optimized through backpropagation to minimize the reconstruction error, ensuring that

\hat{X}

closely approximates X.

\hat{X} = G_{p} (X)

(5)

This study adopts the Autoencoder model developed by Pratomo et al. [44], which is trained on an unclean dataset containing both normal and anomalous network traffic. The detection phase evaluates anomaly scores based on reconstruction errors, computed using the mean squared error (MSE) between input and output (Equation (6)). Specifically, if

x_{i}

represents the input frequency of byte i and

x_{i}^{'}

its reconstructed value, the anomaly score reflects deviations from learned patterns. Higher reconstruction errors indicate traffic patterns that were uncommon in the training data. A connection is classified as malicious if its anomaly score surpasses a precomputed threshold.

e = \frac{1}{256} \sum_{i = 0}^{i = 255} {(x_{i} - x_{i}^{'})}^{2}

(6)

3.3.3. Classical Machine Learning

Unlike deep learning models such as LSTM and Autoencoders, the machine learning models used in this study, OCSVM (One-Class Support Vector Machine) [45], LOF (Local Outlier Factor) [46], and ISOF (Isolation Forest) [47], are specifically designed for anomaly detection. As a result, these models can be applied without architectural modifications.

All three models process byte frequency vectors rather than byte sequences, as they are not designed for sequential data. Given that the training dataset consists primarily of legitimate connections with some malicious traffic as noise, malicious data points are more likely to fall outside OCSVM’s decision boundary, exhibit higher LOF scores, and have longer path lengths in ISOF’s tree structure.

3.4. Calculating Threshold

As mentioned in Section 3.3.1 and Section 3.3.2, in this research, the threshold for determining whether a connection is malicious or legitimate is statistically computed exclusively for LSTM and Autoencoders. This threshold is derived from the anomaly scores obtained during the detection phase. LSTM produces two types of anomaly scores: binary and floating (see Section 3.3.1 for details), while Autoencoders use reconstruction error as their anomaly score (see Section 3.3.2). Any anomaly score exceeding the computed threshold is classified as anomalous, whereas scores below it are considered benign.

This research employs three methods to determine the floating anomaly score threshold. The first method classifies a connection as malicious if its anomaly score falls beyond mean ± two times the standard deviation. While straightforward, this approach assumes a near-normal distribution and is sensitive to outliers itself.

Therefore, we implement a second, more robust method based on the work of [48], which utilizes the median and interquartile range (IQR). The median is less influenced by extreme values, making this method suitable for non-normally distributed data. For skewed distributions, a further adjustment using the medcouple (MC) is recommended [49]. The resulting threshold,

T_{I Q R}

can be computed, as described in Equation (7), with

Q_{3}

representing the 3rd quartile.

T_{I Q R} = \{\begin{matrix} Q_{3} + e^{3 \cdot M C} \cdot 1.5 \cdot I Q R, & if MC \geq 0 \\ Q_{3} + e^{4 \cdot M C} \cdot 1.5 \cdot I Q R, & if MC < 0 \end{matrix}

(7)

The third method for defining the threshold involves using the Median Absolute Deviation (MAD) [50]. This approach utilizes both the median and median absolute deviation as robust measures of central tendency and dispersion, respectively. In this method, the threshold remains static, but cannot be directly compared with the reconstruction error. To make such a comparison, the reconstruction error must be transformed into its z-score, as outlined in Equation (8). In this context, a TCP connection is identified as malicious when its z-score exceeds the specified threshold, usually set at 3.5 [50].

z = \frac{0.6745 \cdot (| e - m e d i a n (E) |)}{M A D}

(8)

After the training phase, one of these three threshold calculation methods is applied. Before transitioning to detection mode, the LSTM and Autoencoder models reprocess the training set to compute anomaly scores for each TCP connection. LSTM calculates both binary and floating scores, while Autoencoders compute the reconstruction error. These scores are then used to determine the final threshold, which is subsequently applied in the detection phase.

4. Problem Setting

To systematically evaluate the proposed detection approach, it is necessary to establish the adversarial context, deployment constraints, and dataset characteristics that define the scope of this work. In what follows, we characterise the adversarial capabilities, outline the operational constraints of realistic network environments, and justify our selection of UNSW-NB15 for modelling low-rate, stealthy intrusion scenarios.

4.1. Threat Model

In this work, we consider an attacker with moderate capabilities who operates under constraints that favour low-rate, stealth-oriented behaviour. Rather than overwhelming the network with high-volume traffic, the attacker sends carefully crafted, low-frequency requests designed to mimic legitimate client activity. We focus on adversaries capable of injecting malicious scripting content—including PHP, Python, Ruby, and SQL—alongside shellcode fragments or command sequences intended to gain remote access, escalate privileges, or maintain persistence on a target host. The attacker is assumed to deliver these payloads through text-based TCP protocols such as HTTP and FTP, which provide well-structured, human-readable request formats that facilitate covert manipulation without triggering volumetric anomalies.

4.2. Deployment Assumptions

The proposed detection approach assumes that packet payloads are accessible in plaintext form for inspection. Although modern networks commonly transmit traffic over TLS, this assumption remains realistic in enterprise environments where intermediate systems legitimately terminate encrypted channels. Examples include reverse proxies, TLS-terminating load balancers, API gateways, and application firewalls, all of which receive decrypted content before forwarding it to backend services.

Figure 3 illustrates this deployment assumption. Under these conditions, the anomaly detection logic may be integrated as a module within existing traffic inspection components—such as ModSecurity-enabled web servers or NGINX App Protect deployments—without violating end-to-end security guarantees. Since payload inspection occurs post-TLS termination, the method does not require intrusive key extraction, man-in-the-middle interception, or packet decryption at unauthorised points.

4.3. Dataset Representativeness and Generalisability

UNSW-NB15 was selected due to the diversity and realism of its attack samples. Compared to DARPA, NSL-KDD, and CIC-IDS2017, where they provide at most 222 low-rate attack samples, UNSW-NB15 provides approximately 27,000 labelled low-rate connections. CSE-CIC-IDS2018 offers a larger quantity (approximately 162,000 samples); however, many of its low-rate attacks were generated using scripted interactions against the DVWA testbed, resulting in repetitive patterns with limited behavioural diversity. In contrast, UNSW-NB15 traffic was generated using IXIA PerfectStorm, which simulates enterprise-like traffic streams with realistic timing variations, protocol noise, and exploit behaviours, making it more suitable for modelling stealthy intrusion behaviour. While UNSW-NB15 captures realistic network interaction patterns, additional validation on contemporary datasets would help further substantiate the robustness of our approach, particularly under different traffic compositions and adversarial behaviours.

5. Experiments and Results

This research aims to determine which combination of detection model (OCSVM, LOF, ISOF, Autoencoders, and LSTM) and feature set (byte frequencies or byte sequences) yields the optimal detection performance, especially when trained on unclean datasets containing varying degrees of malicious traffic. As detailed in Section 3.1, these models are trained on such datasets and then rigorously evaluated using a dedicated testing set. Figure 4 illustrates this comprehensive evaluation process. Initially, relevant features, either byte sequences or byte frequencies, are extracted from the testing sets. Byte sequences serve as features exclusively for the LSTM model, while byte frequencies are utilized by Autoencoders and all other machine learning models (ISOF, LOF, and OCSVM).

For the deep learning models (LSTM and Autoencoders), as previously outlined in Section 3, an anomaly score is calculated based on their predictions on the testing dataset (refer to Section 3.3.1 for LSTM and Section 3.3.2 for Autoencoders). This score is then compared against a predefined threshold, established during the detection phase (Section 3.4). A connection is classified as malicious if its anomaly score surpasses this threshold. In contrast, machine learning models (ISOF, LOF, and OCSVM) operate differently. These models inherently determine anomalous data without requiring a separate manual anomaly score computation or thresholding. Instead, their respective algorithms (explained in Section 3.3.3) directly process the extracted byte frequencies from the testing set connections to classify them as either legitimate or malicious.

5.1. Experiment Setup

5.1.1. Dataset and Evaluation Metrics

The methodology for creating the noisy training sets, including the variation of noise ratios for each protocol, is detailed in Section 3.1. It is important to note that while the noise ratio for the HTTP protocol was varied up to 0.8%, for this study, we limited the maximum noise level to 0.5% due to time constraints.

For the testing phase, we utilized the UNSW-02 dataset. The construction of the testing set followed a similar preparation process as the training set: they are split based on their TCP tuples and classified as either legitimate or malicious by indexing against the corresponding CSV files in the UNSW-NB15 dataset. All connections (both legitimate and malicious) are then grouped based on their respective protocols (HTTP, SMTP, and FTP). An excerpt from the resulting dataset is shown in Table 4, illustrating the structure where labels ‘0’ and ‘1’ denote legitimate and malicious traffic, respectively. The final composition of the testing set, detailing the number of TCP connections per protocol, is summarized in Table 5.

The performance of the detection model was evaluated using a confusion matrix, as depicted in Figure 5. In this matrix, rows correspond to the actual class instances, where the positive condition (P) represents malicious traffic and the negative condition (N) represents benign traffic, while columns represent the predicted class instances [51,52,53,54]. The resulting two-by-two contingency table reports the counts for four key outcomes: true positives (TP), denoting correctly identified malicious connections; false negatives (FN), representing undetected malicious instances; false positives (FPs), indicating benign traffic incorrectly classified as malicious; and true negatives (TN), reflecting correctly classified benign traffic. This comprehensive breakdown facilitates a more detailed performance analysis than the proportion of correct classifications (accuracy), as accuracy can yield misleading results when the dataset is unbalanced and class observations vary significantly.

The confusion matrix results will be used to generate the evaluation metrics. In this study, we employ three types of evaluation metrics: Detection Rate, False-Positive Rate, and F2 score.

Detection Rate (DR) measures the model’s ability to identify positive instances correctly. A higher detection rate indicates the model’s enhanced capability to identify malicious cases effectively. Equation (9) shows the formula of detection rate.

D R = \frac{T P}{T P + F N}

(9)

False-Positive Rate (FPR) provides insights into the model’s tendency to misclassify negative instances as positive. A lower FPR signifies that the model makes fewer errors by classifying negative instances as positive. Equation (10) shows the formula of FPR.

F P R = \frac{F P}{T N + F P}

(10)

The F2 Score is the weighted harmonic mean of precision and recall for a given threshold. It diverges from the F1 Score by placing a greater emphasis on recall than on precision. Greater weight is attributed to recall in cases where false negatives (undetected attacks) have more negative consequences and are deemed more severe than false positives. A higher F2 Score indicates a well-balanced consideration of precision and recall, with a pronounced emphasis on recall. Equation (11) illustrates the computation of the F2 Score, highlighting that False Negatives have a greater impact than False Positives, with FN values carrying a higher weight than FP values.

F_{2} = \frac{(1 + 2^{2}) \cdot T P}{(1 + 2^{2}) \cdot T P + (2^{2}) \cdot F N + F P}

(11)

By incorporating these three evaluation matrices, we aim to comprehensively evaluate the model’s performance, considering its ability to correctly identify positive instances and avoid false positives.

5.1.2. Network Architecture and Hyperparameters

The overview of the proposed deep learning architectures (LSTM and Autoencoders) has been discussed previously in Section 3.3. As deep learning models contain numerous hyperparameters, we list both the LSTM and Autoencoders’ hyperparameters used in this research in Table 6 for reproducibility purposes. For machine learning models, all parameters were set to their default values.

5.2. Experiment Results

This study involves experiments using the LSTM model and classical machine learning models, including OCSVM, LOF, and ISOF, as well as the Autoencoders model for comparison. For the byte sequence feature, each sequence consistently set to a length of 2 bytes. This length was selected as the baseline condition representing the minimum sequence length that preserves ordinal information; sequences of length 1 would eliminate sequential dependencies entirely.

In the Autoencoder experiments, the testing methodology involved varying the number of hidden layers to 1, 3, and 5. The anomaly detection thresholds for these experiments were determined using three distinct methods: mean, interquartile range (IQR), and z-score. For the LSTM experiments, testing was systematically divided based on two computational approaches for generating anomaly scores: a binary and floating approach. Each computational approach was further evaluated using three corresponding threshold methods: b_mean, b_iqr, and b_zscore for the binary scores, and f_mean, f_iqr, and f_zscore for the floating-point scores. Furthermore, the LSTM’s performance was differentiated across various network protocols analyzed, specifically HTTP, FTP, SMTP, and a combined dataset encompassing all mentioned protocols, as well as by the ratio of noise introduced into the datasets.

5.2.1. Protocol-Based Attack Detection Performance

Our experimental evaluation systematically assessed model performance across three protocols (HTTP, FTP, and SMTP) under varying noise conditions. We conducted experiments with varying noise ratios as detailed in Section 3.1 to evaluate the robustness of each approach. The comprehensive results are presented in Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8 and Table A9, organized by model category and protocol: HTTP traffic (Table A1, Table A2 and Table A3), FTP traffic (Table A4, Table A5 and Table A6), and SMTP traffic (Table A7, Table A8 and Table A9).

For HTTP traffic, our findings reveal distinct performance characteristics across model architectures. In the clean data scenario (0% noise), the LOF model achieved the highest F2 score among classical machine learning approaches. Autoencoders demonstrated superior performance using z-score thresholding across all hidden layer configurations, while LSTM networks achieved optimal results with floating anomaly scores combined with IQR thresholding. Notably, autoencoders outperformed the other two categories by up to 13% in F2 score when evaluated on clean data. To assess noise resilience, we examined performance degradation patterns illustrated in Figure 6, which depicts model behavior under increasing noise levels, with dotted lines representing linear regression trends. The gradient of these trends serves as an indicator of noise sensitivity, where steeper negative gradients correspond to greater performance degradation. As observed, autoencoders exhibited the smallest gradient values, demonstrating robust noise resistance. Conversely, the LOF model experienced the steepest decline (gradient =

- 0.0943

), indicating high susceptibility to training data contamination.

The experimental results for FTP traffic, detailed in Table A4, Table A5 and Table A6, reveal performance patterns consistent with those observed for HTTP. In the clean data scenario, the LOF model again achieved the highest F2 score among classical machine learning approaches. For autoencoders, optimal performance was obtained using IQR thresholding with a single hidden layer configuration. LSTM networks achieved peak performance with binary anomaly scores paired with mean thresholding. The performance gap between model categories narrowed considerably for FTP, with autoencoders leading by only 1.21% in F2 score. Figure 7 illustrates the noise sensitivity across models for FTP traffic. Consistent with HTTP results, autoencoder variants demonstrated the lowest gradient values, confirming their superior noise tolerance. The LOF model exhibited the highest sensitivity (gradient =

- 0.0272

), though the magnitude of degradation was substantially lower than that observed in HTTP traffic.

For SMTP traffic, the experimental outcomes are presented in Table A7, Table A8 and Table A9 following the same organizational structure. Unlike HTTP and FTP, the ISOF model produced the highest F2 score among classical machine learning methods when tested on clean data. Autoencoders achieved optimal performance using IQR thresholding with five hidden layers, while LSTM networks performed best with binary anomaly scores combined with mean thresholding. Autoencoders maintained their performance advantage with an 11.5% higher F2 score relative to competing approaches. The noise sensitivity analysis for SMTP traffic, depicted in Figure 8, reinforces the patterns observed across other protocols. Autoencoder variants consistently exhibited the lowest gradient values, demonstrating robust performance under noise. The LOF model again showed the highest sensitivity (gradient =

- 0.0547

), with degradation magnitude falling between the values observed for HTTP and FTP protocols.

Our experiments across all three protocols consistently demonstrate that autoencoders exhibit superior noise resilience compared to classical machine learning and LSTM approaches. While classical methods—particularly LOF—can achieve competitive performance on clean data, they suffer substantial degradation when trained on contaminated datasets. LSTM networks show intermediate sensitivity, with performance varying based on the anomaly scoring and thresholding combination employed.

5.2.2. Overall Attack Detection Performance

To evaluate the generalizability of the proposed models across diverse traffic types, we conducted an aggregate performance analysis that averaged the detection rate, F2 scores, and FPR from the same model variations across all protocols. These variations encompass the specific algorithms for classical machine learning, the threshold method and layer count for Autoencoders, and the scoring calculation and threshold method for LSTM. Due to the varying maximum noise limits in the protocol-specific datasets, this aggregate analysis utilizes consistent noise variations of 0%, 0.1%, 0.2%, and 0.3%. The aggregated results are reported in Table A10, Table A11 and Table A12, representing classical machine learning, Autoencoders, and LSTM, respectively.

As illustrated in Table A10, the results remain consistent with the protocol-specific experiments: LOF achieves the highest F2 score among classical machine learning methods in the clean data scenario. For Autoencoders (Table A11), the highest F2 score is obtained when employing z-score thresholding with a single hidden layer. Regarding LSTM (Table A12), the highest F2 score was achieved utilizing floating anomaly calculation with the IQR thresholding method. Consistent with previous findings, Autoencoders demonstrate a substantial advantage, leading by up to 13% in F2 score compared to other algorithms. The impact of noise variation, illustrated in Figure 9, further confirms that Autoencoders exhibit the most resilience to noise (lowest gradient value), while LOF proves to be the least robust (highest gradient value of

- 0.1382

).

Moreover, Table 7 presents a comprehensive comparison of the optimal performance parameters for each model category using the F2 score as the primary evaluation metric. The optimal parameters were identified by selecting the configurations that yielded the highest average F2 score across all noise levels. This metric was selected to provide a balanced representation of recall and precision, ensuring that both false negatives and false positives are appropriately weighted. The comparative analysis reveals that while LSTM models generally yield lower F2 scores than Autoencoders, they surpass classical machine learning models in performance. Notably, Autoencoders consistently achieved the highest values across all test cases, demonstrating their superiority in anomaly detection tasks under varying noise conditions.

Figure 10 provides a comparative visual analysis of the optimal F2 scores achievable by each algorithm category across the tested protocols. Across all four subplots (a–d), the Autoencoder model (represented by the green line) consistently maintains the highest performance trajectory, visually distinct from the RNN (orange) and classical machine learning (blue) baselines. Crucially, the trend lines for the Autoencoder exhibit minimal negative gradients, appearing nearly horizontal in the FTP and Overall scenarios, which underscores the model’s remarkable stability against increasing noise ratios. Conversely, classical machine learning models display the most significant performance degradation, particularly evident in the HTTP and Overall traffic plots where the downward slope is most pronounced.

5.2.3. Overall Attack Detection Performance with Best Parameter

While the analysis presented in Section 5.2.2 evaluated performance using a single parameter set averaged across all traffic types, that generalized approach provides a constrained perspective on model capabilities. The reason is that optimal hyperparameters and thresholding methods vary significantly depending on the specific network protocol under examination. To address this limitation, this subsection assesses detection capabilities by aggregating the mean F2 scores obtained using the best-performing parameters tailored to each specific traffic type, as detailed in Table 8.

The aggregated results, illustrated in Figure 11, confirm the performance hierarchy observed in the previous section. Our findings reveal that the LSTM model generally yields lower F2 scores than Autoencoders but consistently outperforms traditional machine learning models. Nevertheless, the stability metrics derived from this optimized approach are more definitive than those obtained in the general analysis. As it can be observed in Figure 11, the distinction between the models lies heavily in their resilience to unclean training data. Even when tuned to their optimal parameters, classical machine learning models exhibit the highest sensitivity to noise, manifesting as a steep negative gradient of

- 0.0425

. On the contrary, Autoencoders demonstrate remarkable robustness that is inherent to the architecture rather than a result of specific parameter tuning. Their performance trend line in Figure 11 is virtually flat, with a minimum gradient of

0.0013

, indicating that they are the least sensitive to malicious traffic in the training set. The LSTM model occupies a middle ground with a gradient of

- 0.0128

, showing moderate resilience that exceeds classical models but lacks the near-total immunity to unclean training data exhibited by the Autoencoders.

6. Discussion

Throughout all experiments, whether protocol-specific or protocol-free, the relative performance rankings of the algorithms remained remarkably consistent. Autoencoders consistently achieved the highest F2 scores, followed by LSTM models in second place, with classical machine learning algorithms trailing in third. However, the performance nuances within each algorithmic family varied considerably across experimental scenarios. For instance, in the classical machine learning category, LOF typically achieved the highest F2 scores when trained on clean datasets across different experimental configurations. However, when F2 scores were averaged across all noise levels, ISOF demonstrated superior performance in most experimental settings. This shift can be attributed to LOF’s pronounced sensitivity to noise, as evidenced by the steep decline in its performance gradient.

Although the injected noise levels in our experiments were capped below 1%—a conservative representation compared to potentially higher contamination rates in operational networks—the linear degradation trends observed allow us to infer expected performance under greater noise levels. Assuming the degradation patterns remain approximately linear, the computed gradients provide a reasonable basis for extrapolation to real-world scenarios. Consider the cross-protocol scenario with optimal parameters as an example (see Section 5.2.3). Referring to Figure 11, autoencoders exhibit the linear regression equation

y = - 0.0128 x + 0.8769

. If the noise level were 1%, the F2 score of the corresponding autoencoder model under this scenario would approximate 0.8767.

Autoencoders not only dominated in F2 score performance but also exhibited the greatest resilience to noise across all experimental configurations, as demonstrated by their minimal gradient values. This robustness stems from the synergy between their architectural complexity and the byte frequency feature representation. Byte frequencies capture distributional patterns rather than sequential dependencies, making them inherently tolerant to perturbations in byte ordering. Consequently, when bytes are reordered but maintain similar frequency distributions, Autoencoders can still recognize legitimate connections. The reconstruction-based anomaly detection mechanism of Autoencoders learns a compressed representation of normal traffic patterns in the latent space, enabling them to distinguish genuine anomalies from noise-induced variations. This is reflected in their consistently high detection rates (reaching 100% in several FTP experiments) while maintaining acceptably low false-positive rates.

LSTM models occupied an intermediate position in both performance and noise resilience. The byte sequence representation leverages temporal dependencies and ordering information, providing advantages when attacks manipulate byte arrangements rather than substituting byte values entirely. The recurrent architecture enables LSTM to capture complex behavioral patterns in malicious connections, as evidenced by detection rates exceeding 99% in several scenarios. However, this sequential sensitivity introduces a critical vulnerability: LSTM models become overly rigid to learned data patterns and sensitive to noise, manifested in elevated false-positive rates (up to 14.64% in some FTP experiments) that ultimately depress F2 scores.

Classical machine learning algorithms, despite benefiting from the relatively simple byte frequency features, demonstrated the most limited performance due to their algorithmic constraints. These methods struggled to achieve balanced trade-offs between detection rate and false-positive rate. For instance, ISOF consistently achieved perfect detection rates (100%) in FTP and SMTP protocols but suffered from elevated false-positive rates (up to 14.64%), while OCSVM maintained lower false-positive rates but at the cost of substantially reduced detection capabilities. This limitation reflects the fundamental challenge these algorithms face in learning complex decision boundaries from high-dimensional feature spaces without the representational capacity of deep neural networks.

In real-world deployment, these findings suggest distinct operational niches for each approach. Autoencoders are well-suited for long-running enterprise NIDSs where imperfect retraining data is inevitable, as their minimal performance degradation addresses the reality that production networks contain some proportion of malicious traffic. LSTM models excel in protocol-aware inspection components such as reverse proxies or web application firewalls where sequence patterns matter. Classical machine learning algorithms remain useful as lightweight edge filters in IoT gateways or SD-WAN appliances where computational efficiency is paramount.

7. Conclusions

This work systematically evaluated the resilience of anomaly-based network intrusion detection models when trained on unclean datasets containing varying levels of malicious traffic. We assessed five approaches—Autoencoders, LSTM, Isolation Forest, Local Outlier Factor, and One-Class Support Vector Machine—across HTTP, FTP, and SMTP protocols to determine which combination of algorithm and feature representation maintains robust performance under realistic training conditions where malicious traffic inadvertently contaminates benign datasets.

Our experimental results demonstrate that Autoencoders using byte frequency features achieved superior performance across all evaluation scenarios, with an average F2 score of 0.8975 representing only a 0.001 decrease from clean training conditions. This minimal performance degradation contrasts sharply with classical machine learning approaches (0.04 decrease) and LSTM models (0.01 decrease). LSTM models occupied an intermediate position, effectively capturing attack signatures through sequential byte pattern processing, though this sensitivity introduces vulnerability to training data contamination. Classical machine learning algorithms exhibited the highest sensitivity to noise, with LOF showing the steepest performance degradation across protocols.

These findings provide actionable guidance for practitioners in selecting intrusion detection approaches based on operational requirements. Autoencoders are well-suited for enterprise environments where perfect training data cleanliness cannot be guaranteed, LSTM models excel in protocol-aware inspection scenarios where sequence patterns are critical, and classical methods remain viable for resource-constrained edge deployments where computational efficiency is paramount.

Nevertheless, several limitations warrant acknowledgment. This study focused primarily on detection accuracy and noise resilience without comprehensive assessment of system constraints such as training time, inference speed, and memory consumption—critical factors in cases where computational overhead may offset performance advantages. Future work should incorporate systematic evaluation of these operational constraints, investigate whether LSTM performance can be enhanced through alternative feature representations that reduce noise sensitivity, and validate findings on contemporary datasets.

Author Contributions

Conceptualization, A.O.P., D.J.A., B.A.P.; methodology, A.O.P.; software, A.O.P., A.I.F., K.B.W.; validation, H.S., A.M.S., S.H.O.; formal analysis, A.O.P.; investigation, A.O.P.; resources, B.A.P.; data curation, A.O.P.; writing—original draft preparation, A.O.P., D.J.A., B.A.P.; writing—review and editing, B.A.P., H.S., A.M.S., S.H.O.; visualization, A.O.P.; supervision, B.A.P.; project administration, B.A.P.; funding acquisition, B.A.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Department of Informatics, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia, under funding number 1700/PKS/ITS/2022.

Institutional Review Board Statement

Ethical review and approval were waived for this study as it did not involve humans or animals.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed during the current study are available at: https://research.unsw.edu.au/projects/unsw-nb15-dataset (accessed on 8 September 2024). The code used in this study is publicly available: LSTM: https://github.com/bazz-066/neuralnetwork-AD/tree/master/rnn-ryza, accessed on 8 September 2024; Autoencoders: https://github.com/bazz-066/FP-UG-ITS-2021-noisy-IDS-AE, accessed on 8 September 2024; Classical ML: https://github.com/bazz-066/FP-UG-ITS-2021-noisy-IDS-classic-ML, accessed on 8 September 2024.

Acknowledgments

The authors gratefully acknowledge the financial support they received from the Institut Teknologi Sepuluh Nopember for this work, under project scheme of the Publication Writing and IPR Incentive Program (PPHKI) 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DR	Detection Rate
FPR	False-Positive Rate
FTP	File Transfer Protocol
HTTP	Hypertext Transfer Protocol
ISOF	Isolation Forest
LOF	Local Outlier Factor
LSTM	Long Short-Term Memory
NIDS	Network-based Intrusion Detection System
OCSVM	One-Class Support Vector Machine
SMTP	Simple Mail Transfer Protocol

Appendix A

Table A1. Experimental Results of Classical Machine Learning Algorithm for HTTP Traffic.

Classical Machine Learning-HTTP
Malicious Traffic (%)	Algorithm	Detection Rate	F2 Score	FPR Score
0	OCSVM	0.3271	0.3324	0.0168
0	LOF	0.8107	0.6480	0.0409
0	ISOF	0.8200	0.5908	0.0600
0.1	OCSVM	0.3169	0.3140	0.0202
0.1	LOF	0.4892	0.4083	0.0416
0.1	ISOF	0.7724	0.5435	0.0647
0.2	OCSVM	0.3064	0.2965	0.0242
0.2	LOF	0.3404	0.2991	0.0380
0.2	ISOF	0.6576	0.4851	0.0596
0.3	OCSVM	0.3058	0.3104	0.0182
0.3	LOF	0.2666	0.2033	0.0674
0.3	ISOF	0.6177	0.4768	0.0547
0.4	OCSVM	0.2928	0.2974	0.0177
0.4	LOF	0.2436	0.2114	0.0426
0.4	ISOF	0.6359	0.4674	0.0607
0.5	OCSVM	0.2941	0.2913	0.0205
0.5	LOF	0.2163	0.1253	0.1203
0.5	ISOF	0.8427	0.5750	0.0677

Table A2. Experimental Results of Autoencoder Algorithm for HTTP Traffic.

Malicious Traffic (%)	Parameters	Threshold	Detection Rate	F2 Score	FPR Score
0	1 Hidden Layer	Mean	0.1031	0.1230	0.0144
0	1 Hidden Layer	IQR	0	0	0
0	1 Hidden Layer	Z Score	0.9751	0.8280	0.1475
0	3 Hidden Layer	Mean	0.0758	0.0922	0.00584
0	3 Hidden Layer	IQR	0.1020	0.1231	0.0065
0	3 Hidden Layer	Z Score	0.5225	0.5608	0.0219
0	5 Hidden Layer	Mean	0.0746	0.0907	0.0058
0	5 Hidden Layer	IQR	0.0924	0.1119	0.00626
0	5 Hidden Layer	Z Score	0.3616	0.4030	0.0202
0.1	1 Hidden Layer	Mean	0.1144	0.1360	0.0147
0.1	1 Hidden Layer	IQR	0.0484	0.0594	0.00486
0.1	1 Hidden Layer	Z Score	0.9688	0.8238	0.1471
0.1	3 Hidden Layer	Mean	0.0759	0.0929	0.00585
0.1	3 Hidden Layer	IQR	0.1024	0.1239	0.00654
0.1	3 Hidden Layer	Z Score	0.5997	0.6139	0.0459
0.1	5 Hidden Layer	Mean	0.0753	0.0916	0.00582
0.1	5 Hidden Layer	IQR	0.0991	0.1197	0.0064
0.1	5 Hidden Layer	Z Score	0.4735	0.5087	0.0292
0.2	1 Hidden Layer	Mean	0.1086	0.1293	0.0149
0.2	1 Hidden Layer	IQR	0.00137	0.00171	0.00008
0.2	1 Hidden Layer	Z Score	0.9738	0.8270	0.1476
0.2	3 Hidden Layer	Mean	0.0757	0.0920	0.00589
0.2	3 Hidden Layer	IQR	0.0956	0.1156	0.00638
0.2	3 Hidden Layer	Z Score	0.4124	0.4175	0.0851
0.2	5 Hidden Layer	Mean	0.0759	0.0923	0.00585
0.2	5 Hidden Layer	IQR	0.0982	0.1187	0.00638
0.2	5 Hidden Layer	Z Score	0.4654	0.5027	0.0263
0.3	1 Hidden Layer	Mean	0.1130	0.1343	0.01501
0.3	1 Hidden Layer	IQR	0.0112	0.0140	0.00029
0.3	1 Hidden Layer	Z Score	0.9898	0.8381	0.1477
0.3	3 Hidden Layer	Mean	0.0746	0.0907	0.0058
0.3	3 Hidden Layer	IQR	0.1005	0.1214	0.00643
0.3	3 Hidden Layer	Z Score	0.5754	0.5991	0.0365
0.3	5 Hidden Layer	Mean	0.0750	0.0912	0.0058
0.3	5 Hidden Layer	IQR	0.1087	0.1310	0.00663
0.3	5 Hidden Layer	Z Score	0.7469	0.7503	0.0372
0.4	1 Hidden Layer	Mean	0.1077	0.1282	0.0148
0.4	1 Hidden Layer	IQR	0	0	0
0.4	1 Hidden Layer	Z Score	0.8903	0.7700	0.1439
0.4	3 Hidden Layer	Mean	0.0764	0.0929	0.00589
0.4	3 Hidden Layer	IQR	0.1027	0.1239	0.0065
0.4	3 Hidden Layer	Z Score	0.5766	0.6086	0.0260
0.4	5 Hidden Layer	Mean	0.0761	0.0925	0.00587
0.4	5 Hidden Layer	IQR	0.0960	0.1161	0.00634
0.4	5 Hidden Layer	Z Score	0.4329	0.4712	0.0259
0.5	1 Hidden Layer	Mean	0.1313	0.1552	0.0156
0.5	1 Hidden Layer	IQR	0.0739	0.0888	0.0141
0.5	1 Hidden Layer	Z Score	0.9954	0.8419	0.1479
0.5	3 Hidden Layer	Mean	0.0755	0.0912	0.00581
0.5	3 Hidden Layer	IQR	0.1116	0.1344	0.00668
0.5	3 Hidden Layer	Z Score	0.6458	0.6484	0.0539
0.5	5 Hidden Layer	Mean	0.0745	0.0910	0.0058
0.5	5 Hidden Layer	IQR	0.0933	0.1129	0.00633
0.5	5 Hidden Layer	Z Score	0.4040	0.4218	0.0622

Table A3. Experimental Results of LSTM Algorithm for HTTP Traffic.

Malicious Traffic (%)	Threshold	Detection Rate	F2 Score	FPR Score
0	b_mean	0.1021	0.1091	0.0697
0	b_iqr	0.9355	0.8117	0.1002
0	b_zscore	1	0.7591	0.1923
0	f_mean	0.0076	0.0084	0.0632
0	f_iqr	0.9804	0.843	0.1011
0	f_zscore	0.9894	0.8249	0.1221
0.1	b_mean	0.0912	0.0985	0.0658
0.1	b_iqr	0.6660	0.6226	0.0839
0.1	b_zscore	0.9996	0.7725	0.1808
0.1	f_mean	0.0104	0.0115	0.06
0.1	f_iqr	0.9886	0.8435	0.1071
0.1	f_zscore	0.9904	0.8415	0.11
0.2	b_mean	0.0995	0.1068	0.0681
0.2	b_iqr	0.9721	0.8399	0.0993
0.2	b_zscore	0.9983	0.7705	0.1804
0.2	f_mean	0.0104	0.0115	0.0624
0.2	f_iqr	0.7794	0.6585	0.1388
0.2	f_zscore	0.9228	0.7439	0.1560
0.3	b_mean	0.0504	0.0554	0.0615
0.3	b_iqr	0.1036	0.1121	0.0633
0.3	b_zscore	0.9971	0.7706	0.1806
0.3	f_mean	0.0065	0.0073	0.0592
0.3	f_iqr	0.4820	0.4554	0.0993
0.3	f_zscore	0.9881	0.7773	0.1677
0.4	b_mean	0.1021	0.1094	0.0686
0.4	b_iqr	0.9726	0.8401	0.0993
0.4	b_zscore	1	0.7859	0.1658
0.4	f_mean	0.0074	0.0082	0.0616
0.4	f_iqr	0.9871	0.8486	0.1009
0.4	f_zscore	0.9925	0.8365	0.1144
0.5	b_mean	0.1198	0.1288	0.0656
0.5	b_iqr	0.9985	0.7827	0.1707
0.5	b_zscore	0.9996	0.7834	0.1708
0.5	f_mean	0.0073	0.0081	0.0584
0.5	f_iqr	0.9926	0.8122	0.1383
0.5	f_zscore	0.9919	0.8069	0.1428

Table A4. Experimental Results of Classical Machine Learning Algorithm for FTP Traffic.

Malicious Traffic (%)	Algorithm	Detection Rate	F2 Score	FPR Score
0	OCSVM	0.8798	0.7560	0.0180
0	LOF	0.9728	0.8663	0.0123
0	ISOF	1.0000	0.5953	0.0651
0.05	OCSVM	0.6957	0.5774	0.0248
0.05	LOF	0.6689	0.6337	0.0113
0.05	ISOF	1.0000	0.4466	0.1154
0.1	OCSVM	0.6846	0.5717	0.0246
0.1	LOF	0.6980	0.6633	0.0106
0.1	ISOF	1.0000	0.5308	0.0835
0.15	OCSVM	0.6659	0.5452	0.0269
0.15	LOF	0.6904	0.6485	0.0118
0.15	ISOF	1.0000	0.4619	0.1089
0.2	OCSVM	0.6467	0.5728	0.0189
0.2	LOF	0.6467	0.6242	0.0101
0.2	ISOF	1.0000	0.5292	0.0845
0.25	OCSVM	0.6585	0.5600	0.0229
0.25	LOF	0.5580	0.5388	0.0116
0.25	ISOF	1.0000	0.4936	0.0961
0.3	OCSVM	0.6311	0.5420	0.0224
0.3	LOF	0.5956	0.5761	0.0108
0.3	ISOF	1.0000	0.4966	0.0953
0.35	OCSVM	0.6336	0.5222	0.0273
0.35	LOF	0.5982	0.5823	0.0103
0.35	ISOF	1.0000	0.5168	0.0891
0.39	OCSVM	0.6414	0.5612	0.0203
0.39	LOF	0.5924	0.5681	0.0118
0.39	ISOF	1.0000	0.4911	0.0980

Table A5. Experimental Results of Autoencoder Algorithm for FTP Traffic.

Malicious Traffic (%)	Parameters	Threshold	Detection Rate	F2 Score	FPR Score
0	1 Hidden Layer	Mean	0.9070	0.8285	0.0425
0	1 Hidden Layer	IQR	0.9907	0.8784	0.0505
0	1 Hidden Layer	Z Score	1	0.8754	0.0534
0	3 Hidden Layer	Mean	0.6516	0.6162	0.0477
0	3 Hidden Layer	IQR	1	0.8767	0.0528
0	3 Hidden Layer	Z Score	1	0.8748	0.0537
0	5 Hidden Layer	Mean	0.6516	0.6162	0.0477
0	5 Hidden Layer	IQR	1	0.8767	0.0528
0	5 Hidden Layer	Z Score	1	0.8748	0.0537
0.05	1 Hidden Layer	Mean	0.9013	0.8277	0.0408
0.05	1 Hidden Layer	IQR	0.9560	0.8489	0.0506
0.05	1 Hidden Layer	Z Score	1	0.8750	0.0536
0.05	3 Hidden Layer	Mean	0.6582	0.6218	0.0477
0.05	3 Hidden Layer	IQR	1	0.8773	0.0525
0.05	3 Hidden Layer	Z Score	1	0.8751	0.0536
0.05	5 Hidden Layer	Mean	0.6582	0.6218	0.0477
0.05	5 Hidden Layer	IQR	1	0.8773	0.0525
0.05	5 Hidden Layer	Z Score	1	0.8751	0.0536
0.1	1 Hidden Layer	Mean	0.9095	0.8323	0.0415
0.1	1 Hidden Layer	IQR	0.9594	0.8522	0.0501
0.1	1 Hidden Layer	Z Score	1	0.8741	0.0539
0.1	3 Hidden Layer	Mean	0.6550	0.6190	0.0477
0.1	3 Hidden Layer	IQR	0.9994	0.8760	0.0529
0.1	3 Hidden Layer	Z Score	1	0.8745	0.0539
0.1	5 Hidden Layer	Mean	0.6550	0.6190	0.0477
0.1	5 Hidden Layer	IQR	0.9994	0.8760	0.0529
0.1	5 Hidden Layer	Z Score	1	0.8745	0.0539
0.15	1 Hidden Layer	Mean	0.8994	0.8259	0.0409
0.15	1 Hidden Layer	IQR	0.9564	0.8500	0.0502
0.15	1 Hidden Layer	Z Score	1	0.8756	0.0533
0.15	3 Hidden Layer	Mean	0.6204	0.5910	0.0472
0.15	3 Hidden Layer	IQR	0.9917	0.8709	0.0528
0.15	3 Hidden Layer	Z Score	1	0.8747	0.0538
0.15	5 Hidden Layer	Mean	0.6204	0.5910	0.0472
0.15	5 Hidden Layer	IQR	0.9917	0.8709	0.0528
0.15	5 Hidden Layer	Z Score	1	0.8747	0.0538
0.2	1 Hidden Layer	Mean	0.7410	0.6975	0.0428
0.2	1 Hidden Layer	IQR	0.9905	0.8750	0.0502
0.2	1 Hidden Layer	Z Score	1	0.8749	0.0537
0.2	3 Hidden Layer	Mean	0.6546	0.6189	0.0476
0.2	3 Hidden Layer	IQR	1	0.8767	0.0527
0.2	3 Hidden Layer	Z Score	1	0.8748	0.0537
0.2	5 Hidden Layer	Mean	0.6546	0.6189	0.0476
0.2	5 Hidden Layer	IQR	1	0.8767	0.0527
0.2	5 Hidden Layer	Z Score	1	0.8748	0.0537
0.25	1 Hidden Layer	Mean	0.9103	0.8308	0.0426
0.25	1 Hidden Layer	IQR	0.9920	0.8766	0.0500
0.25	1 Hidden Layer	Z Score	1	0.8761	0.0531
0.25	3 Hidden Layer	Mean	0.6784	0.6379	0.0480
0.25	3 Hidden Layer	IQR	1	0.8759	0.0531
0.25	3 Hidden Layer	Z Score	1	0.8738	0.0542
0.25	5 Hidden Layer	Mean	0.6784	0.6379	0.0480
0.25	5 Hidden Layer	IQR	1	0.8759	0.0531
0.25	5 Hidden Layer	Z Score	1	0.8738	0.0542
0.3	1 Hidden Layer	Mean	0.8966	0.8218	0.0420
0.3	1 Hidden Layer	IQR	0.9884	0.8724	0.0507
0.3	1 Hidden Layer	Z Score	1	0.8735	0.0543
0.3	3 Hidden Layer	Mean	0.6340	0.6020	0.0473
0.3	3 Hidden Layer	IQR	1	0.8763	0.0530
0.3	3 Hidden Layer	Z Score	1	0.8742	0.0540
0.3	5 Hidden Layer	Mean	0.6340	0.6020	0.0473
0.3	5 Hidden Layer	IQR	1	0.8763	0.0530
0.3	5 Hidden Layer	Z Score	1	0.8742	0.0540
0.35	1 Hidden Layer	Mean	0.9070	0.8298	0.0419
0.35	1 Hidden Layer	IQR	0.9924	0.8737	0.0516
0.35	1 Hidden Layer	Z Score	1	0.8744	0.0539
0.35	3 Hidden Layer	Mean	0.6424	0.6090	0.0474
0.35	3 Hidden Layer	IQR	0.9990	0.8764	0.0526
0.35	3 Hidden Layer	Z Score	1	0.8751	0.0536
0.35	5 Hidden Layer	Mean	0.6424	0.6090	0.0474
0.35	5 Hidden Layer	IQR	0.9990	0.8764	0.0526
0.35	5 Hidden Layer	Z Score	1	0.8751	0.0536
0.39	1 Hidden Layer	Mean	0.9126	0.8354	0.0412
0.39	1 Hidden Layer	IQR	0.9918	0.8745	0.0509
0.39	1 Hidden Layer	Z Score	1	0.8760	0.0530
0.39	3 Hidden Layer	Mean	0.6349	0.6026	0.0475
0.39	3 Hidden Layer	IQR	1	0.8767	0.0528
0.39	3 Hidden Layer	Z Score	1	0.8745	0.0538
0.39	5 Hidden Layer	Mean	0.6349	0.6026	0.0475
0.39	5 Hidden Layer	IQR	1	0.8767	0.0528
0.39	5 Hidden Layer	Z Score	1	0.8745	0.0538

Table A6. Experimental Results of LSTM Algorithm for FTP Traffic.

Malicious Traffic (%)	Threshold Type	Detection Rate	F2 Score	FPR Score
0	b_mean	0.9854	0.8222	0.0737
0	b_iqr	1	0.7143	0.1464
0	b_zscore	1	0.7143	0.1464
0	f_mean	0.9203	0.8164	0.0524
0	f_iqr	1	0.7143	0.1464
0	f_zscore	1	0.7143	0.1464
0.05	b_mean	0.9961	0.8631	0.0565
0.05	b_iqr	1	0.7141	0.1462
0.05	b_zscore	1	0.7141	0.1462
0.05	f_mean	0.9645	0.8545	0.0496
0.05	f_iqr	1	0.7141	0.1462
0.05	f_zscore	1	0.7141	0.1462
0.1	b_mean	0.8960	0.8083	0.0475
0.1	b_iqr	1	0.7251	0.1393
0.1	b_zscore	1	0.7251	0.1393
0.1	f_mean	0.8571	0.7819	0.0458
0.1	f_iqr	1	0.7251	0.1393
0.1	f_zscore	1	0.7251	0.1393
0.15	b_mean	0.8898	0.7689	0.0659
0.15	b_iqr	1	0.7176	0.1447
0.15	b_zscore	1	0.7176	0.1447
0.15	f_mean	0.8288	0.7723	0.0395
0.15	f_iqr	1	0.7176	0.1447
0.15	f_zscore	1	0.7176	0.1447
0.2	b_mean	0.8677	0.7853	0.0479
0.2	b_iqr	1	0.7141	0.1461
0.2	b_zscore	1	0.7141	0.1461
0.2	f_mean	0.8350	0.7775	0.0390
0.2	f_iqr	1	0.7141	0.1461
0.2	f_zscore	1	0.7141	0.1461
0.25	b_mean	0.8519	0.7673	0.0511
0.25	b_iqr	1	0.7173	0.1440
0.25	b_zscore	1	0.7173	0.1440
0.25	f_mean	0.8041	0.7536	0.0388
0.25	f_iqr	1	0.7173	0.1440
0.25	f_zscore	1	0.7173	0.1440
0.3	b_mean	0.8941	0.8041	0.0486
0.3	b_iqr	1	0.7134	0.1467
0.3	b_zscore	1	0.7134	0.1467
0.3	f_mean	0.7748	0.7358	0.0358
0.3	f_iqr	1	0.7134	0.1467
0.3	f_zscore	1	0.7134	0.1467
0.35	b_mean	0.8330	0.7740	0.0403
0.35	b_iqr	1	0.7203	0.1428
0.35	b_zscore	1	0.7203	0.1428
0.35	f_mean	0.7472	0.7133	0.0361
0.35	f_iqr	1	0.7203	0.1428
0.35	f_zscore	1	0.7203	0.1428
0.39	b_mean	0.8754	0.7975	0.0445
0.39	b_iqr	1	0.7139	0.1455
0.39	b_zscore	1	0.7139	0.1455
0.39	f_mean	0.8533	0.7826	0.0434
0.39	f_iqr	1	0.7139	0.1455
0.39	f_zscore	1	0.7139	0.1455

Table A7. Experimental Results of Classical Machine Learning Algorithm for SMTP Traffic.

Malicious Traffic (%)	Algorithm	Detection Rate	F2 Score	FPR Score
0	OCSVM	0.9972	0.1840	0.3816
0	LOF	0.9986	0.7706	0.0256
0	ISOF	1.0000	0.8716	0.0127
0.1	OCSVM	0.9972	0.1823	0.3826
0.1	LOF	0.3792	0.3230	0.0255
0.1	ISOF	1.0000	0.8529	0.0148
0.2	OCSVM	0.9973	0.1888	0.3757
0.2	LOF	0.2551	0.2243	0.0251
0.2	ISOF	1.0000	0.8737	0.0127
0.3	OCSVM	0.9958	0.1876	0.3683
0.3	LOF	0.2363	0.2064	0.0254
0.3	ISOF	1.0000	0.8765	0.0120
0.4	OCSVM	0.9958	0.1882	0.3715
0.4	LOF	0.2111	0.1858	0.0255
0.4	ISOF	1.0000	0.8694	0.0130
0.5	OCSVM	0.9861	0.1901	0.3634
0.5	LOF	0.2275	0.1998	0.0254
0.5	ISOF	1.0000	0.8925	0.0104
0.6	OCSVM	0.9151	0.1856	0.3466
0.6	LOF	0.2370	0.2085	0.0254
0.6	ISOF	1.0000	0.8950	0.0103
0.7	OCSVM	0.8699	0.1739	0.3464
0.7	LOF	0.2392	0.2090	0.0255
0.7	ISOF	1.0000	0.8756	0.0122

Table A8. Experimental Results of Autoencoder Algorithm for SMTP Traffic.

Malicious Traffic (%)	Parameters	Threshold	Detection Rate	F2 Score	FPR Score
0	1 Hidden Layer	Mean	0.9990	0.9847	0.0077
0	1 Hidden Layer	IQR	0.9993	0.9849	0.0077
0	1 Hidden Layer	Z Score	1	0.9853	0.0078
0	3 Hidden Layer	Mean	0.9962	0.9873	0.0051
0	3 Hidden Layer	IQR	0.9964	0.9829	0.0076
0	3 Hidden Layer	Z Score	0.9998	0.9852	0.0077
0	5 Hidden Layer	Mean	0.9922	0.9845	0.0049
0	5 Hidden Layer	IQR	0.9952	0.9869	0.0049
0	5 Hidden Layer	Z Score	0.9998	0.9862	0.0072
0.1	1 Hidden Layer	Mean	0.9990	0.9846	0.0078
0.1	1 Hidden Layer	IQR	0.9993	0.9846	0.0078
0.1	1 Hidden Layer	Z Score	1	0.9853	0.0078
0.1	3 Hidden Layer	Mean	0.9234	0.9301	0.0043
0.1	3 Hidden Layer	IQR	0.9979	0.9835	0.0079
0.1	3 Hidden Layer	Z Score	0.9998	0.9849	0.0079
0.1	5 Hidden Layer	Mean	0.6826	0.7244	0.0030
0.1	5 Hidden Layer	IQR	0.9926	0.9850	0.0048
0.1	5 Hidden Layer	Z Score	1	0.9865	0.0072
0.2	1 Hidden Layer	Mean	0.9990	0.9846	0.0078
0.2	1 Hidden Layer	IQR	0.9993	0.9848	0.0078
0.2	1 Hidden Layer	Z Score	1	0.9853	0.0078
0.2	3 Hidden Layer	Mean	0.7862	0.8156	0.0035
0.2	3 Hidden Layer	IQR	0.9979	0.9839	0.0077
0.2	3 Hidden Layer	Z Score	0.9998	0.9852	0.0078
0.2	5 Hidden Layer	Mean	0.5811	0.6310	0.0025
0.2	5 Hidden Layer	IQR	0.9960	0.9876	0.0049
0.2	5 Hidden Layer	Z Score	0.9998	0.9852	0.0077
0.3	1 Hidden Layer	Mean	0.9990	0.9845	0.0078
0.3	1 Hidden Layer	IQR	0.9993	0.9847	0.0078
0.3	1 Hidden Layer	Z Score	1	0.9852	0.0079
0.3	3 Hidden Layer	Mean	0.5426	0.5943	0.0023
0.3	3 Hidden Layer	IQR	0.9964	0.9865	0.0057
0.3	3 Hidden Layer	Z Score	0.9988	0.9845	0.0077
0.3	5 Hidden Layer	Mean	0.5229	0.5753	0.0023
0.3	5 Hidden Layer	IQR	0.9962	0.9876	0.0049
0.3	5 Hidden Layer	Z Score	0.9998	0.9851	0.0078
0.4	1 Hidden Layer	Mean	0.9990	0.9846	0.0078
0.4	1 Hidden Layer	IQR	0.9995	0.9850	0.0078
0.4	1 Hidden Layer	Z Score	1	0.9853	0.0078
0.4	3 Hidden Layer	Mean	0.5207	0.5732	0.0023
0.4	3 Hidden Layer	IQR	0.9964	0.9836	0.0072
0.4	3 Hidden Layer	Z Score	0.9993	0.9847	0.0078
0.4	5 Hidden Layer	Mean	0.4736	0.5269	0.0021
0.4	5 Hidden Layer	IQR	0.9926	0.9849	0.0049
0.4	5 Hidden Layer	Z Score	1	0.9869	0.0070
0.5	1 Hidden Layer	Mean	0.9990	0.9846	0.0078
0.5	1 Hidden Layer	IQR	0.9993	0.9848	0.0078
0.5	1 Hidden Layer	Z Score	1	0.9735	0.0142
0.5	3 Hidden Layer	Mean	0.5024	0.5353	0.0023
0.5	3 Hidden Layer	IQR	0.9964	0.9828	0.0077
0.5	3 Hidden Layer	Z Score	1	0.9853	0.0078
0.5	5 Hidden Layer	Mean	0.4648	0.5183	0.0021
0.5	5 Hidden Layer	IQR	0.9960	0.9875	0.0049
0.5	5 Hidden Layer	Z Score	1	0.9853	0.0078
0.6	1 Hidden Layer	Mean	0.9990	0.9848	0.0077
0.6	1 Hidden Layer	IQR	0.9993	0.9850	0.0077
0.6	1 Hidden Layer	Z Score	1	0.9737	0.0141
0.6	3 Hidden Layer	Mean	0.4850	0.5383	0.0021
0.6	3 Hidden Layer	IQR	0.9967	0.9828	0.0077
0.6	3 Hidden Layer	Z Score	0.9998	0.9850	0.0079
0.6	5 Hidden Layer	Mean	0.4495	0.5029	0.0020
0.6	5 Hidden Layer	IQR	0.9962	0.9877	0.0049
0.6	5 Hidden Layer	Z Score	0.9998	0.9737	14.0000
0.7	1 Hidden Layer	Mean	0.9990	0.9845	0.0078
0.7	1 Hidden Layer	IQR	0.9995	0.9849	0.0078
0.7	1 Hidden Layer	Z Score	1	0.9852	0.0079
0.7	3 Hidden Layer	Mean	0.4497	0.5031	0.0020
0.7	3 Hidden Layer	IQR	0.9964	0.9844	0.0067
0.7	3 Hidden Layer	Z Score	0.9990	0.9844	0.0079
0.7	5 Hidden Layer	Mean	0.4304	0.4838	0.0019
0.7	5 Hidden Layer	IQR	0.9848	0.9788	0.0048
0.7	5 Hidden Layer	Z Score	0.9998	0.9870	0.0068

Table A9. Experimental Results of LSTM Algorithm for SMTP Traffic.

Malicious Traffic (%)	Threshold	Detection Rate	F2 Score	FPR Score
0	b_mean	0.9960	0.9634	0.0172
0	b_iqr	0.9985	0.9441	0.0288
0	b_zscore	0.9985	0.9441	0.0288
0	f_mean	0.9262	0.9084	0.0171
0	f_iqr	0.9597	0.9243	0.0230
0	f_zscore	0.9433	0.9202	0.0181
0.1	b_mean	0.8853	0.8774	0.0160
0.1	b_iqr	0.9862	0.9372	0.0276
0.1	b_zscore	0.9813	0.9335	0.0275
0.1	f_mean	0.8405	0.8426	0.0148
0.1	f_iqr	0.9764	0.9397	0.0219
0.1	f_zscore	0.9713	0.9361	0.0217
0.2	b_mean	0.8833	0.8747	0.0165
0.2	b_iqr	0.9855	0.9356	0.0280
0.2	b_zscore	0.9803	0.9316	0.0280
0.2	f_mean	0.8069	0.8131	0.0155
0.2	f_iqr	0.9433	0.9131	0.0222
0.2	f_zscore	0.9222	0.8973	0.0216
0.3	b_mean	0.7698	0.7817	0.0154
0.3	b_iqr	0.9724	0.8999	0.0429
0.3	b_zscore	0.9675	0.9230	0.0273
0.3	f_mean	0.7525	0.7696	0.0136
0.3	f_iqr	0.9317	0.9064	0.0207
0.3	f_zscore	0.9049	0.8934	0.0159
0.4	b_mean	0.9531	0.9301	0.0170
0.4	b_iqr	0.9933	0.9202	0.0403
0.4	b_zscore	0.9931	0.9413	0.0281
0.4	f_mean	0.6303	0.6589	0.0152
0.4	f_iqr	0.9189	0.8977	0.0198
0.4	f_zscore	0.8935	0.8817	0.0173
0.5	b_mean	0.7668	0.7776	0.0163
0.5	b_iqr	0.9716	0.9045	0.0398
0.5	b_zscore	0.9617	0.9172	0.0280
0.5	f_mean	0.5782	0.6157	0.0117
0.5	f_iqr	0.9042	0.8814	0.0224
0.5	f_zscore	0.8677	0.8604	0.0174
0.6	b_mean	0.7262	0.7429	0.0161
0.6	b_iqr	0.9704	0.9085	0.0369
0.6	b_zscore	0.9563	0.9068	0.0315
0.6	f_mean	0.6348	0.6642	0.0144
0.6	f_iqr	0.9385	0.8946	0.0306
0.6	f_zscore	0.9193	0.8906	0.0241
0.7	b_mean	0.7577	0.7698	0.0164
0.7	b_iqr	0.9717	0.8996	0.0430
0.7	b_zscore	0.9663	0.9144	0.0318
0.7	f_mean	0.6280	0.6570	0.0152
0.7	f_iqr	0.8870	0.8632	0.0251
0.7	f_zscore	0.8676	0.8486	0.0244

Table A10. Experimental Results of Classical Machine Learning Algorithm for All Traffic.

Malicious Traffic (%)	Algorithm	Detection Rate	F2 Score	FPR Score
0	OCSVM	0.7347	0.4241	0.1388
0	LOF	0.9274	0.7617	0.0262
0	ISOF	0.9400	0.6859	0.0459
0.1	OCSVM	0.6662	0.3560	0.1425
0.1	LOF	0.5221	0.4649	0.0259
0.1	ISOF	0.9241	0.6424	0.0543
0.2	OCSVM	0.6501	0.3527	0.1396
0.2	LOF	0.4141	0.3825	0.0244
0.2	ISOF	0.8859	0.6293	0.0523
0.3	OCSVM	0.6442	0.3467	0.1363
0.3	LOF	0.3662	0.3286	0.0345
0.3	ISOF	0.8726	0.6166	0.0540

Table A11. Experimental Results of Autoencoder Algorithm for All Traffic.

Malicious Traffic (%)	Hidden Layers	Threshold	Detection Rate	F2 Score	FPR Score
0	1	Mean	0.6697	0.6454	0.0215
0	1	IQR	0.6633	0.6211	0.0194
0	1	Z Score	0.9917	0.8962	0.0696
0	3	Mean	0.5745	0.5652	0.0195
0	3	IQR	0.6995	0.6609	0.0223
0	3	Z Score	0.8408	0.8069	0.0278
0	5	Mean	0.5728	0.5638	0.0195
0	5	IQR	0.6959	0.6585	0.0213
0	5	Z Score	0.7871	0.7547	0.0270
0.1	1	Mean	0.6743	0.6510	0.0213
0.1	1	IQR	0.6690	0.6321	0.0209
0.1	1	Z Score	0.9896	0.8944	0.0696
0.1	3	Mean	0.5514	0.5473	0.0193
0.1	3	IQR	0.6999	0.6611	0.0224
0.1	3	Z Score	0.8665	0.8244	0.0359
0.1	5	Mean	0.4710	0.4783	0.0189
0.1	5	IQR	0.6970	0.6602	0.0214
0.1	5	Z Score	0.8245	0.7899	0.0301
0.2	1	Mean	0.6162	0.6038	0.0218
0.2	1	IQR	0.6637	0.6205	0.0194
0.2	1	Z Score	0.9913	0.8957	0.0697
0.2	3	Mean	0.5055	0.5088	0.0190
0.2	3	IQR	0.6978	0.6587	0.0223
0.2	3	Z Score	0.8041	0.7592	0.0489
0.2	5	Mean	0.4372	0.4474	0.0186
0.2	5	IQR	0.6981	0.6610	0.0213
0.2	5	Z Score	0.8217	0.7876	0.0292
0.3	1	Mean	0.6695	0.6469	0.0216
0.3	1	IQR	0.6663	0.6237	0.0196
0.3	1	Z Score	0.9966	0.8989	0.0700
0.3	3	Mean	0.4171	0.4290	0.0185
0.3	3	IQR	0.6990	0.6614	0.0217
0.3	3	Z Score	0.8581	0.8193	0.0327
0.3	5	Mean	0.4106	0.4228	0.0185
0.3	5	IQR	0.7016	0.6650	0.0215
0.3	5	Z Score	0.9156	0.8699	0.0330

Table A12. Experimental Results of LSTM Algorithm for All Traffic.

Malicious Traffic (%)	Threshold	Detection Rate	F2 Score	FPR Score
0	b_mean	0.6945	0.6316	0.0536
0	b_iqr	0.978	0.8234	0.0918
0	b_zscore	0.9995	0.8058	0.1225
0	f_mean	0.618	0.5777	0.0442
0	f_iqr	0.98	0.8272	0.0902
0	f_zscore	0.9776	0.8198	0.0955
0.1	b_mean	0.6241	0.5947	0.0431
0.1	b_iqr	0.8841	0.7617	0.0836
0.1	b_zscore	0.9937	0.8104	0.1159
0.1	f_mean	0.5694	0.5453	0.0402
0.1	f_iqr	0.9883	0.8361	0.0894
0.1	f_zscore	0.9872	0.8342	0.0903
0.2	b_mean	0.6168	0.5889	0.0442
0.2	b_iqr	0.9858	0.8299	0.0911
0.2	b_zscore	0.9929	0.8054	0.1182
0.2	f_mean	0.5508	0.5340	0.0390
0.2	f_iqr	0.9076	0.7619	0.1024
0.2	f_zscore	0.9483	0.7851	0.1079
0.3	b_mean	0.5714	0.5471	0.0418
0.3	b_iqr	0.6920	0.5752	0.0843
0.3	b_zscore	0.9882	0.8023	0.1182
0.3	f_mean	0.5113	0.5042	0.0362
0.3	f_iqr	0.8046	0.6918	0.0889
0.3	f_zscore	0.9643	0.7947	0.1101

References

Algaolahi, A.; Aljoby, W.; Ghaleb, M.; Harras, K.A. Detecting and Identifying the Targets of Covert DDoS Attacks. In Proceedings of the 2024 IEEE 21st International Conference on Smart Communities: Improving Quality of Life Using AI, Robotics and IoT (HONET), Doha, Qatar, 3–5 December 2024; pp. 143–148. [Google Scholar] [CrossRef]
Ahmed, I.; Lhee, K.S. Classification of packet contents for malware detection. J. Comput. Virol. 2011, 7, 279–295. [Google Scholar] [CrossRef]
Fauzi, N.; Yulianto, F.A.; Nuha, H.H. The Effectiveness of Anomaly-Based Intrusion Detection Systems in Handling Zero-Day Attacks Using AdaBoost, J48, and Random Forest Methods. In Proceedings of the 2023 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob), Bali, Indonesia, 10–12 October 2023; pp. 57–62. [Google Scholar] [CrossRef]
Faizal, M.A.; Zaki, M.M.; Shahrin, S.; Robiah, Y.; Rahayu, S.S.; Nazrulazhar, B. Threshold Verification Technique for Network Intrusion Detection System. arXiv 2009, arXiv:0906.3843. [Google Scholar] [CrossRef]
David Akande, T.; Kaur, B.; Dadkhah, S.; Ghorbani, A.A. Threshold based Technique to Detect Anomalies using Log Files. In Proceedings of the 2022 7th International Conference on Machine Learning Technologies, New York, NY, USA, 11–13 March 2022; ICMLT ’22. pp. 191–198. [Google Scholar] [CrossRef]
Almuhanna, R.; Dardouri, S. A deep learning/machine learning approach for anomaly based network intrusion detection. Front. Artif. Intell. 2025, 8, 1625891. [Google Scholar] [CrossRef] [PubMed]
Auskalnis, J.; Paulauskas, N.; Baskys, A. Application of local outlier factor algorithm to detect anomalies in computer network. Elektron. Ir Elektrotechnika 2018, 24, 96–99. [Google Scholar] [CrossRef]
Ripan, R.C.; Sarker, I.H.; Anwar, M.M.; Furhad, M.H.; Rahat, F.; Hoque, M.M.; Sarfraz, M. An isolation forest learning based outlier detection approach for effectively classifying cyber anomalies. In Hybrid Intelligent Systems: 20th International Conference on Hybrid Intelligent Systems (HIS 2020), December 14–16, 2020; Springer: Cham, Switzerland, 2021; pp. 270–279. [Google Scholar]
Zhang, M.; Xu, B.; Gong, J. An anomaly detection model based on one-class svm to detect network intrusions. In Proceedings of the 2015 11th International Conference on Mobile Ad-Hoc and Sensor Networks (MSN), Shenzhen, China, 16–18 December 2015; pp. 102–107. [Google Scholar]
Nguimbous, Y.N.; Ksantini, R.; Bouhoula, A. Anomaly-based intrusion detection using auto-encoder. In Proceedings of the 2019 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Split, Croatia, 19–21 September 2019; pp. 1–5. [Google Scholar]
Imrana, Y.; Xiang, Y.; Ali, L.; Abdul-Rauf, Z. A bidirectional LSTM deep learning approach for intrusion detection. Expert Syst. Appl. 2021, 185, 115524. [Google Scholar] [CrossRef]
Anup, B.; Kaur, M.M. Anomaly Detection in Network Traffic; A Statistical Approach; LAP Lambert Academic Publishing: Saarbrücken, Germany, 2012. [Google Scholar]
Bhuyan, M.H.; Bhattacharyya, D.K.; Kalita, J.K. Network Traffic Anomaly Detection Techniques and Systems. In Network Traffic Anomaly Detection and Prevention: Concepts, Techniques, and Tools; Springer International Publishing: Cham, Switzerland, 2017; pp. 115–169. [Google Scholar] [CrossRef]
Zhao, X.; Wu, Q. Subspace-Based Anomaly Detection for Large-Scale Campus Network Traffic. J. Appl. Math. 2023, 2023, 8489644. [Google Scholar] [CrossRef]
Modi, C.; Patel, D.; Borisaniya, B.; Patel, H.; Patel, A.; Rajarajan, M. A survey of intrusion detection techniques in cloud. J. Netw. Comput. Appl. 2013, 36, 42–57. [Google Scholar] [CrossRef]
Othman, S.M.; Alsohybe, N.T.; Ba-Alwi, F.M.; Zahary, A.T. Survey on intrusion detection system types. Int. J. Cyber-Secur. Digit. Forensics 2018, 7, 444–463. [Google Scholar]
Liu, H.; Lang, B. Machine learning and deep learning methods for intrusion detection systems: A survey. Appl. Sci. 2019, 9, 4396. [Google Scholar] [CrossRef]
Otoum, S.; Kantarci, B.; Mouftah, H. A comparative study of ai-based intrusion detection techniques in critical infrastructures. ACM Trans. Internet Technol. (TOIT) 2021, 21, 1–22. [Google Scholar] [CrossRef]
Jain, M.; Kaur, G.; Saxena, V. A K-Means clustering and SVM based hybrid concept drift detection technique for network anomaly detection. Expert Syst. Appl. 2022, 193, 116510. [Google Scholar] [CrossRef]
Zavrak, S.; İskefiyeli, M. Anomaly-based intrusion detection from network flow features using variational autoencoder. IEEE Access 2020, 8, 108346–108358. [Google Scholar] [CrossRef]
Sadaf, K.; Sultana, J. Intrusion detection based on autoencoder and isolation forest in fog computing. IEEE Access 2020, 8, 167059–167068. [Google Scholar] [CrossRef]
Aljbali, S.; Roy, K. Anomaly Detection Using Bidirectional LSTM. In Intelligent Systems and Applications; Advances in Intelligent Systems and Computing; Springer: Cham, Switzerland, 2021; Volume 1250. [Google Scholar] [CrossRef]
Abdallah, M.; An Le Khac, N.; Jahromi, H.; Delia Jurcut, A. A hybrid CNN-LSTM based approach for anomaly detection systems in SDNs. In Proceedings of the 16th International Conference on Availability, Reliability and Security, Vienna, Austria, 17–20 August 2021; pp. 1–7. [Google Scholar]
Farahnakian, F.; Heikkonen, J. A deep auto-encoder based approach for intrusion detection system. In Proceedings of the 2018 20th International Conference on Advanced Communication Technology (ICACT), Chuncheon, Republic of Korea, 11–14 February 2018; pp. 178–183. [Google Scholar]
Paulauskas, N.; Bagdonas, A.F. Local outlier factor use for the network flow anomaly detection. Secur. Commun. Netw. 2015, 8, 4203–4212. [Google Scholar] [CrossRef]
Nguyen, Q.T.; Tran, K.P.; Castagliola, P.; Huong, T.T.; Nguyen, M.K.; Lardjane, S. Nested one-class support vector machines for network intrusion detection. In Proceedings of the 2018 IEEE Seventh International Conference on Communications and Electronics (ICCE), Hue, Vietnam, 18–20 July 2018; pp. 7–12. [Google Scholar]
Abolhasanzadeh, B. Nonlinear dimensionality reduction for intrusion detection using auto-encoder bottleneck features. In Proceedings of the 2015 7th Conference on Information and Knowledge Technology (IKT), Urmia, Iran, 26–28 May 2015; pp. 1–5. [Google Scholar]
Zhang, B.; Yu, Y.; Li, J. Network intrusion detection based on stacked sparse autoencoder and binary tree ensemble method. In Proceedings of the 2018 IEEE International Conference on Communications Workshops (ICC Workshops), Kansas City, MO, USA, 20–24 May 2018; pp. 1–6. [Google Scholar]
Chen, Z.; Simsek, M.; Kantarci, B.; Bagheri, M.; Djukic, P. Machine learning-enabled hybrid intrusion detection system with host data transformation and an advanced two-stage classifier. Comput. Netw. 2024, 250, 110576. [Google Scholar] [CrossRef]
Narayana Rao, K.; Venkata Rao, K.; P.V.G.D., P.R. A hybrid Intrusion Detection System based on Sparse autoencoder and Deep Neural Network. Comput. Commun. 2021, 180, 77–88. [Google Scholar] [CrossRef]
Zha, C.; Wang, Z.; Fan, Y.; Bai, B.; Zhang, Y.; Shi, S.; Zhang, R. A-NIDS: Adaptive Network Intrusion Detection System Based on Clustering and Stacked CTGAN. IEEE Trans. Inf. Forensics Secur. 2025, 20, 3204–3219. [Google Scholar] [CrossRef]
Ma, Z.; Liu, L.; Meng, W.; Luo, X.; Wang, L.; Li, W. ADCL: Toward an Adaptive Network Intrusion Detection System Using Collaborative Learning in IoT Networks. IEEE Internet Things J. 2023, 10, 12521–12536. [Google Scholar] [CrossRef]
Winter, P.; Hermann, E.; Zeilinger, M. Inductive intrusion detection in flow-based network data using one-class support vector machines. In Proceedings of the 2011 4th IFIP International Conference on New Technologies, Mobility and Security, Paris, France, 7–10 February 2011; pp. 1–5. [Google Scholar]
Mirsky, Y.; Doitshman, T.; Elovici, Y.; Shabtai, A. Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection. arXiv 2018, arXiv:2018.23204. [Google Scholar] [CrossRef]
Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar] [CrossRef]
Pratomo, B. Low-Rate Attack Detection with Intelligent Fine-Grained Network Analysis. Ph.D. Thesis, Cardiff University, Cardiff, UK, 2020. [Google Scholar]
Transmission Control Protocol. RFC 793, 1981. Available online: https://www.rfc-editor.org/info/rfc0793 (accessed on 12 August 2025).
Hao, Y.; Sheng, Y.; Wang, J. Variant Gated Recurrent Units with Encoders to Preprocess Packets for Payload-Aware Intrusion Detection. IEEE Access 2019, 7, 49985–49998. [Google Scholar] [CrossRef]
Saeed, R.; Khaliq Qureshi, H.; Ioannou, C.; Lestas, M. A Proactive Model for Intrusion Detection Using Image Representation of Network Flows. IEEE Access 2024, 12, 160653–160666. [Google Scholar] [CrossRef]
Nie, F.; Liu, W.; Liu, G.; Gao, B.; Huang, J.; Tian, W.; Yuen, C. Empowering Anomaly Detection in IoT Traffic Through Multiview Subspace Learning. IEEE Internet Things J. 2025, 12, 15911–15925. [Google Scholar] [CrossRef]
Brizendine, B.; Kusuma, S.S.; Rimal, B.P. Process Injection Using Return-Oriented Programming. IEEE Access 2025, 13, 133790–133816. [Google Scholar] [CrossRef]
Viviani, L.A.; Ranganathan, P. Evaluating the Suitability of LSTM Models for Edge Computing. In Proceedings of the 2024 Cyber Awareness and Research Symposium (CARS), Grand Forks, ND, USA, 28–29 October 2024; pp. 1–7. [Google Scholar] [CrossRef]
Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 2009, 41, 1–58. [Google Scholar] [CrossRef]
Pratomo, B.A.; Fajar, A.I.; Munif, A.; Ijtihadie, R.M.; Studiawan, H.; Santoso, B.J. Training Autoencoders with Noisy Training Sets for Detecting Low-rate Attacks on the Network. In Proceedings of the 2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom), Malang, Indonesia, 16–18 June 2022; pp. 138–143. [Google Scholar]
Bounsiar, A.; Madden, M.G. One-Class Support Vector Machines Revisited. In Proceedings of the 2014 International Conference on Information Science and Applications (ICISA), Seoul, Republic of Korea, 6–9 May 2014; pp. 1–4. [Google Scholar] [CrossRef]
Breunig, M.; Kröger, P.; Ng, R.; Sander, J. LOF: Identifying Density-Based Local Outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 15–18 May 2000; Volume 29, pp. 93–104. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
Diez, D.M.; Barr, C.D.; Cetinkaya-Rundel, M. OpenIntro Statistics; OpenIntro: Boston, MA, USA, 2012. [Google Scholar]
Hubert, M.; Vandervieren, E. An adjusted boxplot for skewed distributions. Comput. Stat. Data Anal. 2008, 52, 5186–5201. [Google Scholar] [CrossRef]
Crosby, T. How to Detect and Handle Outliers; Taylor & Francis: Oxfordshire, UK, 1994. [Google Scholar]
Powers, D.M. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar] [CrossRef]
Sammut, C.; Webb, G.I. Encyclopedia of Machine Learning; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Shepperd, M.; Bowes, D.; Hall, T. Researcher bias: The use of machine learning in software defect prediction. IEEE Trans. Softw. Eng. 2014, 40, 603–616. [Google Scholar] [CrossRef]
Deng, X.; Liu, Q.; Deng, Y.; Mahadevan, S. An improved method to construct basic probability assignment based on the confusion matrix for classification problem. Inf. Sci. 2016, 340, 250–261. [Google Scholar] [CrossRef]

Figure 1. The overall workflow of the proposed methodology. Differences in color doesn’t explain anything.

Figure 2. Comparison between byte frequencies and byte sequences.

Figure 3. Deployment assumption for the proposed payload-based NIDSs.

Figure 4. Workflow for the evaluation phase.

Figure 5. Confusion matrix illustration.

Figure 6. Graphics of F2 score from Classical ML, LSTM, and Autoencoder-based results on HTTP traffic.

Figure 7. Graphics of F2 score from Classical ML, LSTM, and Autoencoder-based results on FTP traffic.

Figure 8. Graphics of F2 score from Classical ML, LSTM, and Autoencoder-based results on SMTP traffic.

Figure 9. Graphics of F2 score from Classical ML, LSTM, and Autoencoder-based results on all traffic.

Figure 10. Graphics of F2 best score: (a) HTTP traffic; (b) FTP traffic; (c) SMTP traffic; (d) All traffic.

Figure 11. Graphics of F2 best score with best parameter.

Table 1. Summary of Related Research Works.

Year	Research Title	Algorithm	Analysis Type	Unclean Data
2022	A K-Means clustering and SVM-based hybrid concept drift detection technique for network anomaly detection	K-Means, SVM	content-based	–
2021	An Isolation Forest Learning Based Outlier Detection Approach for Effectively Classifying Cyber Anomalies	ISOF	flow-based	–
2021	Anomaly Detection Using Bidirectional LSTM	LSTM	flow-based	–
2021	A Hybrid CNN-LSTM Based Approach for Anomaly Detection Systems in SDNs	CNN-LSTM	flow-based	–
2021	A bidirectional LSTM deep learning approach for intrusion detection	LSTM	flow-based	–
2020	Anomaly-based intrusion detection from network flow features using variational autoencoder	VAE	flow-based	–
2020	Intrusion detection based on Autoencoder and Isolation Forest in fog computing	AE-ISOF	flow-based	–
2019	Anomaly-based intrusion detection using auto-encoder	AE	content-based	–
2018	Application of Local Outlier Factor to Detect Anomalies in Computer Networks	LOF	flow-based	–
2018	A deep Autoencoder-based approach for intrusion detection system	Deep AE	flow-based	–
2018	Network intrusion detection using stacked sparse autoencoder and binary tree ensemble	SSAE-XGB	content-based	–
2018	Web attack detection using stacked Auto-Encoder	SAE-ISOF	content-based	–
2018	Nested One-Class Support Vector Machines for network intrusion detection	OCSVM	content-based	–
2015	Local outlier factor usage for network flow anomaly detection	LOF	flow-based	–
2015	Nonlinear dimensionality reduction for intrusion detection using autoencoder bottlenecks	AE	content-based	–
2015	One-class SVM anomaly detection model	OCSVM	content-based	–
2011	Inductive intrusion detection in flow-based network data using one-class SVM	OCSVM	flow-based	–
Now	The proposed article	LOF, ISOF, OCSVM, AE, LSTM	content-based	✓

Table 2. Baseline Malicious Traffic Statistics in Original Training Data.

Protocol	Legitimate Conn.	Malicious Conn.	Malicious Ratio
FTP	45,525	180	0.39%
SMTP	83,433	624	0.70%
HTTP	200,060	1714	0.80%

Table 3. An excerpt from a noisy FTP training set.

No.	Source	Destination	Protocol	Info
247243	59.166.0.7	149.171.126.9	FTP	STOR README.txt
247255	59.166.0.7	149.171.126.9	FTP	QUIT
248804	175.45.176.0	149.171.126.15	FTP	USER anonymous
248916	175.45.176.0	149.171.126.15	FTP	[TCP Previous segment not captured] TYPE I
248920	175.45.176.0	149.171.126.15	FTP	PASV
248934	59.166.0.7	149.171.126.5	FTP	USER anonymous
248942	59.166.0.7	149.171.126.5	FTP	PASS jobs@server.com
248950	59.166.0.7	149.171.126.5	FTP	EPSV
249012	175.45.176.0	149.171.126.15	FTP	SIZE ../../../../../../x2CxsSUW/lwgclmRGLvZu
249018	175.45.176.0	149.171.126.15	FTP	RETR ../../../../../../x2CxsSUW/lwgclmRGLvZu
249024	59.166.0.7	149.171.126.5	FTP	QUIT
249650	59.166.0.7	149.171.126.8	FTP	USER anonymous

Table 4. Sample entries from the FTP testing dataset.

TCP Tuple	Payload	Label
149.171.126.17-21-175.45.176.2-42810-tcp	`213 2549` `150 Data connection accepted from 175.45.176.2:49220; transfer` `starting for exploit8.NWF(12558)bytes)` `226 Transfer completed.`	0
175.45.176.2-4108-149.171.126.11-21-tcp	`USER test` `PASS foobar` `CWD /op/apache-1.3.31/htdocs/test` `PORT 10,2,1,90,17,159` `STOR poc.shtml` `QUIT`	0
175.45.176.1-11178-149.171.126.13-21-tcp	`USER lWthZryPx` `PASS b2Ulm2K` `PORT 175,45,176,1,194,72` `RETR /../../../..//4AfjB1/yDellcx.gOa`	1
59.166.0.3-7585-149.171.126.2-21-tcp	`USER anonymous` `PASS jobs@server.com` `EPSV` `LIST` `CWD pub` `EPSV` `RETR README.txt` `EPSV` `STOR README.txt` `QUIT`	0
175.45.176.3-42152-149.171.126.13-21-tcp	`USER anonymous` `PASS IEUser@` `TYPE I` `PASV` `SIZE /../../../nM63/AwrIGL.aqd` `RETR /../../../nM63/AwrIGL.aqd`	1

Table 5. Number of TCP Connections in Testing Dataset.

Protocol	Benign	Malicious	Total
HTTP	132,344	16,116	148,460
FTP	24,465	1777	26,242
SMTP	40,704	4031	44,735

Table 6. Hyperparameter Configurations for LSTM and Autoencoder Models.

Hyperparameters	LSTM Values	Autoencoder Values
Number of Hidden Layer(s)	2	1; 3; 5
Activation Functions in Hidden Layer(s)	Tanh	ReLU
Activation Functions in Output Layer	Softmax	Sigmoid
Dropout	0.2	0.2
Optimizer	Adam	Adadelta
Loss Function	Categorical Crossentropy	Binary Crossentropy
Number of Epochs	10	10

Table 7. Optimal F2 Score Parameter Results for Each Algorithm on Each Protocol.

Best Parameter Results
Protocol	Algorithm	F2 Score	Parameters
Overall	MAX ML	0.6436	ISOF
	MAX LSTM	0.8085	f_zscore
	MAX AUTOENCODERS	0.8963	1 hidden layer-zscore
FTP	MAX ML	0.6335	LOF
	MAX LSTM	0.7990	b_mean
	MAX AUTOENCODERS	0.8759	3 hidden layer-IQR
SMTP	MAX ML	0.8759	ISOF
	MAX LSTM	0.9265	b_zscore
	MAX AUTOENCODERS	0.9858	5 hidden layer-IQR
HTTP	MAX ML	0.5231	ISOF
	MAX LSTM	0.8052	f_zscore
	MAX AUTOENCODERS	0.8215	1 hidden layer-zscore

Table 8. Optimal F2 Score Parameter Results for Each Algorithm on Each Protocol with Best Threshold.

Algorithm	Best Threshold Parameters	Noise %	F2 Score
MAX ML	– FTP: LOF – SMTP: ISOF – HTTP: ISOF	0	0.7762
		0.1	0.6866
		0.2	0.6610
		0.3	0.6431
		Gradient	−0.0425
		Average	0.6917
MAX LSTM	– FTP: b_mean – SMTP: b_zscore – HTTP: f_zscore	0	0.8637
		0.1	0.8611
		0.2	0.8203
		0.3	0.8348
		Gradient	−0.0128
		Average	0.8450
MAX AE	– FTP: 3 layer-iqr – SMTP: 5 layer-iqr – HTTP: 1 layer-zscore	0	0.8972
		0.1	0.8949
		0.2	0.8971
		0.3	0.9007
		Gradient	0.0013
		Average	0.8975

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Prabowo, A.O.; Arrizki, D.J.; Pratomo, B.A.; Fajar, A.I.; Wijaya, K.B.; Studiawan, H.; Shiddiqi, A.M.; Othman, S.H. Evaluation of Anomaly-Based Network Intrusion Detection Systems with Unclean Training Data for Low-Rate Attack Detection. J. Cybersecur. Priv. 2026, 6, 14. https://doi.org/10.3390/jcp6010014

AMA Style

Prabowo AO, Arrizki DJ, Pratomo BA, Fajar AI, Wijaya KB, Studiawan H, Shiddiqi AM, Othman SH. Evaluation of Anomaly-Based Network Intrusion Detection Systems with Unclean Training Data for Low-Rate Attack Detection. Journal of Cybersecurity and Privacy. 2026; 6(1):14. https://doi.org/10.3390/jcp6010014

Chicago/Turabian Style

Prabowo, Angela Oryza, Deka Julian Arrizki, Baskoro Adi Pratomo, Ahmad Ibnu Fajar, Krisna Badru Wijaya, Hudan Studiawan, Ary Mazharuddin Shiddiqi, and Siti Hajar Othman. 2026. "Evaluation of Anomaly-Based Network Intrusion Detection Systems with Unclean Training Data for Low-Rate Attack Detection" Journal of Cybersecurity and Privacy 6, no. 1: 14. https://doi.org/10.3390/jcp6010014

APA Style

Prabowo, A. O., Arrizki, D. J., Pratomo, B. A., Fajar, A. I., Wijaya, K. B., Studiawan, H., Shiddiqi, A. M., & Othman, S. H. (2026). Evaluation of Anomaly-Based Network Intrusion Detection Systems with Unclean Training Data for Low-Rate Attack Detection. Journal of Cybersecurity and Privacy, 6(1), 14. https://doi.org/10.3390/jcp6010014

Article Menu

Evaluation of Anomaly-Based Network Intrusion Detection Systems with Unclean Training Data for Low-Rate Attack Detection

Abstract

1. Introduction

2. Related Works

3. Proposed Methods

3.1. Dataset Preparation

3.2. Feature Extraction

3.2.1. Byte Frequencies

3.2.2. Byte Sequences

3.3. Anomaly Detection

3.3.1. Long Short-Term Memory

3.3.2. Autoencoders

3.3.3. Classical Machine Learning

3.4. Calculating Threshold

4. Problem Setting

4.1. Threat Model

4.2. Deployment Assumptions

4.3. Dataset Representativeness and Generalisability

5. Experiments and Results

5.1. Experiment Setup

5.1.1. Dataset and Evaluation Metrics

5.1.2. Network Architecture and Hyperparameters

5.2. Experiment Results

5.2.1. Protocol-Based Attack Detection Performance

5.2.2. Overall Attack Detection Performance

5.2.3. Overall Attack Detection Performance with Best Parameter

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI