1. Introduction
The advancement of the internet has been accompanied by the emergence of new vulnerabilities that attackers exploit with increasingly sophisticated techniques. Among these, low-rate attacks pose significant challenges as they operate covertly by mimicking legitimate communication patterns, making them substantially more difficult to detect than high-rate attacks such as Distributed Denial of Service (DDoS) [
1]. Unlike DDoS attacks that generate obvious traffic spikes, low-rate attacks silently infiltrate systems without triggering traditional volume-based detection mechanisms.
Network-based Intrusion Detection Systems (NIDSs) offer an effective method for detecting low-rate attacks by analyzing network traffic in real time to identify unauthorized intrusions or malicious activities. For instance, since a significant portion of HTTP traffic comprises printable ASCII characters and should not contain executable code, the presence of executable code in HTTP packets indicates a potential malware injection attack [
2]. Traditional signature-based NIDSs rely on predefined patterns of known attacks, making them effective against recognized threats but vulnerable to zero-day attacks and novel attack variants [
3]. In contrast, anomaly-based NIDSs detect deviations from established baselines of normal behavior, enabling them to identify previously unseen attacks.
Research on anomaly-based NIDSs has evolved significantly over the past few decades. Early systems were basic, rule-based structures that scrutinized system logs using predefined thresholds and statistical measures [
4,
5]. The integration of machine learning (ML) and deep learning (DL) has fundamentally enhanced the efficiency of anomaly-based NIDSs by enabling systems to learn patterns directly from data without requiring explicit rules for every possible scenario [
6]. Recent studies have demonstrated the effectiveness of various ML approaches. For example, a study by Auskalnis, Paulauskas, and Baskys [
7] employed Local Outlier Factor (LOF) to evaluate network events based on distance from k-nearest neighbors. Ripan et al. [
8] showed that Isolation Forest improved classification accuracy through effective outlier removal. Zhang, Xu, and Gong [
9] demonstrated that a One-Class Support Vector Machine (OCSVM) achieved higher detection rates compared to traditional methods on benchmark datasets. DL architectures have also shown promising results, with Autoencoders achieving up to 7% improvement in classification performance through reconstruction-based anomaly detection [
10], and bidirectional Long Short-Term Memory (Bi-LSTM) networks demonstrating superior accuracy and detection rates compared to traditional ML approaches [
11].
A typical approach to training ML-based intrusion detection models involves providing either clearly labeled malicious and benign data (supervised learning) or exclusively benign data (unsupervised learning). However, obtaining well-labeled and representative data from network traffic in real-world scenarios poses significant challenges, as manual labeling or collecting benign traffic without any malicious traffic is time-consuming given the vast volume of network traffic. Several factors contribute to this impracticality. First, the manual labeling process is extremely resource-intensive due to the enormous volume of network traffic, with malicious traffic typically constituting a small fraction of total traffic, resulting in severely imbalanced training datasets. Second, during the data collection process, some attacks may remain undetected and inadvertently be included in what is assumed to be benign training data.
Consequently, it becomes imperative to develop anomaly detection models that not only learn from benign data but are also robust when small amounts of malicious traffic are unintentionally included in the training set. In this research, such datasets are referred to as unclean training sets, where most traffic is benign but a small fraction of malicious samples may be mixed in. The possibility of contamination in real-world traffic collection has not been adequately addressed in previous NIDS research. Since malicious traffic typically constitutes a small portion of total network traffic and some attacks may evade detection during capture, it is crucial to understand how such contamination affects model performance. Therefore, this research systematically evaluates the resilience of various anomaly-based NIDS architectures when trained on unclean datasets with controlled levels of noise. The main contributions of this study are as follows:
We introduce controlled amounts of malicious traffic into otherwise benign datasets to simulate realistic contamination scenarios.
We assess multiple anomaly-based NIDS architectures, including LOF, Isolation Forest, One-Class SVM, Autoencoders, and LSTM models, under varying degrees of noise to analyze their robustness
This paper is structured as follows. In
Section 2, we present the related work on anomaly-based NIDSs and their assumptions regarding training data. In
Section 3, we outline the research methodology, including model architectures and threshold calculation. Next, in
Section 4, we discuss the scope and environmental setting of this work. In
Section 5, we present the experimental results and an analysis of the impact of contamination. Based on the above, in
Section 6, we provide further explanation and in-depth analysis regarding the results acquired. Finally, the article concludes by summarising our findings and offering directions for future work in
Section 7.
2. Related Works
This section begins with how anomaly-based NIDSs have evolved. We focus on various techniques and algorithms used in machine-learning-based NIDSs (ML-NIDSs) and highlight the problem with their implementation in real-world environments: the need for clean training data.
Anomaly-based intrusion detection systems (IDS) have been widely examined to counter low-rate, stealthy attacks. Early systems predominantly relied on statistical modeling and threshold-driven anomaly scoring. For instance, Bhange and Marhas used statistical profiling to detect deviations from normal traffic behavior [
12]. Bhuyan et al. applied clustering to isolate anomalous traffic patterns [
13], while Zhao and Wu leveraged subspace-based methods with entropy and clustering weights for large-scale anomaly detection [
14]. However, as adversarial strategies evolved, these traditional approaches struggled with high false alarm rates and limited generalizability. To address these limitations, researchers began adopting machine learning (ML)-based anomaly detection techniques. Several surveys have reviewed the development trend in IDS research [
15,
16,
17,
18], emphasizing unsupervised models due to their capability to detect novel attacks and cope with imperfectly labeled datasets.
Machine learning-based IDSs initially focused on classical algorithms such as Support Vector Machines, k-Nearest Neighbors, and tree-based classifiers [
7,
8,
19] to identify abnormal traffic patterns. These approaches often relied on handcrafted features and statistical assumptions, limiting their robustness under noisy or imbalanced traffic conditions. To overcome feature engineering challenges, deep learning emerged as an alternative, enabling automated representation extraction. Autoencoder-based techniques were introduced to learn latent representations of benign traffic [
20,
21], while recurrent models—particularly LSTM architectures—were employed to capture sequential dependencies in flow or payload records [
22,
23].
Developing robust ML-based IDS solutions requires discriminative and representative feature sets. Flow-based IDS approaches extract information such as packet counts, volume, flow duration, and connection states, and have proven scalable and lightweight in practical deployments [
11,
24,
25]. However, flow-only features lack semantic visibility into packet content, making them ineffective in detecting low-rate or injection-driven attacks where malicious behavior is embedded within payload bytes. To address this constraint, content-based IDS research began incorporating raw packet inspection, including byte frequency modeling, n-gram payload profiling, entropy-based characterization, and deep content embeddings [
26,
27,
28]. These representations allow detection systems to capture fine-grained structural anomalies that would otherwise appear benign in flow metadata.
Recent research trends in IDS development have focused on adaptability, online responsiveness, and hybrid detection mechanisms. Hybrid IDS frameworks combine signature-based screening with anomaly-based learning to improve coverage and reduce false alarms [
29,
30]. Meanwhile, adaptive and incremental learning mechanisms allow IDS models to update themselves against concept drift and evolving attack behaviors [
31,
32]. Despite these advancements, a persistent challenge remains: ML-based IDS require substantial and representative benign traffic samples for training, yet network traces in reality frequently contain malicious connections. This assumption of access to clean datasets, implicitly made in most prior works [
21,
33,
34], reduces practical relevance. Therefore, robust IDS techniques must be able to tolerate unclean training sets containing contaminated or mislabeled samples, rather than assuming perfectly benign data availability.
The main purpose of this research is to look for a combination of robust features and algorithms that can handle some malicious traffic in the training set. We introduced varying quantities of malicious data into benign datasets to assess the impact of malicious data on the model’s outcomes. Then, with the prepared unclean training sets, we evaluated two types of content-based features, i.e., byte frequencies and byte subsequences, and several machine learning algorithms, such as LOF, IF, OCSVM, Autoencoders, and LSTM. As shown in
Table 1, to the best of our knowledge, no extensive research has explored this area. The corresponding check mark in the table indicates that unclean data are considered in our experimental scenarios.
3. Proposed Methods
This section outlines the stages of the proposed methods for anomaly detection, encompassing dataset preparation, feature extraction, model development (training and detection phases) using ISOF, LOF, OCSVM, Autoencoders, and LSTM, and culminating in the computation and application of a threshold to judge whether a connection is malicious or not.
Figure 1 illustrates the overall workflow of the proposed method.
The process begins with Dataset Preparation, where initial training sets (comprising primarily legitimate connections) are transformed into noisy training sets. These noisy sets are specifically designed to contain mostly legitimate connections, but with a controlled and intentional injection of malicious instances to reflect real-world scenarios more accurately. Following dataset preparation, representative feature is extracted from these noisy training sets. The specific feature varies for each proposed model. For LSTM, byte sequences (raw ordered sequences of bytes within a connection) are utilized. In contrast, Autoencoders and the classical machine learning models, namely ISOF, LOF, and OCSVM, leverage byte frequencies (the statistical distribution of byte values within a connection). The proposed models then proceed through the Model Development (Training Phase) to learn the patterns of their respective features from the prepared datasets.
After training, the Model Development (Detection Phase) commences. For LSTM and Autoencoder models, an anomaly score is computed for each connection based on its prediction or reconstruction error, respectively. This score is then compared against a pre-determined threshold to classify the connection. This threshold is derived from the training data’s anomaly score distribution to effectively distinguish between normal and anomalous behavior. Classical machine learning algorithms (ISOF, LOF, OCSVM), on the other hand, inherently detect anomalies using their internal algorithms, and thus do not require a separate threshold computation step. Detailed explanations of each stage and model will be provided in the subsequent subsections.
3.1. Dataset Preparation
As mentioned earlier, the dataset plays a critical role in determining the outcomes of a model, especially in the context of an anomaly detection model. The appropriate selection and processing of the dataset contribute significantly to the model’s improved performance. In this study, the UNSW-NB15 dataset was selected due to its comprehensive representation of legitimate network traffic and well-labeled attack categories [
35]. To maintain a focused scope on low-rate attacks, this research considers only four out of the ten available traffic categories, namely Normal, Backdoors, Exploits, and Worms. Additionally, the analysis is restricted to HTTP, FTP, and SMTP protocols, as examining network packet content (payload) is deemed more effective in detecting low-rate attacks.
To achieve the main objective of this research, which is to comprehend how the quantity of malicious traffic in the training set impacts the model performance, we calculated the ratio of malicious traffic for every protocol. The statistics are summarized in
Table 2. As shown, the FTP protocol has the lowest proportion of malicious traffic at 0.39% of its connections classified as malicious. To ensure a consistent and fair comparative analysis across all three protocols, we therefore established a maximum injection limit of 0.4% for the experiments involving HTTP and SMTP.
To systematically assess the effect of malicious traffic in the training dataset, we gradually injected the malicious traffic to the clean training sets. The injection was performed using fine-grained increments to capture detailed model behavior under varying contamination scenarios. This injection methodology was designed to simulate realistic conditions where training data cleanliness cannot be guaranteed in operational environments. For FTP, the proportion of injected malicious connections, referred to as noise levels, is set at 0.05%, 0.1%, 0.15%, 0.2%, 0.25%, 0.3%, 0.35%, and 0.39%. For SMTP and HTTP, the noise levels are configured at 0.1%, 0.2%, 0.3%, 0.4%, 0.5%, 0.6%, 0.7%, and 0.8%. After identifying the malicious TCP connections to be injected, we combined each set of malicious traffic with legitimate traffic into a PCAP file using the mergecap library to preserve the original packet timestamps. This resulting set is what we refer to as an
unclean dataset or
unclean training set. An example of a noisy training set for FTP, illustrating how malicious connections appear within the legitimate traffic, is shown in
Table 3. The rows highlighted in red indicate malicious connections.
3.2. Feature Extraction
In this study, payload-based analysis is employed instead of relying solely on packet headers, as it is considered more effective in detecting low-rate attacks [
36]. Since application-layer messages often exceed the maximum IP packet size, they are segmented into multiple packets, which may arrive in or out of order. To ensure a complete representation of network activity, incoming packets are temporarily stored in a queue buffer and grouped based on their respective TCP flows, identified by source and destination IP addresses and port numbers, before being reassembled according to the TCP protocol standard defined in RFC 793 [
37]. This reassembly process enables the system to analyze entire reconstructed payloads, providing a more comprehensive view of network traffic.
Once the TCP connection is terminated, typically marked by a FIN packet, the reassembled payload is processed by the model. However, if a connection remains incomplete beyond a predefined timeout, it is considered disrupted and is processed accordingly. By enforcing this timeout mechanism, the system prevents stalled or abandoned connections from affecting the accuracy of anomaly detection.
Once reassembled, the complete application-layer message (payload) is ready for feature extraction. To be processed by our machine learning models, the raw payload must first be transformed into a meaningful numerical format. While various payload representation methods exist, including [
38,
39,
40], this study focuses on two complementary approaches: byte frequencies and byte sequences. The fundamental difference in how these two features are constructed from the same payload is illustrated in
Figure 2. The process begins by converting the raw payload into a universal sequence of integers, where each byte is mapped to its corresponding value from 0 to 255. From this numerical sequence, the two distinct feature sets are derived. The precise formulation and rationale for each of these representations are detailed in the subsequent subsections.
3.2.1. Byte Frequencies
Byte frequencies capture the statistical distribution of byte values within network payloads, representing how often each possible byte value (0–255) appears in a given message. We chose this approach based on the observation that different types of network traffic exhibit distinct byte distribution patterns. Legitimate application-layer protocols typically conform to well-defined character sets and structural patterns. For instance, HTTP traffic predominantly comprises printable ASCII characters (bytes 32–126), while DNS queries follow specific encoding schemes. In contrast, malicious payloads, such as injected executable code, shellcode, or encrypted malware, introduce anomalous byte distributions that deviate significantly from these expected patterns.
Formally, for a payload of length N bytes, we construct a 256-dimensional feature vector where each dimension represents the frequency of occurrence of a specific byte value. However, payload lengths vary considerably across different traffic types and network conditions, potentially biasing the analysis toward longer payloads. To address this, we normalize the byte frequencies by dividing each byte count (
x) by the total payload length (
N), as described in Equation (
1). After normalization (
), the byte frequency data is used for training and testing the detection models (i.e., ISOF, LOF, OCSVM, and Autoencoders).
3.2.2. Byte Sequences
While byte frequencies effectively capture the statistical composition of payloads, they inherently discard positional information—the order in which bytes appear. However, many attack signatures are characterized not merely by the presence of specific bytes but by particular byte patterns and their sequential arrangement. For example, return-oriented programming (ROP) chains, format string exploits, and certain injection attacks rely on specific byte sequences that would be indistinguishable from benign traffic when examining only frequency distributions [
41].
To preserve this sequential information, we employ byte sequences as an alternative feature representation specifically for LSTM-based models, which are architecturally designed to capture temporal dependencies in sequential data [
42]. We extract byte subsequences using a sliding window approach with configurable window size, similar to n-gram extraction in natural language processing. This method generates overlapping subsequences that capture local byte patterns within the payload. Algorithm 1 illustrates the transformation process from raw packet payload to byte subsequences.
| Algorithm 1 Sliding Window for Byte Sequence |
- 1:
procedure CreateSlidingWindow(byteSequence, windowSize, stepSize) - 2:
Input: byteSequence, windowSize, stepSize - 3:
Output: List of sliding windows - 4:
windowsList ← [] - 5:
for to step stepSize do - 6:
window ← byteSequence[i : ] - 7:
windowsList.append(window) - 8:
end for - 9:
return windowsList - 10:
end procedure
|
3.3. Anomaly Detection
Anomalies are data points that deviate significantly from normal patterns, also known as outliers or rare events. Anomaly detection works by analyzing historical data to identify these unusual instances. In domains like NIDSs, this process provides critical insights by flagging potential threats.
There are three different ways to detect anomalies: supervised learning, unsupervised learning, and semi-supervised learning. This research focuses on using unsupervised learning to detect anomalies. Unlike supervised or semi-supervised learning, unsupervised learning techniques don’t require labelled training data [
43]. By assuming that normal data points occur much more often than anomalous data points, unsupervised techniques detect anomalies by classifying data points that occur less frequently. Instead of assigning labels to data points, unsupervised techniques assign each data point a score that indicates how likely it is to be an anomaly. However, this approach usually works assuming that the training data are clean and do not contain any malicious instances. Therefore, this study examines the impact of different noise levels in training data on anomaly detection model performance.
As previously mentioned, this research evaluates various machine learning models for anomaly-based intrusion detection, comparing their effectiveness in identifying network intrusion. The models examined include Long Short-Term Memory (LSTM), Autoencoders, and classical machine learning techniques such as Isolation Forest (ISOF), Local Outlier Factor (LOF), and One-Class Support Vector Machine (OCSVM). This subsection provides an in-depth discussion of each model’s core concepts, architecture, and their utilization in detecting intrusions through anomaly detection.
3.3.1. Long Short-Term Memory
An Long Short-Term Memory (LSTM) model is typically used for classification problems where labelled data is provided to train the model. However, in this research, LSTM is employed for anomaly detection, requiring a different approach. The development of the LSTM model is divided into two phases, which are training and detection.
In the training phase, the LSTM model is trained to predict the next item in a subsequence obtained from the network packet payload. The model learns to predict the next byte in a byte subsequence. As illustrated in the “Byte Sequence” portion of
Figure 2, the model is given an input subsequence
(e.g., [85, 83]) and trained to predict the immediately following byte, which serves as the target label
(e.g., 69).
Formulating the LSTM classifier as a simple function would be an oversimplification, as multiple operations are involved. Instead, the function is expressed in more detail in Equation (
2).
transforms input into a vector of specific dimensions, acting as an embedding layer. The function
R represents the recurrent layer, which takes the embedded vectors as inputs and outputs an intermediate vector value. This intermediate vector is then processed by the softmax function
, which calculates the probability distribution over possible next-byte candidates. Finally, the function selects the byte with the highest probability as the predicted next item in the sequence.
The primary objective of training is to enable the LSTM model to remember common byte sequences found in network packet payloads. If the LSTM has encountered a byte sequence before, its prediction error is expected to be low. Conversely, unusual traffic or unseen attack patterns are likely to yield higher prediction errors due to unfamiliar byte sequences.
In the detection phase, the trained LSTM model processes incoming byte sequences, similar to the training phase. However, in addition to predicting the next byte, the model also computes prediction errors, which are used to detect anomalies. Two methods are employed for error calculation: binary anomaly scoring and floating anomaly scoring. The binary anomaly score (
) assesses mispredictions by assigning a value (
) of one if the predicted byte (
) does not match the actual byte (
), as formally defined in Equation (
3). In contrast, the floating anomaly score (
) quantifies the numerical deviation between the softmax output of the model
and the expected probability distribution
, as defined in Equation (
4). Both scoring methods are normalized by the message length (
l) to ensure comparability across sequences. A connection is flagged as malicious if its resulting anomaly score exceeds a predetermined threshold. While this threshold can be set manually, in this research it is computed statistically. (See
Section 3.4)
3.3.2. Autoencoders
Unlike LSTM networks, which process byte sequences, Autoencoders operate on vectorized representations of byte frequencies, as they are not designed to handle sequential data with variable lengths. Formally, an Autoencoder model is defined as a non-linear function
, which maps an input vector
X to its reconstructed output
, as expressed in Equation (
5). The function
consists of stacked neural network layers and is optimized through backpropagation to minimize the reconstruction error, ensuring that
closely approximates
X.
This study adopts the Autoencoder model developed by Pratomo et al. [
44], which is trained on an unclean dataset containing both normal and anomalous network traffic. The detection phase evaluates anomaly scores based on reconstruction errors, computed using the mean squared error (MSE) between input and output (Equation (
6)). Specifically, if
represents the input frequency of byte
i and
its reconstructed value, the anomaly score reflects deviations from learned patterns. Higher reconstruction errors indicate traffic patterns that were uncommon in the training data. A connection is classified as malicious if its anomaly score surpasses a precomputed threshold.
3.3.3. Classical Machine Learning
Unlike deep learning models such as LSTM and Autoencoders, the machine learning models used in this study, OCSVM (One-Class Support Vector Machine) [
45], LOF (Local Outlier Factor) [
46], and ISOF (Isolation Forest) [
47], are specifically designed for anomaly detection. As a result, these models can be applied without architectural modifications.
All three models process byte frequency vectors rather than byte sequences, as they are not designed for sequential data. Given that the training dataset consists primarily of legitimate connections with some malicious traffic as noise, malicious data points are more likely to fall outside OCSVM’s decision boundary, exhibit higher LOF scores, and have longer path lengths in ISOF’s tree structure.
3.4. Calculating Threshold
As mentioned in
Section 3.3.1 and
Section 3.3.2, in this research, the threshold for determining whether a connection is malicious or legitimate is statistically computed exclusively for LSTM and Autoencoders. This threshold is derived from the anomaly scores obtained during the detection phase. LSTM produces two types of anomaly scores: binary and floating (see
Section 3.3.1 for details), while Autoencoders use reconstruction error as their anomaly score (see
Section 3.3.2). Any anomaly score exceeding the computed threshold is classified as anomalous, whereas scores below it are considered benign.
This research employs three methods to determine the floating anomaly score threshold. The first method classifies a connection as malicious if its anomaly score falls beyond mean ± two times the standard deviation. While straightforward, this approach assumes a near-normal distribution and is sensitive to outliers itself.
Therefore, we implement a second, more robust method based on the work of [
48], which utilizes the median and interquartile range (IQR). The median is less influenced by extreme values, making this method suitable for non-normally distributed data. For skewed distributions, a further adjustment using the medcouple (MC) is recommended [
49]. The resulting threshold,
can be computed, as described in Equation (
7), with
representing the 3rd quartile.
The third method for defining the threshold involves using the Median Absolute Deviation (MAD) [
50]. This approach utilizes both the median and median absolute deviation as robust measures of central tendency and dispersion, respectively. In this method, the threshold remains static, but cannot be directly compared with the reconstruction error. To make such a comparison, the reconstruction error must be transformed into its z-score, as outlined in Equation (
8). In this context, a TCP connection is identified as malicious when its z-score exceeds the specified threshold, usually set at 3.5 [
50].
After the training phase, one of these three threshold calculation methods is applied. Before transitioning to detection mode, the LSTM and Autoencoder models reprocess the training set to compute anomaly scores for each TCP connection. LSTM calculates both binary and floating scores, while Autoencoders compute the reconstruction error. These scores are then used to determine the final threshold, which is subsequently applied in the detection phase.
4. Problem Setting
To systematically evaluate the proposed detection approach, it is necessary to establish the adversarial context, deployment constraints, and dataset characteristics that define the scope of this work. In what follows, we characterise the adversarial capabilities, outline the operational constraints of realistic network environments, and justify our selection of UNSW-NB15 for modelling low-rate, stealthy intrusion scenarios.
4.1. Threat Model
In this work, we consider an attacker with moderate capabilities who operates under constraints that favour low-rate, stealth-oriented behaviour. Rather than overwhelming the network with high-volume traffic, the attacker sends carefully crafted, low-frequency requests designed to mimic legitimate client activity. We focus on adversaries capable of injecting malicious scripting content—including PHP, Python, Ruby, and SQL—alongside shellcode fragments or command sequences intended to gain remote access, escalate privileges, or maintain persistence on a target host. The attacker is assumed to deliver these payloads through text-based TCP protocols such as HTTP and FTP, which provide well-structured, human-readable request formats that facilitate covert manipulation without triggering volumetric anomalies.
4.2. Deployment Assumptions
The proposed detection approach assumes that packet payloads are accessible in plaintext form for inspection. Although modern networks commonly transmit traffic over TLS, this assumption remains realistic in enterprise environments where intermediate systems legitimately terminate encrypted channels. Examples include reverse proxies, TLS-terminating load balancers, API gateways, and application firewalls, all of which receive decrypted content before forwarding it to backend services.
Figure 3 illustrates this deployment assumption. Under these conditions, the anomaly detection logic may be integrated as a module within existing traffic inspection components—such as ModSecurity-enabled web servers or NGINX App Protect deployments—without violating end-to-end security guarantees. Since payload inspection occurs post-TLS termination, the method does not require intrusive key extraction, man-in-the-middle interception, or packet decryption at unauthorised points.
4.3. Dataset Representativeness and Generalisability
UNSW-NB15 was selected due to the diversity and realism of its attack samples. Compared to DARPA, NSL-KDD, and CIC-IDS2017, where they provide at most 222 low-rate attack samples, UNSW-NB15 provides approximately 27,000 labelled low-rate connections. CSE-CIC-IDS2018 offers a larger quantity (approximately 162,000 samples); however, many of its low-rate attacks were generated using scripted interactions against the DVWA testbed, resulting in repetitive patterns with limited behavioural diversity. In contrast, UNSW-NB15 traffic was generated using IXIA PerfectStorm, which simulates enterprise-like traffic streams with realistic timing variations, protocol noise, and exploit behaviours, making it more suitable for modelling stealthy intrusion behaviour. While UNSW-NB15 captures realistic network interaction patterns, additional validation on contemporary datasets would help further substantiate the robustness of our approach, particularly under different traffic compositions and adversarial behaviours.
5. Experiments and Results
This research aims to determine which combination of detection model (OCSVM, LOF, ISOF, Autoencoders, and LSTM) and feature set (byte frequencies or byte sequences) yields the optimal detection performance, especially when trained on unclean datasets containing varying degrees of malicious traffic. As detailed in
Section 3.1, these models are trained on such datasets and then rigorously evaluated using a dedicated testing set.
Figure 4 illustrates this comprehensive evaluation process. Initially, relevant features, either byte sequences or byte frequencies, are extracted from the testing sets. Byte sequences serve as features exclusively for the LSTM model, while byte frequencies are utilized by Autoencoders and all other machine learning models (ISOF, LOF, and OCSVM).
For the deep learning models (LSTM and Autoencoders), as previously outlined in
Section 3, an anomaly score is calculated based on their predictions on the testing dataset (refer to
Section 3.3.1 for LSTM and
Section 3.3.2 for Autoencoders). This score is then compared against a predefined threshold, established during the detection phase (
Section 3.4). A connection is classified as malicious if its anomaly score surpasses this threshold. In contrast, machine learning models (ISOF, LOF, and OCSVM) operate differently. These models inherently determine anomalous data without requiring a separate manual anomaly score computation or thresholding. Instead, their respective algorithms (explained in
Section 3.3.3) directly process the extracted byte frequencies from the testing set connections to classify them as either legitimate or malicious.
5.1. Experiment Setup
5.1.1. Dataset and Evaluation Metrics
The methodology for creating the noisy training sets, including the variation of noise ratios for each protocol, is detailed in
Section 3.1. It is important to note that while the noise ratio for the HTTP protocol was varied up to 0.8%, for this study, we limited the maximum noise level to 0.5% due to time constraints.
For the testing phase, we utilized the UNSW-02 dataset. The construction of the testing set followed a similar preparation process as the training set: they are split based on their TCP tuples and classified as either legitimate or malicious by indexing against the corresponding CSV files in the UNSW-NB15 dataset. All connections (both legitimate and malicious) are then grouped based on their respective protocols (HTTP, SMTP, and FTP). An excerpt from the resulting dataset is shown in
Table 4, illustrating the structure where labels ‘0’ and ‘1’ denote legitimate and malicious traffic, respectively. The final composition of the testing set, detailing the number of TCP connections per protocol, is summarized in
Table 5.
The performance of the detection model was evaluated using a confusion matrix, as depicted in
Figure 5. In this matrix, rows correspond to the actual class instances, where the positive condition (P) represents malicious traffic and the negative condition (N) represents benign traffic, while columns represent the predicted class instances [
51,
52,
53,
54]. The resulting two-by-two contingency table reports the counts for four key outcomes: true positives (TP), denoting correctly identified malicious connections; false negatives (FN), representing undetected malicious instances; false positives (FPs), indicating benign traffic incorrectly classified as malicious; and true negatives (TN), reflecting correctly classified benign traffic. This comprehensive breakdown facilitates a more detailed performance analysis than the proportion of correct classifications (accuracy), as accuracy can yield misleading results when the dataset is unbalanced and class observations vary significantly.
The confusion matrix results will be used to generate the evaluation metrics. In this study, we employ three types of evaluation metrics: Detection Rate, False-Positive Rate, and F2 score.
Detection Rate (DR) measures the model’s ability to identify positive instances correctly. A higher detection rate indicates the model’s enhanced capability to identify malicious cases effectively. Equation (
9) shows the formula of detection rate.
False-Positive Rate (FPR) provides insights into the model’s tendency to misclassify negative instances as positive. A lower FPR signifies that the model makes fewer errors by classifying negative instances as positive. Equation (
10) shows the formula of FPR.
The F2 Score is the weighted harmonic mean of precision and recall for a given threshold. It diverges from the F1 Score by placing a greater emphasis on recall than on precision. Greater weight is attributed to recall in cases where false negatives (undetected attacks) have more negative consequences and are deemed more severe than false positives. A higher F2 Score indicates a well-balanced consideration of precision and recall, with a pronounced emphasis on recall. Equation (
11) illustrates the computation of the F2 Score, highlighting that False Negatives have a greater impact than False Positives, with FN values carrying a higher weight than FP values.
By incorporating these three evaluation matrices, we aim to comprehensively evaluate the model’s performance, considering its ability to correctly identify positive instances and avoid false positives.
5.1.2. Network Architecture and Hyperparameters
The overview of the proposed deep learning architectures (LSTM and Autoencoders) has been discussed previously in
Section 3.3. As deep learning models contain numerous hyperparameters, we list both the LSTM and Autoencoders’ hyperparameters used in this research in
Table 6 for reproducibility purposes. For machine learning models, all parameters were set to their default values.
5.2. Experiment Results
This study involves experiments using the LSTM model and classical machine learning models, including OCSVM, LOF, and ISOF, as well as the Autoencoders model for comparison. For the byte sequence feature, each sequence consistently set to a length of 2 bytes. This length was selected as the baseline condition representing the minimum sequence length that preserves ordinal information; sequences of length 1 would eliminate sequential dependencies entirely.
In the Autoencoder experiments, the testing methodology involved varying the number of hidden layers to 1, 3, and 5. The anomaly detection thresholds for these experiments were determined using three distinct methods: mean, interquartile range (IQR), and z-score. For the LSTM experiments, testing was systematically divided based on two computational approaches for generating anomaly scores: a binary and floating approach. Each computational approach was further evaluated using three corresponding threshold methods: b_mean, b_iqr, and b_zscore for the binary scores, and f_mean, f_iqr, and f_zscore for the floating-point scores. Furthermore, the LSTM’s performance was differentiated across various network protocols analyzed, specifically HTTP, FTP, SMTP, and a combined dataset encompassing all mentioned protocols, as well as by the ratio of noise introduced into the datasets.
5.2.1. Protocol-Based Attack Detection Performance
Our experimental evaluation systematically assessed model performance across three protocols (HTTP, FTP, and SMTP) under varying noise conditions. We conducted experiments with varying noise ratios as detailed in
Section 3.1 to evaluate the robustness of each approach. The comprehensive results are presented in
Table A1,
Table A2,
Table A3,
Table A4,
Table A5,
Table A6,
Table A7,
Table A8 and
Table A9, organized by model category and protocol: HTTP traffic (
Table A1,
Table A2 and
Table A3), FTP traffic (
Table A4,
Table A5 and
Table A6), and SMTP traffic (
Table A7,
Table A8 and
Table A9).
For HTTP traffic, our findings reveal distinct performance characteristics across model architectures. In the clean data scenario (0% noise), the LOF model achieved the highest F2 score among classical machine learning approaches. Autoencoders demonstrated superior performance using z-score thresholding across all hidden layer configurations, while LSTM networks achieved optimal results with floating anomaly scores combined with IQR thresholding. Notably, autoencoders outperformed the other two categories by up to 13% in F2 score when evaluated on clean data. To assess noise resilience, we examined performance degradation patterns illustrated in
Figure 6, which depicts model behavior under increasing noise levels, with dotted lines representing linear regression trends. The gradient of these trends serves as an indicator of noise sensitivity, where steeper negative gradients correspond to greater performance degradation. As observed, autoencoders exhibited the smallest gradient values, demonstrating robust noise resistance. Conversely, the LOF model experienced the steepest decline (gradient =
), indicating high susceptibility to training data contamination.
The experimental results for FTP traffic, detailed in
Table A4,
Table A5 and
Table A6, reveal performance patterns consistent with those observed for HTTP. In the clean data scenario, the LOF model again achieved the highest F2 score among classical machine learning approaches. For autoencoders, optimal performance was obtained using IQR thresholding with a single hidden layer configuration. LSTM networks achieved peak performance with binary anomaly scores paired with mean thresholding. The performance gap between model categories narrowed considerably for FTP, with autoencoders leading by only 1.21% in F2 score.
Figure 7 illustrates the noise sensitivity across models for FTP traffic. Consistent with HTTP results, autoencoder variants demonstrated the lowest gradient values, confirming their superior noise tolerance. The LOF model exhibited the highest sensitivity (gradient =
), though the magnitude of degradation was substantially lower than that observed in HTTP traffic.
For SMTP traffic, the experimental outcomes are presented in
Table A7,
Table A8 and
Table A9 following the same organizational structure. Unlike HTTP and FTP, the ISOF model produced the highest F2 score among classical machine learning methods when tested on clean data. Autoencoders achieved optimal performance using IQR thresholding with five hidden layers, while LSTM networks performed best with binary anomaly scores combined with mean thresholding. Autoencoders maintained their performance advantage with an 11.5% higher F2 score relative to competing approaches. The noise sensitivity analysis for SMTP traffic, depicted in
Figure 8, reinforces the patterns observed across other protocols. Autoencoder variants consistently exhibited the lowest gradient values, demonstrating robust performance under noise. The LOF model again showed the highest sensitivity (gradient =
), with degradation magnitude falling between the values observed for HTTP and FTP protocols.
Our experiments across all three protocols consistently demonstrate that autoencoders exhibit superior noise resilience compared to classical machine learning and LSTM approaches. While classical methods—particularly LOF—can achieve competitive performance on clean data, they suffer substantial degradation when trained on contaminated datasets. LSTM networks show intermediate sensitivity, with performance varying based on the anomaly scoring and thresholding combination employed.
5.2.2. Overall Attack Detection Performance
To evaluate the generalizability of the proposed models across diverse traffic types, we conducted an aggregate performance analysis that averaged the detection rate, F2 scores, and FPR from the same model variations across all protocols. These variations encompass the specific algorithms for classical machine learning, the threshold method and layer count for Autoencoders, and the scoring calculation and threshold method for LSTM. Due to the varying maximum noise limits in the protocol-specific datasets, this aggregate analysis utilizes consistent noise variations of 0%, 0.1%, 0.2%, and 0.3%. The aggregated results are reported in
Table A10,
Table A11 and
Table A12, representing classical machine learning, Autoencoders, and LSTM, respectively.
As illustrated in
Table A10, the results remain consistent with the protocol-specific experiments: LOF achieves the highest F2 score among classical machine learning methods in the clean data scenario. For Autoencoders (
Table A11), the highest F2 score is obtained when employing z-score thresholding with a single hidden layer. Regarding LSTM (
Table A12), the highest F2 score was achieved utilizing floating anomaly calculation with the IQR thresholding method. Consistent with previous findings, Autoencoders demonstrate a substantial advantage, leading by up to 13% in F2 score compared to other algorithms. The impact of noise variation, illustrated in
Figure 9, further confirms that Autoencoders exhibit the most resilience to noise (lowest gradient value), while LOF proves to be the least robust (highest gradient value of
).
Moreover,
Table 7 presents a comprehensive comparison of the optimal performance parameters for each model category using the F2 score as the primary evaluation metric. The optimal parameters were identified by selecting the configurations that yielded the highest average F2 score across all noise levels. This metric was selected to provide a balanced representation of recall and precision, ensuring that both false negatives and false positives are appropriately weighted. The comparative analysis reveals that while LSTM models generally yield lower F2 scores than Autoencoders, they surpass classical machine learning models in performance. Notably, Autoencoders consistently achieved the highest values across all test cases, demonstrating their superiority in anomaly detection tasks under varying noise conditions.
Figure 10 provides a comparative visual analysis of the optimal F2 scores achievable by each algorithm category across the tested protocols. Across all four subplots (a–d), the Autoencoder model (represented by the green line) consistently maintains the highest performance trajectory, visually distinct from the RNN (orange) and classical machine learning (blue) baselines. Crucially, the trend lines for the Autoencoder exhibit minimal negative gradients, appearing nearly horizontal in the FTP and Overall scenarios, which underscores the model’s remarkable stability against increasing noise ratios. Conversely, classical machine learning models display the most significant performance degradation, particularly evident in the HTTP and Overall traffic plots where the downward slope is most pronounced.
5.2.3. Overall Attack Detection Performance with Best Parameter
While the analysis presented in
Section 5.2.2 evaluated performance using a single parameter set averaged across all traffic types, that generalized approach provides a constrained perspective on model capabilities. The reason is that optimal hyperparameters and thresholding methods vary significantly depending on the specific network protocol under examination. To address this limitation, this subsection assesses detection capabilities by aggregating the mean F2 scores obtained using the best-performing parameters tailored to each specific traffic type, as detailed in
Table 8.
The aggregated results, illustrated in
Figure 11, confirm the performance hierarchy observed in the previous section. Our findings reveal that the LSTM model generally yields lower F2 scores than Autoencoders but consistently outperforms traditional machine learning models. Nevertheless, the stability metrics derived from this optimized approach are more definitive than those obtained in the general analysis. As it can be observed in
Figure 11, the distinction between the models lies heavily in their resilience to unclean training data. Even when tuned to their optimal parameters, classical machine learning models exhibit the highest sensitivity to noise, manifesting as a steep negative gradient of
. On the contrary, Autoencoders demonstrate remarkable robustness that is inherent to the architecture rather than a result of specific parameter tuning. Their performance trend line in
Figure 11 is virtually flat, with a minimum gradient of
, indicating that they are the least sensitive to malicious traffic in the training set. The LSTM model occupies a middle ground with a gradient of
, showing moderate resilience that exceeds classical models but lacks the near-total immunity to unclean training data exhibited by the Autoencoders.
6. Discussion
Throughout all experiments, whether protocol-specific or protocol-free, the relative performance rankings of the algorithms remained remarkably consistent. Autoencoders consistently achieved the highest F2 scores, followed by LSTM models in second place, with classical machine learning algorithms trailing in third. However, the performance nuances within each algorithmic family varied considerably across experimental scenarios. For instance, in the classical machine learning category, LOF typically achieved the highest F2 scores when trained on clean datasets across different experimental configurations. However, when F2 scores were averaged across all noise levels, ISOF demonstrated superior performance in most experimental settings. This shift can be attributed to LOF’s pronounced sensitivity to noise, as evidenced by the steep decline in its performance gradient.
Although the injected noise levels in our experiments were capped below 1%—a conservative representation compared to potentially higher contamination rates in operational networks—the linear degradation trends observed allow us to infer expected performance under greater noise levels. Assuming the degradation patterns remain approximately linear, the computed gradients provide a reasonable basis for extrapolation to real-world scenarios. Consider the cross-protocol scenario with optimal parameters as an example (see
Section 5.2.3). Referring to
Figure 11, autoencoders exhibit the linear regression equation
. If the noise level were 1%, the F2 score of the corresponding autoencoder model under this scenario would approximate 0.8767.
Autoencoders not only dominated in F2 score performance but also exhibited the greatest resilience to noise across all experimental configurations, as demonstrated by their minimal gradient values. This robustness stems from the synergy between their architectural complexity and the byte frequency feature representation. Byte frequencies capture distributional patterns rather than sequential dependencies, making them inherently tolerant to perturbations in byte ordering. Consequently, when bytes are reordered but maintain similar frequency distributions, Autoencoders can still recognize legitimate connections. The reconstruction-based anomaly detection mechanism of Autoencoders learns a compressed representation of normal traffic patterns in the latent space, enabling them to distinguish genuine anomalies from noise-induced variations. This is reflected in their consistently high detection rates (reaching 100% in several FTP experiments) while maintaining acceptably low false-positive rates.
LSTM models occupied an intermediate position in both performance and noise resilience. The byte sequence representation leverages temporal dependencies and ordering information, providing advantages when attacks manipulate byte arrangements rather than substituting byte values entirely. The recurrent architecture enables LSTM to capture complex behavioral patterns in malicious connections, as evidenced by detection rates exceeding 99% in several scenarios. However, this sequential sensitivity introduces a critical vulnerability: LSTM models become overly rigid to learned data patterns and sensitive to noise, manifested in elevated false-positive rates (up to 14.64% in some FTP experiments) that ultimately depress F2 scores.
Classical machine learning algorithms, despite benefiting from the relatively simple byte frequency features, demonstrated the most limited performance due to their algorithmic constraints. These methods struggled to achieve balanced trade-offs between detection rate and false-positive rate. For instance, ISOF consistently achieved perfect detection rates (100%) in FTP and SMTP protocols but suffered from elevated false-positive rates (up to 14.64%), while OCSVM maintained lower false-positive rates but at the cost of substantially reduced detection capabilities. This limitation reflects the fundamental challenge these algorithms face in learning complex decision boundaries from high-dimensional feature spaces without the representational capacity of deep neural networks.
In real-world deployment, these findings suggest distinct operational niches for each approach. Autoencoders are well-suited for long-running enterprise NIDSs where imperfect retraining data is inevitable, as their minimal performance degradation addresses the reality that production networks contain some proportion of malicious traffic. LSTM models excel in protocol-aware inspection components such as reverse proxies or web application firewalls where sequence patterns matter. Classical machine learning algorithms remain useful as lightweight edge filters in IoT gateways or SD-WAN appliances where computational efficiency is paramount.
7. Conclusions
This work systematically evaluated the resilience of anomaly-based network intrusion detection models when trained on unclean datasets containing varying levels of malicious traffic. We assessed five approaches—Autoencoders, LSTM, Isolation Forest, Local Outlier Factor, and One-Class Support Vector Machine—across HTTP, FTP, and SMTP protocols to determine which combination of algorithm and feature representation maintains robust performance under realistic training conditions where malicious traffic inadvertently contaminates benign datasets.
Our experimental results demonstrate that Autoencoders using byte frequency features achieved superior performance across all evaluation scenarios, with an average F2 score of 0.8975 representing only a 0.001 decrease from clean training conditions. This minimal performance degradation contrasts sharply with classical machine learning approaches (0.04 decrease) and LSTM models (0.01 decrease). LSTM models occupied an intermediate position, effectively capturing attack signatures through sequential byte pattern processing, though this sensitivity introduces vulnerability to training data contamination. Classical machine learning algorithms exhibited the highest sensitivity to noise, with LOF showing the steepest performance degradation across protocols.
These findings provide actionable guidance for practitioners in selecting intrusion detection approaches based on operational requirements. Autoencoders are well-suited for enterprise environments where perfect training data cleanliness cannot be guaranteed, LSTM models excel in protocol-aware inspection scenarios where sequence patterns are critical, and classical methods remain viable for resource-constrained edge deployments where computational efficiency is paramount.
Nevertheless, several limitations warrant acknowledgment. This study focused primarily on detection accuracy and noise resilience without comprehensive assessment of system constraints such as training time, inference speed, and memory consumption—critical factors in cases where computational overhead may offset performance advantages. Future work should incorporate systematic evaluation of these operational constraints, investigate whether LSTM performance can be enhanced through alternative feature representations that reduce noise sensitivity, and validate findings on contemporary datasets.
Author Contributions
Conceptualization, A.O.P., D.J.A., B.A.P.; methodology, A.O.P.; software, A.O.P., A.I.F., K.B.W.; validation, H.S., A.M.S., S.H.O.; formal analysis, A.O.P.; investigation, A.O.P.; resources, B.A.P.; data curation, A.O.P.; writing—original draft preparation, A.O.P., D.J.A., B.A.P.; writing—review and editing, B.A.P., H.S., A.M.S., S.H.O.; visualization, A.O.P.; supervision, B.A.P.; project administration, B.A.P.; funding acquisition, B.A.P. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Department of Informatics, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia, under funding number 1700/PKS/ITS/2022.
Institutional Review Board Statement
Ethical review and approval were waived for this study as it did not involve humans or animals.
Informed Consent Statement
Not applicable.
Data Availability Statement
Acknowledgments
The authors gratefully acknowledge the financial support they received from the Institut Teknologi Sepuluh Nopember for this work, under project scheme of the Publication Writing and IPR Incentive Program (PPHKI) 2026.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| DR | Detection Rate |
| FPR | False-Positive Rate |
| FTP | File Transfer Protocol |
| HTTP | Hypertext Transfer Protocol |
| ISOF | Isolation Forest |
| LOF | Local Outlier Factor |
| LSTM | Long Short-Term Memory |
| NIDS | Network-based Intrusion Detection System |
| OCSVM | One-Class Support Vector Machine |
| SMTP | Simple Mail Transfer Protocol |
Appendix A
Table A1.
Experimental Results of Classical Machine Learning Algorithm for HTTP Traffic.
Table A1.
Experimental Results of Classical Machine Learning Algorithm for HTTP Traffic.
| Classical Machine Learning-HTTP |
|---|
| Malicious Traffic (%) | Algorithm | Detection Rate | F2 Score | FPR Score |
|---|
| 0 | OCSVM | 0.3271 | 0.3324 | 0.0168 |
| 0 | LOF | 0.8107 | 0.6480 | 0.0409 |
| 0 | ISOF | 0.8200 | 0.5908 | 0.0600 |
| 0.1 | OCSVM | 0.3169 | 0.3140 | 0.0202 |
| 0.1 | LOF | 0.4892 | 0.4083 | 0.0416 |
| 0.1 | ISOF | 0.7724 | 0.5435 | 0.0647 |
| 0.2 | OCSVM | 0.3064 | 0.2965 | 0.0242 |
| 0.2 | LOF | 0.3404 | 0.2991 | 0.0380 |
| 0.2 | ISOF | 0.6576 | 0.4851 | 0.0596 |
| 0.3 | OCSVM | 0.3058 | 0.3104 | 0.0182 |
| 0.3 | LOF | 0.2666 | 0.2033 | 0.0674 |
| 0.3 | ISOF | 0.6177 | 0.4768 | 0.0547 |
| 0.4 | OCSVM | 0.2928 | 0.2974 | 0.0177 |
| 0.4 | LOF | 0.2436 | 0.2114 | 0.0426 |
| 0.4 | ISOF | 0.6359 | 0.4674 | 0.0607 |
| 0.5 | OCSVM | 0.2941 | 0.2913 | 0.0205 |
| 0.5 | LOF | 0.2163 | 0.1253 | 0.1203 |
| 0.5 | ISOF | 0.8427 | 0.5750 | 0.0677 |
Table A2.
Experimental Results of Autoencoder Algorithm for HTTP Traffic.
Table A2.
Experimental Results of Autoencoder Algorithm for HTTP Traffic.
| Malicious Traffic (%) | Parameters | Threshold | Detection Rate | F2 Score | FPR Score |
|---|
| 0 | 1 Hidden Layer | Mean | 0.1031 | 0.1230 | 0.0144 |
| 0 | 1 Hidden Layer | IQR | 0 | 0 | 0 |
| 0 | 1 Hidden Layer | Z Score | 0.9751 | 0.8280 | 0.1475 |
| 0 | 3 Hidden Layer | Mean | 0.0758 | 0.0922 | 0.00584 |
| 0 | 3 Hidden Layer | IQR | 0.1020 | 0.1231 | 0.0065 |
| 0 | 3 Hidden Layer | Z Score | 0.5225 | 0.5608 | 0.0219 |
| 0 | 5 Hidden Layer | Mean | 0.0746 | 0.0907 | 0.0058 |
| 0 | 5 Hidden Layer | IQR | 0.0924 | 0.1119 | 0.00626 |
| 0 | 5 Hidden Layer | Z Score | 0.3616 | 0.4030 | 0.0202 |
| 0.1 | 1 Hidden Layer | Mean | 0.1144 | 0.1360 | 0.0147 |
| 0.1 | 1 Hidden Layer | IQR | 0.0484 | 0.0594 | 0.00486 |
| 0.1 | 1 Hidden Layer | Z Score | 0.9688 | 0.8238 | 0.1471 |
| 0.1 | 3 Hidden Layer | Mean | 0.0759 | 0.0929 | 0.00585 |
| 0.1 | 3 Hidden Layer | IQR | 0.1024 | 0.1239 | 0.00654 |
| 0.1 | 3 Hidden Layer | Z Score | 0.5997 | 0.6139 | 0.0459 |
| 0.1 | 5 Hidden Layer | Mean | 0.0753 | 0.0916 | 0.00582 |
| 0.1 | 5 Hidden Layer | IQR | 0.0991 | 0.1197 | 0.0064 |
| 0.1 | 5 Hidden Layer | Z Score | 0.4735 | 0.5087 | 0.0292 |
| 0.2 | 1 Hidden Layer | Mean | 0.1086 | 0.1293 | 0.0149 |
| 0.2 | 1 Hidden Layer | IQR | 0.00137 | 0.00171 | 0.00008 |
| 0.2 | 1 Hidden Layer | Z Score | 0.9738 | 0.8270 | 0.1476 |
| 0.2 | 3 Hidden Layer | Mean | 0.0757 | 0.0920 | 0.00589 |
| 0.2 | 3 Hidden Layer | IQR | 0.0956 | 0.1156 | 0.00638 |
| 0.2 | 3 Hidden Layer | Z Score | 0.4124 | 0.4175 | 0.0851 |
| 0.2 | 5 Hidden Layer | Mean | 0.0759 | 0.0923 | 0.00585 |
| 0.2 | 5 Hidden Layer | IQR | 0.0982 | 0.1187 | 0.00638 |
| 0.2 | 5 Hidden Layer | Z Score | 0.4654 | 0.5027 | 0.0263 |
| 0.3 | 1 Hidden Layer | Mean | 0.1130 | 0.1343 | 0.01501 |
| 0.3 | 1 Hidden Layer | IQR | 0.0112 | 0.0140 | 0.00029 |
| 0.3 | 1 Hidden Layer | Z Score | 0.9898 | 0.8381 | 0.1477 |
| 0.3 | 3 Hidden Layer | Mean | 0.0746 | 0.0907 | 0.0058 |
| 0.3 | 3 Hidden Layer | IQR | 0.1005 | 0.1214 | 0.00643 |
| 0.3 | 3 Hidden Layer | Z Score | 0.5754 | 0.5991 | 0.0365 |
| 0.3 | 5 Hidden Layer | Mean | 0.0750 | 0.0912 | 0.0058 |
| 0.3 | 5 Hidden Layer | IQR | 0.1087 | 0.1310 | 0.00663 |
| 0.3 | 5 Hidden Layer | Z Score | 0.7469 | 0.7503 | 0.0372 |
| 0.4 | 1 Hidden Layer | Mean | 0.1077 | 0.1282 | 0.0148 |
| 0.4 | 1 Hidden Layer | IQR | 0 | 0 | 0 |
| 0.4 | 1 Hidden Layer | Z Score | 0.8903 | 0.7700 | 0.1439 |
| 0.4 | 3 Hidden Layer | Mean | 0.0764 | 0.0929 | 0.00589 |
| 0.4 | 3 Hidden Layer | IQR | 0.1027 | 0.1239 | 0.0065 |
| 0.4 | 3 Hidden Layer | Z Score | 0.5766 | 0.6086 | 0.0260 |
| 0.4 | 5 Hidden Layer | Mean | 0.0761 | 0.0925 | 0.00587 |
| 0.4 | 5 Hidden Layer | IQR | 0.0960 | 0.1161 | 0.00634 |
| 0.4 | 5 Hidden Layer | Z Score | 0.4329 | 0.4712 | 0.0259 |
| 0.5 | 1 Hidden Layer | Mean | 0.1313 | 0.1552 | 0.0156 |
| 0.5 | 1 Hidden Layer | IQR | 0.0739 | 0.0888 | 0.0141 |
| 0.5 | 1 Hidden Layer | Z Score | 0.9954 | 0.8419 | 0.1479 |
| 0.5 | 3 Hidden Layer | Mean | 0.0755 | 0.0912 | 0.00581 |
| 0.5 | 3 Hidden Layer | IQR | 0.1116 | 0.1344 | 0.00668 |
| 0.5 | 3 Hidden Layer | Z Score | 0.6458 | 0.6484 | 0.0539 |
| 0.5 | 5 Hidden Layer | Mean | 0.0745 | 0.0910 | 0.0058 |
| 0.5 | 5 Hidden Layer | IQR | 0.0933 | 0.1129 | 0.00633 |
| 0.5 | 5 Hidden Layer | Z Score | 0.4040 | 0.4218 | 0.0622 |
Table A3.
Experimental Results of LSTM Algorithm for HTTP Traffic.
Table A3.
Experimental Results of LSTM Algorithm for HTTP Traffic.
| Malicious Traffic (%) | Threshold | Detection Rate | F2 Score | FPR Score |
|---|
| 0 | b_mean | 0.1021 | 0.1091 | 0.0697 |
| 0 | b_iqr | 0.9355 | 0.8117 | 0.1002 |
| 0 | b_zscore | 1 | 0.7591 | 0.1923 |
| 0 | f_mean | 0.0076 | 0.0084 | 0.0632 |
| 0 | f_iqr | 0.9804 | 0.843 | 0.1011 |
| 0 | f_zscore | 0.9894 | 0.8249 | 0.1221 |
| 0.1 | b_mean | 0.0912 | 0.0985 | 0.0658 |
| 0.1 | b_iqr | 0.6660 | 0.6226 | 0.0839 |
| 0.1 | b_zscore | 0.9996 | 0.7725 | 0.1808 |
| 0.1 | f_mean | 0.0104 | 0.0115 | 0.06 |
| 0.1 | f_iqr | 0.9886 | 0.8435 | 0.1071 |
| 0.1 | f_zscore | 0.9904 | 0.8415 | 0.11 |
| 0.2 | b_mean | 0.0995 | 0.1068 | 0.0681 |
| 0.2 | b_iqr | 0.9721 | 0.8399 | 0.0993 |
| 0.2 | b_zscore | 0.9983 | 0.7705 | 0.1804 |
| 0.2 | f_mean | 0.0104 | 0.0115 | 0.0624 |
| 0.2 | f_iqr | 0.7794 | 0.6585 | 0.1388 |
| 0.2 | f_zscore | 0.9228 | 0.7439 | 0.1560 |
| 0.3 | b_mean | 0.0504 | 0.0554 | 0.0615 |
| 0.3 | b_iqr | 0.1036 | 0.1121 | 0.0633 |
| 0.3 | b_zscore | 0.9971 | 0.7706 | 0.1806 |
| 0.3 | f_mean | 0.0065 | 0.0073 | 0.0592 |
| 0.3 | f_iqr | 0.4820 | 0.4554 | 0.0993 |
| 0.3 | f_zscore | 0.9881 | 0.7773 | 0.1677 |
| 0.4 | b_mean | 0.1021 | 0.1094 | 0.0686 |
| 0.4 | b_iqr | 0.9726 | 0.8401 | 0.0993 |
| 0.4 | b_zscore | 1 | 0.7859 | 0.1658 |
| 0.4 | f_mean | 0.0074 | 0.0082 | 0.0616 |
| 0.4 | f_iqr | 0.9871 | 0.8486 | 0.1009 |
| 0.4 | f_zscore | 0.9925 | 0.8365 | 0.1144 |
| 0.5 | b_mean | 0.1198 | 0.1288 | 0.0656 |
| 0.5 | b_iqr | 0.9985 | 0.7827 | 0.1707 |
| 0.5 | b_zscore | 0.9996 | 0.7834 | 0.1708 |
| 0.5 | f_mean | 0.0073 | 0.0081 | 0.0584 |
| 0.5 | f_iqr | 0.9926 | 0.8122 | 0.1383 |
| 0.5 | f_zscore | 0.9919 | 0.8069 | 0.1428 |
Table A4.
Experimental Results of Classical Machine Learning Algorithm for FTP Traffic.
Table A4.
Experimental Results of Classical Machine Learning Algorithm for FTP Traffic.
| Malicious Traffic (%) | Algorithm | Detection Rate | F2 Score | FPR Score |
|---|
| 0 | OCSVM | 0.8798 | 0.7560 | 0.0180 |
| 0 | LOF | 0.9728 | 0.8663 | 0.0123 |
| 0 | ISOF | 1.0000 | 0.5953 | 0.0651 |
| 0.05 | OCSVM | 0.6957 | 0.5774 | 0.0248 |
| 0.05 | LOF | 0.6689 | 0.6337 | 0.0113 |
| 0.05 | ISOF | 1.0000 | 0.4466 | 0.1154 |
| 0.1 | OCSVM | 0.6846 | 0.5717 | 0.0246 |
| 0.1 | LOF | 0.6980 | 0.6633 | 0.0106 |
| 0.1 | ISOF | 1.0000 | 0.5308 | 0.0835 |
| 0.15 | OCSVM | 0.6659 | 0.5452 | 0.0269 |
| 0.15 | LOF | 0.6904 | 0.6485 | 0.0118 |
| 0.15 | ISOF | 1.0000 | 0.4619 | 0.1089 |
| 0.2 | OCSVM | 0.6467 | 0.5728 | 0.0189 |
| 0.2 | LOF | 0.6467 | 0.6242 | 0.0101 |
| 0.2 | ISOF | 1.0000 | 0.5292 | 0.0845 |
| 0.25 | OCSVM | 0.6585 | 0.5600 | 0.0229 |
| 0.25 | LOF | 0.5580 | 0.5388 | 0.0116 |
| 0.25 | ISOF | 1.0000 | 0.4936 | 0.0961 |
| 0.3 | OCSVM | 0.6311 | 0.5420 | 0.0224 |
| 0.3 | LOF | 0.5956 | 0.5761 | 0.0108 |
| 0.3 | ISOF | 1.0000 | 0.4966 | 0.0953 |
| 0.35 | OCSVM | 0.6336 | 0.5222 | 0.0273 |
| 0.35 | LOF | 0.5982 | 0.5823 | 0.0103 |
| 0.35 | ISOF | 1.0000 | 0.5168 | 0.0891 |
| 0.39 | OCSVM | 0.6414 | 0.5612 | 0.0203 |
| 0.39 | LOF | 0.5924 | 0.5681 | 0.0118 |
| 0.39 | ISOF | 1.0000 | 0.4911 | 0.0980 |
Table A5.
Experimental Results of Autoencoder Algorithm for FTP Traffic.
Table A5.
Experimental Results of Autoencoder Algorithm for FTP Traffic.
| Malicious Traffic (%) | Parameters | Threshold | Detection Rate | F2 Score | FPR Score |
|---|
| 0 | 1 Hidden Layer | Mean | 0.9070 | 0.8285 | 0.0425 |
| 0 | 1 Hidden Layer | IQR | 0.9907 | 0.8784 | 0.0505 |
| 0 | 1 Hidden Layer | Z Score | 1 | 0.8754 | 0.0534 |
| 0 | 3 Hidden Layer | Mean | 0.6516 | 0.6162 | 0.0477 |
| 0 | 3 Hidden Layer | IQR | 1 | 0.8767 | 0.0528 |
| 0 | 3 Hidden Layer | Z Score | 1 | 0.8748 | 0.0537 |
| 0 | 5 Hidden Layer | Mean | 0.6516 | 0.6162 | 0.0477 |
| 0 | 5 Hidden Layer | IQR | 1 | 0.8767 | 0.0528 |
| 0 | 5 Hidden Layer | Z Score | 1 | 0.8748 | 0.0537 |
| 0.05 | 1 Hidden Layer | Mean | 0.9013 | 0.8277 | 0.0408 |
| 0.05 | 1 Hidden Layer | IQR | 0.9560 | 0.8489 | 0.0506 |
| 0.05 | 1 Hidden Layer | Z Score | 1 | 0.8750 | 0.0536 |
| 0.05 | 3 Hidden Layer | Mean | 0.6582 | 0.6218 | 0.0477 |
| 0.05 | 3 Hidden Layer | IQR | 1 | 0.8773 | 0.0525 |
| 0.05 | 3 Hidden Layer | Z Score | 1 | 0.8751 | 0.0536 |
| 0.05 | 5 Hidden Layer | Mean | 0.6582 | 0.6218 | 0.0477 |
| 0.05 | 5 Hidden Layer | IQR | 1 | 0.8773 | 0.0525 |
| 0.05 | 5 Hidden Layer | Z Score | 1 | 0.8751 | 0.0536 |
| 0.1 | 1 Hidden Layer | Mean | 0.9095 | 0.8323 | 0.0415 |
| 0.1 | 1 Hidden Layer | IQR | 0.9594 | 0.8522 | 0.0501 |
| 0.1 | 1 Hidden Layer | Z Score | 1 | 0.8741 | 0.0539 |
| 0.1 | 3 Hidden Layer | Mean | 0.6550 | 0.6190 | 0.0477 |
| 0.1 | 3 Hidden Layer | IQR | 0.9994 | 0.8760 | 0.0529 |
| 0.1 | 3 Hidden Layer | Z Score | 1 | 0.8745 | 0.0539 |
| 0.1 | 5 Hidden Layer | Mean | 0.6550 | 0.6190 | 0.0477 |
| 0.1 | 5 Hidden Layer | IQR | 0.9994 | 0.8760 | 0.0529 |
| 0.1 | 5 Hidden Layer | Z Score | 1 | 0.8745 | 0.0539 |
| 0.15 | 1 Hidden Layer | Mean | 0.8994 | 0.8259 | 0.0409 |
| 0.15 | 1 Hidden Layer | IQR | 0.9564 | 0.8500 | 0.0502 |
| 0.15 | 1 Hidden Layer | Z Score | 1 | 0.8756 | 0.0533 |
| 0.15 | 3 Hidden Layer | Mean | 0.6204 | 0.5910 | 0.0472 |
| 0.15 | 3 Hidden Layer | IQR | 0.9917 | 0.8709 | 0.0528 |
| 0.15 | 3 Hidden Layer | Z Score | 1 | 0.8747 | 0.0538 |
| 0.15 | 5 Hidden Layer | Mean | 0.6204 | 0.5910 | 0.0472 |
| 0.15 | 5 Hidden Layer | IQR | 0.9917 | 0.8709 | 0.0528 |
| 0.15 | 5 Hidden Layer | Z Score | 1 | 0.8747 | 0.0538 |
| 0.2 | 1 Hidden Layer | Mean | 0.7410 | 0.6975 | 0.0428 |
| 0.2 | 1 Hidden Layer | IQR | 0.9905 | 0.8750 | 0.0502 |
| 0.2 | 1 Hidden Layer | Z Score | 1 | 0.8749 | 0.0537 |
| 0.2 | 3 Hidden Layer | Mean | 0.6546 | 0.6189 | 0.0476 |
| 0.2 | 3 Hidden Layer | IQR | 1 | 0.8767 | 0.0527 |
| 0.2 | 3 Hidden Layer | Z Score | 1 | 0.8748 | 0.0537 |
| 0.2 | 5 Hidden Layer | Mean | 0.6546 | 0.6189 | 0.0476 |
| 0.2 | 5 Hidden Layer | IQR | 1 | 0.8767 | 0.0527 |
| 0.2 | 5 Hidden Layer | Z Score | 1 | 0.8748 | 0.0537 |
| 0.25 | 1 Hidden Layer | Mean | 0.9103 | 0.8308 | 0.0426 |
| 0.25 | 1 Hidden Layer | IQR | 0.9920 | 0.8766 | 0.0500 |
| 0.25 | 1 Hidden Layer | Z Score | 1 | 0.8761 | 0.0531 |
| 0.25 | 3 Hidden Layer | Mean | 0.6784 | 0.6379 | 0.0480 |
| 0.25 | 3 Hidden Layer | IQR | 1 | 0.8759 | 0.0531 |
| 0.25 | 3 Hidden Layer | Z Score | 1 | 0.8738 | 0.0542 |
| 0.25 | 5 Hidden Layer | Mean | 0.6784 | 0.6379 | 0.0480 |
| 0.25 | 5 Hidden Layer | IQR | 1 | 0.8759 | 0.0531 |
| 0.25 | 5 Hidden Layer | Z Score | 1 | 0.8738 | 0.0542 |
| 0.3 | 1 Hidden Layer | Mean | 0.8966 | 0.8218 | 0.0420 |
| 0.3 | 1 Hidden Layer | IQR | 0.9884 | 0.8724 | 0.0507 |
| 0.3 | 1 Hidden Layer | Z Score | 1 | 0.8735 | 0.0543 |
| 0.3 | 3 Hidden Layer | Mean | 0.6340 | 0.6020 | 0.0473 |
| 0.3 | 3 Hidden Layer | IQR | 1 | 0.8763 | 0.0530 |
| 0.3 | 3 Hidden Layer | Z Score | 1 | 0.8742 | 0.0540 |
| 0.3 | 5 Hidden Layer | Mean | 0.6340 | 0.6020 | 0.0473 |
| 0.3 | 5 Hidden Layer | IQR | 1 | 0.8763 | 0.0530 |
| 0.3 | 5 Hidden Layer | Z Score | 1 | 0.8742 | 0.0540 |
| 0.35 | 1 Hidden Layer | Mean | 0.9070 | 0.8298 | 0.0419 |
| 0.35 | 1 Hidden Layer | IQR | 0.9924 | 0.8737 | 0.0516 |
| 0.35 | 1 Hidden Layer | Z Score | 1 | 0.8744 | 0.0539 |
| 0.35 | 3 Hidden Layer | Mean | 0.6424 | 0.6090 | 0.0474 |
| 0.35 | 3 Hidden Layer | IQR | 0.9990 | 0.8764 | 0.0526 |
| 0.35 | 3 Hidden Layer | Z Score | 1 | 0.8751 | 0.0536 |
| 0.35 | 5 Hidden Layer | Mean | 0.6424 | 0.6090 | 0.0474 |
| 0.35 | 5 Hidden Layer | IQR | 0.9990 | 0.8764 | 0.0526 |
| 0.35 | 5 Hidden Layer | Z Score | 1 | 0.8751 | 0.0536 |
| 0.39 | 1 Hidden Layer | Mean | 0.9126 | 0.8354 | 0.0412 |
| 0.39 | 1 Hidden Layer | IQR | 0.9918 | 0.8745 | 0.0509 |
| 0.39 | 1 Hidden Layer | Z Score | 1 | 0.8760 | 0.0530 |
| 0.39 | 3 Hidden Layer | Mean | 0.6349 | 0.6026 | 0.0475 |
| 0.39 | 3 Hidden Layer | IQR | 1 | 0.8767 | 0.0528 |
| 0.39 | 3 Hidden Layer | Z Score | 1 | 0.8745 | 0.0538 |
| 0.39 | 5 Hidden Layer | Mean | 0.6349 | 0.6026 | 0.0475 |
| 0.39 | 5 Hidden Layer | IQR | 1 | 0.8767 | 0.0528 |
| 0.39 | 5 Hidden Layer | Z Score | 1 | 0.8745 | 0.0538 |
Table A6.
Experimental Results of LSTM Algorithm for FTP Traffic.
Table A6.
Experimental Results of LSTM Algorithm for FTP Traffic.
| Malicious Traffic (%) | Threshold Type | Detection Rate | F2 Score | FPR Score |
|---|
| 0 | b_mean | 0.9854 | 0.8222 | 0.0737 |
| 0 | b_iqr | 1 | 0.7143 | 0.1464 |
| 0 | b_zscore | 1 | 0.7143 | 0.1464 |
| 0 | f_mean | 0.9203 | 0.8164 | 0.0524 |
| 0 | f_iqr | 1 | 0.7143 | 0.1464 |
| 0 | f_zscore | 1 | 0.7143 | 0.1464 |
| 0.05 | b_mean | 0.9961 | 0.8631 | 0.0565 |
| 0.05 | b_iqr | 1 | 0.7141 | 0.1462 |
| 0.05 | b_zscore | 1 | 0.7141 | 0.1462 |
| 0.05 | f_mean | 0.9645 | 0.8545 | 0.0496 |
| 0.05 | f_iqr | 1 | 0.7141 | 0.1462 |
| 0.05 | f_zscore | 1 | 0.7141 | 0.1462 |
| 0.1 | b_mean | 0.8960 | 0.8083 | 0.0475 |
| 0.1 | b_iqr | 1 | 0.7251 | 0.1393 |
| 0.1 | b_zscore | 1 | 0.7251 | 0.1393 |
| 0.1 | f_mean | 0.8571 | 0.7819 | 0.0458 |
| 0.1 | f_iqr | 1 | 0.7251 | 0.1393 |
| 0.1 | f_zscore | 1 | 0.7251 | 0.1393 |
| 0.15 | b_mean | 0.8898 | 0.7689 | 0.0659 |
| 0.15 | b_iqr | 1 | 0.7176 | 0.1447 |
| 0.15 | b_zscore | 1 | 0.7176 | 0.1447 |
| 0.15 | f_mean | 0.8288 | 0.7723 | 0.0395 |
| 0.15 | f_iqr | 1 | 0.7176 | 0.1447 |
| 0.15 | f_zscore | 1 | 0.7176 | 0.1447 |
| 0.2 | b_mean | 0.8677 | 0.7853 | 0.0479 |
| 0.2 | b_iqr | 1 | 0.7141 | 0.1461 |
| 0.2 | b_zscore | 1 | 0.7141 | 0.1461 |
| 0.2 | f_mean | 0.8350 | 0.7775 | 0.0390 |
| 0.2 | f_iqr | 1 | 0.7141 | 0.1461 |
| 0.2 | f_zscore | 1 | 0.7141 | 0.1461 |
| 0.25 | b_mean | 0.8519 | 0.7673 | 0.0511 |
| 0.25 | b_iqr | 1 | 0.7173 | 0.1440 |
| 0.25 | b_zscore | 1 | 0.7173 | 0.1440 |
| 0.25 | f_mean | 0.8041 | 0.7536 | 0.0388 |
| 0.25 | f_iqr | 1 | 0.7173 | 0.1440 |
| 0.25 | f_zscore | 1 | 0.7173 | 0.1440 |
| 0.3 | b_mean | 0.8941 | 0.8041 | 0.0486 |
| 0.3 | b_iqr | 1 | 0.7134 | 0.1467 |
| 0.3 | b_zscore | 1 | 0.7134 | 0.1467 |
| 0.3 | f_mean | 0.7748 | 0.7358 | 0.0358 |
| 0.3 | f_iqr | 1 | 0.7134 | 0.1467 |
| 0.3 | f_zscore | 1 | 0.7134 | 0.1467 |
| 0.35 | b_mean | 0.8330 | 0.7740 | 0.0403 |
| 0.35 | b_iqr | 1 | 0.7203 | 0.1428 |
| 0.35 | b_zscore | 1 | 0.7203 | 0.1428 |
| 0.35 | f_mean | 0.7472 | 0.7133 | 0.0361 |
| 0.35 | f_iqr | 1 | 0.7203 | 0.1428 |
| 0.35 | f_zscore | 1 | 0.7203 | 0.1428 |
| 0.39 | b_mean | 0.8754 | 0.7975 | 0.0445 |
| 0.39 | b_iqr | 1 | 0.7139 | 0.1455 |
| 0.39 | b_zscore | 1 | 0.7139 | 0.1455 |
| 0.39 | f_mean | 0.8533 | 0.7826 | 0.0434 |
| 0.39 | f_iqr | 1 | 0.7139 | 0.1455 |
| 0.39 | f_zscore | 1 | 0.7139 | 0.1455 |
Table A7.
Experimental Results of Classical Machine Learning Algorithm for SMTP Traffic.
Table A7.
Experimental Results of Classical Machine Learning Algorithm for SMTP Traffic.
| Malicious Traffic (%) | Algorithm | Detection Rate | F2 Score | FPR Score |
|---|
| 0 | OCSVM | 0.9972 | 0.1840 | 0.3816 |
| 0 | LOF | 0.9986 | 0.7706 | 0.0256 |
| 0 | ISOF | 1.0000 | 0.8716 | 0.0127 |
| 0.1 | OCSVM | 0.9972 | 0.1823 | 0.3826 |
| 0.1 | LOF | 0.3792 | 0.3230 | 0.0255 |
| 0.1 | ISOF | 1.0000 | 0.8529 | 0.0148 |
| 0.2 | OCSVM | 0.9973 | 0.1888 | 0.3757 |
| 0.2 | LOF | 0.2551 | 0.2243 | 0.0251 |
| 0.2 | ISOF | 1.0000 | 0.8737 | 0.0127 |
| 0.3 | OCSVM | 0.9958 | 0.1876 | 0.3683 |
| 0.3 | LOF | 0.2363 | 0.2064 | 0.0254 |
| 0.3 | ISOF | 1.0000 | 0.8765 | 0.0120 |
| 0.4 | OCSVM | 0.9958 | 0.1882 | 0.3715 |
| 0.4 | LOF | 0.2111 | 0.1858 | 0.0255 |
| 0.4 | ISOF | 1.0000 | 0.8694 | 0.0130 |
| 0.5 | OCSVM | 0.9861 | 0.1901 | 0.3634 |
| 0.5 | LOF | 0.2275 | 0.1998 | 0.0254 |
| 0.5 | ISOF | 1.0000 | 0.8925 | 0.0104 |
| 0.6 | OCSVM | 0.9151 | 0.1856 | 0.3466 |
| 0.6 | LOF | 0.2370 | 0.2085 | 0.0254 |
| 0.6 | ISOF | 1.0000 | 0.8950 | 0.0103 |
| 0.7 | OCSVM | 0.8699 | 0.1739 | 0.3464 |
| 0.7 | LOF | 0.2392 | 0.2090 | 0.0255 |
| 0.7 | ISOF | 1.0000 | 0.8756 | 0.0122 |
Table A8.
Experimental Results of Autoencoder Algorithm for SMTP Traffic.
Table A8.
Experimental Results of Autoencoder Algorithm for SMTP Traffic.
| Malicious Traffic (%) | Parameters | Threshold | Detection Rate | F2 Score | FPR Score |
|---|
| 0 | 1 Hidden Layer | Mean | 0.9990 | 0.9847 | 0.0077 |
| 0 | 1 Hidden Layer | IQR | 0.9993 | 0.9849 | 0.0077 |
| 0 | 1 Hidden Layer | Z Score | 1 | 0.9853 | 0.0078 |
| 0 | 3 Hidden Layer | Mean | 0.9962 | 0.9873 | 0.0051 |
| 0 | 3 Hidden Layer | IQR | 0.9964 | 0.9829 | 0.0076 |
| 0 | 3 Hidden Layer | Z Score | 0.9998 | 0.9852 | 0.0077 |
| 0 | 5 Hidden Layer | Mean | 0.9922 | 0.9845 | 0.0049 |
| 0 | 5 Hidden Layer | IQR | 0.9952 | 0.9869 | 0.0049 |
| 0 | 5 Hidden Layer | Z Score | 0.9998 | 0.9862 | 0.0072 |
| 0.1 | 1 Hidden Layer | Mean | 0.9990 | 0.9846 | 0.0078 |
| 0.1 | 1 Hidden Layer | IQR | 0.9993 | 0.9846 | 0.0078 |
| 0.1 | 1 Hidden Layer | Z Score | 1 | 0.9853 | 0.0078 |
| 0.1 | 3 Hidden Layer | Mean | 0.9234 | 0.9301 | 0.0043 |
| 0.1 | 3 Hidden Layer | IQR | 0.9979 | 0.9835 | 0.0079 |
| 0.1 | 3 Hidden Layer | Z Score | 0.9998 | 0.9849 | 0.0079 |
| 0.1 | 5 Hidden Layer | Mean | 0.6826 | 0.7244 | 0.0030 |
| 0.1 | 5 Hidden Layer | IQR | 0.9926 | 0.9850 | 0.0048 |
| 0.1 | 5 Hidden Layer | Z Score | 1 | 0.9865 | 0.0072 |
| 0.2 | 1 Hidden Layer | Mean | 0.9990 | 0.9846 | 0.0078 |
| 0.2 | 1 Hidden Layer | IQR | 0.9993 | 0.9848 | 0.0078 |
| 0.2 | 1 Hidden Layer | Z Score | 1 | 0.9853 | 0.0078 |
| 0.2 | 3 Hidden Layer | Mean | 0.7862 | 0.8156 | 0.0035 |
| 0.2 | 3 Hidden Layer | IQR | 0.9979 | 0.9839 | 0.0077 |
| 0.2 | 3 Hidden Layer | Z Score | 0.9998 | 0.9852 | 0.0078 |
| 0.2 | 5 Hidden Layer | Mean | 0.5811 | 0.6310 | 0.0025 |
| 0.2 | 5 Hidden Layer | IQR | 0.9960 | 0.9876 | 0.0049 |
| 0.2 | 5 Hidden Layer | Z Score | 0.9998 | 0.9852 | 0.0077 |
| 0.3 | 1 Hidden Layer | Mean | 0.9990 | 0.9845 | 0.0078 |
| 0.3 | 1 Hidden Layer | IQR | 0.9993 | 0.9847 | 0.0078 |
| 0.3 | 1 Hidden Layer | Z Score | 1 | 0.9852 | 0.0079 |
| 0.3 | 3 Hidden Layer | Mean | 0.5426 | 0.5943 | 0.0023 |
| 0.3 | 3 Hidden Layer | IQR | 0.9964 | 0.9865 | 0.0057 |
| 0.3 | 3 Hidden Layer | Z Score | 0.9988 | 0.9845 | 0.0077 |
| 0.3 | 5 Hidden Layer | Mean | 0.5229 | 0.5753 | 0.0023 |
| 0.3 | 5 Hidden Layer | IQR | 0.9962 | 0.9876 | 0.0049 |
| 0.3 | 5 Hidden Layer | Z Score | 0.9998 | 0.9851 | 0.0078 |
| 0.4 | 1 Hidden Layer | Mean | 0.9990 | 0.9846 | 0.0078 |
| 0.4 | 1 Hidden Layer | IQR | 0.9995 | 0.9850 | 0.0078 |
| 0.4 | 1 Hidden Layer | Z Score | 1 | 0.9853 | 0.0078 |
| 0.4 | 3 Hidden Layer | Mean | 0.5207 | 0.5732 | 0.0023 |
| 0.4 | 3 Hidden Layer | IQR | 0.9964 | 0.9836 | 0.0072 |
| 0.4 | 3 Hidden Layer | Z Score | 0.9993 | 0.9847 | 0.0078 |
| 0.4 | 5 Hidden Layer | Mean | 0.4736 | 0.5269 | 0.0021 |
| 0.4 | 5 Hidden Layer | IQR | 0.9926 | 0.9849 | 0.0049 |
| 0.4 | 5 Hidden Layer | Z Score | 1 | 0.9869 | 0.0070 |
| 0.5 | 1 Hidden Layer | Mean | 0.9990 | 0.9846 | 0.0078 |
| 0.5 | 1 Hidden Layer | IQR | 0.9993 | 0.9848 | 0.0078 |
| 0.5 | 1 Hidden Layer | Z Score | 1 | 0.9735 | 0.0142 |
| 0.5 | 3 Hidden Layer | Mean | 0.5024 | 0.5353 | 0.0023 |
| 0.5 | 3 Hidden Layer | IQR | 0.9964 | 0.9828 | 0.0077 |
| 0.5 | 3 Hidden Layer | Z Score | 1 | 0.9853 | 0.0078 |
| 0.5 | 5 Hidden Layer | Mean | 0.4648 | 0.5183 | 0.0021 |
| 0.5 | 5 Hidden Layer | IQR | 0.9960 | 0.9875 | 0.0049 |
| 0.5 | 5 Hidden Layer | Z Score | 1 | 0.9853 | 0.0078 |
| 0.6 | 1 Hidden Layer | Mean | 0.9990 | 0.9848 | 0.0077 |
| 0.6 | 1 Hidden Layer | IQR | 0.9993 | 0.9850 | 0.0077 |
| 0.6 | 1 Hidden Layer | Z Score | 1 | 0.9737 | 0.0141 |
| 0.6 | 3 Hidden Layer | Mean | 0.4850 | 0.5383 | 0.0021 |
| 0.6 | 3 Hidden Layer | IQR | 0.9967 | 0.9828 | 0.0077 |
| 0.6 | 3 Hidden Layer | Z Score | 0.9998 | 0.9850 | 0.0079 |
| 0.6 | 5 Hidden Layer | Mean | 0.4495 | 0.5029 | 0.0020 |
| 0.6 | 5 Hidden Layer | IQR | 0.9962 | 0.9877 | 0.0049 |
| 0.6 | 5 Hidden Layer | Z Score | 0.9998 | 0.9737 | 14.0000 |
| 0.7 | 1 Hidden Layer | Mean | 0.9990 | 0.9845 | 0.0078 |
| 0.7 | 1 Hidden Layer | IQR | 0.9995 | 0.9849 | 0.0078 |
| 0.7 | 1 Hidden Layer | Z Score | 1 | 0.9852 | 0.0079 |
| 0.7 | 3 Hidden Layer | Mean | 0.4497 | 0.5031 | 0.0020 |
| 0.7 | 3 Hidden Layer | IQR | 0.9964 | 0.9844 | 0.0067 |
| 0.7 | 3 Hidden Layer | Z Score | 0.9990 | 0.9844 | 0.0079 |
| 0.7 | 5 Hidden Layer | Mean | 0.4304 | 0.4838 | 0.0019 |
| 0.7 | 5 Hidden Layer | IQR | 0.9848 | 0.9788 | 0.0048 |
| 0.7 | 5 Hidden Layer | Z Score | 0.9998 | 0.9870 | 0.0068 |
Table A9.
Experimental Results of LSTM Algorithm for SMTP Traffic.
Table A9.
Experimental Results of LSTM Algorithm for SMTP Traffic.
| Malicious Traffic (%) | Threshold | Detection Rate | F2 Score | FPR Score |
|---|
| 0 | b_mean | 0.9960 | 0.9634 | 0.0172 |
| 0 | b_iqr | 0.9985 | 0.9441 | 0.0288 |
| 0 | b_zscore | 0.9985 | 0.9441 | 0.0288 |
| 0 | f_mean | 0.9262 | 0.9084 | 0.0171 |
| 0 | f_iqr | 0.9597 | 0.9243 | 0.0230 |
| 0 | f_zscore | 0.9433 | 0.9202 | 0.0181 |
| 0.1 | b_mean | 0.8853 | 0.8774 | 0.0160 |
| 0.1 | b_iqr | 0.9862 | 0.9372 | 0.0276 |
| 0.1 | b_zscore | 0.9813 | 0.9335 | 0.0275 |
| 0.1 | f_mean | 0.8405 | 0.8426 | 0.0148 |
| 0.1 | f_iqr | 0.9764 | 0.9397 | 0.0219 |
| 0.1 | f_zscore | 0.9713 | 0.9361 | 0.0217 |
| 0.2 | b_mean | 0.8833 | 0.8747 | 0.0165 |
| 0.2 | b_iqr | 0.9855 | 0.9356 | 0.0280 |
| 0.2 | b_zscore | 0.9803 | 0.9316 | 0.0280 |
| 0.2 | f_mean | 0.8069 | 0.8131 | 0.0155 |
| 0.2 | f_iqr | 0.9433 | 0.9131 | 0.0222 |
| 0.2 | f_zscore | 0.9222 | 0.8973 | 0.0216 |
| 0.3 | b_mean | 0.7698 | 0.7817 | 0.0154 |
| 0.3 | b_iqr | 0.9724 | 0.8999 | 0.0429 |
| 0.3 | b_zscore | 0.9675 | 0.9230 | 0.0273 |
| 0.3 | f_mean | 0.7525 | 0.7696 | 0.0136 |
| 0.3 | f_iqr | 0.9317 | 0.9064 | 0.0207 |
| 0.3 | f_zscore | 0.9049 | 0.8934 | 0.0159 |
| 0.4 | b_mean | 0.9531 | 0.9301 | 0.0170 |
| 0.4 | b_iqr | 0.9933 | 0.9202 | 0.0403 |
| 0.4 | b_zscore | 0.9931 | 0.9413 | 0.0281 |
| 0.4 | f_mean | 0.6303 | 0.6589 | 0.0152 |
| 0.4 | f_iqr | 0.9189 | 0.8977 | 0.0198 |
| 0.4 | f_zscore | 0.8935 | 0.8817 | 0.0173 |
| 0.5 | b_mean | 0.7668 | 0.7776 | 0.0163 |
| 0.5 | b_iqr | 0.9716 | 0.9045 | 0.0398 |
| 0.5 | b_zscore | 0.9617 | 0.9172 | 0.0280 |
| 0.5 | f_mean | 0.5782 | 0.6157 | 0.0117 |
| 0.5 | f_iqr | 0.9042 | 0.8814 | 0.0224 |
| 0.5 | f_zscore | 0.8677 | 0.8604 | 0.0174 |
| 0.6 | b_mean | 0.7262 | 0.7429 | 0.0161 |
| 0.6 | b_iqr | 0.9704 | 0.9085 | 0.0369 |
| 0.6 | b_zscore | 0.9563 | 0.9068 | 0.0315 |
| 0.6 | f_mean | 0.6348 | 0.6642 | 0.0144 |
| 0.6 | f_iqr | 0.9385 | 0.8946 | 0.0306 |
| 0.6 | f_zscore | 0.9193 | 0.8906 | 0.0241 |
| 0.7 | b_mean | 0.7577 | 0.7698 | 0.0164 |
| 0.7 | b_iqr | 0.9717 | 0.8996 | 0.0430 |
| 0.7 | b_zscore | 0.9663 | 0.9144 | 0.0318 |
| 0.7 | f_mean | 0.6280 | 0.6570 | 0.0152 |
| 0.7 | f_iqr | 0.8870 | 0.8632 | 0.0251 |
| 0.7 | f_zscore | 0.8676 | 0.8486 | 0.0244 |
Table A10.
Experimental Results of Classical Machine Learning Algorithm for All Traffic.
Table A10.
Experimental Results of Classical Machine Learning Algorithm for All Traffic.
| Malicious Traffic (%) | Algorithm | Detection Rate | F2 Score | FPR Score |
|---|
| 0 | OCSVM | 0.7347 | 0.4241 | 0.1388 |
| 0 | LOF | 0.9274 | 0.7617 | 0.0262 |
| 0 | ISOF | 0.9400 | 0.6859 | 0.0459 |
| 0.1 | OCSVM | 0.6662 | 0.3560 | 0.1425 |
| 0.1 | LOF | 0.5221 | 0.4649 | 0.0259 |
| 0.1 | ISOF | 0.9241 | 0.6424 | 0.0543 |
| 0.2 | OCSVM | 0.6501 | 0.3527 | 0.1396 |
| 0.2 | LOF | 0.4141 | 0.3825 | 0.0244 |
| 0.2 | ISOF | 0.8859 | 0.6293 | 0.0523 |
| 0.3 | OCSVM | 0.6442 | 0.3467 | 0.1363 |
| 0.3 | LOF | 0.3662 | 0.3286 | 0.0345 |
| 0.3 | ISOF | 0.8726 | 0.6166 | 0.0540 |
Table A11.
Experimental Results of Autoencoder Algorithm for All Traffic.
Table A11.
Experimental Results of Autoencoder Algorithm for All Traffic.
| Malicious Traffic (%) | Hidden Layers | Threshold | Detection Rate | F2 Score | FPR Score |
|---|
| 0 | 1 | Mean | 0.6697 | 0.6454 | 0.0215 |
| 0 | 1 | IQR | 0.6633 | 0.6211 | 0.0194 |
| 0 | 1 | Z Score | 0.9917 | 0.8962 | 0.0696 |
| 0 | 3 | Mean | 0.5745 | 0.5652 | 0.0195 |
| 0 | 3 | IQR | 0.6995 | 0.6609 | 0.0223 |
| 0 | 3 | Z Score | 0.8408 | 0.8069 | 0.0278 |
| 0 | 5 | Mean | 0.5728 | 0.5638 | 0.0195 |
| 0 | 5 | IQR | 0.6959 | 0.6585 | 0.0213 |
| 0 | 5 | Z Score | 0.7871 | 0.7547 | 0.0270 |
| 0.1 | 1 | Mean | 0.6743 | 0.6510 | 0.0213 |
| 0.1 | 1 | IQR | 0.6690 | 0.6321 | 0.0209 |
| 0.1 | 1 | Z Score | 0.9896 | 0.8944 | 0.0696 |
| 0.1 | 3 | Mean | 0.5514 | 0.5473 | 0.0193 |
| 0.1 | 3 | IQR | 0.6999 | 0.6611 | 0.0224 |
| 0.1 | 3 | Z Score | 0.8665 | 0.8244 | 0.0359 |
| 0.1 | 5 | Mean | 0.4710 | 0.4783 | 0.0189 |
| 0.1 | 5 | IQR | 0.6970 | 0.6602 | 0.0214 |
| 0.1 | 5 | Z Score | 0.8245 | 0.7899 | 0.0301 |
| 0.2 | 1 | Mean | 0.6162 | 0.6038 | 0.0218 |
| 0.2 | 1 | IQR | 0.6637 | 0.6205 | 0.0194 |
| 0.2 | 1 | Z Score | 0.9913 | 0.8957 | 0.0697 |
| 0.2 | 3 | Mean | 0.5055 | 0.5088 | 0.0190 |
| 0.2 | 3 | IQR | 0.6978 | 0.6587 | 0.0223 |
| 0.2 | 3 | Z Score | 0.8041 | 0.7592 | 0.0489 |
| 0.2 | 5 | Mean | 0.4372 | 0.4474 | 0.0186 |
| 0.2 | 5 | IQR | 0.6981 | 0.6610 | 0.0213 |
| 0.2 | 5 | Z Score | 0.8217 | 0.7876 | 0.0292 |
| 0.3 | 1 | Mean | 0.6695 | 0.6469 | 0.0216 |
| 0.3 | 1 | IQR | 0.6663 | 0.6237 | 0.0196 |
| 0.3 | 1 | Z Score | 0.9966 | 0.8989 | 0.0700 |
| 0.3 | 3 | Mean | 0.4171 | 0.4290 | 0.0185 |
| 0.3 | 3 | IQR | 0.6990 | 0.6614 | 0.0217 |
| 0.3 | 3 | Z Score | 0.8581 | 0.8193 | 0.0327 |
| 0.3 | 5 | Mean | 0.4106 | 0.4228 | 0.0185 |
| 0.3 | 5 | IQR | 0.7016 | 0.6650 | 0.0215 |
| 0.3 | 5 | Z Score | 0.9156 | 0.8699 | 0.0330 |
Table A12.
Experimental Results of LSTM Algorithm for All Traffic.
Table A12.
Experimental Results of LSTM Algorithm for All Traffic.
| Malicious Traffic (%) | Threshold | Detection Rate | F2 Score | FPR Score |
|---|
| 0 | b_mean | 0.6945 | 0.6316 | 0.0536 |
| 0 | b_iqr | 0.978 | 0.8234 | 0.0918 |
| 0 | b_zscore | 0.9995 | 0.8058 | 0.1225 |
| 0 | f_mean | 0.618 | 0.5777 | 0.0442 |
| 0 | f_iqr | 0.98 | 0.8272 | 0.0902 |
| 0 | f_zscore | 0.9776 | 0.8198 | 0.0955 |
| 0.1 | b_mean | 0.6241 | 0.5947 | 0.0431 |
| 0.1 | b_iqr | 0.8841 | 0.7617 | 0.0836 |
| 0.1 | b_zscore | 0.9937 | 0.8104 | 0.1159 |
| 0.1 | f_mean | 0.5694 | 0.5453 | 0.0402 |
| 0.1 | f_iqr | 0.9883 | 0.8361 | 0.0894 |
| 0.1 | f_zscore | 0.9872 | 0.8342 | 0.0903 |
| 0.2 | b_mean | 0.6168 | 0.5889 | 0.0442 |
| 0.2 | b_iqr | 0.9858 | 0.8299 | 0.0911 |
| 0.2 | b_zscore | 0.9929 | 0.8054 | 0.1182 |
| 0.2 | f_mean | 0.5508 | 0.5340 | 0.0390 |
| 0.2 | f_iqr | 0.9076 | 0.7619 | 0.1024 |
| 0.2 | f_zscore | 0.9483 | 0.7851 | 0.1079 |
| 0.3 | b_mean | 0.5714 | 0.5471 | 0.0418 |
| 0.3 | b_iqr | 0.6920 | 0.5752 | 0.0843 |
| 0.3 | b_zscore | 0.9882 | 0.8023 | 0.1182 |
| 0.3 | f_mean | 0.5113 | 0.5042 | 0.0362 |
| 0.3 | f_iqr | 0.8046 | 0.6918 | 0.0889 |
| 0.3 | f_zscore | 0.9643 | 0.7947 | 0.1101 |
References
- Algaolahi, A.; Aljoby, W.; Ghaleb, M.; Harras, K.A. Detecting and Identifying the Targets of Covert DDoS Attacks. In Proceedings of the 2024 IEEE 21st International Conference on Smart Communities: Improving Quality of Life Using AI, Robotics and IoT (HONET), Doha, Qatar, 3–5 December 2024; pp. 143–148. [Google Scholar] [CrossRef]
- Ahmed, I.; Lhee, K.S. Classification of packet contents for malware detection. J. Comput. Virol. 2011, 7, 279–295. [Google Scholar] [CrossRef]
- Fauzi, N.; Yulianto, F.A.; Nuha, H.H. The Effectiveness of Anomaly-Based Intrusion Detection Systems in Handling Zero-Day Attacks Using AdaBoost, J48, and Random Forest Methods. In Proceedings of the 2023 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob), Bali, Indonesia, 10–12 October 2023; pp. 57–62. [Google Scholar] [CrossRef]
- Faizal, M.A.; Zaki, M.M.; Shahrin, S.; Robiah, Y.; Rahayu, S.S.; Nazrulazhar, B. Threshold Verification Technique for Network Intrusion Detection System. arXiv 2009, arXiv:0906.3843. [Google Scholar] [CrossRef]
- David Akande, T.; Kaur, B.; Dadkhah, S.; Ghorbani, A.A. Threshold based Technique to Detect Anomalies using Log Files. In Proceedings of the 2022 7th International Conference on Machine Learning Technologies, New York, NY, USA, 11–13 March 2022; ICMLT ’22. pp. 191–198. [Google Scholar] [CrossRef]
- Almuhanna, R.; Dardouri, S. A deep learning/machine learning approach for anomaly based network intrusion detection. Front. Artif. Intell. 2025, 8, 1625891. [Google Scholar] [CrossRef] [PubMed]
- Auskalnis, J.; Paulauskas, N.; Baskys, A. Application of local outlier factor algorithm to detect anomalies in computer network. Elektron. Ir Elektrotechnika 2018, 24, 96–99. [Google Scholar] [CrossRef]
- Ripan, R.C.; Sarker, I.H.; Anwar, M.M.; Furhad, M.H.; Rahat, F.; Hoque, M.M.; Sarfraz, M. An isolation forest learning based outlier detection approach for effectively classifying cyber anomalies. In Hybrid Intelligent Systems: 20th International Conference on Hybrid Intelligent Systems (HIS 2020), December 14–16, 2020; Springer: Cham, Switzerland, 2021; pp. 270–279. [Google Scholar]
- Zhang, M.; Xu, B.; Gong, J. An anomaly detection model based on one-class svm to detect network intrusions. In Proceedings of the 2015 11th International Conference on Mobile Ad-Hoc and Sensor Networks (MSN), Shenzhen, China, 16–18 December 2015; pp. 102–107. [Google Scholar]
- Nguimbous, Y.N.; Ksantini, R.; Bouhoula, A. Anomaly-based intrusion detection using auto-encoder. In Proceedings of the 2019 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Split, Croatia, 19–21 September 2019; pp. 1–5. [Google Scholar]
- Imrana, Y.; Xiang, Y.; Ali, L.; Abdul-Rauf, Z. A bidirectional LSTM deep learning approach for intrusion detection. Expert Syst. Appl. 2021, 185, 115524. [Google Scholar] [CrossRef]
- Anup, B.; Kaur, M.M. Anomaly Detection in Network Traffic; A Statistical Approach; LAP Lambert Academic Publishing: Saarbrücken, Germany, 2012. [Google Scholar]
- Bhuyan, M.H.; Bhattacharyya, D.K.; Kalita, J.K. Network Traffic Anomaly Detection Techniques and Systems. In Network Traffic Anomaly Detection and Prevention: Concepts, Techniques, and Tools; Springer International Publishing: Cham, Switzerland, 2017; pp. 115–169. [Google Scholar] [CrossRef]
- Zhao, X.; Wu, Q. Subspace-Based Anomaly Detection for Large-Scale Campus Network Traffic. J. Appl. Math. 2023, 2023, 8489644. [Google Scholar] [CrossRef]
- Modi, C.; Patel, D.; Borisaniya, B.; Patel, H.; Patel, A.; Rajarajan, M. A survey of intrusion detection techniques in cloud. J. Netw. Comput. Appl. 2013, 36, 42–57. [Google Scholar] [CrossRef]
- Othman, S.M.; Alsohybe, N.T.; Ba-Alwi, F.M.; Zahary, A.T. Survey on intrusion detection system types. Int. J. Cyber-Secur. Digit. Forensics 2018, 7, 444–463. [Google Scholar]
- Liu, H.; Lang, B. Machine learning and deep learning methods for intrusion detection systems: A survey. Appl. Sci. 2019, 9, 4396. [Google Scholar] [CrossRef]
- Otoum, S.; Kantarci, B.; Mouftah, H. A comparative study of ai-based intrusion detection techniques in critical infrastructures. ACM Trans. Internet Technol. (TOIT) 2021, 21, 1–22. [Google Scholar] [CrossRef]
- Jain, M.; Kaur, G.; Saxena, V. A K-Means clustering and SVM based hybrid concept drift detection technique for network anomaly detection. Expert Syst. Appl. 2022, 193, 116510. [Google Scholar] [CrossRef]
- Zavrak, S.; İskefiyeli, M. Anomaly-based intrusion detection from network flow features using variational autoencoder. IEEE Access 2020, 8, 108346–108358. [Google Scholar] [CrossRef]
- Sadaf, K.; Sultana, J. Intrusion detection based on autoencoder and isolation forest in fog computing. IEEE Access 2020, 8, 167059–167068. [Google Scholar] [CrossRef]
- Aljbali, S.; Roy, K. Anomaly Detection Using Bidirectional LSTM. In Intelligent Systems and Applications; Advances in Intelligent Systems and Computing; Springer: Cham, Switzerland, 2021; Volume 1250. [Google Scholar] [CrossRef]
- Abdallah, M.; An Le Khac, N.; Jahromi, H.; Delia Jurcut, A. A hybrid CNN-LSTM based approach for anomaly detection systems in SDNs. In Proceedings of the 16th International Conference on Availability, Reliability and Security, Vienna, Austria, 17–20 August 2021; pp. 1–7. [Google Scholar]
- Farahnakian, F.; Heikkonen, J. A deep auto-encoder based approach for intrusion detection system. In Proceedings of the 2018 20th International Conference on Advanced Communication Technology (ICACT), Chuncheon, Republic of Korea, 11–14 February 2018; pp. 178–183. [Google Scholar]
- Paulauskas, N.; Bagdonas, A.F. Local outlier factor use for the network flow anomaly detection. Secur. Commun. Netw. 2015, 8, 4203–4212. [Google Scholar] [CrossRef]
- Nguyen, Q.T.; Tran, K.P.; Castagliola, P.; Huong, T.T.; Nguyen, M.K.; Lardjane, S. Nested one-class support vector machines for network intrusion detection. In Proceedings of the 2018 IEEE Seventh International Conference on Communications and Electronics (ICCE), Hue, Vietnam, 18–20 July 2018; pp. 7–12. [Google Scholar]
- Abolhasanzadeh, B. Nonlinear dimensionality reduction for intrusion detection using auto-encoder bottleneck features. In Proceedings of the 2015 7th Conference on Information and Knowledge Technology (IKT), Urmia, Iran, 26–28 May 2015; pp. 1–5. [Google Scholar]
- Zhang, B.; Yu, Y.; Li, J. Network intrusion detection based on stacked sparse autoencoder and binary tree ensemble method. In Proceedings of the 2018 IEEE International Conference on Communications Workshops (ICC Workshops), Kansas City, MO, USA, 20–24 May 2018; pp. 1–6. [Google Scholar]
- Chen, Z.; Simsek, M.; Kantarci, B.; Bagheri, M.; Djukic, P. Machine learning-enabled hybrid intrusion detection system with host data transformation and an advanced two-stage classifier. Comput. Netw. 2024, 250, 110576. [Google Scholar] [CrossRef]
- Narayana Rao, K.; Venkata Rao, K.; P.V.G.D., P.R. A hybrid Intrusion Detection System based on Sparse autoencoder and Deep Neural Network. Comput. Commun. 2021, 180, 77–88. [Google Scholar] [CrossRef]
- Zha, C.; Wang, Z.; Fan, Y.; Bai, B.; Zhang, Y.; Shi, S.; Zhang, R. A-NIDS: Adaptive Network Intrusion Detection System Based on Clustering and Stacked CTGAN. IEEE Trans. Inf. Forensics Secur. 2025, 20, 3204–3219. [Google Scholar] [CrossRef]
- Ma, Z.; Liu, L.; Meng, W.; Luo, X.; Wang, L.; Li, W. ADCL: Toward an Adaptive Network Intrusion Detection System Using Collaborative Learning in IoT Networks. IEEE Internet Things J. 2023, 10, 12521–12536. [Google Scholar] [CrossRef]
- Winter, P.; Hermann, E.; Zeilinger, M. Inductive intrusion detection in flow-based network data using one-class support vector machines. In Proceedings of the 2011 4th IFIP International Conference on New Technologies, Mobility and Security, Paris, France, 7–10 February 2011; pp. 1–5. [Google Scholar]
- Mirsky, Y.; Doitshman, T.; Elovici, Y.; Shabtai, A. Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection. arXiv 2018, arXiv:2018.23204. [Google Scholar] [CrossRef]
- Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar] [CrossRef]
- Pratomo, B. Low-Rate Attack Detection with Intelligent Fine-Grained Network Analysis. Ph.D. Thesis, Cardiff University, Cardiff, UK, 2020. [Google Scholar]
- Transmission Control Protocol. RFC 793, 1981. Available online: https://www.rfc-editor.org/info/rfc0793 (accessed on 12 August 2025).
- Hao, Y.; Sheng, Y.; Wang, J. Variant Gated Recurrent Units with Encoders to Preprocess Packets for Payload-Aware Intrusion Detection. IEEE Access 2019, 7, 49985–49998. [Google Scholar] [CrossRef]
- Saeed, R.; Khaliq Qureshi, H.; Ioannou, C.; Lestas, M. A Proactive Model for Intrusion Detection Using Image Representation of Network Flows. IEEE Access 2024, 12, 160653–160666. [Google Scholar] [CrossRef]
- Nie, F.; Liu, W.; Liu, G.; Gao, B.; Huang, J.; Tian, W.; Yuen, C. Empowering Anomaly Detection in IoT Traffic Through Multiview Subspace Learning. IEEE Internet Things J. 2025, 12, 15911–15925. [Google Scholar] [CrossRef]
- Brizendine, B.; Kusuma, S.S.; Rimal, B.P. Process Injection Using Return-Oriented Programming. IEEE Access 2025, 13, 133790–133816. [Google Scholar] [CrossRef]
- Viviani, L.A.; Ranganathan, P. Evaluating the Suitability of LSTM Models for Edge Computing. In Proceedings of the 2024 Cyber Awareness and Research Symposium (CARS), Grand Forks, ND, USA, 28–29 October 2024; pp. 1–7. [Google Scholar] [CrossRef]
- Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 2009, 41, 1–58. [Google Scholar] [CrossRef]
- Pratomo, B.A.; Fajar, A.I.; Munif, A.; Ijtihadie, R.M.; Studiawan, H.; Santoso, B.J. Training Autoencoders with Noisy Training Sets for Detecting Low-rate Attacks on the Network. In Proceedings of the 2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom), Malang, Indonesia, 16–18 June 2022; pp. 138–143. [Google Scholar]
- Bounsiar, A.; Madden, M.G. One-Class Support Vector Machines Revisited. In Proceedings of the 2014 International Conference on Information Science and Applications (ICISA), Seoul, Republic of Korea, 6–9 May 2014; pp. 1–4. [Google Scholar] [CrossRef]
- Breunig, M.; Kröger, P.; Ng, R.; Sander, J. LOF: Identifying Density-Based Local Outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 15–18 May 2000; Volume 29, pp. 93–104. [Google Scholar] [CrossRef]
- Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
- Diez, D.M.; Barr, C.D.; Cetinkaya-Rundel, M. OpenIntro Statistics; OpenIntro: Boston, MA, USA, 2012. [Google Scholar]
- Hubert, M.; Vandervieren, E. An adjusted boxplot for skewed distributions. Comput. Stat. Data Anal. 2008, 52, 5186–5201. [Google Scholar] [CrossRef]
- Crosby, T. How to Detect and Handle Outliers; Taylor & Francis: Oxfordshire, UK, 1994. [Google Scholar]
- Powers, D.M. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar] [CrossRef]
- Sammut, C.; Webb, G.I. Encyclopedia of Machine Learning; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
- Shepperd, M.; Bowes, D.; Hall, T. Researcher bias: The use of machine learning in software defect prediction. IEEE Trans. Softw. Eng. 2014, 40, 603–616. [Google Scholar] [CrossRef]
- Deng, X.; Liu, Q.; Deng, Y.; Mahadevan, S. An improved method to construct basic probability assignment based on the confusion matrix for classification problem. Inf. Sci. 2016, 340, 250–261. [Google Scholar] [CrossRef]
Figure 1.
The overall workflow of the proposed methodology. Differences in color doesn’t explain anything.
Figure 1.
The overall workflow of the proposed methodology. Differences in color doesn’t explain anything.
Figure 2.
Comparison between byte frequencies and byte sequences.
Figure 2.
Comparison between byte frequencies and byte sequences.
Figure 3.
Deployment assumption for the proposed payload-based NIDSs.
Figure 3.
Deployment assumption for the proposed payload-based NIDSs.
Figure 4.
Workflow for the evaluation phase.
Figure 4.
Workflow for the evaluation phase.
Figure 5.
Confusion matrix illustration.
Figure 5.
Confusion matrix illustration.
Figure 6.
Graphics of F2 score from Classical ML, LSTM, and Autoencoder-based results on HTTP traffic.
Figure 6.
Graphics of F2 score from Classical ML, LSTM, and Autoencoder-based results on HTTP traffic.
Figure 7.
Graphics of F2 score from Classical ML, LSTM, and Autoencoder-based results on FTP traffic.
Figure 7.
Graphics of F2 score from Classical ML, LSTM, and Autoencoder-based results on FTP traffic.
Figure 8.
Graphics of F2 score from Classical ML, LSTM, and Autoencoder-based results on SMTP traffic.
Figure 8.
Graphics of F2 score from Classical ML, LSTM, and Autoencoder-based results on SMTP traffic.
Figure 9.
Graphics of F2 score from Classical ML, LSTM, and Autoencoder-based results on all traffic.
Figure 9.
Graphics of F2 score from Classical ML, LSTM, and Autoencoder-based results on all traffic.
Figure 10.
Graphics of F2 best score: (a) HTTP traffic; (b) FTP traffic; (c) SMTP traffic; (d) All traffic.
Figure 10.
Graphics of F2 best score: (a) HTTP traffic; (b) FTP traffic; (c) SMTP traffic; (d) All traffic.
Figure 11.
Graphics of F2 best score with best parameter.
Figure 11.
Graphics of F2 best score with best parameter.
Table 1.
Summary of Related Research Works.
Table 1.
Summary of Related Research Works.
| Year | Research Title | Algorithm | Analysis Type | Unclean Data |
|---|
| 2022 | A K-Means clustering and SVM-based hybrid concept drift detection technique for network anomaly detection | K-Means, SVM | content-based | – |
| 2021 | An Isolation Forest Learning Based Outlier Detection Approach for Effectively Classifying Cyber Anomalies | ISOF | flow-based | – |
| 2021 | Anomaly Detection Using Bidirectional LSTM | LSTM | flow-based | – |
| 2021 | A Hybrid CNN-LSTM Based Approach for Anomaly Detection Systems in SDNs | CNN-LSTM | flow-based | – |
| 2021 | A bidirectional LSTM deep learning approach for intrusion detection | LSTM | flow-based | – |
| 2020 | Anomaly-based intrusion detection from network flow features using variational autoencoder | VAE | flow-based | – |
| 2020 | Intrusion detection based on Autoencoder and Isolation Forest in fog computing | AE-ISOF | flow-based | – |
| 2019 | Anomaly-based intrusion detection using auto-encoder | AE | content-based | – |
| 2018 | Application of Local Outlier Factor to Detect Anomalies in Computer Networks | LOF | flow-based | – |
| 2018 | A deep Autoencoder-based approach for intrusion detection system | Deep AE | flow-based | – |
| 2018 | Network intrusion detection using stacked sparse autoencoder and binary tree ensemble | SSAE-XGB | content-based | – |
| 2018 | Web attack detection using stacked Auto-Encoder | SAE-ISOF | content-based | – |
| 2018 | Nested One-Class Support Vector Machines for network intrusion detection | OCSVM | content-based | – |
| 2015 | Local outlier factor usage for network flow anomaly detection | LOF | flow-based | – |
| 2015 | Nonlinear dimensionality reduction for intrusion detection using autoencoder bottlenecks | AE | content-based | – |
| 2015 | One-class SVM anomaly detection model | OCSVM | content-based | – |
| 2011 | Inductive intrusion detection in flow-based network data using one-class SVM | OCSVM | flow-based | – |
| Now | The proposed article | LOF, ISOF, OCSVM, AE, LSTM | content-based | ✓ |
Table 2.
Baseline Malicious Traffic Statistics in Original Training Data.
Table 2.
Baseline Malicious Traffic Statistics in Original Training Data.
| Protocol | Legitimate Conn. | Malicious Conn. | Malicious Ratio |
|---|
| FTP | 45,525 | 180 | 0.39% |
| SMTP | 83,433 | 624 | 0.70% |
| HTTP | 200,060 | 1714 | 0.80% |
Table 3.
An excerpt from a noisy FTP training set.
Table 3.
An excerpt from a noisy FTP training set.
| No. | Source | Destination | Protocol | Info |
|---|
| 247243 | 59.166.0.7 | 149.171.126.9 | FTP | STOR README.txt |
| 247255 | 59.166.0.7 | 149.171.126.9 | FTP | QUIT |
| 248804 | 175.45.176.0 | 149.171.126.15 | FTP | USER anonymous |
| 248916 | 175.45.176.0 | 149.171.126.15 | FTP | [TCP Previous segment not captured] TYPE I |
| 248920 | 175.45.176.0 | 149.171.126.15 | FTP | PASV |
| 248934 | 59.166.0.7 | 149.171.126.5 | FTP | USER anonymous |
| 248942 | 59.166.0.7 | 149.171.126.5 | FTP | PASS jobs@server.com |
| 248950 | 59.166.0.7 | 149.171.126.5 | FTP | EPSV |
| 249012 | 175.45.176.0 | 149.171.126.15 | FTP | SIZE ../../../../../../x2CxsSUW/lwgclmRGLvZu |
| 249018 | 175.45.176.0 | 149.171.126.15 | FTP | RETR ../../../../../../x2CxsSUW/lwgclmRGLvZu |
| 249024 | 59.166.0.7 | 149.171.126.5 | FTP | QUIT |
| 249650 | 59.166.0.7 | 149.171.126.8 | FTP | USER anonymous |
Table 4.
Sample entries from the FTP testing dataset.
Table 4.
Sample entries from the FTP testing dataset.
| TCP Tuple | Payload | Label |
|---|
| 149.171.126.17-21-175.45.176.2-42810-tcp | 213 2549 150 Data connection accepted from 175.45.176.2:49220; transfer starting for exploit8.NWF(12558)bytes) 226 Transfer completed. | 0 |
| 175.45.176.2-4108-149.171.126.11-21-tcp | USER test PASS foobar CWD /op/apache-1.3.31/htdocs/test PORT 10,2,1,90,17,159 STOR poc.shtml QUIT | 0 |
| 175.45.176.1-11178-149.171.126.13-21-tcp | USER lWthZryPx PASS b2Ulm2K PORT 175,45,176,1,194,72 RETR /../../../..//4AfjB1/yDellcx.gOa | 1 |
| 59.166.0.3-7585-149.171.126.2-21-tcp | USER anonymous PASS jobs@server.com EPSV LIST CWD pub EPSV RETR README.txt EPSV STOR README.txt QUIT | 0 |
| 175.45.176.3-42152-149.171.126.13-21-tcp | USER anonymous PASS IEUser@ TYPE I PASV SIZE /../../../nM63/AwrIGL.aqd RETR /../../../nM63/AwrIGL.aqd | 1 |
Table 5.
Number of TCP Connections in Testing Dataset.
Table 5.
Number of TCP Connections in Testing Dataset.
| Protocol | Benign | Malicious | Total |
|---|
| HTTP | 132,344 | 16,116 | 148,460 |
| FTP | 24,465 | 1777 | 26,242 |
| SMTP | 40,704 | 4031 | 44,735 |
Table 6.
Hyperparameter Configurations for LSTM and Autoencoder Models.
Table 6.
Hyperparameter Configurations for LSTM and Autoencoder Models.
| Hyperparameters | LSTM Values | Autoencoder Values |
|---|
| Number of Hidden Layer(s) | 2 | 1; 3; 5 |
| Activation Functions in Hidden Layer(s) | Tanh | ReLU |
| Activation Functions in Output Layer | Softmax | Sigmoid |
| Dropout | 0.2 | 0.2 |
| Optimizer | Adam | Adadelta |
| Loss Function | Categorical Crossentropy | Binary Crossentropy |
| Number of Epochs | 10 | 10 |
Table 7.
Optimal F2 Score Parameter Results for Each Algorithm on Each Protocol.
Table 7.
Optimal F2 Score Parameter Results for Each Algorithm on Each Protocol.
| Best Parameter Results |
|---|
| Protocol | Algorithm | F2 Score | Parameters |
|---|
| Overall | MAX ML | 0.6436 | ISOF |
| MAX LSTM | 0.8085 | f_zscore |
| MAX AUTOENCODERS | 0.8963 | 1 hidden layer-zscore |
| FTP | MAX ML | 0.6335 | LOF |
| MAX LSTM | 0.7990 | b_mean |
| MAX AUTOENCODERS | 0.8759 | 3 hidden layer-IQR |
| SMTP | MAX ML | 0.8759 | ISOF |
| MAX LSTM | 0.9265 | b_zscore |
| MAX AUTOENCODERS | 0.9858 | 5 hidden layer-IQR |
| HTTP | MAX ML | 0.5231 | ISOF |
| MAX LSTM | 0.8052 | f_zscore |
| MAX AUTOENCODERS | 0.8215 | 1 hidden layer-zscore |
Table 8.
Optimal F2 Score Parameter Results for Each Algorithm on Each Protocol with Best Threshold.
Table 8.
Optimal F2 Score Parameter Results for Each Algorithm on Each Protocol with Best Threshold.
| Algorithm | Best Threshold Parameters | Noise % | F2 Score |
|---|
| MAX ML | – FTP: LOF – SMTP: ISOF – HTTP: ISOF | 0 | 0.7762 |
| 0.1 | 0.6866 |
| 0.2 | 0.6610 |
| 0.3 | 0.6431 |
| Gradient | −0.0425 |
| Average | 0.6917 |
| MAX LSTM | – FTP: b_mean – SMTP: b_zscore – HTTP: f_zscore | 0 | 0.8637 |
| 0.1 | 0.8611 |
| 0.2 | 0.8203 |
| 0.3 | 0.8348 |
| Gradient | −0.0128 |
| Average | 0.8450 |
| MAX AE | – FTP: 3 layer-iqr – SMTP: 5 layer-iqr – HTTP: 1 layer-zscore | 0 | 0.8972 |
| 0.1 | 0.8949 |
| 0.2 | 0.8971 |
| 0.3 | 0.9007 |
| Gradient | 0.0013 |
| Average | 0.8975 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |