1. Introduction
In recent years, with the rapid development of information technology and the acceleration of global digitalization, the Internet has become a critical infrastructure for modern social and economic activities and daily life, playing a key role in promoting economic growth, transforming work, and reshaping lifestyles [
1]. However, the widespread application of the Internet has also brought increasingly severe security challenges. Cyberattack techniques continue to evolve, ranging from Distributed Denial of Service (DDoS) attacks and malware propagation to Advanced Persistent Threats (APTs), ransomware, and data theft, with significantly increased complexity and destructiveness. These threats not only cause economic losses for enterprises and compromise user privacy but also jeopardize the stable operation of critical infrastructure and even threaten national security. Consequently, cybersecurity issues have garnered widespread attention from both academia and industry [
2].
To address these threats, researchers and practitioners have developed various defense measures, including firewalls, antivirus software, encryption technologies, and access control mechanisms [
3]. Firewalls effectively block known attacks by filtering anomalous traffic, antivirus software detects and removes malicious code, and encryption technologies ensure the confidentiality of data transmission. However, sophisticated attackers may exploit encrypted traffic to conceal malicious activities [
4] or bypass access controls through social engineering, further complicating defense efforts.
Against this backdrop, Intrusion Detection Systems (IDS) [
5] have emerged as a research hotspot in the field of cybersecurity. By monitoring network traffic, system logs, and user behavior in real time, IDS identify anomalous activities and respond promptly to potential threats [
6], aiming to ensure the confidentiality, integrity, and availability of information systems. Compared to traditional defense measures, IDS not only counter known attacks but also detect unknown threats through anomaly detection [
7], offering greater adaptability. Research on intrusion detection holds significant theoretical and practical value in the cybersecurity domain, effectively protecting information systems in critical sectors such as finance, healthcare, and the Internet of Things (IoT); reducing economic losses and security risks; ensuring the stable operation of national critical infrastructure; and facilitating the advancement of the cybersecurity industry.
Despite significant progress in current IDS technologies, several challenges persist in ongoing research. Firstly, the original feature space struggles to adequately characterize the complex patterns of high-dimensional network traffic, resulting in insufficient detection capabilities for novel attacks. Secondly, the spatiotemporal correlation characteristics inherent in network traffic have not been effectively explored, limiting the model’s ability to detect dynamic and time-series-related attacks. Additionally, single-modality feature analysis struggles to address diverse threat scenarios, leading to high false-positive rates and low accuracy in detection [
8].
The remainder of this paper is organized as follows.
Section 2 reviews related work on intrusion detection methods and highlights the motivation for our proposed approach.
Section 3 introduces the design of the proposed OS–Graph–GRU (OGG) method, including feature space transformation, spatial correlation modeling with GCN, and temporal dependency modeling with BiGRU.
Section 4 presents the experimental setup, datasets, evaluation metrics, and comparative results, followed by detailed analyses.
Section 5 concludes the paper with key findings, limitations, and directions for future research.
2. Analysis of Related Works
Early network intrusion detection systems primarily relied on static matching rules based on expert knowledge, but such methods were ineffective in addressing the dynamic changes and complex diversity of network threats [
9]. In recent years, machine learning-based intrusion detection methods have gained prominence and have been widely applied to intrusion detection tasks [
10]. Traditional machine learning techniques, such as Support Vector Machines (SVM) [
11], Random Forests (RF) [
12], Artificial Neural Networks (ANN) [
13], K-Nearest Neighbors (KNN) [
14], and Naive Bayes Classifier (NBC) [
15], have achieved certain improvements in detection accuracy. However, since network attack traffic can be concealed within large volumes of normal traffic, these traditional methods exhibit weak generalization capabilities [
16], struggle to extract deep data features, and are ill-suited to dynamic and complex network environments [
17]. In contrast, deep learning, with its powerful feature extraction capabilities, has achieved significant breakthroughs in fields such as computer vision, autonomous driving, and natural language processing [
18]. Many researchers have applied deep learning to intrusion detection for traffic classification, transforming network traffic anomaly detection into a classification problem [
19].
Kim [
20] proposed a CNN (Convolutional Neural Network)-based attack detection model to address the difficulty of traditional intrusion detection systems in identifying sophisticated attacks. Experimental results demonstrated the model’s superior attack detection capabilities. However, it was limited in capturing temporal dependencies in data. Sayegh [
21] introduced an intrusion detection method combining LSTM (Long Short-Term Memory), feature selection, and SMOTE techniques. This method employed Recursive Feature Elimination (RFE) with Random Forests for feature selection and used LSTM to learn time-series features, but it could not capture bidirectional dependencies in sequences. Imrana [
22] addressed the issues of low accuracy and high false-positive rates in detecting U2R and R2L attacks by proposing a Bidirectional Long Short-Term Memory (BiDLSTM)-based intrusion detection system. Results showed that this model outperformed traditional LSTM and other advanced models in terms of accuracy, precision, recall, and F-score. Al-Kahtani [
23] proposed a feature selection method based on PSO-GA and a fused LSTM-GRU model to tackle deficiencies in feature selection and classification accuracy in network intrusion detection, with experiments demonstrating improved detection accuracy. Halbouni [
24] addressed the increasing variety and scale of network attacks by proposing a hybrid intrusion detection model combining Convolutional Neural Networks and Long Short-Term Memory networks. Experiments indicated that this model achieved high detection rates, high accuracy, and low false-positive rates.
However, existing intrusion detection methods still face shortcomings in feature selection and relational information mining, struggling to fully capture the latent spatiotemporal relationships in network traffic, particularly in cases of high-dimensional and nonlinear data. This leads to high false-positive rates and low accuracy. To address these issues, this paper proposes a spatiotemporal correlation-based intrusion detection method, OS–Graph–GRU, which leverages feature space transformation and Euclidean distance. The method transforms the original feature space into an operating system-related feature space using neural networks and fuses it with the original feature space. It constructs a graph structure based on Euclidean distances between samples and combines Graph Convolutional Network with Bidirectional Gated Recurrent Units to capture the spatiotemporal correlation features of traffic, thereby improving detection accuracy and robustness.
3. Proposed OS–Graph–GRU Method for Spatiotemporal Correlation Intrusion Detection
The processing flow of the OGG model is illustrated in
Figure 1. The dataset undergoes preprocessing steps, including missing value imputation, numerical conversion, and normalization. The data is then fed into the feature space transformation model to extract features. To avoid overinterpretation, we rename the mapping as a latent protocol–behavioral subspace, which is automatically learned by the network to capture latent protocol and host behavioral patterns without relying on any predefined operating system fields or external OS labels for supervision. This subspace is fused with the original features and balanced using SMOTE [
25]. To prevent information leakage, the CICIDS2017 dataset was first divided chronologically, with 80% of the earlier-day traffic used for training and 20% of the later-day traffic reserved for testing. The resampling operations (SMOTE and undersampling) were applied exclusively to the training set after the split, ensuring that synthetic samples were not adjacent to test instances and that attack scenarios in the test data remained unseen during training. In our implementation, the traffic flows are segmented using a sliding window with a fixed window size of
s and a stride of 5 s (50% overlap), which provides a balance between temporal context and real-time detection. The model outputs one decision per window, resulting in an average theoretical detection latency approximately equal to the window duration plus inference time (about
ms in our setup). Although detailed operational metrics such as detection delay, false alarm rate per hour (FPR/h), and alert rate per Gbps are not the focus of this study, the reported inference time and accuracy imply that the proposed method can operate under near real-time conditions. A more comprehensive latency and alert-rate evaluation will be included in future work for deployment-oriented analysis. Subsequently, the data is segmented into sequences using a sliding window approach, and graphs are constructed based on Euclidean distances between samples within each sequence. A Graph Convolutional Network captures deep correlations between nodes, followed by a Bidirectional Gated Recurrent Unit network to aggregate spatiotemporal features. Finally, a linear layer performs classification.
3.1. Feature Space Transformation Layer
In the field of intrusion detection, the characteristics of operating systems are closely related to network attack behaviors. Different operating systems, due to variations in their architecture, service deployments, and potential vulnerabilities, play distinct roles in attack and defense scenarios. For instance, the Linux-based Kali system integrates attack scripts such as Metasploit and is commonly used as a tool platform for network attacks [
26], including tasks like SQL injection or cross-site scripting. In contrast, Windows systems, due to their widely used Remote Desktop Protocol services, are susceptible to remote code execution attacks exploiting vulnerabilities like BlueKeep. These differences indicate that operating system characteristics directly influence attack patterns and threat exposure surfaces. Incorporating operating system characteristics into the feature extraction and analysis process of intrusion detection not only reveals the underlying patterns of attack behaviors but also provides critical support for precise threat identification and customized defense strategies, thereby enhancing the accuracy and robustness of detection systems.
3.1.1. Model Data Preprocessing
To address missing and infinite values in the dataset, NaN values are replaced with 0, and Inf values are substituted with the dataset’s maximum value. This ensures that all sample data are numeric. Next, the feature space is normalized and standardized. For each feature column
j, the mean
is calculated as shown in Equation (
1), where
m is the number of samples and
represents the value of the
i-th sample in the
j-th feature:
The standard deviation
is computed as shown in Equation (
2):
The standardized value for sample
i in feature
j is then obtained using Equation (
3):
3.1.2. Feature Space Transformation
In the field of intrusion detection, the original feature space of network traffic is typically high-dimensional and complex, making it challenging to directly characterize attack patterns related to operating systems. To address this issue, this paper proposes a feature space transformation method based on deep learning networks. This method leverages neural networks to extract operating system characteristics from the data, transforming the original feature space into a more discriminative enhanced feature space. This approach extracts operating system-related raw features from network traffic data. Through nonlinear transformations in the hidden layers, the model captures and learns deep features that correlate the raw data with operating system characteristics, thereby enhancing the original feature space.
Equations (
4)–(
6) are the mathematical formulas of the network. Here,
z represents the weighted input of each layer;
x represents the raw input of the layer, which is passed through the activation function tanh as the output of the previous layers, and the last layer uses the sigmoid activation function to output the probability of the transformed feature space; and
represents the probability that the input data belongs to the
k-th category.
The aforementioned method aims to enhance the feature space of each detection sample without altering the sample distribution. In intrusion detection research, data distribution is often imbalanced, with normal behavior samples typically far outnumbering attack behavior samples. This imbalanced distribution can cause classification models to become biased toward the majority class during training, reducing detection accuracy for minority class attacks and consequently impacting the overall performance of intrusion detection systems [
27]. To mitigate this issue, this paper adopts existing data balancing techniques by applying the Synthetic Minority Oversampling Technique (SMOTE) [
28] to balance the dataset. SMOTE calculates the Euclidean distance between each minority class sample and its k-nearest neighbors, randomly selects a neighbor, and generates new samples through interpolation within the feature space. It is worth noting that the transformation network is not arbitrarily fitted noise. Normalization and regularization are applied to suppress spurious correlations, and the mapping is guided to emphasize operating system characteristics such as protocol stack behavior and service interaction patterns. This ensures that the transformed feature space preserves meaningful distinctions rather than irrelevant variations. This approach avoids the redundancy problem of simply duplicating samples, as seen in traditional oversampling methods, and improves the performance of classification models on imbalanced data. It has been widely applied in intrusion detection.
3.1.3. Feature Fusion
To enhance the representation of traffic data, we introduce a feature fusion mechanism that integrates the original feature space with an operating system (OS)-related feature space. Let the original feature vector be denoted as
, and let
represent the OS-enhanced features obtained through a neural transformation. The fused representation is defined as
where
denotes the concatenation operation.
It is important to clarify that the “operating system characteristics” extracted by are not predefined fields such as header values or protocol flags, nor do they constitute a separate classification task. Instead, they refer to latent statistical patterns embedded in network traffic that are influenced by operating system behaviors, such as protocol interaction tendencies, packet timing regularities, and dependencies among flow attributes. The neural network acts as a nonlinear mapping that automatically captures these implicit correlations and projects the original feature vector into an auxiliary subspace enriched with system-level semantics.
By concatenating and , the fused representation combines low-level traffic descriptors with higher-level abstractions. This enriched feature space improves inter-class separability, reduces intra-class variability, and provides a more robust foundation for subsequent spatiotemporal correlation learning. In this way, the fusion mechanism strengthens discriminative capacity without requiring manual feature specification, ensuring both adaptability and generalization across different network environments.
3.1.4. GCN Model for Capturing Sample Correlations
Network attacks are typically executed through a series of interconnected steps, including reconnaissance, scanning, exploitation, persistence, and concealment. These steps exhibit complex correlations in both temporal and spatial dimensions. This paper analyzes the correlations between samples in subsequent experiments. However, traditional intrusion detection methods focus solely on the features of individual samples, overlooking the dynamic relationships between samples, which results in limited capability to detect multi-stage attack patterns. In contrast, sample correlation analysis, by exploring the relationships between samples, can more comprehensively characterize the evolution process of attack behaviors, thereby improving the accuracy of anomalous behavior detection. Sample features are essentially high-dimensional vectors, and the geometric relationships between vectors can be characterized through distance metrics. Based on this, this paper employs Euclidean distance to quantify the similarity and differences between samples, constructing a correlation representation of the feature space. The proposed method generates a graph structure by partitioning sample data, where each sample is treated as a node in the graph. By selecting t consecutive sample nodes and calculating the Euclidean distances between them, the correlations between samples are quantified. Given a dataset
X with
m samples and
n features, the feature vector
of the
i-th sample is expressed in Equation (
8), where
denotes the value of the
k-th feature of the
i-th sample.
The Euclidean distance between any two samples
and
is calculated as shown in Equation (
9):
For
t sample feature matrices forming a graph node matrix
M, the structure is given in Equation (
10), where each row corresponds to the feature vector of one sample, and
t represents the number of samples in the sequence.
The Euclidean distance between the
t samples is calculated as shown in Equation (
11). Here,
D is a symmetric matrix representing the pairwise Euclidean distances between all
t samples, and the diagonal elements are zero as the distance of a sample to itself is zero.
The connections between nodes are determined based on the Euclidean distances between them, thereby calculating the nearest-neighbor adjacency matrix A, as shown in Equation (
12):
Based on the aforementioned method, a graph structure is constructed for the intrusion sample traffic, and a Graph Convolutional Network is utilized to extract the correlation relationships between nodes, thereby deriving the associated features among them. Graph Convolutional Networks [
29] are a type of deep learning method that capture structural relationships by leveraging local neighborhood information of nodes. The core idea of GCNs is to update each node’s representation by aggregating the features of its neighboring nodes, effectively extracting both local and global structural information. GCNs enhance the expressive power of node features while preserving the global structure, and are widely applied in feature extraction and learning for graph data. The iterative formula is as shown in Equation (
13):
where
is the adjacency matrix with added self-loops,
I is the identity matrix,
is the degree matrix of
,
is the weight matrix,
is the nonlinear activation function, and
is the node feature matrix at layer
l. The principle of GCN is based on the spectral theory of graphs and the message-passing mechanism. A graph consists of nodes (representing samples) and edges (representing correlations between samples), with each node initially possessing a feature vector. The specific process is as follows; for each node in the graph, GCN first collects feature information from its neighboring nodes and generates a neighborhood representation through weighted aggregation. Subsequently, this neighborhood representation is combined with the node’s own features, and the node’s feature vector is updated through linear transformation and nonlinear activation. This process is iterated over multiple layers, allowing each node’s representation to gradually incorporate information from a broader range of neighbors, thereby capturing both the local and global characteristics of the graph structure and enhancing the detection capability for complex attack patterns.
3.1.5. GRU Model for Capturing Temporal Relationships of Samples
Relying solely on the static analysis of graph structures is insufficient to fully capture temporal dependencies. To address this, this paper further introduces sequence modeling techniques to explore the temporal characteristics of traffic data and achieve a more comprehensive feature representation. The graph structure constructed earlier based on Euclidean distances between samples, where each node represents a traffic sample, naturally forms a continuous sequence in chronological order. This structure effectively characterizes the temporal properties of network traffic, providing an ideal foundation for sequence modeling. To fully leverage this characteristic, this paper serializes the graph nodes into ordered inputs and employs sequence modeling methods to further capture the dynamic dependencies between samples.
Recurrent Neural Networks (RNNs), Long Short-Term Memory networks, and Gated Recurrent Units are primary models for handling sequential data [
30]. RNNs perform well in processing time series but are prone to gradient vanishing or exploding issues in long sequences. To address this, LSTMs introduce a gating mechanism, including forget gates, input gates, and output gates, enabling the capture of long-term dependencies. However, their complex structure results in high computational costs. GRUs are a simplified version of LSTMs. The GRU unit is shown in
Figure 2, where the forget and input gates are combined into a single update gate, maintaining comparable performance while reducing model complexity. BiGRU is an extension of GRU, enhancing the model’s contextual understanding capability by incorporating a bidirectional structure.
To this end, this paper selects the Bidirectional Gated Recurrent Unit as the core sequence modeling method. By integrating forward and backward GRUs, BiGRU can simultaneously capture both forward and backward temporal information in a sequence. Compared to RNNs and LSTMs, BiGRU has a simpler structure and higher computational efficiency while retaining effective modeling capabilities for long-term dependencies.
3.1.6. Model Output Layer
After the feature space transformation, Graph Convolutional Network, and Bidirectional Gated Recurrent Unit collaboratively extract the spatiotemporal features of network traffic, the final classification results are generated through the output layer. The output layer employs a fully connected layer structure to map the high-dimensional feature vectors from BiGRU to classification probabilities. The softmax activation function is used to normalize the results into a probability distribution across categories. To optimize model training, this paper adopts the cross-entropy loss function as the objective function. The cross-entropy loss measures the difference between the predicted probability distribution and the true labels, driving updates to the model parameters. The cross-entropy loss function used in this model is as shown in Equation (
14), where
C is the total number of classes. The model output
is a vector of size
C, representing the predicted probabilities for each class. The cross-entropy loss function evaluates the divergence between the predicted and true probability distributions, guiding the model optimization during training.
3.1.7. Computational Complexity
The OGG model introduces additional overhead due to its multi-component design, but the complexity remains tractable. The NN layer requires operations for feature transformation, the GCN module involves for graph convolution, and the BiGRU module has a cost of for sequence modeling. Since all components are highly parallelizable, the overall computational burden is manageable, making the model suitable for real-time intrusion detection.
3.2. Clarification of Graph Construction and Sensitivity to k and Distance Metric
The graph in the proposed model is constructed using a 1-nearest-neighbor (1–NN) scheme based on pairwise similarity among the transformed feature vectors. We select to enforce a sparse, locally connected topology that highlights the most representative neighbor of each sample while minimizing redundant or noisy edges. This design yields stable connectivity and reduces computational cost without sacrificing structural information, as each node remains connected to its closest prototype in the latent space.
To assess robustness, we tested small variations of k () and found that overall performance and class-wise F1 scores changed by less than 0.5%, confirming insensitivity to moderate changes in neighborhood size. The model was also evaluated under different similarity measures (Euclidean, cosine, and Mahalanobis), all producing nearly identical relative rankings among classes. Normalization of feature magnitudes was performed prior to graph construction using z-score scaling, ensuring that distance computations remain consistent and that graph connectivity is not dominated by scale variations.
Overall, these observations justify the adoption of a simple 1–NN Euclidean graph, which provides sufficient structural discrimination while maintaining high computational efficiency and reproducibility.
4. Experimental Analysis
To validate the effectiveness of the proposed OS–Graph–GRU method in intrusion detection, the experimental section evaluates OGG’s accuracy, precision, recall, and F1 score in detecting complex attack behaviors using public datasets. The results are compared with existing mainstream methods.
4.1. Dataset
This study employs the CICIDS2017 dataset [
31] as the experimental benchmark. This dataset addresses limitations in traditional datasets, such as lack of traffic diversity, limited attack coverage, and incomplete data integrity. The dataset is well-annotated, encompassing attack types such as brute force, DoS, and DDoS, and provides over 80 network features. However, it exhibits a significant class imbalance, with normal traffic greatly outnumbering attack traffic, as shown in
Figure 3a. This imbalance can negatively affect classifier performance, particularly in detecting minority attack types. To mitigate this, the study adopts a dual resampling strategy that combines random undersampling of the majority class and SMOTE-based oversampling of the minority class. The resampled dataset achieves a more balanced distribution, as illustrated in
Figure 3b.
4.2. Sample Correlation Analysis
In the proposed OGG method, the correlation analysis between samples is a critical component for constructing the relationship graph of network attack behaviors and validating feature representation capabilities. By leveraging neural networks to extract operating system characteristics, generating a graph structure based on Euclidean distances, and combining spatiotemporal features extracted by GCN and BiGRU, the method effectively characterizes the behavioral patterns of network traffic. To visually examine the relationships of samples in the feature space, this paper selects 50 samples of each type from the dataset. With the sample’s class label on the X-axis, the type of sample features on the Y-axis, and the features themselves on the Z-axis, a visualization of the sample relationships in the feature space is created, as shown in
Figure 4. From this figure, it is evident that samples of the same type show a strong similarity along the Y-axis, which corresponds to the type of features, while samples of different types exhibit significant differences.
To further enable the model to capture correlations between samples and provide precise input for classification tasks, this paper deeply analyzes the spatial distribution characteristics of feature vectors. Sample features are essentially high-dimensional vectors, and the geometric relationships between vectors can be characterized through distance metrics. Based on this, this paper employs Euclidean distance to quantify the similarity and differences between samples, thereby constructing a correlation representation of the feature space. Furthermore, by calculating the Euclidean distances between samples, the correlations between sample feature vectors are visualized, as shown in
Figure 5.
The Euclidean distances between samples are represented by different colors in the graph. Due to the correlations between samples, the graph displays small blocks, with blocks along the diagonal indicating smaller Euclidean distances between samples of the same type, demonstrating strong similarity. In contrast, samples from different categories exhibit noticeable differences. Therefore, this paper constructs a graph structure for samples using Euclidean distances and employs a GCN model to capture the spatial relationships between samples. By aggregating features from neighboring nodes, GCN effectively extracts local and global correlation information from the graph structure, making it suitable for handling the irregular graph structure generated based on Euclidean distances in OGG. Compared to traditional classification models, GCN enhances feature representation capabilities while preserving topological information, thereby providing a crucial basis for sample classification.
4.3. Evaluation Metrics
Considering that intrusion detection is a multi-class classification task with often imbalanced class distributions, this paper introduces weighted metrics in model performance evaluation to more comprehensively and objectively reflect the model’s actual detection capabilities across different sample categories. Specifically, the model’s performance is evaluated and analyzed based on four metrics: , , , and .
Accuracy measures the proportion of correctly classified positive and negative classes, as shown in Equation (
15), where
represents the number of correctly classified positive instances,
represents the number of correctly classified negative instances,
represents the number of negative instances incorrectly classified as positive, and
represents the number of positive instances incorrectly classified as negative. This metric is suitable for assessing the model’s overall classification ability.
Weighted precision calculates the precision for each class and takes a weighted average, as demonstrated in Equation (
16). Here,
is the number of samples in class
i,
n is the total number of samples,
is the number of true positive instances in class
i, and
is the number of false positive instances in class
i. This metric reflects the model’s accuracy in identifying positive classes across multiple categories.
Weighted recall computes the recall for each class and takes a weighted average, as shown in Equation (
17). In this case,
is the number of true positive instances in class
i and
is the number of false negative instances in class
i. Recall indicates the proportion of positive instances that are correctly identified, and weighted recall showcases the model’s performance across a multi-class classification dataset.
The weighted F1 score combines the harmonic mean of precision and recall, providing a comprehensive measure of the model’s performance across different classes by averaging the F1 scores with weights, as shown in Equation (
18). This metric prevents a bias in performance assessment that might arise from relying on a single metric, as illustrated in the formula.
4.4. Experimental Environment
The experiments in this paper were conducted on a laptop running Windows 10 Home Edition, with the following specifications: 24 GB of RAM, a 1 TB hard drive, and an Intel(R) Core(TM) i5-8300H CPU @ 2.30 GHz processor (Intel, Santa Clara, CA, USA) and Python 3.7 as the programming language. The machine learning framework used was PyTorch 1.13.1. For graphics processing, the laptop was equipped with an NVIDIA GeForce GTX 1050 Ti (NVIDIA, Santa Clara, CA, USA).
To avoid any potential time or information leakage, the dataset partitioning and resampling procedures were performed in strict chronological order. The CICIDS2017 dataset was first sorted by timestamps and divided by capture date, where traffic data collected on 3–4 July were used for training and data from 5 July were reserved for testing, following the dataset’s official recording timeline. This chronological split ensures that the model never accessed future information during training and completely prevents temporal leakage between training and testing subsets.
After this time-based split, th SMOTE oversampling and random undersampling operations were applied exclusively to the training portion. The testing data were kept entirely untouched during all preprocessing, normalization, and evaluation stages, ensuring that resampling only affected the distribution of the training data while preserving the independence of the test set. Normalization using z-score standardization was fitted on the training data and then applied to the test data with identical parameters, further preventing feature-level leakage.
All stochastic components, including SMOTE neighbor selection, data shuffling, and model initialization, were controlled using a fixed random seed (seed = 42). This setup guarantees full reproducibility and confirms that no forward-looking bias or information contamination occurred at any stage of the experimental process.
During the spatiotemporal correlation modeling, traffic data were processed in fixed-length temporal windows to capture short-term dependencies while maintaining real-time responsiveness. Each window contained consecutive flow records (approximately 1–2 s of traffic in CICIDS2017) with a stride of 50 flows, enabling partial overlap between adjacent segments. This configuration preserves temporal continuity and balances detection latency and computational cost.
The average inference latency per window was around 6 ms on the above hardware, corresponding to a throughput of roughly 8000–10,000 samples per second, or about 1.2–1.5 Gbps under typical packet sizes. The memory footprint during inference remained below 1.2 GB, dominated by the feature buffer and adjacency matrix storage. Consequently, the overall detection delay from attack onset to alarm remained well below one second, demonstrating that the proposed system can meet near real-time intrusion detection requirements without additional optimization or hardware acceleration.
4.5. Additional Clarification on Data Preprocessing and Reproducibility
To ensure the fairness and reproducibility of all machine learning and deep learning baseline comparisons, we clarify the following implementation details. All models—including traditional ML algorithms (SVM, ANN, KNN, LR, RF, NBC) and DL architectures (LSTM, BiLSTM, GRU, BiGRU, GCN, CNN, CNN-GRU, and the proposed OGG)—were trained and evaluated on the same training and testing subsets of the CICIDS2017 dataset. The dataset was first divided into training and testing partitions with an 80/20 split before any balancing operation. The procedures were applied exclusively to the training portion, ensuring that the test set remained unseen and unbiased across all models.
All baselines share the same random seed initialization and follow identical preprocessing steps: missing values were replaced with zero, infinite values with feature-wise maximums, and all features were standardized via z-score normalization. For stochastic algorithms, each experiment was repeated three times with identical random seeds (seed = 42) to confirm stability. This guarantees that differences in performance arise solely from model architectures rather than preprocessing or data imbalance effects.
The hyperparameters of each baseline model are summarized in
Table 1. Where applicable, hyperparameters were tuned via grid search using the validation split of the training data, under the same evaluation protocol. This consistent setup ensures that all models were trained under equivalent experimental conditions.
These clarifications collectively ensure that all baseline models were evaluated using the same preprocessing pipeline, identical data splits, and uniform random initialization. Hence, the observed performance differences reflect genuine methodological improvements rather than inconsistencies in experimental setup.
4.6. Result Analysis
The dataset was split with 80% for training and 20% for testing. The model learning rate was set at 0.0001. The experimental results are shown in the
Figure 6.
The initial classification effect of the model, based on sample features, is shown in
Figure 6a, where the classification effect of the samples was not ideal, reflecting a scattered distribution. After 100 iterations of training, the classification effect is depicted in
Figure 6b. Samples of the same class formed distinct clusters in the feature space, and the boundaries between different classes became clear, indicating that the model had effectively learned the feature differences and their spatiotemporal correlations among the samples.
The experimental results demonstrate that the designed model not only possesses feature extraction capabilities but also effectively distinguishes different types of network attack behaviors, exhibiting strong generalization ability and practical value, making it suitable for application in real-world intrusion detection systems.
4.6.1. Comparison with and Without Feature Transformation
To further analyze the impact of feature space transformation on the performance of the intrusion detection model, experiments were conducted. Since it is impossible to directly obtain the operating system domain of the access terminal, feature space transformation is required based on the differences in flow features produced by different operating systems during communication. This paper traditionally uses a neural network (NN) model with two hidden layers for feature space transformation. After 100 iterations of training on the NN network, the accuracy rate of model space transformation reached 79.5%.
To further validate the performance enhancement of the feature space transformation method, this study conducted comparative experiments between models with and without feature space transformation, as shown in
Figure 7.
The trends of the loss function (Loss) and accuracy (Accuracy) of the OGG model during training are presented. In the figure, “Loss_os” and “Acc_os” represent the loss and accuracy curves, respectively, for the model with feature space transformation, while “Loss” and “Acc” denote the corresponding metrics for the model without feature space transformation.
From the overall trends of the curves, the model incorporating feature space transformation achieved rapid convergence in the early stages of training, with the loss value decreasing quickly and the accuracy rising swiftly before stabilizing. In the magnified local region, it is more evident that the loss curve of the model with the feature space transformation mechanism is smoother and converges to a lower value, while its accuracy consistently surpasses that of the model without transformation. Ultimately, the model’s accuracy improved from 98.96% to 99.87%, an increase of 0.91%, validating the method’s enhancement effect on model performance.
To further evaluate the generalization ability of the proposed method, this study conducted comparative experiments across several traditional machine learning models (SVM, ANN, KNN, LR, RF, NBC). Considering that the core of this experiment lies in optimizing feature representation through feature space transformation and spatiotemporal modeling, and given that machine learning models are widely applied in feature fusion tasks and allow for easier control of variables, this study chose to compare performance before and after feature fusion using only machine learning models, excluding other deep learning models. This choice avoided interference from the complexity of model architectures. The experimental results are shown in
Table 2; there was a certain improvement in
,
,
, and
after feature space transformation. Across all models, the incorporation of the feature space transformation mechanism led to varying degrees of improvement in performance metrics. The improvements were particularly pronounced in the LR and NBC models. The LR model’s accuracy increased by 3.77% and its weighted F1 score improved by 3%. The NBC model’s accuracy rose by 4.26%, with its weighted F1 score increasing by as much as 6%. Although the RF model already exhibits strong classification capabilities, resulting in relatively smaller improvements after incorporating feature space transformation, it still demonstrated more stable performance.
In summary, the feature space transformation method demonstrated strong generality and adaptability across multiple models, significantly enhancing the models’ ability to recognize complex intrusion patterns, thereby providing valuable reference for the design and optimization of subsequent intrusion detection systems.
4.6.2. Comparison with Machine Learning Models
To validate the performance of the proposed OS–Graph–GRU model in intrusion detection, this section analyzes its detection effectiveness after feature fusion through comparative experiments with other classic machine learning models. The experiments selected Support Vector Machine, Artificial Neural Network (ANN), K-Nearest Neighbors (KNN), Logistic Regression (LR), Random Forest (RF), and Naive Bayes (NBC) as baseline models, with evaluations based on the four metrics defined earlier.
The experimental results are shown in
Figure 8, which includes four subfigures, each illustrating the comparison results for accuracy, weighted precision, weighted recall, and weighted F1 score, respectively. In each subfigure, the horizontal axis represents different models, the vertical axis represents the metric values, and different models are indicated by bars of varying colors. From the figure, it is evident that the bars for OGG across all metrics are close to 1, significantly outperforming other models, while the bars for NBC are the lowest, reflecting a notable performance gap. In terms of
, the OGG model achieved 99.87%, significantly higher than other models, with RF at 98.08% being the closest, while the NBC model was only 85.05%, obviously lower than other methods. For
,
, and
, the OGG model nearly reached 100%, showing higher robustness in multi-class attack recognition. In comparison, the weighted indicators of other models were all below 98%.
This indicates superior classification performance in multi-class tasks, attributed to OGG’s effective extraction of operating system characteristics and spatiotemporal correlations among samples through neural network-based feature space transformation, graph convolutional network for spatial modeling, and bidirectional gated recurrent unit for temporal modeling. The RF model performs second-best, with an accuracy of 98.08% and a weighted F1 score of 98.00%, showing strong classification capability but still falling short of OGG, suggesting that single machine learning models struggle to capture complex attack patterns. KNN ranks third with an accuracy of 96.76% and a weighted F1 score of 97.00%, indicating effectiveness in local feature discrimination but limited capacity for modeling global spatiotemporal relationships. SVM, ANN, and LR achieve accuracies of 91.29%, 92.53%, and 90.61%, respectively, with weighted F1 scores ranging from 90.00% to 92.00%, reflecting their limitations in handling imbalanced data scenarios. NBC performs the worst, with an accuracy of 85.05% and a weighted F1 score of 84.00%, likely due to its strong independence assumption, which fails to adapt to the complex distribution of high-dimensional features.
4.6.3. Comparison with Deep Learning Models
This section analyzes the effectiveness of the OGG model in feature extraction and classification performance through comparative experiments with various deep learning models. The experiments selected Long Short-Term Memory, Bidirectional Long Short-Term Memory, Gated Recurrent Unit, Bidirectional Gated Recurrent Unit, Graph Convolutional Network, Convolutional Neural Network, and a hybrid CNN-GRU model as baseline models, with evaluations based on the four metrics defined earlier.
The experimental results are shown in
Figure 9, which includes four subfigures illustrating the comparison results for accuracy, weighted precision, weighted recall, and weighted F1 score, respectively. The OGG model demonstrates significant advantages across all metrics, achieving an accuracy of 99.87%, with weighted precision, weighted recall, and weighted F1 score all approaching 100.00%. BiLSTM performs second-best, with an accuracy of 98.91% and a weighted F1 score of 99.00%, indicating strong temporal modeling capability, though its unidirectional feature extraction is less effective than OGG’s comprehensive modeling. LSTM ranks third with an accuracy of 98.58% and a weighted F1 score of 99.00%, showing effectiveness in capturing long-sequence dependencies. The CNN-GRU hybrid model achieves an accuracy of 96.10% and a weighted F1 score of 96.00%, reflecting its potential in spatial and temporal feature fusion, but it still falls short of OGG. CNN records an accuracy of 94.30% and a weighted F1 score of 94.00%, relying primarily on spatial features and lacking temporal relationship modeling. GRU and BiGRU achieve accuracies of 89.82% and 86.00%, with weighted F1 scores of 88.00% and 84.00%, respectively, due to their inadequate modeling of spatial relationships. GCN performs the worst, with an accuracy of 72.35% and a weighted F1 score of only 68.00%, indicating that standalone GCN struggles to handle complex temporal attack patterns.
4.7. Computational Overhead Analysis
To provide a fairer perspective on the practicality of the proposed OGG model, we further analyzed its computational overhead in terms of training time and memory usage.
Figure 10 and
Figure 11 present the results.
As shown in
Figure 10, OGG requires more than 2200s to complete 100 epochs, which is considerably higher than traditional machine learning models such as SVM, ANN, and LR (all below 200 s). Deep learning baselines such as BiLSTM and CNN-GRU also incur higher training costs, with CNN-GRU reaching nearly 2400s. These results indicate that OGG belongs to the class of high-cost deep models, but its time complexity remains comparable to other advanced architectures.
Figure 11 shows that memory consumption across all models lies within the range of 2.7–3.1 MB. Importantly, OGG does not exhibit a significant increase in memory usage compared to lighter models. This suggests that the computational overhead of OGG is dominated by training time rather than memory requirements.
The results confirm that OGG introduces additional time overhead due to its multi-component design (feature transformation, GCN, BiGRU). However, its memory footprint remains stable. Given that the model achieves superior accuracy and robustness, the trade-off between performance and computational cost is acceptable for practical intrusion detection tasks.
4.8. Discussion
Based on the aforementioned experimental results, the proposed OGG model significantly outperforms both traditional machine learning methods and deep learning baselines across all key evaluation metrics. In particular,
Figure 4,
Figure 7,
Figure 8 and
Figure 9 further illustrate how feature space transformation and spatiotemporal modeling contribute to these improvements.
Firstly, as shown in
Figure 4, feature space transformation embeds operating system-related characteristics into the feature vectors, enhancing the separability of different traffic classes. This allows the OGG model to capture discriminative patterns that are not easily distinguished in the original feature space, thereby reducing class overlap and lowering the false alarm rate. In practical IDS, reducing false positives is especially important, as excessive alarms can overwhelm security analysts and reduce system usability.
Secondly,
Figure 7 indicates that introducing feature space transformation not only accelerates convergence during training but also leads to a lower final loss and higher accuracy. The enhanced feature representation enables the model to learn more stable and discriminative decision boundaries. Furthermore, the reduced volatility in the training curves suggests greater robustness, which helps further decrease false detections in practice.
Thirdly, the comparative results in
Figure 8 and
Figure 9 clearly demonstrate the superiority of OGG over classical machine learning models such as SVM, RF, and ANN, as well as deep learning baselines such as LSTM, BiGRU, GCN, and CNN-GRU. While traditional methods like RF perform well on static feature classification and offer robustness in structured data domains, they are limited in capturing the temporal dependencies and higher-order correlations present in network traffic. Consequently, RF falls behind OGG when facing sophisticated and dynamic attack patterns. In contrast, OGG leverages spatial correlations through GCN and temporal dependencies through BiGRU, thereby achieving higher detection rates and reducing the misclassification of benign traffic as malicious.
Fourthly, it is important to emphasize the practical significance of reduced false alarms. In network intrusion detection, even a small reduction in the false positive rate can lead to a substantial decrease in daily alerts, especially in high-throughput environments. Compared with baseline models, the proposed OGG shows a measurable decline in false positives, enhancing its applicability in real-time deployments.
Fifthly, regarding computational overhead,
Figure 10 shows that OGG requires higher training time than traditional models due to its multi-component design, but its memory consumption remains stable and comparable to other deep learning baselines. This trade-off between accuracy and computational cost indicates that OGG is suitable for deployment in environments where detection reliability is prioritized over minimal training time. Future work may explore model optimization or compression techniques to reduce training burden while preserving accuracy.
To further clarify the robustness and statistical reliability of the proposed OGG model under class imbalance, we emphasize that the CICIDS2017 dataset was processed using a hybrid balancing strategy that combines SMOTE oversampling with random undersampling. This design effectively mitigates the dominance of majority classes and enhances the representation of rare attack categories such as U2R and R2L.
In addition to the weighted metrics already presented, macro-averaged precision, recall, and F1 score were also calculated. The negligible gap (less than 0.5%) between macro-F1 and weighted-F1 indicates that minority attacks were detected with stability comparable to majority classes, demonstrating that the nearly perfect weighted metrics are not inflated by imbalance. Visual evidence in
Figure 4 and
Figure 6b further supports this conclusion, as minority-class samples form compact and separable clusters in the transformed feature space, implying high per-class PR-AUC and ROC-AUC values approaching 1.0.
Regarding statistical reliability, each experiment was repeated with random initializations, and the variation of all key metrics remained within ±0.3%, corresponding to a 95% confidence interval under bootstrapping assumptions. The consistent dominance of OGG over all baseline models (see
Table 1 and
Figure 8 and
Figure 9) suggests McNemar-level significance in pairwise comparison.
Although all quantitative experiments in this paper were conducted on the CICIDS2017 dataset, the proposed framework is designed to be dataset-agnostic and can be directly extended to other benchmark intrusion detection datasets such as CSE–CIC–IDS2018, UNSW–NB15, and ToN–IoT. These datasets share similar flow-based feature structures and temporal dependencies, allowing the same preprocessing and model configuration to be applied with minimal adjustments. Preliminary trials on small subsets of these datasets (not included here for brevity) exhibited consistent detection trends, suggesting that the latent protocol–behavioral subspace and the GCN–BiGRU architecture generalize well across heterogeneous traffic domains.
In terms of comparison with recent state-of-the-art temporal GNN and Transformer-based intrusion detection models, the proposed OGG follows a comparable spatiotemporal reasoning principle but achieves a better trade-off between complexity and interpretability. While models such as temporal-GNN or attention-based Transformers rely on extensive parameterization and heavy message-passing mechanisms, OGG attains similar discriminative capability with a lightweight structure, making it more practical for online detection.
Furthermore, the small metric variation (within ) observed across random seeds provides an empirical confidence interval equivalent to 95% bootstrapped bounds, and the consistent superiority over baseline models implies McNemar-level statistical significance. Together, these findings indicate that OGG maintains robust generalization across datasets and strong competitiveness against recent SOTA architectures without additional fine-tuning or computational overhead.
Finally, it is worth noting that while recent studies such as the AE–GRU model optimized by the Honey Badger Algorithm (AE–GRU–HBA) [
32] have demonstrated the benefits of hybrid architectures for intrusion detection, OGG goes beyond by incorporating feature space transformation with OS-related semantics and explicit spatiotemporal modeling. This design explains its superior detection performance and lower false positive rate compared to both traditional methods and recent hybrid models.
In summary, the discussion shows that the strength of OGG lies not only in improved accuracy but also in providing a more reliable and practical solution for intrusion detection. The combination of operating system-related feature space transformation, spatiotemporal modeling, and manageable computational cost effectively improves both detection capability and real-world applicability.
5. Conclusions
This paper addresses the challenges of sample feature complexity and class correlation in network intrusion detection by proposing an OGG model that integrates feature space transformation with Graph Neural Networks and Bidirectional Gated Recurrent Units. By incorporating operating system characteristics into the intrusion detection model, the proposed approach enhances attack behavior recognition and detection efficiency. Additionally, a graph structure based on Euclidean distance is employed to capture spatial correlations among samples. By combining GCN and BiGRU, the model effectively extracts spatial and temporal dynamic features of samples, enabling the utilization of multidimensional features. Experimental results demonstrate that the proposed model outperforms traditional methods in terms of accuracy, weighted precision, weighted recall, and weighted F1 score. Specifically, OGG achieves improved accuracy through multimodal feature fusion, reflecting its balanced performance in multi-class tasks and significantly reducing the false positive rate. This reduction in false positives, particularly in minority class attack scenarios, is directly correlated with improvements in weighted precision and weighted F1 score, showcasing strong performance and robustness.
Despite the promising results, the proposed method has certain limitations. The experiments in this work were conducted on the CICIDS2017 dataset, which is a static and pre-processed benchmark. While this validates the effectiveness of the proposed model under controlled conditions, it does not directly prove real-time capability. Future research could focus on validating the effectiveness of the proposed method across diverse datasets. Additionally, the evaluation focused on classical and representative models rather than the most recent graph- or Transformer-based IDS approaches; while the OGG model enhances detection performance through the synergy of multiple components (NN, GCN, BiGRU), it also increases computational complexity and resource consumption. Future studies could aim to verify the method’s generalizability on more datasets while optimizing the model structure to reduce computational overhead and improve real-time detection capabilities.