5.2. Comparative Study
To evaluate our proposed framework (PseudoMetapathNet), we conducted extensive comparisons against diverse baseline models across four network intrusion detection datasets. The baselines include traditional machine learning methods (RandomForest and SVM), classic GNN architectures (GraphSAGE and GIN), and a state-of-the-art spectral GNN (BWGNN). Results are summarized in
Table 1,
Table 2,
Table 3 and
Table 4.
A consistent observation across all datasets is that graph-based methods significantly outperform traditional machine learning approaches. Traditional methods rely solely on node features, ignoring network topology and traffic relationships. In contrast, GNNs leverage structural information through message passing to identify complex intrusion patterns. On CICIDS2017 (
Table 1), most traditional models yield F1-scores below 56%, while leading GNNs exceed 80%.
Our PseudoMetapathNet demonstrates consistently superior performance across all datasets. On CICIDS2017, it achieves 93.46% F1-score and 99.56% Precision, outperforming SuperGAT (F1: 87.73%) and BWGNN (F1: 91.85%). On NSL-KDD, it secures the highest metrics with an F1-score of 97.80%. Similarly, it achieves state-of-the-art results on KDD CUP 1999 and UNSW-NB15 (F1-scores of 90.19% and 98.55% respectively).
The performance improvement over BWGNN, which serves as our framework’s spectral backbone, directly validates our core contributions: dynamic graph heterogenization and Pseudo-Metapath propagation. Beyond the gains on CICIDS2017, our model boosts the F1-score on UNSW-NB15 from 95.38% to 98.55%, and on NSL-KDD, from 97.43% to 97.80%. This confirms that relying solely on frequency-aware node features is insufficient; learning to route supervision signals along semantically relevant paths effectively combats the “under-reaching” problem caused by sparse labels.
To deconstruct our architecture’s advantages, we compare it specifically with two advanced GNNs: SuperGAT and BWGNN. Each possesses distinct strengths but also limitations that our framework overcomes:
BWGNN leverages Beta-Wavelet filters to capture high-frequency signals characteristic of anomalies. While it excels as a feature extractor, it remains constrained by standard message-passing, limiting propagation to distant nodes under sparse supervision.
SuperGAT addresses long-range dependencies by aggregating information from different neighborhood ranges. However, its multi-hop pathways are structurally fixed and semantically agnostic, unable to follow specific semantic patterns crucial for intrusion detection.
Our PseudoMetapathNet synthesizes these strengths while overcoming their limitations. It begins with a strong spectral foundation for robust initial representations, then transcends the propagation bottleneck with our dynamic Pseudo-Metapath mechanism. This allows the model to route supervision signals along semantically meaningful paths discovered on-the-fly—a capability both baselines lack.
The empirical results strongly validate this design:
PseudoMetapathNet vs. BWGNN: Our consistent performance improvement over BWGNN (e.g., F1-score of 93.46% vs. 91.85% on CICIDS2017) demonstrates the benefit of our dynamic propagation mechanism.
PseudoMetapathNet vs. SuperGAT: The larger gap between our model and SuperGAT (93.46% vs. 87.73% F1-score on CICIDS2017) highlights the superiority of adaptive, semantic propagation over fixed multi-scale aggregation.
As illustrated by the loss curves in
Figure 3, PseudoMetapathNet exhibits a smooth and consistent convergence trend across all three datasets, comparable to the established baseline models. This demonstrates that despite the inclusion of the auxiliary loss
, our model maintains excellent training stability and feasibility.
In conclusion, our framework’s remarkable performance stems from synergistically combining a powerful spectral feature extractor with a novel, semantically aware propagation mechanism that directly addresses the limitations of existing GNNs.
We acknowledge that the performance gap on the original intrusion detection datasets could be further substantiated. To more rigorously test the robustness and generalizability of our framework, we carry out a broader evaluation on three widely-recognized benchmark datasets from the related domain of Graph Anomaly Detection (GAD): Amazon, T-Finance, and Questions. These datasets are known for their challenging characteristics, such as severe class imbalance and heterophily, making them an ideal testbed to validate the effectiveness of our Pseudo-Metapath mechanism beyond its initial application domain. In these datasets, we following the setting from [
65], using three metrics: Area Under the Receiver Operating Characteristic Curve (AUROC) and Area Under the Precision–Recall Curve (AUPRC), calculated by average precision, and the Recall score within top-K predictions (Rec@K). We set K as the number of anomalies within the test set. In all metrics, anomalies are treated as the positive class, with higher scores indicating better model performance.
As shown in
Table 5 and
Table 6, our framework demonstrates a consistently strong, and often superior, performance against a comprehensive suite of GNN baselines. Specifically, on the Amazon and T-Finance datasets, PseudoMetapathNet achieves state-of-the-art results by securing the top performance across all three evaluation metrics (AUROC, AUPRC, and Rec@K). For instance, on T-Finance, it obtains the highest AUROC of 96.40%, AUPRC of 86.62%, and Rec@K of 81.55%, decisively outperforming all competitors. On the challenging Questions dataset, where performance is highly competitive, our model achieves the highest AUPRC of 18.09%. This result is particularly significant, as AUPRC is a more informative metric than AUROC for evaluating models on severely imbalanced datasets, a key feature of this benchmark. These comprehensive results strongly corroborate our central claim: the dynamic Pseudo-Metapath mechanism is a powerful and generalizable strategy for enhancing node anomaly detection in complex graph structures.
5.3. Ablation Study
To verify the individual contributions of the key components within our proposed framework, we conducted a series of ablation experiments on all four datasets. We investigated the impact of two core modules: (1) the Beta-Wavelet spectral filter module, responsible for learning frequency-aware node representations and (2) our novel Pseudo-Metapath propagation module, which dynamically routes supervision signals. We evaluated two variants of our model: one without the spectral module and another without the metapath module. The performance degradation relative to the full model is presented in
Figure 4.
The Pseudo-Metapath module, as illustrated by the results, are unequivocally the most critical component of our framework. Removing this module resulted in a substantial and consistent performance drop across all datasets and nearly all metrics. The impact was particularly dramatic on the UNSW-NB15 dataset, where its removal led to a catastrophic decrease in Recall by 8.92% and in AUC by 11.75%. Similarly, on CICIDS, the F1-score plummeted by 7.12%. These significant degradations strongly validate our central hypothesis: standard message passing is insufficient for this task. The dynamic, semantically aware propagation routes learned by the Pseudo-Metapath module are essential for effectively combating the “under-reaching” problem and ensuring that supervision signals reach relevant nodes throughout the network.
The Beta-Wavelet spectral filter also proved to be a vital component for achieving optimal performance. Disabling this module consistently led to a noticeable drop in performance, confirming its role in generating robust initial node embeddings. For instance, on the CICIDS dataset, removing the spectral filter caused a 4.92% drop in Precision. On the NSL-KDD dataset, Accuracy and Precision decreased by 2.53 and 2.09%, respectively. This demonstrates that effectively capturing the high-frequency signals characteristic of network anomalies provides a strong and necessary foundation for the subsequent propagation and classification steps. The synergy between a powerful feature extractor and an intelligent propagation mechanism is therefore key to the model’s success.
In summary, our ablation studies confirm that both the spectral filter and the Pseudo-Metapath module are integral and synergistic components. The spectral filter provides a robust feature basis by capturing anomaly signatures, while the Pseudo-Metapath module provides an indispensable mechanism for effective, long-range information propagation, with the latter being the primary driver of our model’s superior performance.
5.4. Hyper-Parameter Study
In this section, we conduct a comprehensive study to investigate the impact of two critical hyper-parameters on the performance of our proposed model: the number of Dynamic Metapath Learning Layers (aka, Num Layers) and the pseudo-labeling cutoff threshold (aka, Cutoff).
The number of layers determines the model’s complexity and receptive field, while the cutoff threshold controls the confidence required for assigning pseudo-labels during the training process. The model’s performance is measured using Accuracy and Recall, with the results visualized as 3D surface plots to illustrate the interplay between these two parameters. As illustrated in
Figure 5 and
Figure 6, we can draw several key insights from the experimental results.
First, a striking observation is the high degree of consistency between the optimal hyper-parameter regions for maximizing accuracy and recall. Across all datasets, the parameter combinations that yield the highest accuracy also tend to produce the highest recall. This indicates that our model does not require a significant trade-off between these two crucial metrics, simplifying the tuning process.
Second, the Cutoff threshold emerges as the most dominant factor influencing performance. For all four datasets, setting the threshold to a high value (e.g., greater than 0.8) invariably leads to a sharp decline in both accuracy and recall. This is intuitive, as a stricter criterion for classifying positive instances causes the model to miss more potential threats, thereby increasing the number of false negatives and degrading overall performance. The results suggest that a lower-to-mid-range cutoff (approximately 0.5 to 0.7) is optimal.
Third, the ideal model complexity, dictated by the Num Layers, varies depending on the characteristics of the dataset. For KDD CUP 99 and its subset NSL-KDD, a simpler model with two to three layers achieves the best results. Increasing the model’s depth beyond this point leads to a performance drop, likely due to over-smoothing or overfitting. Conversely, for the CICIDS2017 dataset, the model’s performance is largely insensitive to the number of layers, provided that the cutoff threshold is set appropriately. This suggests that the features in this dataset are robust enough to be effectively captured by models of varying depths. The UNSW-NB15 dataset shows a more complex relationship, but a model with 2 layers still provides a reliable and high-performing baseline.
In summary, this study underscores the importance of careful hyper-parameter tuning. The primary guideline for our model is to maintain the Cutoff threshold within a moderate range of [0.5, 0.7]. Within this range, the PseudoMetapathNet with 2 to 3 layers offers a robust and effective configuration for achieving high accuracy and recall across diverse network intrusion detection environments.
5.5. Case Study
To provide a qualitative and intuitive understanding of how PseudoMetapathNet overcomes the limitations of standard GNNs, we conduct a detailed case study on a representative subgraph extracted from the dataset. As illustrated in
Figure 7, we selected a homophilous cluster consisting of four interconnected anomaly nodes. This scenario is particularly insightful as it tests a model’s ability to recognize and amplify signals within a group of coordinated malicious entities, a situation where simpler models can surprisingly fail.
The results of the case study clearly demonstrate the superiority of our proposed framework. The leftmost panel of
Figure 7 shows the ground truth, a cluster where all four nodes are anomalies. The second panel reveals the surprising failure of the baseline Graph Convolutional Network (GCN) model [
67]. Despite the absence of any normal nodes to cause confusion, the GCN misclassifies every single anomaly as normal, yielding a low confidence score of 0.45 for each. This suggests that the standard message-passing mechanism is insufficient for creating a signal reinforcement loop among anomalous peers and may be biased by the globally prevalent normal class.
In stark contrast, the third panel shows the decisive success of our PseudoMetapathNet. It correctly identifies all four nodes as anomalies with the highest possible confidence score of 1.00. The key to this success is revealed in the rightmost panel, which visualizes the learned attention weights. Our model has learned to assign the highest importance to the A-A (Anomaly-Anomaly) metapath, with a weight significantly greater than other path types.
This learned knowledge is critical. When processing this subgraph, PseudoMetapathNet’s dynamic typing and attention mechanism explicitly amplify the information flow along these high-weight A-A paths. This creates a powerful positive feedback loop where each anomaly node mutually reinforces its neighbors’ anomalous status, rapidly driving the prediction confidence to its maximum. This case provides strong qualitative evidence that the Pseudo-Metapath mechanism is crucial for identifying not only isolated threats in heterophilous environments but also coordinated patterns of malicious activity by learning and exploiting the underlying semantic graph structure.