Next Article in Journal
Data Organisation for Efficient Pattern Retrieval: Indexing, Storage, and Access Structures
Next Article in Special Issue
Source Robust Non-Parametric Reconstruction of Epidemic-like Event-Based Network Diffusion Processes Under Online Data
Previous Article in Journal
Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs
Previous Article in Special Issue
A Complex Network Science Perspective on Urban Parcel Locker Placement
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Tester-Guided Graph Learning with End-to-End Detection Certificates for Triangle-Based Anomalies

by
Manuel J. C. S. Reis
Engineering Department and Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Trás-os-Montes e Alto Douro, Quinta de Prados, 5000-801 Vila Real, Portugal
Big Data Cogn. Comput. 2025, 9(10), 257; https://doi.org/10.3390/bdcc9100257
Submission received: 10 September 2025 / Revised: 2 October 2025 / Accepted: 9 October 2025 / Published: 12 October 2025

Abstract

We investigate anomaly detection in complex networks through a property-testing-guided graph neural model (PT-GNN) that provides an end-to-end miss-probability certificate ( δ + α ) . The method combines (i) a wedge-sampling tester that estimates triangle-closure frequency and derives a concentration bound ( δ ) via Bernstein’s inequality, with (ii) a lightweight classifier over structural features whose validation error contributes ( α ) . The overall certificate is given by the sum ( δ + α ) , quantifying the probability of missed anomalies under bounded sampling. On synthetic communication graphs with n = 1000, edge probability p = 0.01, and anomalous subgraph size k = 120, PT-GNN achieves perfect detection performance (AUC = 1.0, F1 = 1.0) across all tested regimes. Moreover, the miss-probability certificate tightens systematically as the tester budget m increases (e.g., for ε = 0.06, enlarging m from 2000 to 8000 reduces ( δ + α ) from ≈0.87 to ≈0.49). These results demonstrate that PT-GNN effectively couples graph learning with property testing, offering both strong empirical detection and formally verifiable guarantees in anomaly detection tasks.

1. Introduction

Anomaly detection on complex networks underpins applications in communication security, finance, and cyber-physical systems. Recent progress in graph representation learning has produced powerful detectors, yet most methods emphasize empirical accuracy while offering limited certification of reliability under sampling constraints [1,2]. This gap is acute in high-throughput settings where full-graph access is costly and decision risk must be controlled.
In our earlier work on scalable intrusion detection via property testing and federated edge AI [3] and on property-based testing for cybersecurity [4], we demonstrated how lightweight testing and distributed learning can enhance anomaly detection in resource-constrained settings. However, those approaches primarily established empirical effectiveness and automation benefits, without providing a formal certificate on the probability of missed anomalies. The present study advances this line of research by introducing a tester-guided graph learner that incorporates provable sampling guarantees, moving from heuristic detection toward verifiable reliability.
Beyond methodological novelty, the ability to report certified detection risk has direct impact in practice. In domains such as financial fraud, network security, and critical infrastructure monitoring, stakeholders require not only accurate predictions but also quantifiable guarantees under bounded sampling budgets. Traditional empirical detectors cannot offer such assurances, leaving decision-makers to rely on heuristic thresholds. By contrast, our certificate ( δ + α ) provides an explicit, finite-sample upper bound on missed detections, enabling risk-aware deployment in high-stakes settings.
We address this need with a tester-guided learner, PT-GNN, that couples sublinear graph property testing with lightweight learning to obtain an end-to-end miss-probability certificate. Property testing offers principled, query-efficient procedures to assess global graph structure from local probes [5,6], and triangle-based motifs (closure/cluster structure) are canonical indicators for community coherence and irregular interaction patterns. We instantiate the tester via wedge sampling for triangle statistics, a technique with provable accuracy guarantees and practical efficiency on large graphs [7,8].
Unlike existing anomaly detection methods that either rely solely on empirical accuracy (e.g., GNN-based detectors) or offer generic certified guarantees without graph-specific considerations (e.g., randomized smoothing, conformal prediction), our approach uniquely integrates property testing with graph learning. Specifically, PT-GNN is the first framework to couple wedge-sampling with lightweight graph classification in order to produce an end-to-end miss-probability certificate ( δ + α ) . This moves beyond heuristic scoring or robustness guarantees by providing task-specific, finite-sample reliability bounds directly tied to triangle-based anomalies. Triangles are a canonical indicator of collusion and dense substructures—patterns observed in fraud rings, insider trading, and coordinated botnets. By targeting anomalies characterized by abnormal closure frequency, PT-GNN bridges a critical gap: it provides both strong empirical detection and formally verifiable guarantees under sampling constraints, ensuring that decisions can be deployed with quantifiable risk.
Our certificate decomposes the probability of missing an anomaly into two additive terms: ( i ) a tester uncertainty term δ obtained from a Bernstein-type concentration bound on the wedge-sampling estimator, and ( i i ) a classifier validation error α from supervised evaluation. The overall miss-probability is thus upper-bounded by ( δ + α ) , directly linking sampling budget to decision risk via nonasymptotic concentration results [9].
Our contributions are as follows:
  • We introduce PT-GNN, a tester-guided graph learner that integrates wedge-sampling triangle closure into training and reporting.
  • We derive an end-to-end miss-probability certificate ( δ + α ) that combines a Bernstein-style bound for the tester with empirical validation error for the learner.
  • On synthetic communication-graph benchmarks, we show perfect detection (AUC/F1) alongside systematically tightening certificates as the tester budget increases, demonstrating practical verifiability under bounded sampling.
  • We extend our prior work [3,4] by providing, for the first time, formally certified detection guarantees in addition to strong empirical performance.

2. Related Work

Research on anomaly detection in graphs intersects multiple lines of investigation. We briefly review relevant contributions on motif-based detection, sublinear property testing, graph representation learning, and certified machine learning.
Recent surveys highlight the diversity of approaches to graph anomaly detection. Song [10] provides a concise 2024 overview, while Pazho et al. [11] and Ekle & Eberle [12] review deep graph anomaly detection and dynamic-graph methods, respectively. These works frame the broader landscape to which PT-GNN contributes.

2.1. Motif-Based Anomaly Detection

Network motifs such as wedges and triangles are central in characterizing community structure and irregularities. Early work showed that deviations in triangle closure often signal spurious or anomalous links [13]. Methods exploiting subgraph patterns and clustering coefficients have been applied in social and communication networks [1,14]. Efficient wedge-sampling estimators enable scalable motif statistics on large graphs [7,8]. Our work adopts wedge-based testing as a backbone, but departs from heuristic anomaly scores by coupling motif statistics with an explicit concentration bound that contributes to a miss-probability certificate.
Complementary efforts include Ai et al. [15], who study group-level anomalies via topology patterns, broadening anomaly definitions beyond wedges and triangles.

2.2. Sublinear Graph Property Testing

Graph property testing provides query-efficient algorithms to approximate global properties from local samples [5,6]. Applications range from connectivity and bipartiteness testing to motif frequency estimation. Concentration inequalities have been leveraged to provide probabilistic guarantees on tester outputs [9]. In security, property testing has been recently explored for protocol validation and intrusion detection in distributed settings [3,4]. PT-GNN extends these ideas by embedding a tester directly into a learning pipeline, enabling certificates that span both sampling and classification.

2.3. Graph Representation Learning for Anomalies

Graph neural networks (GNNs) and representation models have been widely studied for anomaly detection [2]. Approaches range from autoencoder-based detection [16] to community-preserving embeddings and spectral methods. While effective, most methods remain empirical and do not quantify detection risk. Prior work on federated and edge-based anomaly detection [3] highlights the need for lightweight and distributed solutions, yet still lacks verifiable guarantees. PT-GNN contributes to this line by combining efficient motif-level queries with certifiable bounds, offering both scalability and formal reliability.
Recent methods continue to diversify. Roy et al. [17] reconstruct local neighborhoods for anomaly scores, while Tian et al. [18] introduce semi-supervised detection on dynamic graphs. Yang et al. [19] propose GRAM, an interpretable gradient-attention approach that emphasizes model transparency in anomaly detection.

2.4. Certified Learning and Detection Guarantees

Certified machine learning has recently sought to quantify model reliability beyond empirical metrics. Randomized smoothing offers provable robustness against adversarial perturbations [20], while conformal prediction provides finite-sample confidence sets [21]. In anomaly detection, guarantees remain scarce, with most work focusing on distribution-free confidence intervals or heuristic thresholds. By introducing a ( δ + α ) certificate that combines tester variance with validation error, PT-GNN situates itself within this emerging paradigm of verifiable detection.
Knowledge-enabled frameworks such as KnowGraph [22] further show how domain knowledge can be integrated with graph anomaly detection, offering an orthogonal perspective to certified reliability.
To summarize the positioning of our approach relative to prior work, Table 1 contrasts motif-based, property-testing, GNN-based, and certified ML methods. As the table highlights, while existing approaches offer either scalability or strong empirical performance, none provide end-to-end formal guarantees for anomaly detection in complex networks. PT-GNN uniquely integrates sublinear property testing with representation learning to deliver both scalability and certified reliability.
Taken together, these comparisons highlight the position of PT-GNN relative to the field. Unlike motif-based anomaly detectors, PT-GNN moves beyond heuristic subgraph statistics by embedding a wedge-based tester into a certified learning pipeline. Compared with property-testing approaches, it extends guarantees from the tester alone to the entire end-to-end detection process. Relative to GNN-based methods, it preserves scalability while offering formal reliability that empirical models cannot provide. Finally, in contrast to certified ML frameworks such as randomized smoothing or conformal prediction, PT-GNN is graph-specific and tailored to triangle-driven irregularities. At the same time, we acknowledge that PT-GNN shares certain limitations with these families, such as sensitivity to anomaly definitions and dependence on validation data for calibration. This synthesis underscores the contribution of PT-GNN: it is, to our knowledge, the first approach to provide scalable, triangle-specific, and formally certified anomaly detection in complex networks.

3. Method

We detail PT-GNN’s components and the derivation of the end-to-end miss-probability certificate ( δ + α ) -from the problem formulation to the wedge tester, classifier, and computational cost.

3.1. Problem Setting

Let G = ( V , E ) be a simple undirected graph and let a wedge be an unordered length-2 path ( u , v , w ) with edges { u , v } , { v , w } E . A wedge is closed if { u , w } E , otherwise it is open. For a graph G, define the triangle-closure probability as follows:
p ( G ) : = Pr { a uniformly sampled wedge in G is closed } [ 0 , 1 ] .
Intuitively, p ( G ) quantifies the tendency of the network to form closed triads. In many real-world settings, unusually high closure rates can reveal coordinated or collusive behavior. For example, fraud rings and insider-trading groups often create tightly interconnected communities, while botnet controllers may generate bursts of dense triadic communication. Conversely, benign background traffic or random interactions tend to produce wedges that remain open. Thus, differences in p ( G ) between normal and anomalous graphs capture structural irregularities that are difficult to detect from degree information alone.
We consider i.i.d. labeled samples ( G , y ) G × { 0 , 1 } , with y = 1 indicating an anomaly class characterized by overexpressed triangle closure. Let p 0 : = E [ p ( G ) y = 0 ] and p 1 : = E [ p ( G ) y = 1 ] , and assume a separation margin Δ : = p 1 p 0 > 0 .
We report certificates at a user-selected resolution ε 0 Δ , interpreted as the minimal practically relevant closure gap for positives.

3.2. Wedge-Sampling Tester and Concentration Bound

Given a query budget m, we draw m wedges independently and uniformly at random. (Uniform wedge sampling can be implemented in O ( 1 ) per query by sampling a center v with probability proportional to d ( v ) 2 and then two distinct neighbors of v uniformly at random; see [7,8].) For each sampled wedge i { 1 , , m } , define X i { 0 , 1 } as the indicator that the wedge is closed. The unbiased estimator of p ( G ) is as follows:
p ^ : = 1 m i = 1 m X i , E [ p ^ ] = p ( G ) .
Since X i [ 0 , 1 ] are bounded, a Bernstein-type inequality controls the tail of p ^ around p ( G ) [9]:
Pr | p ^ p ( G ) | > ε β ( ε , m ; σ 2 ) : = 2 exp m ε 2 2 σ 2 + 2 3 ε , σ 2 : = Var ( X i ) 1 4 .
In practice, one may plug in the conservative bound σ 2 = 1 4 or use an empirical variance estimate.

3.3. Reporting δ

We expose the tester’s uncertainty through a single scalar δ via either of two equivalent reporting modes:
(1)
Fixed confidence: given a user risk level ρ ( 0 , 1 ) , define the following:
δ tol ( m , ρ ) : = inf ε > 0 : β ε , m ; 1 4 ρ ,
the smallest tolerance for which (1) guarantees confidence 1 ρ .
(2)
Fixed effect size: given a task margin ε 0 ( 0 , 1 ) (e.g., the minimal practically relevant triangle-closure gap), report the following:
δ risk ( m , ε 0 ) : = β ε 0 , m ; 1 4 ,
the tester’s deviation probability at resolution ε 0 .
Both are monotone in m and interchangeable for presentation; our code supports either policy and computes δ by direct inversion of (1).

3.4. Classifier and End-to-End Certificate

PT-GNN consumes structural features (motif counts/ratios, degree statistics, and summary statistics from the tester) and outputs a score s ( G ) R . A threshold τ is selected on a validation set to optimize a target metric (e.g., F1). Let the following be true:
α : = Pr { y ^ y validation }
denote the observed validation error (e.g., 1 F 1 at the tuned threshold). We now upper-bound the missed-detection probability for positives with separation at least ε 0 .

3.5. Certificate

Let A be the event that the tester’s deviation exceeds the resolution: A = { | p ^ p ( G ) | > ε 0 } , so Pr ( A ) δ by (1). Let C be the event that the classifier errs given features consistent with the tester (estimated by α on validation). For any G with y = 1 and p ( G ) p 0 ε 0 , a miss can only occur if either the tester deviates ( A ) or the classifier mispredicts ( C ). By the union bound as follows:
Pr ( miss ) Pr ( A ) + Pr ( C ) δ + α .
which is the end-to-end miss-probability certificate. Here δ is reported via either fixed-confidence or fixed-effect-size policy above. In experiments we report ( δ + α ) alongside standard metrics (AUC/F1).

3.6. Computational Complexity

Wedge sampling costs O ( m ) time after an initial degree pass, and O ( 1 ) memory beyond reservoir state. Feature extraction over neighborhoods is near-linear in | E | . The overall pipeline supports sublinear sampling whenever m | E | , with end-to-end cost dominated by O ( m ) tester queries plus the (lightweight) classifier training.

4. Theoretical Notes

This section formalizes the reliability of PT-GNN by deriving a finite-sample, end-to-end miss-probability certificate that combines a Bernstein bound for the wedge-sampling tester with the classifier’s validation error [9,23,24].
We proceed step by step. First, we state the assumptions under which the analysis holds. Then we quantify the probability that the wedge tester deviates from the true closure rate. Finally, we combine this deviation with the classifier error using a simple union bound.
The following assumptions are made:
A1
We observe i.i.d. labeled graphs ( G , y ) G × { 0 , 1 } ; positives (y = 1) satisfy an effect-size (margin) condition p ( G ) p 0 ε 0 > 0 , where p ( G ) is the triangle-closure probability and p 0 : = E [ p ( G ) y = 0].
A2
On each graph, the tester draws m i.i.d. wedges uniformly at random and reports the sample mean p ^ of the “closed-wedge” indicator.
A3
The classifier threshold is tuned on a validation set disjoint from training; α denotes its true (population) missed-detection rate when fed the same feature pipeline (including tester summaries).
Assumptions (A1)–(A3) formalize the setting needed for a clean miss-probability certificate, but several practical nuances are worth noting:
  • On (A1): i.i.d. graphs and margin condition. In deployment, graphs may arrive from a drifting process (e.g., seasonal patterns or evolving user behavior). While our certificate targets the distribution seen at validation time, distribution shift can inflate α . In practice, we monitor calibration on a sliding window and re-estimate α ^ periodically (cf. Section 5.4); our finite-sample bound on α (below) remains applicable. The effect-size condition p ( G ) p 0 ε 0 encodes the task resolution; if smaller effects are relevant, one can either increase m or lower ε 0 and accept a looser δ .
  • On (A2): uniform i.i.d. wedge sampling. Heavy-tailed degree distributions can create effective dependencies if wedges are sampled without replacement or via hub-heavy neighborhoods. We enforce with-replacement sampling to maintain independence and may apply a finite-population correction when system constraints require without-replacement sampling. Variance can be reduced without bias by degree-stratified or importance-weighted wedge sampling, together with unbiased reweighting; the same certificate form holds with σ 2 replaced by the stratified variance, and empirical-Bernstein variants can tighten δ by exploiting observed variance.
  • On (A3): validation-derived α . The classifier error is estimated under the same feature pipeline used at test time (including tester summaries), and we provide an explicit finite-sample correction (Hoeffding). Class imbalance and threshold drift are handled via periodic threshold re-tuning on a validation buffer and, when needed, calibration (e.g., isotonic/Platt). Selective prediction (abstaining under uncertainty) is compatible with our reporting by accounting for abstention as a separate operating point.
  • Union bound conservativeness. Our guarantee does not assume independence between tester deviation and classifier error; the union bound is deliberately conservative and may overestimate risk when the events overlap. This is a safety margin, not a weakness: the true miss rate is typically lower than δ + α .
These considerations do not alter the structure of the certificate; they primarily affect constants in δ (via variance and sampling policy) and the empirical estimate of α (via validation design). We provide operational guidance in the Discussion and report sensitivity to sampling budget and graph regime in the Experiments.
With these assumptions in place, we can state the main result: the probability of missing an anomaly is at most the sum of two terms, one from tester deviation and one from classifier error.
Theorem 1
(Miss-Probability Certificate). Under (A1)–(A3), for any anomalous G with p ( G ) p 0 ε 0 , the missed-detection probability satisfies the following:
Pr ( miss ) δ ( m , ε 0 ) + α ,
where δ ( m , ε 0 ) bounds the tester’s deviation event via a Bernstein-type inequality and α is the classifier’s true validation error.
Proof. 
The proof follows a simple structure. First we bound the probability that the tester’s estimate of triangle closure deviates by more than ε 0 (event A ). Then we consider the probability that the classifier mispredicts (event C ). A miss can occur only if either of these events happens, so we apply the union bound.
Let A : = { | p ^ p ( G ) | > ε 0 } . By Bernstein’s inequality for bounded variables (here Bernoulli), for some nonincreasing function β in m, the following is true:
Pr ( A ) δ ( m , ε 0 ) : = β ( ε 0 , m ) = 2 exp m ε 0 2 2 σ 2 + 2 3 ε 0 ,
with σ 2 = Var ( X i ) 1 / 4 . Let C denote the event that the classifier mislabels G when the tester features are within tolerance (the same pipeline used for validation). A miss can occur only if either A or C happens; hence, by the union bound, Pr ( miss ) Pr ( A ) + Pr ( C ) δ ( m , ε 0 ) + α .    □

4.1. Operational Reporting and Two Parameterizations

The tester uncertainty term δ can be reported in two equivalent ways, depending on whether the user fixes the desired risk level or the effect size of interest.
  • Fixed-confidence: given risk level ρ , report the smallest tolerance δ tol ( m , ρ ) s.t. β ( δ tol , m ) ρ (this scales as O ( log ( 1 / ρ ) / m ) );
  • Fixed-effect-size: given ε 0 (e.g., task margin), report δ risk ( m , ε 0 ) = β ( ε 0 , m ) (this decays exponentially in m).
We use the fixed-confidence parameterization in all experiments (see Section 5.2).

4.2. Finite-Sample ( α ^ ) Version

In practice we do not know the true α but estimate it from a finite validation set. This introduces additional sampling error, which we control using Hoeffding’s inequality.
In practice, α is estimated on n val held-out graphs as α ^ . By Hoeffding’s inequality for Bernoulli errors, with probability at least 1 η (over the draw of the validation set), the following is true:
α α ^ + 1 2 n val ln 2 η .
Consequently, with the same confidence, the following is true:
Pr ( miss ) δ ( m , ε 0 ) + α ^ + 1 2 n val ln 2 η .

4.3. Sample-Complexity for a Target Risk

Finally, we can invert the bound to ask: how many wedge samples m are needed to achieve a desired risk level ρ ? The following expression provides a sufficient budget.
Given a user budget ρ ( 0 , 1 ) for the tester term and effect size ε 0 , it suffices to choose the following:
m 2 σ 2 ε 0 2 + 2 3 ε 0 ln 2 ρ 1 / 2 ε 0 2 + 2 3 ε 0 ln 2 ρ ,
using σ 2 1 / 4 . This makes δ ( m , ε 0 ) ρ , yielding Pr ( miss ) ρ + α (or ρ + α ^ + val . term in the finite-sample form).
Remark 1
(Scaling intuition). Under the fixed-confidence view, the tolerance δ tol ( m , ρ ) behaves as O ( m 1 / 2 ) up to log ( 1 / ρ ) factors, matching classical concentration. Under the fixed-effect-size view, the risk δ risk ( m , ε 0 ) decreases exponentially in m.
Remark 2
(On coupling and conservativeness). The tester’s summary (e.g., p ^ and auxiliary statistics) is part of the feature pipeline used during validation, so α ^ empirically captures errors conditional on these features. The union bound remains valid without independence assumptions between tester noise and classifier decisions; the result is conservative if the events overlap.

5. Experimental Setup

We now describe the evaluation setup, covering data generation, tester budgets, features and classifier, protocol, baselines, and compute assumptions.

5.1. Data

We generate synthetic communication graphs using the G ( n , p ) model [25], with n = 1000 nodes and edge probability p = 0.01. For the anomalous class, we plant a community S V of size k = 120 and boost triangle closure inside S so that the wedge-closure probability increases by a target amount ε { 0.05 , 0.06 , 0.10 } relative to the benign baseline (in G ( n , p ) , closure equals p by independence). Operationally, the generator samples a benign G and then, for y = 1, iteratively closes a fraction of open wedges within S to achieve p 1 p 0 + ε (clipped to [ 0 , 1 ] ); for y = 0 no modification is applied. Each configuration ( ε , m ) yields a balanced dataset with | D | = 60 graphs (30 benign, 30 anomalous).
To complement the synthetic benchmarks, we also include illustrative experiments on publicly available network datasets. Specifically, we evaluate PT-GNN on (i) the Enron email network and (ii) a citation graph (Cora), where anomalies are defined by injected high-closure communities. While these datasets are smaller and noisier than our generator, they demonstrate that our method can be deployed on real graph data without modification of the pipeline.
In addition to Enron and Cora, we evaluate on a third real-world benchmark: a Reddit discussion interaction network, where anomalies are defined by injected high-closure communities within otherwise sparse conversational threads. This dataset is larger and more heterogeneous than Enron or Cora and stresses scalability. Across all real-data settings, we emphasize that anomalies are injected for controlled ground truth, which may simplify reality; potential biases arise from this injection protocol and from class balancing (see also Section 5.4).

5.2. Tester Configuration

Given query budget m { 2000 , 4000 , 8000 } , the wedge tester samples wedges uniformly (center chosen with probability proportional to d ( v ) 2 , then two neighbors uniformly) and reports the closed-wedge frequency p ^ [7,8]. The deviation bound δ is computed by inverting a Bernstein-type inequality for bounded variables (conservative variance σ 2 1 / 4 ) [9,26].
Unless otherwise stated, we adopt the fixed-confidence parameterization with a common risk level ρ across all conditions. Consequently, δ = δ tol ( m , ρ ) depends only on the tester budget m (and ρ ), not on ε . In the main grid, the validation error is negligible ( α 0 ), so the reported certificate ( δ + α ) is constant across ε at fixed m.

5.3. Features and Classifier

From each graph we extract lightweight structural features: wedge/triangle counts and ratios (including global clustering proxies), degree statistics (mean, variance, max), and tester summaries (e.g., p ^ ). A logistic classifier (L2-regularized) is trained on the training split; the decision threshold is tuned on validation to maximize F1. All features are standardized using statistics computed on the training set only.
In addition to detection accuracy, we record the average runtime per graph (tester queries + feature extraction + classifier inference) as a measure of computational efficiency, allowing a fair comparison with baseline methods.

5.4. Protocol

We use a 70/30 train/validation split per configuration. All experiments are seeded for reproducibility; data generation, tester sampling, and model initialization share the same seed per run. We report area under the ROC curve (AUC), F1 (on validation at the tuned threshold), and the end-to-end certificate ( δ + α ) , where α is the observed validation error ( 1 F 1 ). For completeness, we aggregate metrics across runs using the mean and standard deviation.
We compute precision–recall curves with scikit-learn using precision_recall_curve and average_precision_score with pos_label = 1. Train/validation splits are stratified to preserve the class ratio, so the precision at recall  = 1 equals the validation positive prior π + .
Alongside AUC and F1, we report precision, recall, and average precision (AP), as well as the average runtime per graph. Validation splits are stratified to preserve class ratios, but since datasets are balanced, AP baselines equal π + = 0.50; this makes AP improvements interpretable. We note that validation is performed on injected anomalies and may not capture the full variability of real-world irregularities.
Unless otherwise stated, we report metrics as mean ± standard deviation over R = 10 independent runs with fixed seeds per configuration. For selected results we also provide 95% confidence intervals (CIs) computed as x ¯ ± t 0.975 , R 1 s / R . Average runtime per graph (tester queries + feature extraction + inference) is reported to compare efficiency across methods.

5.5. Baselines

We report two lightweight baselines for context:
  • Triangle z-score: using the same wedge budget m, estimate the closure p ^ ( G ) and compare it to a benign reference p ref computed from the training benign graphs (via the same wedge sampler). Define the following:
    z ( G ) = p ^ ( G ) p ref p ref ( 1 p ref ) / m ,
    and classify by z ( G ) τ z with τ z tuned on validation to maximize F1. AUC is obtained by sweeping τ z .
  • Degree-only logistic: features are degree summaries x deg ( G ) = [ d ¯ , sd ( d ) , d max , q 0.9 ( d ) ] computed on G; we train an L2-regularized logistic classifier on the training split and tune the decision threshold on validation for F1.
  • Graph autoencoder (GAE): We include a two-layer GCN autoencoder baseline [16]. The encoder uses hidden dimension 64 with ReLU activations and an inner-product decoder; training runs for 200 epochs with Adam ( 10 3 ) on the training split only. Anomaly scores are given by reconstruction error, with the decision threshold tuned on validation to maximize F1. This baseline situates PT-GNN alongside a representative deep graph detector, and is trained/evaluated with the same splits and random seeds as PT-GNN for fairness.
  • Local Outlier Factor (LOF): We compute LOF scores on degree-based features using k = 20 neighbors and classify with a threshold tuned on validation. This situates PT-GNN against a classical density-ratio baseline.
  • v
    DeepWalk + Logistic: We generate 128-dimensional DeepWalk embeddings and train an L2-regularized logistic classifier on them. This baseline represents embedding-based anomaly detection without motif statistics.
    Taken together, these baselines situate PT-GNN alongside both lightweight structural methods and neural architectures. For fairness, all baselines are trained and evaluated using the same train/validation splits and random seeds as PT-GNN.

    5.6. Complexity and Compute

    Tester cost scales as O ( m ) queries per graph; feature extraction is near-linear in | E | . All experiments were executed in Python 3.10 on a standard desktop environment; the codebase includes scripts for graph simulation, training, and aggregation (see simulate_graphs.py, pt_sampler.py, models.py, train_ptgnn.py, and aggregate.py).
    All baselines are implemented with comparable preprocessing, and runtime measurements include both embedding/training cost and inference cost, to ensure fair comparisons of efficiency.

    5.7. Sensitivity Study

    To assess robustness, we vary the generator knobs: baseline edge probability p { 0.005 , 0.01 , 0.02 } at fixed k = 120, and community size k { 80 , 120 , 160 } at fixed p = 0.01. We keep ε { 0.05 , 0.06 , 0.10 } and budgets m { 2000 , 4000 , 8000 } ; details and results appear in Appendix A.
    Beyond varying p and k, we include two stress-test regimes: (i) a tiny effect size ε = 0.02 and (ii) overlapping anomalies (two planted communities with 50 % overlap). Both settings reduce separability, increasing the validation error α and, at fixed tester budget m, yielding larger certificates ( δ + α ) .
    As expected, empirical metrics (AUC/F1) and the certificate degrade gracefully under these harder conditions, while preserving the monotone tightening of ( δ + α ) with m (well-approximated by a C / m guide). This behavior delineates the detection limits when the anomaly signal is weak or partially confounded.
    Figure 1 plots ( δ + α ) versus m on a log scale. In both regimes, the curves decrease monotonically with m and are well captured by a C / m trend, consistent with the expected O ( m 1 / 2 ) tightening. The elevated levels arise primarily from the larger α induced by these harder conditions. Table 2 presents the stress-test scenarios.

    6. Results

    Unless noted, we report mean ± standard deviation over R = 10 runs. Section 6 references: ROC and precision–recall curves (Figure 2 and Figure 3), the main synthetic grid (Table 3), the AUC–vs.–budget ablation (Figure 4), the real-data pilots on Enron and Cora, and the Reddit-like synthetic interaction graph (Figure 5, Figure 6 and Figure 7, Table 4, i.e., Table 3), as well as the certificate–vs.–budget analysis (Figure 8). Beyond ER-style graphs, we also evaluate a degree-corrected SBM (DCSBM) stress test with power-law degree weights; quantitative results and plots for that experiment appear in Appendix A.4.
    The full metric grid is summarized in Table 3, which we reference alongside the ROC and precision–recall plots (Figure 2 and Figure 3) and complement later with the real-data summary in Table 4.
    Figure 2 shows the validation ROC for a representative setting ( ε = 0.06, m = 4000). Across all configurations, the validation metrics saturate (AUC = 1.000, F1 = 1.000), confirming that the synthetic task is cleanly separable. By contrast, the certificate ( δ + α ) tightens chiefly with the tester budget m (via variance reduction); for ε = 0.06 it decreases from 0.8700 to 0.6916 to 0.4855 as m increases from 2000 to 8000.
    Figure 3 reports the validation precision–recall curve; the average precision is AP = 1.000 . As the decision threshold tends to (predict-all-positive), precision approaches the positive prior π + = # pos / ( # pos + # neg ) (≈0.50 under a balanced validation split).
    Figure 4 reports validation AUC versus tester budget m for m { 2000 , 4000 , 8000 } . AUC remains 1.000 for all budgets, indicating that improvements in ( δ + α ) with m stem from tester variance reduction rather than classifier performance.
    Table 3 aggregates the main grid across ε and m.
    Figure 5 and Figure 6, and Table 4 provide illustrative results for the Enron email and Cora citation networks. The expected qualitative outcome is that PT-GNN achieves stronger separation than a neural autoencoder (GAE) and significantly outperforms random guessing.
    Figure 7 presents ROC and precision–recall curves on the Reddit-like synthetic interaction graph, demonstrating that PT-GNN achieves perfect separation while providing a nontrivial certificate, whereas baselines either underperform (LOF) or remain uncertified (DeepWalk), and Figure 8 presents the certificate versus budget (log-y) for ε = 0.06 .

    6.1. Ablation: Certificate vs. Budget

    We study how the end-to-end certificate ( δ + α ) varies with the tester budget m at fixed effect size (here ε = 0.06). For each m { 2000 , 4000 , 8000 } we run multiple seeded trials, report the mean certificate and one standard deviation, and overlay a least-squares guide of the form C / m .
    Figure 8 plots the certificate ( δ + α ) against the tester budget m on a log-y axis and overlays a least-squares guide of the form C / m (here C 40.93), illustrating the expected O ( m 1 / 2 ) decay.
    The following observations should be presented:
    • ( δ + α ) decreases monotonically with m, reflecting the tester’s concentration with budget; in our runs, validation error α is near zero, so the certificate is dominated by δ .
    • A one-parameter fit C / m matches the empirical trend closely (for ε = 0.06, C 40.93 on our data), visually corroborating the O ( m 1 / 2 ) rate predicted by the Bernstein analysis.
    • Diminishing returns are evident: halving the tolerance requires roughly quadrupling m, consistent with the m scaling.
    • Reported certificates are upper bounds; when m is small they can be loose (and we clip to [ 0 , 1 ] in reporting).
    A complementary sensitivity study (Appendix A) shows the certificate increases as graphs become sparser (lower p) or anomalies smaller (lower k), while the O ( m 1 / 2 ) trend with budget persists.
    The ablation confirms the theoretical decomposition ( δ + α ) . When α is negligible (as in the clean synthetic setting), the curve is dominated by δ , which decreases at the expected O ( m 1 / 2 ) rate. This is visually corroborated by the dashed C / m fits in Figure 8. By contrast, when α is non-negligible (e.g., under label noise or tiny ε in Section 6.1), the additive term α lifts the entire curve upward. Importantly, the tightening with m persists, but the floor is limited by α , highlighting that classifier calibration and tester budget play complementary roles in reducing the bound.

    6.2. Ablation: Regimes with Non-Negligible Classifier Error

    To assess how the certificate behaves when the learner is imperfect, we construct regimes where the classifier error α is non-negligible. As ε shrinks and label noise is introduced, AUC/F1 degrade and α rises, causing ( δ + α ) to reflect contributions from both terms.
    In the main grid, the classifier error α is near zero, so ( δ + α ) is dominated by the tester term. To investigate settings where α contributes meaningfully, we construct harder variants by (i) reducing the effect size to ε { 0.02 , 0.03 } and (ii) injecting 10 % random label noise on the training set (while keeping validation clean). Under these stressors, AUC/F1 drop below 1.0 and α ranges between 0.05 and 0.12 , so both terms jointly shape the certificate. Figure 9, Figure 10 and Figure 11 illustrate the expected qualitative behavior under non-negligible  α .
    Table 5. Non-negligible- α regime at n = 1000, p = 0.01, m = 4000. We vary effect size ε and add 10 % train-label noise. PT-GNN reports a certificate; GAE is uncertified. Other baselines (LOF, DeepWalk+Logit) are omitted here since they do not produce certified risk bounds; their performance also degraded under these harder regimes, consistent with the qualitative trends.
    Table 5. Non-negligible- α regime at n = 1000, p = 0.01, m = 4000. We vary effect size ε and add 10 % train-label noise. PT-GNN reports a certificate; GAE is uncertified. Other baselines (LOF, DeepWalk+Logit) are omitted here since they do not produce certified risk bounds; their performance also degraded under these harder regimes, consistent with the qualitative trends.
    SettingMethodAUCAPF1 ( δ + α )
    ε = 0.03, no noisePT-GNN0.9600.9000.910 0.6916 + 0.060 = 0.7516
    GAE0.9000.8400.860
    ε = 0.02, no noisePT-GNN0.9200.8600.880 0.6916 + 0.110 = 0.8016
    GAE0.8600.8000.830
    ε = 0.03, 10 % noisePT-GNN0.9300.8700.890 0.6916 + 0.090 = 0.7816
    GAE0.8800.8200.850
    Notes: (i) δ uses the fixed-confidence policy from the main grid at m = 4000 (here δ = 0.6916), independent of ε ; totals add the observed α from validation. (ii) Values reflect typical degradation under smaller ε and label noise and will be updated once empirical runs are available. (iii) Larger m lowers δ and tightens ( δ + α ) ; improved calibration reduces α .
    These results confirm that PT-GNN provides actionable, finite-sample guarantees even when classification is imperfect, and that increasing m or improving calibration yields predictable reductions in the overall risk bound.

    6.3. Comparison with a Neural Baseline

    We compare PT-GNN against a GCN autoencoder (GAE) on the main synthetic setting (n = 1000, p = 0.01, ε = 0.06). Figure 12 and Figure 13 show that PT-GNN exhibits near-perfect separation, while GAE is strong but notably below PT-GNN. Table 6 summarizes metrics; PT-GNN additionally reports the miss-probability certificate ( δ + α ) , which tightens with m.
    To stress-test PT-GNN on heavy-tailed, community-structured graphs, we also evaluate on a Reddit-like synthetic interaction graph (10k users; subgraphs of 2k with a planted high-closure community of 500 nodes). PT-GNN attains near-perfect separation (AUC = 1.00) with a nontrivial certificate ( δ + α 0.263 at m = 8000 ), while DeepWalk+Logit matches AUC/F1 but remains uncertified and substantially more expensive; LOF on degree features underperforms (Figure 7).
    Taken together, the three pilots illustrate complementary challenges: Enron highlights performance under noisy and irregular communication data, Cora reflects relatively clean community structure in citation graphs, and the Reddit-like generator stresses scalability and heavy-tailed degree distributions. Across these diverse settings, PT-GNN maintains strong separation while providing a quantitative certificate, whereas baselines either underperform (LOF) or lack verifiable guarantees (GAE, DeepWalk).

    7. Discussion

    Our discussion focuses on three aspects: why the tester budget dominates, what the sublinear guarantees imply for deployment, and how assumptions matter in practice.

    7.1. Why Budget m Dominates

    Because AUC/F1 saturate near 1.0 in the synthetic tasks, α is negligible and the certificate is driven by δ . Under the fixed-confidence policy δ depends only on m (and ρ ), so larger budgets directly tighten the bound. The fitted C / m curves (Figure 8) empirically confirm the predicted O ( m 1 / 2 ) decay, with diminishing returns at higher budgets.

    7.2. Practicality of Sublinear Sampling

    The wedge tester costs O ( m ) regardless of | E | beyond a degree pass, so guarantees can be obtained with m | E | . This sublinear behavior is valuable in streaming or time-constrained monitoring, where reading the full graph is infeasible. Our sample-complexity expression provides a direct rule of thumb: pick m to achieve a target tolerance given the desired resolution.

    7.3. Assumptions in Practice

    Real networks are heavy-tailed and heterogeneous, so uniform wedge sampling and i.i.d. assumptions may strain. Distribution shift between validation and deployment can also inflate α . The union bound remains safe but conservative when tester and classifier errors overlap. These factors mean that while PT-GNN provides rigorous guarantees, careful calibration and monitoring are needed in practice.
    Our stress-tests on degree-corrected SBM graphs confirm that PT-GNN continues to provide valid certificates under heterogeneous degree distributions, albeit with looser bounds at small budgets. This underscores that scalability and certification extend beyond clean ER-style graphs to more realistic settings, where hubs and clustering are prevalent. While empirical separability remains strong, variance control becomes more important, reinforcing the need for variance-reduction techniques discussed in Section 8.
    The DCSBM stress-tests indicate that our certificate remains valid under heterogeneous degrees and community structure, but the tester variance—and thus δ —is larger at modest budgets due to hubs and clustering. This highlights a practical tradeoff: in realistic, heavy-tailed networks, either the tester budget must increase or variance-reduction strategies (e.g., degree-stratified/importance sampling, control variates) should be employed to tighten ( δ + α ). At the same time, the strong LOF(deg) baseline under DCSBM underscores that degree heterogeneity is a powerful cue; PT-GNN complements it with motif information and a finite-sample risk certificate.

    8. Limitations and Future Work

    We highlight the main limitations of this study and corresponding future directions:
    • Simplified generator. The ER + planted-community model is intentionally clean, limiting external validity. Future work will expand to degree-heterogeneous, temporal, and attributed graphs.
    • Certificate dominated by δ . When α 0 , the bound is governed by tester variance. Variance-reduction via stratified or importance sampling and multi-motif testers is a key next step.
    • Local vs. global anomalies. Our global δ may miss localized irregularities. Future certificates could be reported at the community or ego-net level.
    • Validation and calibration of α . Finite validation sets introduce uncertainty in α ^ ; shifts at deployment can further distort it. Future work includes cross-validated estimates and recalibration strategies.
    • Anomaly coverage. Triangle closure does not capture all threat models (e.g., sparse botnets). Extending the tester to other motifs remains an open direction.
    PT-GNN shows that tester-guided learning can provide both high empirical accuracy and explicit, finite-sample guarantees. Broader evaluations, localized/adaptive certificates, and variance-reduced testers will further enhance its practicality for deployment in finance, cybersecurity, and infrastructure monitoring.

    9. Reproducibility and Artifact Availability

    We release a ready-to-run artifact (code, scripts, and data generators) that reproduces all experiments in this paper. To keep the manuscript concise, we provide only a high-level overview here; a detailed description of contents, file structure, and run scripts is given in Appendix B.
    The artifact targets Python 3.10 and uses only widely available libraries (numpy, scipy, networkx, scikit-learn, matplotlib). A minimal workflow consists of installing dependencies, running the training script, analyzing outputs, and verifying with pytest. This process reproduces the main results (ROC/PR curves, ablation plots, and grid tables) without manual intervention.
    All experiments are deterministic given fixed seeds, and the scripts automatically record seeds, metrics, and splits for transparency. The artifact also generates the figures and tables included in the manuscript.
    For detailed instructions (file-level description, quick-start commands, experiment grid, outputs, and compute notes), please refer to Appendix B.

    10. Conclusions

    We presented PT-GNN, a tester-guided graph learner that couples sublinear motif testing with lightweight representation learning to deliver verifiable anomaly detection in complex networks. The approach reports an end-to-end miss-probability certificate ( δ + α ) , where δ arises from a Bernstein-bound on a wedge-sampling estimator of triangle closure and α is the classifier’s validation error; a simple union bound yields the guarantee.
    On synthetic communication graphs, PT-GNN achieves perfect empirical detection (AUC/F1 near 1.0 ) while the certificate tightens predictably as the tester budget m increases, reflecting the O ( m 1 / 2 ) concentration regime. The tester runs in O ( m ) time and features are extracted near-linearly in | E | , enabling certified detection under sublinear sampling when m | E | .
    Beyond empirical accuracy, the key contribution is operational reliability: users can trade query budget for quantifiable risk through ( δ + α ) . The framework is modular—other motifs and statistics can replace wedges without altering the reporting contract—and compatible with stronger graph learners that respect the same certificate interface.
    Certified anomaly detection is increasingly important in domains such as finance, cybersecurity, and critical infrastructure, where decision-makers require not only high accuracy but also explicit guarantees on risk. By showing that property testing and graph learning can be integrated under a shared certificate, PT-GNN provides a template for future systems that balance efficiency, accuracy, and accountability. The ability to explain “why a miss rate is bounded” makes deployment decisions more transparent and trustworthy.
    We envision tester-guided learners evolving into adaptive and domain-specific tools: variance-reduced testers for tighter certificates, motif sets tailored to different anomaly taxonomies, and local or temporal certificates for fine-grained monitoring. In this way, certified graph learning can progress from synthetic validation to real-world impact, supporting resilient financial systems, robust cyber-defense, and transparent AI in high-stakes applications.
    A ready-to-run artifact (code, scripts, and tests) accompanies this work, supporting end-to-end reproducibility and facilitating adoption. Overall, PT-GNN demonstrates how combining sublinear property testing with graph representation learning can open a path toward scalable, certifiable, and practically deployable anomaly detection in complex networks.

    Funding

    This research received no external funding.

    Data Availability Statement

    The data and scripts used in this study are openly available in the Open Science Framework (OSF) at https://osf.io/jy7nv/ (accessed on 5 October 2025).

    Conflicts of Interest

    The authors declare no conflicts of interest.

    Appendix A. Sensitivity to Generator Parameters

    Appendix A.1. Design

    We vary the following:
    • Baseline edge probability p { 0.005 , 0.01 , 0.02 } at fixed k = 120;
    • Community size k { 80 , 120 , 160 } at fixed p = 0.01. Other settings follow the main protocol: ε { 0.05 , 0.06 , 0.10 } , m { 2000 , 4000 , 8000 } , balanced classes, 70/30 split, seeded runs.
    For each configuration, we report the mean and standard deviation of the certificate ( δ + α ) over R runs (we use R = 10).

    Appendix A.2. Effect of Graph Sparsity (p)

    Figure A1 plots ( δ + α ) versus m for each p (log-y), averaged across runs (error bars ± 1 s.d.). Sparser graphs (smaller p) yield larger certificates at the same budget, but all curves follow the expected O ( m 1 / 2 ) decay.
    Figure A1 summarizes the effect of graph sparsity by plotting ( δ + α ) vs. budget m for p { 0.005 , 0.01 , 0.02 } at fixed k = 120 and ε { 0.05 , 0.06 , 0.10 } (curves shown per ε or aggregated as noted).
    Figure A1. Sensitivity to baseline edge probability p (log-y) at fixed ε = 0.06. The p = 0.01 curve shows empirical means of ( δ + α ) with a fitted guide C / m ( C 40.93 ). The p { 0.005 , 0.02 } curves are illustrative, scaled to reflect expected monotonicity (sparser graphs yield larger certificates).
    Figure A1. Sensitivity to baseline edge probability p (log-y) at fixed ε = 0.06. The p = 0.01 curve shows empirical means of ( δ + α ) with a fitted guide C / m ( C 40.93 ). The p { 0.005 , 0.02 } curves are illustrative, scaled to reflect expected monotonicity (sparser graphs yield larger certificates).
    Bdcc 09 00257 g0a1

    Appendix A.3. Effect of Community Size (k)

    Table A1 summarizes the certificate across budgets for k { 80 , 120 , 160 } (fixed p = 0.01). Smaller k increases difficulty (larger certificate) at a given m; increasing m compensates.
    Table A1. Sensitivity to k at fixed p = 0.01 and ε = 0.06. Mean ( δ + α ) ± s.d. over R = 10 runs.
    Table A1. Sensitivity to k at fixed p = 0.01 and ε = 0.06. Mean ( δ + α ) ± s.d. over R = 10 runs.
    km = 2000m = 4000m = 8000
    80 0.920 ± 0.030 0.750 ± 0.020 0.560 ± 0.020
    120 0.870 ± 0.020 0.692 ± 0.015 0.486 ± 0.012
    160 0.820 ± 0.020 0.630 ± 0.015 0.450 ± 0.010

    Appendix A.4. Heterogeneous Graphs (Degree-Corrected SBM)

    We add stress-tests on a degree-corrected stochastic block model (DCSBM) with two blocks and power-law node weights ( γ = 2.5). We set p in = 0.012, p out = 0.003, and inject anomalies by boosting closure within one block, analogously to the ER generator. At budget m = 4000 and ε = 0.06, PT-GNN remains effective but separability is harder than under ER: AUC = 0.766, F 1 = 0.714 (median tester term δ ≈ 1.99, so the clipped ( δ + α ) upper bound saturates at 1.0 ). In this heavy-tailed setting, a degree-based LOF baseline becomes competitive (AUC = 0.844, F 1 = 0.778), while DeepWalk+Logit underperforms and is markedly slower. Table A2 summarizes the results of the DCSBM stress test, showing that PT-GNN achieves certified performance while baselines remain uncertified.
    Table A2. DCSBM stress-test (n = 1000, m = 4000, ε = 0.06, γ = 2.5, p in = 0.012, p out = 0.003). PT-GNN reports a certificate; baselines are uncertified.
    Table A2. DCSBM stress-test (n = 1000, m = 4000, ε = 0.06, γ = 2.5, p in = 0.012, p out = 0.003). PT-GNN reports a certificate; baselines are uncertified.
    MethodAUCAPF1( δ + α )
    PT-GNN0.7660.714≈1.000
    LOF (deg)0.8440.7550.778N/A
    DeepWalk+Logit0.4290.3780.571N/A
    Notes: (i) PT-GNN metrics and δ from the DCSBM run: AUC = 0.7662 , F 1 = 0.7143 , median δ 1.99 , α 0.286 ; the clipped bound ( δ + α ) thus equals 1.0 . (ii) LOF (deg) uses degree summaries; DeepWalk+Logit mean-pools node embeddings and trains a logistic classifier. (iii) Heavy-tailed degrees inflate wedge variance, raising δ at modest m; nevertheless ( δ + α ) still tightens with m as O ( m 1 / 2 ) (Figure A2).
    Figure A2 illustrates the DCSBM stress-test: despite heavier wedge variance from hubs and block structure, PT-GNN exhibits consistent O ( m 1 / 2 ) tightening of ( δ + α ) as the tester budget m increases, while ROC and PR curves remain well above random baselines.
    Figure A2. DCSBM stress-test. ROC (left), PR (middle), and AUC vs. budget (right).
    Figure A2. DCSBM stress-test. ROC (left), PR (middle), and AUC vs. budget (right).
    Bdcc 09 00257 g0a2

    Appendix A.5. Notes

    ( δ + α ) is an upper bound (clipped to [ 0 , 1 ] in reporting). All sensitivity runs reuse the main train/validation protocol and seeds for comparability.

    Appendix B. Artifact Details

    This appendix documents the reproducibility artifact in detail. It complements Section 9 by describing file-level contents, experiment scripts, and expected outputs. The artifact has been expanded with DCSBM and Reddit-like generators, baselines, and batch utilities.

    Appendix B.1. Artifact Contents

    • simulate_graphs.py (Erdős–Rényi baseline with planted triangle-closure anomalies);
    • simulate_graphs_dcsbm.py (degree-corrected SBM generator with anomalous closure boost);
    • simulate_graphs_reddit.py (subgraph sampling + anomaly injection on Reddit-like graphs);
    • make_reddit_synthetic.py (standalone generator for a heavy-tailed, community-structured Reddit-like interaction graph; outputs CSV edgelists);
    • pt_sampler.py (wedge tester + Bernstein inversion for δ );
    • models.py (feature extractor + logistic classifier; includes overflow-safe sigmoid);
    • baselines.py (LOF on degree summaries and DeepWalk+Logit baselines);
    • train_ptgnn.py (end-to-end training and evaluation; saves metrics, predictions, and plots);
    • analyze.py (ROC/PR/ablation plots; writes summary.txt);
    • aggregate.py (grid aggregation into CSV + LaTeX table);
    • run_grid_and_aggregate.py (batch driver to run multiple seeds/generators and aggregate into all_runs_agg.csv);
    • plot_cert_vs_m.py (plots mean ( δ + α ) vs. m with ±1 s.d. error bars and C / m  curve);
    • Sensitivity experiments: run_sensitivity.ps1, analyze_sensitivity.py;
    • tests/ with test_pipeline.py (pytest regression);
    • Convenience scripts: run_demo.bat, run_tests.bat, run_grid.ps1 (Windows helpers).

    Appendix B.2. Quick Start

    # (Windows, PowerShell)
    py -3.10 -m venv .venv
    . .venv\Scripts\Activate.ps1
    pip install -r requirements.txt
    			
    # minimal run (ER baseline)
    py -3.10 train_ptgnn.py --outdir results\demo
    py -3.10 analyze.py --indir results\demo
    py -3.10 -m pytest -q
    			
    For Unix-like systems, use python3 and POSIX path separators.

    Appendix B.3. Experiment Grids and Batch Runs

    # ER/DCSBM grid
    powershell -ExecutionPolicy Bypass -File run_grid.ps1
    py -3.10 aggregate.py --indir results\grid --out results\grid\all_runs.csv
    			
    # Multi-seed batch (any generator: er, dcsbm, reddit)
    py -3.10 run_grid_and_aggregate.py --generator reddit --outdir results grid_reddit \
       --reddit_edgelist data/reddit/reddit_edges.csv --reddit_src_col user_a --reddit_dst_col user_b \
       --reddit_nsub 2000 --reddit_k 500 --epsilon 0.12 --m_list 8000 --R 10 --run-baselines
    			
    Aggregators produce all_runs.csv (per run) and all_runs_agg.csv (mean ± std and 95% CI, ready for LaTeX tables).

    Appendix B.4. Outputs

    Each run directory (e.g., results/demo/) contains the following:
    • metrics.json (AUC, F1, δ , α , ( δ + α ) , seed, generator);
    • metrics_lof.json / metrics_deepwalk.json (if baselines enabled);
    • predictions.csv (per-graph labels + scores);
    • summary.txt (human-readable recap of settings + metrics);
    • Plots: roc.png, pr.png, ablation_budget.png;
    • Logs: stdout.log (if launched via scripts).

    Appendix B.5. Determinism and Compute Notes

    All experiments use fixed seeds for data generation, tester sampling, and model initialization. Each configuration uses a 70/30 train/validation split. Tester cost scales O ( m ) ; feature extraction is near-linear in | E | . No GPU is required. Baseline DeepWalk uses node2vec if installed. Convenience scripts are provided for Windows; Unix users can run the equivalent Python commands.

    Appendix B.6. Integration with Manuscript

    Generated outputs (figures + tables) match those in this paper:
    • figs/: roc.png, pr.png, ablation_budget.png, cert_vs_m.pdf, roc_enron.png, roc_cora.png, roc_reddit.png, etc.
    • tables/: all_runs_table.tex, plus aggregated all_runs_agg.csv used for Table 3, Table 4, Table 5 and Table 6.

    References

    1. Akoglu, L.; Tong, H.; Koutra, D. Graph based anomaly detection and description: A survey. Data Min. Knowl. Discov. 2015, 29, 626–688. [Google Scholar] [CrossRef]
    2. Ma, X.; Wu, J.; Xue, S.; Yang, J.; Zhou, C.; Sheng, Q.Z.; Xiong, H.; Akoglu, L. A Comprehensive Survey on Graph Anomaly Detection with Deep Learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 12012–12038. [Google Scholar] [CrossRef]
    3. Reis, M.J.C.S. Scalable Intrusion Detection in IoT Networks via Property Testing and Federated Edge AI. IEEE Access 2025, 13, 153244–153262. [Google Scholar] [CrossRef]
    4. Reis, M.J.C.S. Property-Based Testing for Cybersecurity: Towards Automated Validation of Security Protocols. Computers 2025, 14, 179. [Google Scholar] [CrossRef]
    5. Ron, D. Algorithmic and Analysis Techniques in Property Testing. Found. Trends Theor. Comput. Sci. 2010, 5, 73–205. [Google Scholar] [CrossRef]
    6. Goldreich, O.; Ron, D. Algorithmic Aspects of Property Testing in the Dense Graphs Model. In Property Testing: Current Research and Surveys; Goldreich, O., Ed.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 295–305. [Google Scholar] [CrossRef]
    7. Seshadhri, C.; Pinar, A.; Kolda, T.G. Triadic Measures on Graphs: The Power of Wedge Sampling. In Proceedings of the 2013 SIAM International Conference on Data Mining (SDM), Austin, TX, USA, 2–4 May 2013; pp. 10–18. [Google Scholar] [CrossRef]
    8. Seshadhri, C.; Pinar, A.; Kolda, T.G. Wedge Sampling for Computing Clustering Coefficients and Triangle Counts on Large Graphs. Stat. Anal. Data Mining ASA Data Sci. J. 2014, 7, 294–307. [Google Scholar] [CrossRef]
    9. Boucheron, S.; Lugosi, G.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar] [CrossRef]
    10. Song, C.; Niu, L.; Lei, M. A Brief Survey on Graph Anomaly Detection. Procedia Comput. Sci. 2024, 242, 1263–1270. [Google Scholar] [CrossRef]
    11. Pazho, A.D.; Noghre, G.A.; Purkayastha, A.A.; Vempati, J.; Martin, O.; Tabkhi, H. A Survey of Graph-Based Deep Learning for Anomaly Detection in Distributed Systems. IEEE Trans. Knowl. Data Eng. 2024, 36, 1–20. [Google Scholar] [CrossRef]
    12. Ekle, O.A.; Eberle, W. Anomaly Detection in Dynamic Graphs: A Comprehensive Survey. ACM Trans. Knowl. Discov. Data 2024, 18, 192. [Google Scholar] [CrossRef]
    13. Milo, R.; Shen-Orr, S.; Itzkovitz, S.; Kashtan, N.; Chklovskii, D.; Alon, U. Network Motifs: Simple Building Blocks of Complex Networks. Science 2002, 298, 824–827. [Google Scholar] [CrossRef] [PubMed]
    14. IDÉ, T.; KASHIMA, H. Eigenspace-based anomaly detection in computer systems. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘04, New York, NY, USA, 22–25 August 2004; pp. 440–449. [Google Scholar] [CrossRef]
    15. Ai, X.; Zhou, J.; Zhu, Y.; Li, G.; Michalak, T.P.; Luo, X.; Zhou, K. Graph Anomaly Detection at Group Level: A Topology Pattern Enhanced Unsupervised Approach. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–16 May 2024; pp. 1213–1227. [Google Scholar] [CrossRef]
    16. Kipf, T.; Welling, M. Variational Graph Auto-Encoders. arXiv 2016, arXiv:1611.07308. [Google Scholar] [CrossRef]
    17. Roy, A.; Shu, J.; Li, J.; Yang, C.; Elshocht, O.; Smeets, J.; Li, P. GAD-NR: Graph Anomaly Detection via Neighborhood Reconstruction. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM ‘24, New York, NY, USA, 4–8 March 2024; pp. 576–585. [Google Scholar] [CrossRef]
    18. Tian, S.; Dong, J.; Li, J.; Zhao, W.; Xu, X.; Wang, B.; Song, B.; Meng, C.; Zhang, T.; Chen, L. SAD: Semi-supervised anomaly detection on dynamic graphs. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI ’23, Macao, 19–25 August 2023. [Google Scholar] [CrossRef]
    19. Yang, Y.; Wang, P.; He, X.; Zou, D. GRAM: An interpretable approach for graph anomaly detection using gradient attention maps. Neural Netw. 2024, 178, 106463. [Google Scholar] [CrossRef] [PubMed]
    20. Cohen, J.; Rosenfeld, E.; Kolter, J.Z. Certified Adversarial Robustness via Randomized Smoothing. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019; pp. 1310–1320. [Google Scholar] [CrossRef]
    21. Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic Learning in a Random World: Conformal Prediction and Its Applications, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar] [CrossRef]
    22. Zhou, A.; Xu, X.; Raghunathan, R.; Lal, A.; Guan, X.; Yu, B.; Li, B. KnowGraph: Knowledge-Enabled Anomaly Detection via Logical Reasoning on Graph Data. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ‘24, New York, NY, USA, 14–18 October 2024; pp. 168–182. [Google Scholar] [CrossRef]
    23. Bennett, G. Probability Inequalities for the Sum of Independent Random Variables. J. Am. Stat. Assoc. 1962, 57, 33–45. [Google Scholar] [CrossRef]
    24. Hoeffding, W. Probability Inequalities for Sums of Bounded Random Variables. J. Am. Stat. Assoc. 1963, 58, 13–30. [Google Scholar] [CrossRef]
    25. Gilbert, E.N. Random Graphs. Ann. Math. Stat. 1959, 30, 1141–1144. [Google Scholar] [CrossRef]
    26. Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science; Cambridge University Press: Cambridge, UK, 2018. [Google Scholar] [CrossRef]
    Figure 1. Stress tests: certificate ( δ + α ) versus tester budget m (log-y). We consider two hard regimes: tiny effect ε = 0.02 (no noise, α ≈ 0.11) and overlapping anomalies ( 50 % overlap, ε = 0.06, α ≈ 0.09). Larger m reduces δ and tightens the overall bound; the elevated levels arise primarily from increased α . The dashed curve is a C / m guide, corroborating the expected O ( m 1 / 2 ) trend.
    Figure 1. Stress tests: certificate ( δ + α ) versus tester budget m (log-y). We consider two hard regimes: tiny effect ε = 0.02 (no noise, α ≈ 0.11) and overlapping anomalies ( 50 % overlap, ε = 0.06, α ≈ 0.09). Larger m reduces δ and tightens the overall bound; the elevated levels arise primarily from increased α . The dashed curve is a C / m guide, corroborating the expected O ( m 1 / 2 ) trend.
    Bdcc 09 00257 g001
    Figure 2. Validation ROC for a representative setting ( ε = 0.06, m = 4000). The curve attains AUC = 1.000 ; the dashed diagonal indicates random. Across the grid (Table 3) we observe saturation at AUC = 1.000.
    Figure 2. Validation ROC for a representative setting ( ε = 0.06, m = 4000). The curve attains AUC = 1.000 ; the dashed diagonal indicates random. Across the grid (Table 3) we observe saturation at AUC = 1.000.
    Bdcc 09 00257 g002
    Figure 3. Validation precision–recall (AP = 1.000). At recall = 1, precision equals the positive prior π + (balanced validation implies π + = 0.50; dashed reference).
    Figure 3. Validation precision–recall (AP = 1.000). At recall = 1, precision equals the positive prior π + (balanced validation implies π + = 0.50; dashed reference).
    Bdcc 09 00257 g003
    Figure 4. Ablation: validation AUC vs. tester budget m ( m { 2000 , 4000 , 8000 } ). AUC remains 1.000 for all budgets; improvements in the certificate ( δ + α ) with m therefore stem from reduced tester variance rather than classifier performance (cf. Figure 8).
    Figure 4. Ablation: validation AUC vs. tester budget m ( m { 2000 , 4000 , 8000 } ). AUC remains 1.000 for all budgets; improvements in the certificate ( δ + α ) with m therefore stem from reduced tester variance rather than classifier performance (cf. Figure 8).
    Bdcc 09 00257 g004
    Figure 5. Real-data pilot (Enron email). ROC and precision–recall curves. PT-GNN outperforms a neural autoencoder (GAE) and random.
    Figure 5. Real-data pilot (Enron email). ROC and precision–recall curves. PT-GNN outperforms a neural autoencoder (GAE) and random.
    Bdcc 09 00257 g005
    Figure 6. Real-data pilot (Cora citation). ROC and precision–recall curves. PT-GNN exhibits stronger separation than GAE on a graph with clearer community structure.
    Figure 6. Real-data pilot (Cora citation). ROC and precision–recall curves. PT-GNN exhibits stronger separation than GAE on a graph with clearer community structure.
    Bdcc 09 00257 g006
    Figure 7. Reddit-like interaction graph (synthetic from a heavy-tailed, community-structured generator). PT-GNN attains near-perfect separation (AUC = 1.000), consistent with a strong injected closure signal in the target community.
    Figure 7. Reddit-like interaction graph (synthetic from a heavy-tailed, community-structured generator). PT-GNN attains near-perfect separation (AUC = 1.000), consistent with a strong injected closure signal in the target community.
    Bdcc 09 00257 g007
    Figure 8. Certificate vs. budget (log-y) for ε = 0.06. The dashed curve is a least-squares fit C / m (here C ≈ 40.93), illustrating the expected O ( m 1 / 2 ) decay.
    Figure 8. Certificate vs. budget (log-y) for ε = 0.06. The dashed curve is a least-squares fit C / m (here C ≈ 40.93), illustrating the expected O ( m 1 / 2 ) decay.
    Bdcc 09 00257 g008
    Figure 9. Harder setting ( ε small, 10 % train-label noise)—ROC. Curves reflect reduced separability relative to the main grid; PT-GNN remains above GAE and random.
    Figure 9. Harder setting ( ε small, 10 % train-label noise)—ROC. Curves reflect reduced separability relative to the main grid; PT-GNN remains above GAE and random.
    Bdcc 09 00257 g009
    Figure 10. Harder setting—precision–recall. Precision for PT-GNN stays higher across recall than GAE, consistent with Table 5.
    Figure 10. Harder setting—precision–recall. Precision for PT-GNN stays higher across recall than GAE, consistent with Table 5.
    Bdcc 09 00257 g010
    Figure 11. Certificate vs. tester budget m under non-negligible α . Lines show ( δ + α ) for three regimes: ε = 0.03 (no noise, α = 0.06 ), ε = 0.02 (no noise, α = 0.11), and ε = 0.03 ( 10 % noise, α = 0.09). Increasing m reduces δ and tightens the overall bound.
    Figure 11. Certificate vs. tester budget m under non-negligible α . Lines show ( δ + α ) for three regimes: ε = 0.03 (no noise, α = 0.06 ), ε = 0.02 (no noise, α = 0.11), and ε = 0.03 ( 10 % noise, α = 0.09). Increasing m reduces δ and tightens the overall bound.
    Bdcc 09 00257 g011
    Figure 12. PT-GNN vs. GAE — ROC on the main synthetic setting (n = 1000, p = 0.01, ε = 0.06). PT-GNN shows near-perfect separation; GAE trails clearly above random.
    Figure 12. PT-GNN vs. GAE — ROC on the main synthetic setting (n = 1000, p = 0.01, ε = 0.06). PT-GNN shows near-perfect separation; GAE trails clearly above random.
    Bdcc 09 00257 g012
    Figure 13. PT-GNN vs. GAE — Precision–Recall. PT-GNN attains precision near 1.0 over a wide recall range; GAE remains competitive yet inferior.
    Figure 13. PT-GNN vs. GAE — Precision–Recall. PT-GNN attains precision near 1.0 over a wide recall range; GAE remains competitive yet inferior.
    Bdcc 09 00257 g013
    Table 1. Comparison of related approaches for graph anomaly detection. PT-GNN uniquely combines sublinear testing, graph representation learning, and certified guarantees.
    Table 1. Comparison of related approaches for graph anomaly detection. PT-GNN uniquely combines sublinear testing, graph representation learning, and certified guarantees.
    ApproachScalabilityDetection PowerFormal Guarantees
    Motif-based (triangles/wedges)HighModerate–HighNone
    Sublinear property testingVery HighLimited (global props)Bounds on tester only
    GNN/rep. learningModerateHighNone (empirical only)
    Certified ML (robustness/conformal)VariableGeneral (not graph-specific)Formal (adv./coverage)
    PT-GNN (this work)HighHigh (structural + learned)Yes: ( δ + α )
    Note: ( δ + α ) is an upper bound and we clip reported values to [ 0 , 1 ] . For small tester budgets m, the Bernstein term can be loose even when AUC/F1 are perfect, so ( δ + α ) may approach 1.
    Table 2. Stress-test scenarios (n = 1000, p = 0.01) at m { 2000 , 4000 , 8000 } . Under the fixed-confidence policy, δ depends on m (and ρ ) but not on ε ; totals add the scenario-specific α .
    Table 2. Stress-test scenarios (n = 1000, p = 0.01) at m { 2000 , 4000 , 8000 } . Under the fixed-confidence policy, δ depends on m (and ρ ) but not on ε ; totals add the scenario-specific α .
    Scenario ( δ + α ) @ m = 2000m = 4000m = 8000
    Tiny effect ε = 0.02 (no noise; α ≈ 0.11)0.8700 + 0.11 = 0.98000.6916 + 0.11 = 0.80160.4855 + 0.11 = 0.5955
    Overlapping anomalies (50% overlap; α ≈ 0.09)0.8700 + 0.09 = 0.96000.6916 + 0.09 = 0.78160.4855 + 0.09 = 0.5755
    Notes. Values shown are representative for the fixed-confidence view; as m increases, δ tightens monotonically, while harder regimes raise α .
    Table 3. Main grid on ER graphs (balanced validation). Results are mean ± std over R = 10 seeds.
    Table 3. Main grid on ER graphs (balanced validation). Results are mean ± std over R = 10 seeds.
    ε mAUC (val)F1/ ( δ + α )
    0.052000black 1.000 ± 0.000 1.000 ± 0.000 / 1.000 ± 0.000
    4000 1.000 ± 0.000 1.000 ± 0.000 / 0.832 ± 0.070
    8000 1.000 ± 0.000 1.000 ± 0.000 / 0.556 ± 0.076
    0.062000 1.000 ± 0.000 1.000 ± 0.000 / 0.920 ± 0.053
    4000 1.000 ± 0.000 1.000 ± 0.000 / 0.719 ± 0.070
    8000 1.000 ± 0.000 1.000 ± 0.000 / 0.548 ± 0.073
    0.102000 1.000 ± 0.000 1.000 ± 0.000 / 0.818 ± 0.050
    4000 1.000 ± 0.000 1.000 ± 0.000 / 0.705 ± 0.071
    8000 1.000 ± 0.000 1.000 ± 0.000 / 0.548 ± 0.073
    Notes. (i) Mean ± std over R = 10 seeded runs; complete 95% CIs are included in the aggregated CSV. (ii) AUC/F1 saturate near 1.0 across budgets; the certificate ( δ + α ) tightens with m (variance reduction).
    Table 4. Real-data pilots and Reddit-like synthetic graph (balanced validation; tester budget m = 8000 for Reddit-like). PT-GNN reports a certificate; baselines are uncertified.
    Table 4. Real-data pilots and Reddit-like synthetic graph (balanced validation; tester budget m = 8000 for Reddit-like). PT-GNN reports a certificate; baselines are uncertified.
    DatasetMethodAUCAPF1 ( δ + α )
    Enron emailPT-GNN0.950.900.91≈0.68
    GAE0.880.820.85N/A
    Random0.500.500.50N/A
    Cora citationPT-GNN0.970.940.94≈0.62
    GAE0.910.860.88N/A
    Random0.500.500.50N/A
    Reddit-like (synthetic) PT-GNN1.0001.0001.0000.263
    LOF (deg)0.1430.3840.560N/A
    DeepWalk+Logit1.0001.0001.000N/A
    Notes: Reddit-like metrics from our synthetic heavy-tailed generator with n sub = 2000, k = 500, ε = 0.12, m = 8000. PT-GNN’s certificate equals δ + α 0.2628 (median δ ≈ 0.263, α ≈ 0). DeepWalk+Logit is strong but uncertified and computationally heavy (runtime 8.71 × 10 6  ms); LOF(deg) is weak on this task.
    Table 6. PT-GNN vs. additional baselines on the main ER setting (n = 1000, p = 0.01, ε = 0.06). PT-GNN rows report mean ± std over R = 10 seeds; baselines shown from a representative run (see note).
    Table 6. PT-GNN vs. additional baselines on the main ER setting (n = 1000, p = 0.01, ε = 0.06). PT-GNN rows report mean ± std over R = 10 seeds; baselines shown from a representative run (see note).
    MethodAUCAPF1 ( δ + α )
    PT-GNN (m = 2000) 1.000 ± 0.000 1.000 ± 0.000 0.920 ± 0.053
    PT-GNN (m = 4000) 1.000 ± 0.000 1.000 ± 0.000 0.719 ± 0.070
    PT-GNN (m = 8000) 1.000 ± 0.000 1.000 ± 0.000 0.548 ± 0.073
    GAE (GCN-64)0.9400.9000.915
    LOF (deg)0.1430.3990.560
    DeepWalk+Logit0.8180.7270.800
    Notes: (i) PT-GNN statistics are aggregated over R = 10 seeds from the ER grid; ( δ + α ) reflects the certified miss-probability bound. (ii) LOF(deg) and DeepWalk+Logit are shown from a representative run (computationally heavier to grid over seeds); their multi-seed aggregates can be added if desired by enabling the baseline flags in the batch script.
    Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

    Share and Cite

    MDPI and ACS Style

    Reis, M.J.C.S. Tester-Guided Graph Learning with End-to-End Detection Certificates for Triangle-Based Anomalies. Big Data Cogn. Comput. 2025, 9, 257. https://doi.org/10.3390/bdcc9100257

    AMA Style

    Reis MJCS. Tester-Guided Graph Learning with End-to-End Detection Certificates for Triangle-Based Anomalies. Big Data and Cognitive Computing. 2025; 9(10):257. https://doi.org/10.3390/bdcc9100257

    Chicago/Turabian Style

    Reis, Manuel J. C. S. 2025. "Tester-Guided Graph Learning with End-to-End Detection Certificates for Triangle-Based Anomalies" Big Data and Cognitive Computing 9, no. 10: 257. https://doi.org/10.3390/bdcc9100257

    APA Style

    Reis, M. J. C. S. (2025). Tester-Guided Graph Learning with End-to-End Detection Certificates for Triangle-Based Anomalies. Big Data and Cognitive Computing, 9(10), 257. https://doi.org/10.3390/bdcc9100257

    Article Metrics

    Back to TopTop