Tester-Guided Graph Learning with End-to-End Detection Certificates for Triangle-Based Anomalies

Reis, Manuel J. C. S.

doi:10.3390/bdcc9100257

Open AccessArticle

Tester-Guided Graph Learning with End-to-End Detection Certificates for Triangle-Based Anomalies

by

Manuel J. C. S. Reis

Engineering Department and Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Trás-os-Montes e Alto Douro, Quinta de Prados, 5000-801 Vila Real, Portugal

Big Data Cogn. Comput. 2025, 9(10), 257; https://doi.org/10.3390/bdcc9100257

Submission received: 10 September 2025 / Revised: 2 October 2025 / Accepted: 9 October 2025 / Published: 12 October 2025

(This article belongs to the Special Issue Advances in Graph Learning and Representation Models for Complex Network Analysis)

Download

Browse Figures

Versions Notes

Abstract

We investigate anomaly detection in complex networks through a property-testing-guided graph neural model (PT-GNN) that provides an end-to-end miss-probability certificate

(δ + α)

. The method combines (i) a wedge-sampling tester that estimates triangle-closure frequency and derives a concentration bound

(δ)

via Bernstein’s inequality, with (ii) a lightweight classifier over structural features whose validation error contributes

(α)

. The overall certificate is given by the sum

(δ + α)

, quantifying the probability of missed anomalies under bounded sampling. On synthetic communication graphs with n = 1000, edge probability p = 0.01, and anomalous subgraph size k = 120, PT-GNN achieves perfect detection performance (AUC = 1.0, F1 = 1.0) across all tested regimes. Moreover, the miss-probability certificate tightens systematically as the tester budget m increases (e.g., for

ε

= 0.06, enlarging m from 2000 to 8000 reduces

(δ + α)

from ≈0.87 to ≈0.49). These results demonstrate that PT-GNN effectively couples graph learning with property testing, offering both strong empirical detection and formally verifiable guarantees in anomaly detection tasks.

Keywords:

graph learning; property testing; anomaly detection; triangle closures; certificates; complex networks

1. Introduction

Anomaly detection on complex networks underpins applications in communication security, finance, and cyber-physical systems. Recent progress in graph representation learning has produced powerful detectors, yet most methods emphasize empirical accuracy while offering limited certification of reliability under sampling constraints [1,2]. This gap is acute in high-throughput settings where full-graph access is costly and decision risk must be controlled.

In our earlier work on scalable intrusion detection via property testing and federated edge AI [3] and on property-based testing for cybersecurity [4], we demonstrated how lightweight testing and distributed learning can enhance anomaly detection in resource-constrained settings. However, those approaches primarily established empirical effectiveness and automation benefits, without providing a formal certificate on the probability of missed anomalies. The present study advances this line of research by introducing a tester-guided graph learner that incorporates provable sampling guarantees, moving from heuristic detection toward verifiable reliability.

Beyond methodological novelty, the ability to report certified detection risk has direct impact in practice. In domains such as financial fraud, network security, and critical infrastructure monitoring, stakeholders require not only accurate predictions but also quantifiable guarantees under bounded sampling budgets. Traditional empirical detectors cannot offer such assurances, leaving decision-makers to rely on heuristic thresholds. By contrast, our certificate

(δ + α)

provides an explicit, finite-sample upper bound on missed detections, enabling risk-aware deployment in high-stakes settings.

We address this need with a tester-guided learner, PT-GNN, that couples sublinear graph property testing with lightweight learning to obtain an end-to-end miss-probability certificate. Property testing offers principled, query-efficient procedures to assess global graph structure from local probes [5,6], and triangle-based motifs (closure/cluster structure) are canonical indicators for community coherence and irregular interaction patterns. We instantiate the tester via wedge sampling for triangle statistics, a technique with provable accuracy guarantees and practical efficiency on large graphs [7,8].

Unlike existing anomaly detection methods that either rely solely on empirical accuracy (e.g., GNN-based detectors) or offer generic certified guarantees without graph-specific considerations (e.g., randomized smoothing, conformal prediction), our approach uniquely integrates property testing with graph learning. Specifically, PT-GNN is the first framework to couple wedge-sampling with lightweight graph classification in order to produce an end-to-end miss-probability certificate

(δ + α)

. This moves beyond heuristic scoring or robustness guarantees by providing task-specific, finite-sample reliability bounds directly tied to triangle-based anomalies. Triangles are a canonical indicator of collusion and dense substructures—patterns observed in fraud rings, insider trading, and coordinated botnets. By targeting anomalies characterized by abnormal closure frequency, PT-GNN bridges a critical gap: it provides both strong empirical detection and formally verifiable guarantees under sampling constraints, ensuring that decisions can be deployed with quantifiable risk.

Our certificate decomposes the probability of missing an anomaly into two additive terms:

(i)

a tester uncertainty term

δ

obtained from a Bernstein-type concentration bound on the wedge-sampling estimator, and

(i i)

a classifier validation error

α

from supervised evaluation. The overall miss-probability is thus upper-bounded by

(δ + α)

, directly linking sampling budget to decision risk via nonasymptotic concentration results [9].

Our contributions are as follows:

We introduce PT-GNN, a tester-guided graph learner that integrates wedge-sampling triangle closure into training and reporting.
We derive an end-to-end miss-probability certificate $(δ + α)$ that combines a Bernstein-style bound for the tester with empirical validation error for the learner.
On synthetic communication-graph benchmarks, we show perfect detection (AUC/F1) alongside systematically tightening certificates as the tester budget increases, demonstrating practical verifiability under bounded sampling.
We extend our prior work [3,4] by providing, for the first time, formally certified detection guarantees in addition to strong empirical performance.

2. Related Work

Research on anomaly detection in graphs intersects multiple lines of investigation. We briefly review relevant contributions on motif-based detection, sublinear property testing, graph representation learning, and certified machine learning.

Recent surveys highlight the diversity of approaches to graph anomaly detection. Song [10] provides a concise 2024 overview, while Pazho et al. [11] and Ekle & Eberle [12] review deep graph anomaly detection and dynamic-graph methods, respectively. These works frame the broader landscape to which PT-GNN contributes.

2.1. Motif-Based Anomaly Detection

Network motifs such as wedges and triangles are central in characterizing community structure and irregularities. Early work showed that deviations in triangle closure often signal spurious or anomalous links [13]. Methods exploiting subgraph patterns and clustering coefficients have been applied in social and communication networks [1,14]. Efficient wedge-sampling estimators enable scalable motif statistics on large graphs [7,8]. Our work adopts wedge-based testing as a backbone, but departs from heuristic anomaly scores by coupling motif statistics with an explicit concentration bound that contributes to a miss-probability certificate.

Complementary efforts include Ai et al. [15], who study group-level anomalies via topology patterns, broadening anomaly definitions beyond wedges and triangles.

2.2. Sublinear Graph Property Testing

Graph property testing provides query-efficient algorithms to approximate global properties from local samples [5,6]. Applications range from connectivity and bipartiteness testing to motif frequency estimation. Concentration inequalities have been leveraged to provide probabilistic guarantees on tester outputs [9]. In security, property testing has been recently explored for protocol validation and intrusion detection in distributed settings [3,4]. PT-GNN extends these ideas by embedding a tester directly into a learning pipeline, enabling certificates that span both sampling and classification.

2.3. Graph Representation Learning for Anomalies

Graph neural networks (GNNs) and representation models have been widely studied for anomaly detection [2]. Approaches range from autoencoder-based detection [16] to community-preserving embeddings and spectral methods. While effective, most methods remain empirical and do not quantify detection risk. Prior work on federated and edge-based anomaly detection [3] highlights the need for lightweight and distributed solutions, yet still lacks verifiable guarantees. PT-GNN contributes to this line by combining efficient motif-level queries with certifiable bounds, offering both scalability and formal reliability.

Recent methods continue to diversify. Roy et al. [17] reconstruct local neighborhoods for anomaly scores, while Tian et al. [18] introduce semi-supervised detection on dynamic graphs. Yang et al. [19] propose GRAM, an interpretable gradient-attention approach that emphasizes model transparency in anomaly detection.

2.4. Certified Learning and Detection Guarantees

Certified machine learning has recently sought to quantify model reliability beyond empirical metrics. Randomized smoothing offers provable robustness against adversarial perturbations [20], while conformal prediction provides finite-sample confidence sets [21]. In anomaly detection, guarantees remain scarce, with most work focusing on distribution-free confidence intervals or heuristic thresholds. By introducing a

(δ + α)

certificate that combines tester variance with validation error, PT-GNN situates itself within this emerging paradigm of verifiable detection.

Knowledge-enabled frameworks such as KnowGraph [22] further show how domain knowledge can be integrated with graph anomaly detection, offering an orthogonal perspective to certified reliability.

To summarize the positioning of our approach relative to prior work, Table 1 contrasts motif-based, property-testing, GNN-based, and certified ML methods. As the table highlights, while existing approaches offer either scalability or strong empirical performance, none provide end-to-end formal guarantees for anomaly detection in complex networks. PT-GNN uniquely integrates sublinear property testing with representation learning to deliver both scalability and certified reliability.

Taken together, these comparisons highlight the position of PT-GNN relative to the field. Unlike motif-based anomaly detectors, PT-GNN moves beyond heuristic subgraph statistics by embedding a wedge-based tester into a certified learning pipeline. Compared with property-testing approaches, it extends guarantees from the tester alone to the entire end-to-end detection process. Relative to GNN-based methods, it preserves scalability while offering formal reliability that empirical models cannot provide. Finally, in contrast to certified ML frameworks such as randomized smoothing or conformal prediction, PT-GNN is graph-specific and tailored to triangle-driven irregularities. At the same time, we acknowledge that PT-GNN shares certain limitations with these families, such as sensitivity to anomaly definitions and dependence on validation data for calibration. This synthesis underscores the contribution of PT-GNN: it is, to our knowledge, the first approach to provide scalable, triangle-specific, and formally certified anomaly detection in complex networks.

3. Method

We detail PT-GNN’s components and the derivation of the end-to-end miss-probability certificate

(δ + α)

-from the problem formulation to the wedge tester, classifier, and computational cost.

3.1. Problem Setting

Let

G = (V, E)

be a simple undirected graph and let a wedge be an unordered length-2 path

(u, v, w)

with edges

{u, v}, {v, w} \in E

. A wedge is closed if

{u, w} \in E

, otherwise it is open. For a graph G, define the triangle-closure probability as follows:

p (G) : = Pr {a uniformly sampled wedge in G is closed} \in [0, 1] .

Intuitively,

p (G)

quantifies the tendency of the network to form closed triads. In many real-world settings, unusually high closure rates can reveal coordinated or collusive behavior. For example, fraud rings and insider-trading groups often create tightly interconnected communities, while botnet controllers may generate bursts of dense triadic communication. Conversely, benign background traffic or random interactions tend to produce wedges that remain open. Thus, differences in

p (G)

between normal and anomalous graphs capture structural irregularities that are difficult to detect from degree information alone.

We consider i.i.d. labeled samples

(G, y) \in G \times {0, 1}

, with

y = 1

indicating an anomaly class characterized by overexpressed triangle closure. Let

p_{0} : = E [p (G) ∣ y = 0]

and

p_{1} : = E [p (G) ∣ y = 1]

, and assume a separation margin

Δ : = p_{1} - p_{0} > 0

.

We report certificates at a user-selected resolution

ε_{0} \leq Δ

, interpreted as the minimal practically relevant closure gap for positives.

3.2. Wedge-Sampling Tester and Concentration Bound

Given a query budget m, we draw m wedges independently and uniformly at random. (Uniform wedge sampling can be implemented in

O (1)

per query by sampling a center v with probability proportional to

(\binom{d (v)}{2})

and then two distinct neighbors of v uniformly at random; see [7,8].) For each sampled wedge

i \in {1, \dots, m}

, define

X_{i} \in {0, 1}

as the indicator that the wedge is closed. The unbiased estimator of

p (G)

is as follows:

\hat{p} : = \frac{1}{m} \sum_{i = 1}^{m} X_{i}, E [\hat{p}] = p (G) .

Since

X_{i} \in [0, 1]

are bounded, a Bernstein-type inequality controls the tail of

\hat{p}

around

p (G)

[9]:

Pr (| \hat{p} - p (G) | > ε) \leq β (ε, m; σ^{2}) : = 2 exp (- \frac{m ε^{2}}{2 σ^{2} + \frac{2}{3} ε}), σ^{2} : = Var (X_{i}) \leq \frac{1}{4} .

(1)

In practice, one may plug in the conservative bound

σ^{2} = \frac{1}{4}

or use an empirical variance estimate.

3.3. Reporting $δ$

We expose the tester’s uncertainty through a single scalar

δ

via either of two equivalent reporting modes:

(1): Fixed confidence: given a user risk level $ρ \in (0, 1)$ , define the following:

$δ_{tol} (m, ρ) : = inf \{ε > 0 : β (ε, m; \frac{1}{4}) \leq ρ\},$

the smallest tolerance for which (1) guarantees confidence $1 - ρ$ .
(2): Fixed effect size: given a task margin $ε_{0} \in (0, 1)$ (e.g., the minimal practically relevant triangle-closure gap), report the following:

$δ_{risk} (m, ε_{0}) : = β (ε_{0}, m; \frac{1}{4}),$

the tester’s deviation probability at resolution $ε_{0}$ .

Both are monotone in m and interchangeable for presentation; our code supports either policy and computes

δ

by direct inversion of (1).

3.4. Classifier and End-to-End Certificate

PT-GNN consumes structural features (motif counts/ratios, degree statistics, and summary statistics from the tester) and outputs a score

s (G) \in R

. A threshold

τ

is selected on a validation set to optimize a target metric (e.g., F1). Let the following be true:

α : = Pr {\hat{y} \neq y ∣ validation}

denote the observed validation error (e.g.,

1 - F 1^{★}

at the tuned threshold). We now upper-bound the missed-detection probability for positives with separation at least

ε_{0}

.

3.5. Certificate

Let

A

be the event that the tester’s deviation exceeds the resolution:

A = {| \hat{p} - p (G) | > ε_{0}}

, so

Pr (A) \leq δ

by (1). Let

C

be the event that the classifier errs given features consistent with the tester (estimated by

α

on validation). For any G with

y = 1

and

p (G) - p_{0} \geq ε_{0}

, a miss can only occur if either the tester deviates (

A

) or the classifier mispredicts (

C

). By the union bound as follows:

Pr (miss) \leq Pr (A) + Pr (C) \leq δ + α .

(2)

which is the end-to-end miss-probability certificate. Here

δ

is reported via either fixed-confidence or fixed-effect-size policy above. In experiments we report

(δ + α)

alongside standard metrics (AUC/F1).

3.6. Computational Complexity

Wedge sampling costs

O (m)

time after an initial degree pass, and

O (1)

memory beyond reservoir state. Feature extraction over neighborhoods is near-linear in

| E |

. The overall pipeline supports sublinear sampling whenever

m ≪ | E |

, with end-to-end cost dominated by

O (m)

tester queries plus the (lightweight) classifier training.

4. Theoretical Notes

This section formalizes the reliability of PT-GNN by deriving a finite-sample, end-to-end miss-probability certificate that combines a Bernstein bound for the wedge-sampling tester with the classifier’s validation error [9,23,24].

We proceed step by step. First, we state the assumptions under which the analysis holds. Then we quantify the probability that the wedge tester deviates from the true closure rate. Finally, we combine this deviation with the classifier error using a simple union bound.

The following assumptions are made:

A1: We observe i.i.d. labeled graphs $(G, y) \in G \times {0, 1}$ ; positives (y = 1) satisfy an effect-size (margin) condition $p (G) - p_{0} \geq ε_{0} > 0$ , where $p (G)$ is the triangle-closure probability and $p_{0} : = E [p (G) ∣ y$ = 0].
A2: On each graph, the tester draws m i.i.d. wedges uniformly at random and reports the sample mean $\hat{p}$ of the “closed-wedge” indicator.
A3: The classifier threshold is tuned on a validation set disjoint from training; $α$ denotes its true (population) missed-detection rate when fed the same feature pipeline (including tester summaries).

Assumptions (A1)–(A3) formalize the setting needed for a clean miss-probability certificate, but several practical nuances are worth noting:

On (A1): i.i.d. graphs and margin condition. In deployment, graphs may arrive from a drifting process (e.g., seasonal patterns or evolving user behavior). While our certificate targets the distribution seen at validation time, distribution shift can inflate $α$ . In practice, we monitor calibration on a sliding window and re-estimate $\hat{α}$ periodically (cf. Section 5.4); our finite-sample bound on $α$ (below) remains applicable. The effect-size condition $p (G) - p_{0} \geq ε_{0}$ encodes the task resolution; if smaller effects are relevant, one can either increase m or lower $ε_{0}$ and accept a looser $δ$ .
On (A2): uniform i.i.d. wedge sampling. Heavy-tailed degree distributions can create effective dependencies if wedges are sampled without replacement or via hub-heavy neighborhoods. We enforce with-replacement sampling to maintain independence and may apply a finite-population correction when system constraints require without-replacement sampling. Variance can be reduced without bias by degree-stratified or importance-weighted wedge sampling, together with unbiased reweighting; the same certificate form holds with $σ^{2}$ replaced by the stratified variance, and empirical-Bernstein variants can tighten $δ$ by exploiting observed variance.
On (A3): validation-derived $α$ . The classifier error is estimated under the same feature pipeline used at test time (including tester summaries), and we provide an explicit finite-sample correction (Hoeffding). Class imbalance and threshold drift are handled via periodic threshold re-tuning on a validation buffer and, when needed, calibration (e.g., isotonic/Platt). Selective prediction (abstaining under uncertainty) is compatible with our reporting by accounting for abstention as a separate operating point.
Union bound conservativeness. Our guarantee does not assume independence between tester deviation and classifier error; the union bound is deliberately conservative and may overestimate risk when the events overlap. This is a safety margin, not a weakness: the true miss rate is typically lower than $δ + α$ .

These considerations do not alter the structure of the certificate; they primarily affect constants in

δ

(via variance and sampling policy) and the empirical estimate of

α

(via validation design). We provide operational guidance in the Discussion and report sensitivity to sampling budget and graph regime in the Experiments.

With these assumptions in place, we can state the main result: the probability of missing an anomaly is at most the sum of two terms, one from tester deviation and one from classifier error.

Theorem 1

(Miss-Probability Certificate). Under (A1)–(A3), for any anomalous G with

p (G) - p_{0} \geq ε_{0}

, the missed-detection probability satisfies the following:

Pr (miss) \leq δ (m, ε_{0}) + α,

where

δ (m, ε_{0})

bounds the tester’s deviation event via a Bernstein-type inequality and α is the classifier’s true validation error.

Proof.

The proof follows a simple structure. First we bound the probability that the tester’s estimate of triangle closure deviates by more than

ε_{0}

(event

A

). Then we consider the probability that the classifier mispredicts (event

C

). A miss can occur only if either of these events happens, so we apply the union bound.

Let

A : = {| \hat{p} - p (G) | > ε_{0}}

. By Bernstein’s inequality for bounded variables (here Bernoulli), for some nonincreasing function

β

in m, the following is true:

Pr (A) \leq δ (m, ε_{0}) : = β (ε_{0}, m) = 2 exp (- \frac{m ε_{0}^{2}}{2 σ^{2} + \frac{2}{3} ε_{0}}),

with

σ^{2} = Var (X_{i}) \leq 1 / 4

. Let

C

denote the event that the classifier mislabels G when the tester features are within tolerance (the same pipeline used for validation). A miss can occur only if either

A

or

C

happens; hence, by the union bound,

Pr (miss) \leq Pr (A) + Pr (C) \leq δ (m, ε_{0}) + α

. □

4.1. Operational Reporting and Two Parameterizations

The tester uncertainty term

δ

can be reported in two equivalent ways, depending on whether the user fixes the desired risk level or the effect size of interest.

Fixed-confidence: given risk level $ρ$ , report the smallest tolerance $δ_{tol} (m, ρ)$ s.t. $β (δ_{tol}, m) \leq ρ$ (this scales as $O (\sqrt{log (1 / ρ) / m})$ );
Fixed-effect-size: given $ε_{0}$ (e.g., task margin), report $δ_{risk} (m, ε_{0}) = β (ε_{0}, m)$ (this decays exponentially in m).

We use the fixed-confidence parameterization in all experiments (see Section 5.2).

4.2. Finite-Sample ( $\hat{α}$ ) Version

In practice we do not know the true

α

but estimate it from a finite validation set. This introduces additional sampling error, which we control using Hoeffding’s inequality.

In practice,

α

is estimated on

n_{val}

held-out graphs as

\hat{α}

. By Hoeffding’s inequality for Bernoulli errors, with probability at least

1 - η

(over the draw of the validation set), the following is true:

α \leq \hat{α} + \sqrt{\frac{1}{2 n_{val}} ln \frac{2}{η}} .

Consequently, with the same confidence, the following is true:

Pr (miss) \leq δ (m, ε_{0}) + \hat{α} + \sqrt{\frac{1}{2 n_{val}} ln \frac{2}{η}} .

4.3. Sample-Complexity for a Target Risk

Finally, we can invert the bound to ask: how many wedge samples m are needed to achieve a desired risk level

ρ

? The following expression provides a sufficient budget.

Given a user budget

ρ \in (0, 1)

for the tester term and effect size

ε_{0}

, it suffices to choose the following:

m \geq (\frac{2 σ^{2}}{ε_{0}^{2}} + \frac{2}{3 ε_{0}}) ln \frac{2}{ρ} \leq (\frac{1 / 2}{ε_{0}^{2}} + \frac{2}{3 ε_{0}}) ln \frac{2}{ρ},

using

σ^{2} \leq 1 / 4

. This makes

δ (m, ε_{0}) \leq ρ

, yielding

Pr (miss) \leq ρ + α

(or

ρ + \hat{α} + val . term

in the finite-sample form).

Remark 1

(Scaling intuition). Under the fixed-confidence view, the tolerance

δ_{tol} (m, ρ)

behaves as

O (m^{- 1 / 2})

up to

log (1 / ρ)

factors, matching classical concentration. Under the fixed-effect-size view, the risk

δ_{risk} (m, ε_{0})

decreases exponentially in m.

Remark 2

(On coupling and conservativeness). The tester’s summary (e.g.,

\hat{p}

and auxiliary statistics) is part of the feature pipeline used during validation, so

\hat{α}

empirically captures errors conditional on these features. The union bound remains valid without independence assumptions between tester noise and classifier decisions; the result is conservative if the events overlap.

5. Experimental Setup

We now describe the evaluation setup, covering data generation, tester budgets, features and classifier, protocol, baselines, and compute assumptions.

5.1. Data

We generate synthetic communication graphs using the

G (n, p)

model [25], with n = 1000 nodes and edge probability p = 0.01. For the anomalous class, we plant a community

S \subset V

of size k = 120 and boost triangle closure inside S so that the wedge-closure probability increases by a target amount

ε \in {0.05, 0.06, 0.10}

relative to the benign baseline (in

G (n, p)

, closure equals p by independence). Operationally, the generator samples a benign G and then, for y = 1, iteratively closes a fraction of open wedges within S to achieve

p_{1} \approx p_{0} + ε

(clipped to

[0, 1]

); for y = 0 no modification is applied. Each configuration

(ε, m)

yields a balanced dataset with

| D |

= 60 graphs (30 benign, 30 anomalous).

To complement the synthetic benchmarks, we also include illustrative experiments on publicly available network datasets. Specifically, we evaluate PT-GNN on (i) the Enron email network and (ii) a citation graph (Cora), where anomalies are defined by injected high-closure communities. While these datasets are smaller and noisier than our generator, they demonstrate that our method can be deployed on real graph data without modification of the pipeline.

In addition to Enron and Cora, we evaluate on a third real-world benchmark: a Reddit discussion interaction network, where anomalies are defined by injected high-closure communities within otherwise sparse conversational threads. This dataset is larger and more heterogeneous than Enron or Cora and stresses scalability. Across all real-data settings, we emphasize that anomalies are injected for controlled ground truth, which may simplify reality; potential biases arise from this injection protocol and from class balancing (see also Section 5.4).

5.2. Tester Configuration

Given query budget

m \in {2000, 4000, 8000}

, the wedge tester samples wedges uniformly (center chosen with probability proportional to

(\binom{d (v)}{2})

, then two neighbors uniformly) and reports the closed-wedge frequency

\hat{p}

[7,8]. The deviation bound

δ

is computed by inverting a Bernstein-type inequality for bounded variables (conservative variance

σ^{2} \leq 1 / 4

) [9,26].

Unless otherwise stated, we adopt the fixed-confidence parameterization with a common risk level

ρ

across all conditions. Consequently,

δ = δ_{tol} (m, ρ)

depends only on the tester budget m (and

ρ

), not on

ε

. In the main grid, the validation error is negligible (

α \approx 0

), so the reported certificate

(δ + α)

is constant across

ε

at fixed m.

5.3. Features and Classifier

From each graph we extract lightweight structural features: wedge/triangle counts and ratios (including global clustering proxies), degree statistics (mean, variance, max), and tester summaries (e.g.,

\hat{p}

). A logistic classifier (L2-regularized) is trained on the training split; the decision threshold is tuned on validation to maximize F1. All features are standardized using statistics computed on the training set only.

In addition to detection accuracy, we record the average runtime per graph (tester queries + feature extraction + classifier inference) as a measure of computational efficiency, allowing a fair comparison with baseline methods.

5.4. Protocol

We use a 70/30 train/validation split per configuration. All experiments are seeded for reproducibility; data generation, tester sampling, and model initialization share the same seed per run. We report area under the ROC curve (AUC), F1 (on validation at the tuned threshold), and the end-to-end certificate

(δ + α)

, where

α

is the observed validation error (

1 - F 1^{★}

). For completeness, we aggregate metrics across runs using the mean and standard deviation.

We compute precision–recall curves with scikit-learn using precision_recall_curve and average_precision_score with pos_label = 1. Train/validation splits are stratified to preserve the class ratio, so the precision at recall

= 1

equals the validation positive prior

π^{+}

.

Alongside AUC and F1, we report precision, recall, and average precision (AP), as well as the average runtime per graph. Validation splits are stratified to preserve class ratios, but since datasets are balanced, AP baselines equal

π^{+}

= 0.50; this makes AP improvements interpretable. We note that validation is performed on injected anomalies and may not capture the full variability of real-world irregularities.

Unless otherwise stated, we report metrics as mean ± standard deviation over R = 10 independent runs with fixed seeds per configuration. For selected results we also provide 95% confidence intervals (CIs) computed as

\bar{x} \pm t_{0.975, R - 1} s / \sqrt{R}

. Average runtime per graph (tester queries + feature extraction + inference) is reported to compare efficiency across methods.

5.5. Baselines

We report two lightweight baselines for context:

v: DeepWalk + Logistic: We generate 128-dimensional DeepWalk embeddings and train an L2-regularized logistic classifier on them. This baseline represents embedding-based anomaly detection without motif statistics.

Taken together, these baselines situate PT-GNN alongside both lightweight structural methods and neural architectures. For fairness, all baselines are trained and evaluated using the same train/validation splits and random seeds as PT-GNN.

5.6. Complexity and Compute

Tester cost scales as

O (m)

queries per graph; feature extraction is near-linear in

| E |

. All experiments were executed in Python 3.10 on a standard desktop environment; the codebase includes scripts for graph simulation, training, and aggregation (see simulate_graphs.py, pt_sampler.py, models.py, train_ptgnn.py, and aggregate.py).

All baselines are implemented with comparable preprocessing, and runtime measurements include both embedding/training cost and inference cost, to ensure fair comparisons of efficiency.

5.7. Sensitivity Study

To assess robustness, we vary the generator knobs: baseline edge probability

p \in {0.005, 0.01, 0.02}

at fixed k = 120, and community size

k \in {80, 120, 160}

at fixed p = 0.01. We keep

ε \in {0.05, 0.06, 0.10}

and budgets

m \in {2000, 4000, 8000}

; details and results appear in Appendix A.

Beyond varying p and k, we include two stress-test regimes: (i) a tiny effect size

ε = 0.02

and (ii) overlapping anomalies (two planted communities with

50 %

overlap). Both settings reduce separability, increasing the validation error

α

and, at fixed tester budget m, yielding larger certificates

(δ + α)

.

As expected, empirical metrics (AUC/F1) and the certificate degrade gracefully under these harder conditions, while preserving the monotone tightening of

(δ + α)

with m (well-approximated by a

C / \sqrt{m}

guide). This behavior delineates the detection limits when the anomaly signal is weak or partially confounded.

Figure 1 plots

(δ + α)

versus m on a log scale. In both regimes, the curves decrease monotonically with m and are well captured by a

C / \sqrt{m}

trend, consistent with the expected

O (m^{- 1 / 2})

tightening. The elevated levels arise primarily from the larger

α

induced by these harder conditions. Table 2 presents the stress-test scenarios.

6. Results

Unless noted, we report mean ± standard deviation over R = 10 runs. Section 6 references: ROC and precision–recall curves (Figure 2 and Figure 3), the main synthetic grid (Table 3), the AUC–vs.–budget ablation (Figure 4), the real-data pilots on Enron and Cora, and the Reddit-like synthetic interaction graph (Figure 5, Figure 6 and Figure 7, Table 4, i.e., Table 3), as well as the certificate–vs.–budget analysis (Figure 8). Beyond ER-style graphs, we also evaluate a degree-corrected SBM (DCSBM) stress test with power-law degree weights; quantitative results and plots for that experiment appear in Appendix A.4.

The full metric grid is summarized in Table 3, which we reference alongside the ROC and precision–recall plots (Figure 2 and Figure 3) and complement later with the real-data summary in Table 4.

Figure 2 shows the validation ROC for a representative setting (

ε

= 0.06, m = 4000). Across all configurations, the validation metrics saturate (AUC = 1.000, F1 = 1.000), confirming that the synthetic task is cleanly separable. By contrast, the certificate

(δ + α)

tightens chiefly with the tester budget m (via variance reduction); for

ε

= 0.06 it decreases from

0.8700

to

0.6916

to

0.4855

as m increases from 2000 to 8000.

Figure 3 reports the validation precision–recall curve; the average precision is

AP = 1.000

. As the decision threshold tends to

- \infty

(predict-all-positive), precision approaches the positive prior

π^{+} = # pos / (# pos + # neg)

(≈0.50 under a balanced validation split).

Figure 4 reports validation AUC versus tester budget m for

m \in {2000, 4000, 8000}

. AUC remains

1.000

for all budgets, indicating that improvements in

(δ + α)

with m stem from tester variance reduction rather than classifier performance.

Table 3 aggregates the main grid across

ε

and m.

Figure 5 and Figure 6, and Table 4 provide illustrative results for the Enron email and Cora citation networks. The expected qualitative outcome is that PT-GNN achieves stronger separation than a neural autoencoder (GAE) and significantly outperforms random guessing.

Figure 7 presents ROC and precision–recall curves on the Reddit-like synthetic interaction graph, demonstrating that PT-GNN achieves perfect separation while providing a nontrivial certificate, whereas baselines either underperform (LOF) or remain uncertified (DeepWalk), and Figure 8 presents the certificate versus budget (log-y) for

ε = 0.06

.

6.1. Ablation: Certificate vs. Budget

We study how the end-to-end certificate

(δ + α)

varies with the tester budget m at fixed effect size (here

ε

= 0.06). For each

m \in {2000, 4000, 8000}

we run multiple seeded trials, report the mean certificate and one standard deviation, and overlay a least-squares guide of the form

C / \sqrt{m}

.

Figure 8 plots the certificate

(δ + α)

against the tester budget m on a log-y axis and overlays a least-squares guide of the form

C / \sqrt{m}

(here C

\approx

40.93), illustrating the expected

O (m^{- 1 / 2})

decay.

The following observations should be presented:

$(δ + α)$ decreases monotonically with m, reflecting the tester’s concentration with budget; in our runs, validation error $α$ is near zero, so the certificate is dominated by $δ$ .
A one-parameter fit $C / \sqrt{m}$ matches the empirical trend closely (for $ε$ = 0.06, $C \approx 40.93$ on our data), visually corroborating the $O (m^{- 1 / 2})$ rate predicted by the Bernstein analysis.
Diminishing returns are evident: halving the tolerance requires roughly quadrupling m, consistent with the $\sqrt{m}$ scaling.
Reported certificates are upper bounds; when m is small they can be loose (and we clip to $[0, 1]$ in reporting).

A complementary sensitivity study (Appendix A) shows the certificate increases as graphs become sparser (lower p) or anomalies smaller (lower k), while the

O (m^{- 1 / 2})

trend with budget persists.

The ablation confirms the theoretical decomposition

(δ + α)

. When

α

is negligible (as in the clean synthetic setting), the curve is dominated by

δ

, which decreases at the expected

O (m^{- 1 / 2})

rate. This is visually corroborated by the dashed

C / \sqrt{m}

fits in Figure 8. By contrast, when

α

is non-negligible (e.g., under label noise or tiny

ε

in Section 6.1), the additive term

α

lifts the entire curve upward. Importantly, the tightening with m persists, but the floor is limited by

α

, highlighting that classifier calibration and tester budget play complementary roles in reducing the bound.

6.2. Ablation: Regimes with Non-Negligible Classifier Error

To assess how the certificate behaves when the learner is imperfect, we construct regimes where the classifier error

α

is non-negligible. As

ε

shrinks and label noise is introduced, AUC/F1 degrade and

α

rises, causing

(δ + α)

to reflect contributions from both terms.

In the main grid, the classifier error

α

is near zero, so

(δ + α)

is dominated by the tester term. To investigate settings where

α

contributes meaningfully, we construct harder variants by (i) reducing the effect size to

ε \in {0.02, 0.03}

and (ii) injecting

10 %

random label noise on the training set (while keeping validation clean). Under these stressors, AUC/F1 drop below 1.0 and

α

ranges between

0.05

and

0.12

, so both terms jointly shape the certificate. Figure 9, Figure 10 and Figure 11 illustrate the expected qualitative behavior under non-negligible

α

.

Table 5. Non-negligible-

α

regime at n = 1000, p = 0.01, m = 4000. We vary effect size

ε

and add

10 %

train-label noise. PT-GNN reports a certificate; GAE is uncertified. Other baselines (LOF, DeepWalk+Logit) are omitted here since they do not produce certified risk bounds; their performance also degraded under these harder regimes, consistent with the qualitative trends.

Table 5. Non-negligible-

α

regime at n = 1000, p = 0.01, m = 4000. We vary effect size

ε

and add

10 %

train-label noise. PT-GNN reports a certificate; GAE is uncertified. Other baselines (LOF, DeepWalk+Logit) are omitted here since they do not produce certified risk bounds; their performance also degraded under these harder regimes, consistent with the qualitative trends.

Setting	Method	AUC	AP	F1	$(δ + α)$
$ε$ = 0.03, no noise	PT-GNN	0.960	0.900	0.910	$0.6916 + 0.060 =$ 0.7516
	GAE	0.900	0.840	0.860	–
$ε$ = 0.02, no noise	PT-GNN	0.920	0.860	0.880	$0.6916 + 0.110 =$ 0.8016
	GAE	0.860	0.800	0.830	–
$ε$ = 0.03, $10 %$ noise	PT-GNN	0.930	0.870	0.890	$0.6916 + 0.090 =$ 0.7816
	GAE	0.880	0.820	0.850	–

Notes: (i)

δ

uses the fixed-confidence policy from the main grid at m = 4000 (here

δ

= 0.6916), independent of

ε

; totals add the observed

α

from validation. (ii) Values reflect typical degradation under smaller

ε

and label noise and will be updated once empirical runs are available. (iii) Larger m lowers

δ

and tightens

(δ + α)

; improved calibration reduces

α

.

These results confirm that PT-GNN provides actionable, finite-sample guarantees even when classification is imperfect, and that increasing m or improving calibration yields predictable reductions in the overall risk bound.

6.3. Comparison with a Neural Baseline

We compare PT-GNN against a GCN autoencoder (GAE) on the main synthetic setting (n = 1000, p = 0.01,

ε

= 0.06). Figure 12 and Figure 13 show that PT-GNN exhibits near-perfect separation, while GAE is strong but notably below PT-GNN. Table 6 summarizes metrics; PT-GNN additionally reports the miss-probability certificate

(δ + α)

, which tightens with m.

To stress-test PT-GNN on heavy-tailed, community-structured graphs, we also evaluate on a Reddit-like synthetic interaction graph (10k users; subgraphs of 2k with a planted high-closure community of 500 nodes). PT-GNN attains near-perfect separation (AUC = 1.00) with a nontrivial certificate (

δ + α \approx 0.263

at

m = 8000

), while DeepWalk+Logit matches AUC/F1 but remains uncertified and substantially more expensive; LOF on degree features underperforms (Figure 7).

Taken together, the three pilots illustrate complementary challenges: Enron highlights performance under noisy and irregular communication data, Cora reflects relatively clean community structure in citation graphs, and the Reddit-like generator stresses scalability and heavy-tailed degree distributions. Across these diverse settings, PT-GNN maintains strong separation while providing a quantitative certificate, whereas baselines either underperform (LOF) or lack verifiable guarantees (GAE, DeepWalk).

7. Discussion

Our discussion focuses on three aspects: why the tester budget dominates, what the sublinear guarantees imply for deployment, and how assumptions matter in practice.

7.1. Why Budget m Dominates

Because AUC/F1 saturate near 1.0 in the synthetic tasks,

α

is negligible and the certificate is driven by

δ

. Under the fixed-confidence policy

δ

depends only on m (and

ρ

), so larger budgets directly tighten the bound. The fitted

C / \sqrt{m}

curves (Figure 8) empirically confirm the predicted

O (m^{- 1 / 2})

decay, with diminishing returns at higher budgets.

7.2. Practicality of Sublinear Sampling

The wedge tester costs

O (m)

regardless of

| E |

beyond a degree pass, so guarantees can be obtained with

m ≪ | E |

. This sublinear behavior is valuable in streaming or time-constrained monitoring, where reading the full graph is infeasible. Our sample-complexity expression provides a direct rule of thumb: pick m to achieve a target tolerance given the desired resolution.

7.3. Assumptions in Practice

Real networks are heavy-tailed and heterogeneous, so uniform wedge sampling and i.i.d. assumptions may strain. Distribution shift between validation and deployment can also inflate

α

. The union bound remains safe but conservative when tester and classifier errors overlap. These factors mean that while PT-GNN provides rigorous guarantees, careful calibration and monitoring are needed in practice.

Our stress-tests on degree-corrected SBM graphs confirm that PT-GNN continues to provide valid certificates under heterogeneous degree distributions, albeit with looser bounds at small budgets. This underscores that scalability and certification extend beyond clean ER-style graphs to more realistic settings, where hubs and clustering are prevalent. While empirical separability remains strong, variance control becomes more important, reinforcing the need for variance-reduction techniques discussed in Section 8.

The DCSBM stress-tests indicate that our certificate remains valid under heterogeneous degrees and community structure, but the tester variance—and thus

δ

—is larger at modest budgets due to hubs and clustering. This highlights a practical tradeoff: in realistic, heavy-tailed networks, either the tester budget must increase or variance-reduction strategies (e.g., degree-stratified/importance sampling, control variates) should be employed to tighten (

δ

+

α

). At the same time, the strong LOF(deg) baseline under DCSBM underscores that degree heterogeneity is a powerful cue; PT-GNN complements it with motif information and a finite-sample risk certificate.

8. Limitations and Future Work

We highlight the main limitations of this study and corresponding future directions:

Simplified generator. The ER + planted-community model is intentionally clean, limiting external validity. Future work will expand to degree-heterogeneous, temporal, and attributed graphs.
Certificate dominated by $δ$ . When $α \approx 0$ , the bound is governed by tester variance. Variance-reduction via stratified or importance sampling and multi-motif testers is a key next step.
Local vs. global anomalies. Our global $δ$ may miss localized irregularities. Future certificates could be reported at the community or ego-net level.
Validation and calibration of $α$ . Finite validation sets introduce uncertainty in $\hat{α}$ ; shifts at deployment can further distort it. Future work includes cross-validated estimates and recalibration strategies.
Anomaly coverage. Triangle closure does not capture all threat models (e.g., sparse botnets). Extending the tester to other motifs remains an open direction.

PT-GNN shows that tester-guided learning can provide both high empirical accuracy and explicit, finite-sample guarantees. Broader evaluations, localized/adaptive certificates, and variance-reduced testers will further enhance its practicality for deployment in finance, cybersecurity, and infrastructure monitoring.

9. Reproducibility and Artifact Availability

We release a ready-to-run artifact (code, scripts, and data generators) that reproduces all experiments in this paper. To keep the manuscript concise, we provide only a high-level overview here; a detailed description of contents, file structure, and run scripts is given in Appendix B.

The artifact targets Python 3.10 and uses only widely available libraries (numpy, scipy, networkx, scikit-learn, matplotlib). A minimal workflow consists of installing dependencies, running the training script, analyzing outputs, and verifying with pytest. This process reproduces the main results (ROC/PR curves, ablation plots, and grid tables) without manual intervention.

All experiments are deterministic given fixed seeds, and the scripts automatically record seeds, metrics, and splits for transparency. The artifact also generates the figures and tables included in the manuscript.

For detailed instructions (file-level description, quick-start commands, experiment grid, outputs, and compute notes), please refer to Appendix B.

10. Conclusions

We presented PT-GNN, a tester-guided graph learner that couples sublinear motif testing with lightweight representation learning to deliver verifiable anomaly detection in complex networks. The approach reports an end-to-end miss-probability certificate

(δ + α)

, where

δ

arises from a Bernstein-bound on a wedge-sampling estimator of triangle closure and

α

is the classifier’s validation error; a simple union bound yields the guarantee.

On synthetic communication graphs, PT-GNN achieves perfect empirical detection (AUC/F1 near

1.0

) while the certificate tightens predictably as the tester budget m increases, reflecting the

O (m^{- 1 / 2})

concentration regime. The tester runs in

O (m)

time and features are extracted near-linearly in

| E |

, enabling certified detection under sublinear sampling when

m ≪ | E |

.

Beyond empirical accuracy, the key contribution is operational reliability: users can trade query budget for quantifiable risk through

(δ + α)

. The framework is modular—other motifs and statistics can replace wedges without altering the reporting contract—and compatible with stronger graph learners that respect the same certificate interface.

Certified anomaly detection is increasingly important in domains such as finance, cybersecurity, and critical infrastructure, where decision-makers require not only high accuracy but also explicit guarantees on risk. By showing that property testing and graph learning can be integrated under a shared certificate, PT-GNN provides a template for future systems that balance efficiency, accuracy, and accountability. The ability to explain “why a miss rate is bounded” makes deployment decisions more transparent and trustworthy.

We envision tester-guided learners evolving into adaptive and domain-specific tools: variance-reduced testers for tighter certificates, motif sets tailored to different anomaly taxonomies, and local or temporal certificates for fine-grained monitoring. In this way, certified graph learning can progress from synthetic validation to real-world impact, supporting resilient financial systems, robust cyber-defense, and transparent AI in high-stakes applications.

A ready-to-run artifact (code, scripts, and tests) accompanies this work, supporting end-to-end reproducibility and facilitating adoption. Overall, PT-GNN demonstrates how combining sublinear property testing with graph representation learning can open a path toward scalable, certifiable, and practically deployable anomaly detection in complex networks.

Funding

This research received no external funding.

Data Availability Statement

The data and scripts used in this study are openly available in the Open Science Framework (OSF) at https://osf.io/jy7nv/ (accessed on 5 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Sensitivity to Generator Parameters

Appendix A.1. Design

We vary the following:

Baseline edge probability $p \in {0.005, 0.01, 0.02}$ at fixed k = 120;
Community size $k \in {80, 120, 160}$ at fixed p = 0.01. Other settings follow the main protocol: $ε \in {0.05, 0.06, 0.10}$ , $m \in {2000, 4000, 8000}$ , balanced classes, 70/30 split, seeded runs.

For each configuration, we report the mean and standard deviation of the certificate

(δ + α)

over R runs (we use R = 10).

Appendix A.2. Effect of Graph Sparsity (p)

Figure A1 plots

(δ + α)

versus m for each p (log-y), averaged across runs (error bars ± 1 s.d.). Sparser graphs (smaller p) yield larger certificates at the same budget, but all curves follow the expected

O (m^{- 1 / 2})

decay.

Figure A1 summarizes the effect of graph sparsity by plotting

(δ + α)

vs. budget m for

p \in {0.005, 0.01, 0.02}

at fixed k = 120 and

ε \in {0.05, 0.06, 0.10}

(curves shown per

ε

or aggregated as noted).

Figure A1. Sensitivity to baseline edge probability p (log-y) at fixed

ε

= 0.06. The p = 0.01 curve shows empirical means of

(δ + α)

with a fitted guide

C / \sqrt{m}

(

C \approx 40.93

). The

p \in {0.005, 0.02}

curves are illustrative, scaled to reflect expected monotonicity (sparser graphs yield larger certificates).

Figure A1. Sensitivity to baseline edge probability p (log-y) at fixed

ε

= 0.06. The p = 0.01 curve shows empirical means of

(δ + α)

with a fitted guide

C / \sqrt{m}

(

C \approx 40.93

). The

p \in {0.005, 0.02}

curves are illustrative, scaled to reflect expected monotonicity (sparser graphs yield larger certificates).

Appendix A.3. Effect of Community Size (k)

Table A1 summarizes the certificate across budgets for

k \in {80, 120, 160}

(fixed p = 0.01). Smaller k increases difficulty (larger certificate) at a given m; increasing m compensates.

Table A1. Sensitivity to k at fixed p = 0.01 and

ε

= 0.06. Mean (

δ

+

α

) ± s.d. over R = 10 runs.

Table A1. Sensitivity to k at fixed p = 0.01 and

ε

= 0.06. Mean (

δ

+

α

) ± s.d. over R = 10 runs.

k	m = 2000	m = 4000	m = 8000
80	$0.920 \pm 0.030$	$0.750 \pm 0.020$	$0.560 \pm 0.020$
120	$0.870 \pm 0.020$	$0.692 \pm 0.015$	$0.486 \pm 0.012$
160	$0.820 \pm 0.020$	$0.630 \pm 0.015$	$0.450 \pm 0.010$

Appendix A.4. Heterogeneous Graphs (Degree-Corrected SBM)

We add stress-tests on a degree-corrected stochastic block model (DCSBM) with two blocks and power-law node weights (

γ

= 2.5). We set

p_{in}

= 0.012,

p_{out}

= 0.003, and inject anomalies by boosting closure within one block, analogously to the ER generator. At budget m = 4000 and

ε

= 0.06, PT-GNN remains effective but separability is harder than under ER: AUC = 0.766,

F 1^{★}

= 0.714 (median tester term

δ

≈ 1.99, so the clipped (

δ

+

α

) upper bound saturates at

1.0

). In this heavy-tailed setting, a degree-based LOF baseline becomes competitive (AUC = 0.844,

F 1^{★}

= 0.778), while DeepWalk+Logit underperforms and is markedly slower. Table A2 summarizes the results of the DCSBM stress test, showing that PT-GNN achieves certified performance while baselines remain uncertified.

Table A2. DCSBM stress-test (n = 1000, m = 4000,

ε

= 0.06,

γ

= 2.5,

p_{in}

= 0.012,

p_{out}

= 0.003). PT-GNN reports a certificate; baselines are uncertified.

Table A2. DCSBM stress-test (n = 1000, m = 4000,

ε

= 0.06,

γ

= 2.5,

p_{in}

= 0.012,

p_{out}

= 0.003). PT-GNN reports a certificate; baselines are uncertified.

Method	AUC	AP	F1	( $δ$ + $α$ )
PT-GNN	0.766	—	0.714	≈1.000
LOF (deg)	0.844	0.755	0.778	N/A
DeepWalk+Logit	0.429	0.378	0.571	N/A

Notes: (i) PT-GNN metrics and

δ

from the DCSBM run: AUC

= 0.7662

,

F 1^{★} = 0.7143

, median

δ \approx 1.99

,

α \approx 0.286

; the clipped bound (

δ

+

α

) thus equals

1.0

. (ii) LOF (deg) uses degree summaries; DeepWalk+Logit mean-pools node embeddings and trains a logistic classifier. (iii) Heavy-tailed degrees inflate wedge variance, raising

δ

at modest m; nevertheless (

δ

+

α

) still tightens with m as

O (m^{- 1 / 2})

(Figure A2).

Figure A2 illustrates the DCSBM stress-test: despite heavier wedge variance from hubs and block structure, PT-GNN exhibits consistent

O (m^{- 1 / 2})

tightening of (

δ

+

α

) as the tester budget m increases, while ROC and PR curves remain well above random baselines.

Figure A2. DCSBM stress-test. ROC (left), PR (middle), and AUC vs. budget (right).

Appendix A.5. Notes

(δ + α)

is an upper bound (clipped to

[0, 1]

in reporting). All sensitivity runs reuse the main train/validation protocol and seeds for comparability.

Appendix B. Artifact Details

This appendix documents the reproducibility artifact in detail. It complements Section 9 by describing file-level contents, experiment scripts, and expected outputs. The artifact has been expanded with DCSBM and Reddit-like generators, baselines, and batch utilities.

Appendix B.1. Artifact Contents

simulate_graphs.py (Erdős–Rényi baseline with planted triangle-closure anomalies);
simulate_graphs_dcsbm.py (degree-corrected SBM generator with anomalous closure boost);
simulate_graphs_reddit.py (subgraph sampling + anomaly injection on Reddit-like graphs);
make_reddit_synthetic.py (standalone generator for a heavy-tailed, community-structured Reddit-like interaction graph; outputs CSV edgelists);
pt_sampler.py (wedge tester + Bernstein inversion for $δ$ );
models.py (feature extractor + logistic classifier; includes overflow-safe sigmoid);
baselines.py (LOF on degree summaries and DeepWalk+Logit baselines);
train_ptgnn.py (end-to-end training and evaluation; saves metrics, predictions, and plots);
analyze.py (ROC/PR/ablation plots; writes summary.txt);
aggregate.py (grid aggregation into CSV + LaTeX table);
run_grid_and_aggregate.py (batch driver to run multiple seeds/generators and aggregate into all_runs_agg.csv);
plot_cert_vs_m.py (plots mean $(δ + α)$ vs. m with ±1 s.d. error bars and $C / \sqrt{m}$ curve);
Sensitivity experiments: run_sensitivity.ps1, analyze_sensitivity.py;
tests/ with test_pipeline.py (pytest regression);
Convenience scripts: run_demo.bat, run_tests.bat, run_grid.ps1 (Windows helpers).

Appendix B.2. Quick Start

# (Windows, PowerShell)
py -3.10 -m venv .venv
. .venv\Scripts\Activate.ps1
pip install -r requirements.txt

# minimal run (ER baseline)
py -3.10 train_ptgnn.py --outdir results\demo
py -3.10 analyze.py --indir results\demo
py -3.10 -m pytest -q

For Unix-like systems, use python3 and POSIX path separators.

Appendix B.3. Experiment Grids and Batch Runs

# ER/DCSBM grid
powershell -ExecutionPolicy Bypass -File run_grid.ps1
py -3.10 aggregate.py --indir results\grid --out results\grid\all_runs.csv

# Multi-seed batch (any generator: er, dcsbm, reddit)
py -3.10 run_grid_and_aggregate.py --generator reddit --outdir results grid_reddit \
   --reddit_edgelist data/reddit/reddit_edges.csv --reddit_src_col user_a --reddit_dst_col user_b \
   --reddit_nsub 2000 --reddit_k 500 --epsilon 0.12 --m_list 8000 --R 10 --run-baselines

Aggregators produce all_runs.csv (per run) and all_runs_agg.csv (mean ± std and 95% CI, ready for LaTeX tables).

Appendix B.4. Outputs

Each run directory (e.g., results/demo/) contains the following:

metrics.json (AUC, F1, $δ$ , $α$ , $(δ + α)$ , seed, generator);
metrics_lof.json / metrics_deepwalk.json (if baselines enabled);
predictions.csv (per-graph labels + scores);
summary.txt (human-readable recap of settings + metrics);
Plots: roc.png, pr.png, ablation_budget.png;
Logs: stdout.log (if launched via scripts).

Appendix B.5. Determinism and Compute Notes

All experiments use fixed seeds for data generation, tester sampling, and model initialization. Each configuration uses a 70/30 train/validation split. Tester cost scales

O (m)

; feature extraction is near-linear in

| E |

. No GPU is required. Baseline DeepWalk uses node2vec if installed. Convenience scripts are provided for Windows; Unix users can run the equivalent Python commands.

Appendix B.6. Integration with Manuscript

Generated outputs (figures + tables) match those in this paper:

figs/: roc.png, pr.png, ablation_budget.png, cert_vs_m.pdf, roc_enron.png, roc_cora.png, roc_reddit.png, etc.
tables/: all_runs_table.tex, plus aggregated all_runs_agg.csv used for Table 3, Table 4, Table 5 and Table 6.

References

Akoglu, L.; Tong, H.; Koutra, D. Graph based anomaly detection and description: A survey. Data Min. Knowl. Discov. 2015, 29, 626–688. [Google Scholar] [CrossRef]
Ma, X.; Wu, J.; Xue, S.; Yang, J.; Zhou, C.; Sheng, Q.Z.; Xiong, H.; Akoglu, L. A Comprehensive Survey on Graph Anomaly Detection with Deep Learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 12012–12038. [Google Scholar] [CrossRef]
Reis, M.J.C.S. Scalable Intrusion Detection in IoT Networks via Property Testing and Federated Edge AI. IEEE Access 2025, 13, 153244–153262. [Google Scholar] [CrossRef]
Reis, M.J.C.S. Property-Based Testing for Cybersecurity: Towards Automated Validation of Security Protocols. Computers 2025, 14, 179. [Google Scholar] [CrossRef]
Ron, D. Algorithmic and Analysis Techniques in Property Testing. Found. Trends Theor. Comput. Sci. 2010, 5, 73–205. [Google Scholar] [CrossRef]
Goldreich, O.; Ron, D. Algorithmic Aspects of Property Testing in the Dense Graphs Model. In Property Testing: Current Research and Surveys; Goldreich, O., Ed.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 295–305. [Google Scholar] [CrossRef]
Seshadhri, C.; Pinar, A.; Kolda, T.G. Triadic Measures on Graphs: The Power of Wedge Sampling. In Proceedings of the 2013 SIAM International Conference on Data Mining (SDM), Austin, TX, USA, 2–4 May 2013; pp. 10–18. [Google Scholar] [CrossRef]
Seshadhri, C.; Pinar, A.; Kolda, T.G. Wedge Sampling for Computing Clustering Coefficients and Triangle Counts on Large Graphs. Stat. Anal. Data Mining ASA Data Sci. J. 2014, 7, 294–307. [Google Scholar] [CrossRef]
Boucheron, S.; Lugosi, G.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar] [CrossRef]
Song, C.; Niu, L.; Lei, M. A Brief Survey on Graph Anomaly Detection. Procedia Comput. Sci. 2024, 242, 1263–1270. [Google Scholar] [CrossRef]
Pazho, A.D.; Noghre, G.A.; Purkayastha, A.A.; Vempati, J.; Martin, O.; Tabkhi, H. A Survey of Graph-Based Deep Learning for Anomaly Detection in Distributed Systems. IEEE Trans. Knowl. Data Eng. 2024, 36, 1–20. [Google Scholar] [CrossRef]
Ekle, O.A.; Eberle, W. Anomaly Detection in Dynamic Graphs: A Comprehensive Survey. ACM Trans. Knowl. Discov. Data 2024, 18, 192. [Google Scholar] [CrossRef]
Milo, R.; Shen-Orr, S.; Itzkovitz, S.; Kashtan, N.; Chklovskii, D.; Alon, U. Network Motifs: Simple Building Blocks of Complex Networks. Science 2002, 298, 824–827. [Google Scholar] [CrossRef] [PubMed]
IDÉ, T.; KASHIMA, H. Eigenspace-based anomaly detection in computer systems. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘04, New York, NY, USA, 22–25 August 2004; pp. 440–449. [Google Scholar] [CrossRef]
Ai, X.; Zhou, J.; Zhu, Y.; Li, G.; Michalak, T.P.; Luo, X.; Zhou, K. Graph Anomaly Detection at Group Level: A Topology Pattern Enhanced Unsupervised Approach. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–16 May 2024; pp. 1213–1227. [Google Scholar] [CrossRef]
Kipf, T.; Welling, M. Variational Graph Auto-Encoders. arXiv 2016, arXiv:1611.07308. [Google Scholar] [CrossRef]
Roy, A.; Shu, J.; Li, J.; Yang, C.; Elshocht, O.; Smeets, J.; Li, P. GAD-NR: Graph Anomaly Detection via Neighborhood Reconstruction. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM ‘24, New York, NY, USA, 4–8 March 2024; pp. 576–585. [Google Scholar] [CrossRef]
Tian, S.; Dong, J.; Li, J.; Zhao, W.; Xu, X.; Wang, B.; Song, B.; Meng, C.; Zhang, T.; Chen, L. SAD: Semi-supervised anomaly detection on dynamic graphs. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI ’23, Macao, 19–25 August 2023. [Google Scholar] [CrossRef]
Yang, Y.; Wang, P.; He, X.; Zou, D. GRAM: An interpretable approach for graph anomaly detection using gradient attention maps. Neural Netw. 2024, 178, 106463. [Google Scholar] [CrossRef] [PubMed]
Cohen, J.; Rosenfeld, E.; Kolter, J.Z. Certified Adversarial Robustness via Randomized Smoothing. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019; pp. 1310–1320. [Google Scholar] [CrossRef]
Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic Learning in a Random World: Conformal Prediction and Its Applications, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar] [CrossRef]
Zhou, A.; Xu, X.; Raghunathan, R.; Lal, A.; Guan, X.; Yu, B.; Li, B. KnowGraph: Knowledge-Enabled Anomaly Detection via Logical Reasoning on Graph Data. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ‘24, New York, NY, USA, 14–18 October 2024; pp. 168–182. [Google Scholar] [CrossRef]
Bennett, G. Probability Inequalities for the Sum of Independent Random Variables. J. Am. Stat. Assoc. 1962, 57, 33–45. [Google Scholar] [CrossRef]
Hoeffding, W. Probability Inequalities for Sums of Bounded Random Variables. J. Am. Stat. Assoc. 1963, 58, 13–30. [Google Scholar] [CrossRef]
Gilbert, E.N. Random Graphs. Ann. Math. Stat. 1959, 30, 1141–1144. [Google Scholar] [CrossRef]
Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science; Cambridge University Press: Cambridge, UK, 2018. [Google Scholar] [CrossRef]

Figure 1. Stress tests: certificate

(δ + α)

versus tester budget m (log-y). We consider two hard regimes: tiny effect

ε

= 0.02 (no noise,

α

≈ 0.11) and overlapping anomalies (

50 %

overlap,

ε

= 0.06,

α

≈ 0.09). Larger m reduces

δ

and tightens the overall bound; the elevated levels arise primarily from increased

α

. The dashed curve is a

C / \sqrt{m}

guide, corroborating the expected

O (m^{- 1 / 2})

trend.

Figure 1. Stress tests: certificate

(δ + α)

versus tester budget m (log-y). We consider two hard regimes: tiny effect

ε

= 0.02 (no noise,

α

≈ 0.11) and overlapping anomalies (

50 %

overlap,

ε

= 0.06,

α

≈ 0.09). Larger m reduces

δ

and tightens the overall bound; the elevated levels arise primarily from increased

α

. The dashed curve is a

C / \sqrt{m}

guide, corroborating the expected

O (m^{- 1 / 2})

trend.

Figure 2. Validation ROC for a representative setting (

ε

= 0.06, m = 4000). The curve attains

AUC = 1.000

; the dashed diagonal indicates random. Across the grid (Table 3) we observe saturation at AUC = 1.000.

Figure 2. Validation ROC for a representative setting (

ε

= 0.06, m = 4000). The curve attains

AUC = 1.000

; the dashed diagonal indicates random. Across the grid (Table 3) we observe saturation at AUC = 1.000.

Figure 3. Validation precision–recall (AP = 1.000). At recall = 1, precision equals the positive prior

π^{+}

(balanced validation implies

π^{+}

= 0.50; dashed reference).

Figure 3. Validation precision–recall (AP = 1.000). At recall = 1, precision equals the positive prior

π^{+}

(balanced validation implies

π^{+}

= 0.50; dashed reference).

Figure 4. Ablation: validation AUC vs. tester budget m (

m \in {2000, 4000, 8000}

). AUC remains

1.000

for all budgets; improvements in the certificate

(δ + α)

with m therefore stem from reduced tester variance rather than classifier performance (cf. Figure 8).

Figure 4. Ablation: validation AUC vs. tester budget m (

m \in {2000, 4000, 8000}

). AUC remains

1.000

for all budgets; improvements in the certificate

(δ + α)

with m therefore stem from reduced tester variance rather than classifier performance (cf. Figure 8).

Figure 5. Real-data pilot (Enron email). ROC and precision–recall curves. PT-GNN outperforms a neural autoencoder (GAE) and random.

Figure 6. Real-data pilot (Cora citation). ROC and precision–recall curves. PT-GNN exhibits stronger separation than GAE on a graph with clearer community structure.

Figure 7. Reddit-like interaction graph (synthetic from a heavy-tailed, community-structured generator). PT-GNN attains near-perfect separation (AUC = 1.000), consistent with a strong injected closure signal in the target community.

Figure 8. Certificate vs. budget (log-y) for

ε

= 0.06. The dashed curve is a least-squares fit

C / \sqrt{m}

(here C ≈ 40.93), illustrating the expected

O (m^{- 1 / 2})

decay.

Figure 8. Certificate vs. budget (log-y) for

ε

= 0.06. The dashed curve is a least-squares fit

C / \sqrt{m}

(here C ≈ 40.93), illustrating the expected

O (m^{- 1 / 2})

decay.

Figure 9. Harder setting (

ε

small,

10 %

train-label noise)—ROC. Curves reflect reduced separability relative to the main grid; PT-GNN remains above GAE and random.

Figure 9. Harder setting (

ε

small,

10 %

train-label noise)—ROC. Curves reflect reduced separability relative to the main grid; PT-GNN remains above GAE and random.

Figure 10. Harder setting—precision–recall. Precision for PT-GNN stays higher across recall than GAE, consistent with Table 5.

Figure 11. Certificate vs. tester budget m under non-negligible

α

. Lines show

(δ + α)

for three regimes:

ε

= 0.03 (no noise,

α = 0.06

),

ε

= 0.02 (no noise,

α

= 0.11), and

ε

= 0.03 (

10 %

noise,

α

= 0.09). Increasing m reduces

δ

and tightens the overall bound.

Figure 11. Certificate vs. tester budget m under non-negligible

α

. Lines show

(δ + α)

for three regimes:

ε

= 0.03 (no noise,

α = 0.06

),

ε

= 0.02 (no noise,

α

= 0.11), and

ε

= 0.03 (

10 %

noise,

α

= 0.09). Increasing m reduces

δ

and tightens the overall bound.

Figure 12. PT-GNN vs. GAE — ROC on the main synthetic setting (n = 1000, p = 0.01,

ε

= 0.06). PT-GNN shows near-perfect separation; GAE trails clearly above random.

Figure 12. PT-GNN vs. GAE — ROC on the main synthetic setting (n = 1000, p = 0.01,

ε

= 0.06). PT-GNN shows near-perfect separation; GAE trails clearly above random.

Figure 13. PT-GNN vs. GAE — Precision–Recall. PT-GNN attains precision near 1.0 over a wide recall range; GAE remains competitive yet inferior.

Table 1. Comparison of related approaches for graph anomaly detection. PT-GNN uniquely combines sublinear testing, graph representation learning, and certified guarantees.

Approach	Scalability	Detection Power	Formal Guarantees
Motif-based (triangles/wedges)	High	Moderate–High	None
Sublinear property testing	Very High	Limited (global props)	Bounds on tester only
GNN/rep. learning	Moderate	High	None (empirical only)
Certified ML (robustness/conformal)	Variable	General (not graph-specific)	Formal (adv./coverage)
PT-GNN (this work)	High	High (structural + learned)	Yes: $(δ + α)$

Note:

(δ + α)

is an upper bound and we clip reported values to

[0, 1]

. For small tester budgets m, the Bernstein term can be loose even when AUC/F1 are perfect, so

(δ + α)

may approach 1.

Table 2. Stress-test scenarios (n = 1000, p = 0.01) at

m \in {2000, 4000, 8000}

. Under the fixed-confidence policy,

δ

depends on m (and

ρ

) but not on

ε

; totals add the scenario-specific

α

.

Table 2. Stress-test scenarios (n = 1000, p = 0.01) at

m \in {2000, 4000, 8000}

. Under the fixed-confidence policy,

δ

depends on m (and

ρ

) but not on

ε

; totals add the scenario-specific

α

.

Scenario	$(δ + α)$ @ m = 2000	m = 4000	m = 8000
Tiny effect $ε$ = 0.02 (no noise; $α$ ≈ 0.11)	0.8700 + 0.11 = 0.9800	0.6916 + 0.11 = 0.8016	0.4855 + 0.11 = 0.5955
Overlapping anomalies (50% overlap; $α$ ≈ 0.09)	0.8700 + 0.09 = 0.9600	0.6916 + 0.09 = 0.7816	0.4855 + 0.09 = 0.5755

Notes. Values shown are representative for the fixed-confidence view; as m increases,

δ

tightens monotonically, while harder regimes raise

α

.

Table 3. Main grid on ER graphs (balanced validation). Results are mean ± std over R = 10 seeds.

$ε$	m	AUC (val)	F1/ $(δ + α)$
0.05	2000	black $1.000 \pm 0.000$	$1.000 \pm 0.000$ / $1.000 \pm 0.000$
	4000	$1.000 \pm 0.000$	$1.000 \pm 0.000$ / $0.832 \pm 0.070$
	8000	$1.000 \pm 0.000$	$1.000 \pm 0.000$ / $0.556 \pm 0.076$
0.06	2000	$1.000 \pm 0.000$	$1.000 \pm 0.000$ / $0.920 \pm 0.053$
	4000	$1.000 \pm 0.000$	$1.000 \pm 0.000$ / $0.719 \pm 0.070$
	8000	$1.000 \pm 0.000$	$1.000 \pm 0.000$ / $0.548 \pm 0.073$
0.10	2000	$1.000 \pm 0.000$	$1.000 \pm 0.000$ / $0.818 \pm 0.050$
	4000	$1.000 \pm 0.000$	$1.000 \pm 0.000$ / $0.705 \pm 0.071$
	8000	$1.000 \pm 0.000$	$1.000 \pm 0.000$ / $0.548 \pm 0.073$

Notes. (i) Mean ± std over R = 10 seeded runs; complete 95% CIs are included in the aggregated CSV. (ii) AUC/F1 saturate near 1.0 across budgets; the certificate

(δ + α)

tightens with m (variance reduction).

Table 4. Real-data pilots and Reddit-like synthetic graph (balanced validation; tester budget m = 8000 for Reddit-like). PT-GNN reports a certificate; baselines are uncertified.

Dataset	Method	AUC	AP	F1	$(δ + α)$
Enron email	PT-GNN	0.95	0.90	0.91	≈0.68
	GAE	0.88	0.82	0.85	N/A
	Random	0.50	0.50	0.50	N/A
Cora citation	PT-GNN	0.97	0.94	0.94	≈0.62
	GAE	0.91	0.86	0.88	N/A
	Random	0.50	0.50	0.50	N/A
Reddit-like (synthetic)	PT-GNN	1.000	1.000	1.000	0.263
	LOF (deg)	0.143	0.384	0.560	N/A
	DeepWalk+Logit	1.000	1.000	1.000	N/A

Notes: Reddit-like metrics from our synthetic heavy-tailed generator with

n_{sub}

= 2000, k = 500,

ε

= 0.12, m = 8000. PT-GNN’s certificate equals

δ

+

α \approx

0.2628 (median

δ

≈ 0.263,

α

≈ 0). DeepWalk+Logit is strong but uncertified and computationally heavy (runtime

\sim 8.71 \times 10^{6}

ms); LOF(deg) is weak on this task.

Table 6. PT-GNN vs. additional baselines on the main ER setting (n = 1000, p = 0.01,

ε

= 0.06). PT-GNN rows report mean ± std over R = 10 seeds; baselines shown from a representative run (see note).

Table 6. PT-GNN vs. additional baselines on the main ER setting (n = 1000, p = 0.01,

ε

= 0.06). PT-GNN rows report mean ± std over R = 10 seeds; baselines shown from a representative run (see note).

Method	AUC	AP	F1	$(δ + α)$
PT-GNN (m = 2000)	$1.000 \pm 0.000$	–	$1.000 \pm 0.000$	$0.920 \pm 0.053$
PT-GNN (m = 4000)	$1.000 \pm 0.000$	–	$1.000 \pm 0.000$	$0.719 \pm 0.070$
PT-GNN (m = 8000)	$1.000 \pm 0.000$	–	$1.000 \pm 0.000$	$0.548 \pm 0.073$
GAE (GCN-64)	0.940	0.900	0.915	–
LOF (deg)	0.143	0.399	0.560	–
DeepWalk+Logit	0.818	0.727	0.800	–

Notes: (i) PT-GNN statistics are aggregated over R = 10 seeds from the ER grid; (

δ

+

α

) reflects the certified miss-probability bound. (ii) LOF(deg) and DeepWalk+Logit are shown from a representative run (computationally heavier to grid over seeds); their multi-seed aggregates can be added if desired by enabling the baseline flags in the batch script.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Reis, M.J.C.S. Tester-Guided Graph Learning with End-to-End Detection Certificates for Triangle-Based Anomalies. Big Data Cogn. Comput. 2025, 9, 257. https://doi.org/10.3390/bdcc9100257

AMA Style

Reis MJCS. Tester-Guided Graph Learning with End-to-End Detection Certificates for Triangle-Based Anomalies. Big Data and Cognitive Computing. 2025; 9(10):257. https://doi.org/10.3390/bdcc9100257

Chicago/Turabian Style

Reis, Manuel J. C. S. 2025. "Tester-Guided Graph Learning with End-to-End Detection Certificates for Triangle-Based Anomalies" Big Data and Cognitive Computing 9, no. 10: 257. https://doi.org/10.3390/bdcc9100257

APA Style

Reis, M. J. C. S. (2025). Tester-Guided Graph Learning with End-to-End Detection Certificates for Triangle-Based Anomalies. Big Data and Cognitive Computing, 9(10), 257. https://doi.org/10.3390/bdcc9100257

Article Menu

Tester-Guided Graph Learning with End-to-End Detection Certificates for Triangle-Based Anomalies

Abstract

1. Introduction

2. Related Work

2.1. Motif-Based Anomaly Detection

2.2. Sublinear Graph Property Testing

2.3. Graph Representation Learning for Anomalies

2.4. Certified Learning and Detection Guarantees

3. Method

3.1. Problem Setting

3.2. Wedge-Sampling Tester and Concentration Bound

3.3. Reporting δ

3.4. Classifier and End-to-End Certificate

3.5. Certificate

3.6. Computational Complexity

4. Theoretical Notes

4.1. Operational Reporting and Two Parameterizations

4.2. Finite-Sample ( α ^ ) Version

4.3. Sample-Complexity for a Target Risk

5. Experimental Setup

5.1. Data

5.2. Tester Configuration

5.3. Features and Classifier

5.4. Protocol

5.5. Baselines

5.6. Complexity and Compute

5.7. Sensitivity Study

6. Results

6.1. Ablation: Certificate vs. Budget

6.2. Ablation: Regimes with Non-Negligible Classifier Error

6.3. Comparison with a Neural Baseline

7. Discussion

7.1. Why Budget m Dominates

7.2. Practicality of Sublinear Sampling

7.3. Assumptions in Practice

8. Limitations and Future Work

9. Reproducibility and Artifact Availability

10. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Sensitivity to Generator Parameters

Appendix A.1. Design

Appendix A.2. Effect of Graph Sparsity (p)

Appendix A.3. Effect of Community Size (k)

Appendix A.4. Heterogeneous Graphs (Degree-Corrected SBM)

Appendix A.5. Notes

Appendix B. Artifact Details

Appendix B.1. Artifact Contents

Appendix B.2. Quick Start

Appendix B.3. Experiment Grids and Batch Runs

Appendix B.4. Outputs

Appendix B.5. Determinism and Compute Notes

Appendix B.6. Integration with Manuscript

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3. Reporting $δ$

4.2. Finite-Sample ( $\hat{α}$ ) Version