Operationally Constrained Zero-Day Intrusion Detection with Target-FPR Calibration and Similarity Graph Construction

Ha, Yuseong; Kim, Keecheon

doi:10.3390/app16052284

Open AccessArticle

Operationally Constrained Zero-Day Intrusion Detection with Target-FPR Calibration and Similarity Graph Construction

by

Yuseong Ha

¹ and

Keecheon Kim

^2,*

¹

Department of IT Convergence and Information Security, Konkuk University, Seoul 05029, Republic of Korea

²

Department of Computer Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(5), 2284; https://doi.org/10.3390/app16052284

Submission received: 4 February 2026 / Revised: 23 February 2026 / Accepted: 25 February 2026 / Published: 26 February 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Intrusion detectors are often evaluated using average metrics at unconstrained thresholds, yet deployments require explicit control over false alarms. We investigate zero-day (out-of-distribution, OOD) intrusion detection under a target-FPR calibrated protocol, where a threshold is set on benign validation traffic to satisfy a target false positive rate α and transferred, unchanged, to a seen-test and OOD-test. Using CICIDS2017-derived host-session nodes aggregated in 1 min and 5 min windows, we compare tabular baselines, message-passing GNNs on a rule-based graph, and employ a method that builds a k-nearest-neighbor similarity graph with lightweight feature pre-smoothing. Robustness is measured using the OOD violation ratio, percentile tail risk, and feasibility under explicit false-alarm budgets. Base-graph GNNs exhibit heavy-tailed false-alarm amplification under OOD shifts: at α = 0.001, the p95 violation ratio reaches 68.50 (1 m) and 67.95 (5 m). In contrast, the proposed method reduces p95 to 3.41 (1 m) and 1.15 (5 m) and improves budget feasibility. We further verify robustness beyond a single held-out family by evaluating additional unseen-family splits (e.g., DDoS and DDoS+DoS) under the same calibrated operating point. We also quantify deployment-oriented cost via edge-list size and practical parsing/loading time. These findings suggest that similarity-based graphs with light pre-smoothing improve deployability under distribution shifts.

Keywords:

zero-day intrusion detection; out-of-distribution (OOD); target-FPR calibration; operational robustness; false-alarm budget; tail risk; graph neural networks; KNN similarity graph; feature pre-smoothing

1. Introduction

Intrusion detection systems are commonly evaluated under fixed datasets and unconstrained thresholds, where performance is summarized using average metrics such as accuracy or AUC [1,2]. However, deployment environments rarely remain stationary: traffic composition, benign behavior, and attack implementations drift over time [3]. In such settings, the most damaging failure mode is often not a small decrease in the average detection quality but an operational breakdown caused by false-alarm inflation [4,5].

This work studies zero-day intrusion detection from an operational perspective. We treat “zero-day” as an out-of-distribution (OOD) condition in which attack behaviors encountered at test time are absent from training, while benign traffic is still present [6,7]. Importantly, we focus on the practical constraint that operators must control false positives. In real deployments, the question is not only whether a method can detect attacks, but whether it can do so while respecting a specified false-alarm budget. Recent studies have emphasized interpretability and application-oriented deployment considerations for intrusion detection, particularly in industrial IoT settings [8]. More broadly, reliability-oriented assessment frameworks have also been discussed in cyber defense contexts, providing motivation to report operational stability through metrics other than average performance [9].

The key technical novelty of our study is the coupling of target-FPR calibration (fixed operating point transfer) with similarity graph construction and operational tail risk/feasibility reporting, enabling robustness assessments under explicit false alarm constraints rather than unconstrained thresholds. To achieve this, we adopt a target-FPR calibrated evaluation protocol. For a given target false positive rate α, we calibrate the decision threshold on a benign subset of validation data so that the validation FPR satisfies the target constraint [10,11]. The calibrated threshold is then transferred, unchanged, to both (i) a seen-test split that contains benign traffic and attack behaviors observed during training, and (ii) an OOD/zero-day test split that contains benign traffic and unseen attack behaviors. Under this protocol, any false-alarm inflation observed on OOD data directly indicates that an operating point that appears feasible during development can become infeasible after deployment-time shift.

Within this protocol, we compare three method families under the same host-session node representation and identical train/validation/test splits. First, we consider tabular baselines that operate only on node features and do not exploit relational information. Second, we consider standard message-passing GNN detectors trained on a rule-based base graph derived from domain/session relations. Third, we propose a graph construction and feature pre-processing strategy that targets operational robustness: we replace the base graph with a k-nearest-neighbor (KNN) similarity graph built in the node feature space and apply lightweight feature pre-smoothing prior to GNN training [12,13]. We predict that, under distribution shift, brittle or noisy edges can amplify spurious message passing and destabilize score distributions, which in turn causes calibrated operating points to fail. A similarity graph combined with mild pre-smoothing encourages locally coherent neighborhoods and attenuates noise, stabilizing the transfer of a fixed calibrated threshold [14,15].

We instantiate this study on CICIDS2017-derived host-session representations under two temporal aggregation windows (1 and 5 m) and multiple target FPRs (α ∈ {0.001, 0.01, 0.05}) [16]. Beyond reporting the OOD FPR achieved at the calibrated operating point, we quantify robustness using worst-case-oriented operational metrics: percentile-based tail risk of the OOD violation ratio and feasibility rates under explicit false-alarm budgets. These metrics directly answer whether a detector remains deployable when the operating point is fixed by calibration and the test distribution shifts.

A key practical implication of this protocol is that it separates “model quality” from “operational feasibility.” When the threshold is allowed to be tuned on the test distribution, such a method can mask instability via post hoc adjustment. By fixing τ(α) using benign validation traffic and transferring it unchanged, feasibility violations on OOD data directly represent deployment risk under a realistic operational constraint.

The main contributions of this research are as follows:

(1): We formulate and evaluate zero-day intrusion detection under an explicit target-FPR calibrated protocol that fixes the operating point on benign validation data and transfers it, unchanged, to seen- and OOD-tests, aligning evaluation with operational constraints.
(2): We propose a robustness-oriented, graph-based approach that combines a KNN similarity graph and lightweight feature pre-smoothing to stabilize neighborhood aggregation and score distributions under OOD shifts.
(3): We introduce and report operational robustness metrics—the tail risk of false-alarm inflation and budget feasibility rates—that capture worst-case behavior and deployability beyond the average detection performance.
(4): Through experiments across two time windows, various target FPR regimes, and various model backbones, we show that the proposed approach substantially reduces heavy-tailed false alarm amplification and improves feasibility under strict alert budgets while preserving meaningful detection quality at a fixed calibrated operating point.

2. Related Work

In this section, we review prior research most relevant to our study: intrusion detection under distribution shift (zero-day/OOD), graph-based intrusion detection and graph neural networks [17], and calibration or constraint-aware evaluation focused on false alarms. We also highlight why robustness should be characterized by tail behavior rather than averages when operating points are fixed.

2.1. Zero-Day and Distribution Shift in Intrusion Detection

Traditional intrusion detection research has emphasized supervised learning on labeled datasets, often evaluated under i.i.d. assumptions, where training and test data share similar distributions [18,19]. In practice, however, deployment-time conditions differ due to evolving benign behavior, changes in network configuration, and the emergence of novel attack strategies [3]. This mismatch suggests the need for evaluation settings that explicitly model zero-day or OOD conditions, typically by withholding certain attack behaviors during training and assessing generalization to unseen attacks at test time. A key challenge in this regime is that detection thresholds and score distributions can shift, so methods that perform well under standard evaluation can become operationally unstable once deployed.

A recurring theme in zero-day IDS research is the reliance on anomaly detection or representation learning to capture normal behavior and flag deviations [20,21]. While such approaches can improve coverage against unknown attacks, they often suffer from high false positive rates when benign behavior changes [22]. This leads to an operational gap: success in detecting “unknown” activity does not automatically translate to deployability if false alarms cannot be controlled.

2.2. Graph-Based Intrusion Detection

Graph neural networks have been widely explored in security analytics because many cyber systems exhibit a relational structure (hosts, sessions, communications, and temporal co-occurrence). By performing message passing, GNNs can integrate neighborhood context and capture higher-order dependencies beyond what is available in independent tabular features. Prior work has constructed graphs using communication edges, shared attributes, temporal proximity, or domain-driven rules, and has applied GNN backbones such as GCN [23], GraphSAGE [24], GAT [25], and GIN [26] to detection tasks.

Despite their promise, graph-based detectors can be brittle under distribution shifts. When the graph contains noisy, incomplete, or spurious edges, message passing may propagate misleading signals and distort node representations [27]. This distortion can be tolerated under unconstrained thresholds—where one can re-tune the operating point on the test distribution—but becomes problematic when a detector must operate at a fixed threshold that is chosen during development. Under such constraints, small shifts in score distributions can cause disproportionate increases in false alarms.

Graph construction is therefore a critical design choice. Rule-based graphs embed domain knowledge but may not capture behaviorally coherent neighborhoods when deployment conditions differ from the environment in which the rules were crafted. In contrast, similarity graphs built in a feature space offer an alternative: they connect nodes that are close under the observed representation and can remain meaningful even when explicit domain relations are noisy or missing. However, similarity graphs can also introduce their own sensitivities (e.g., choice of similarity metric, neighbor count k), making it important to evaluate whether they actually improve operational robustness under OOD conditions.

2.3. Calibration and Constraint-Aware Evaluation

In deployment, security operators often require explicit control over the false positive rate or the number of alerts within a given time horizon. This provides the impetus to develop calibration procedures that select a decision threshold to satisfy a target constraint on a validation set [10]. A common approach is to set the threshold as a high quantile of benign validation scores so that the expected false positive rate meets a desired level. Such calibration aligns the evaluation with real constraints, but it also exposes a core difficulty: calibration performed on one distribution may not transfer under distribution shifts. If the score distribution shifts on OOD data, the realized false positive rate can inflate substantially even though the detector was properly calibrated during development.

Constraint-aware evaluation thus emphasizes not only detection performance but also the stability of the calibrated operating point under shift. This is especially important in strict regimes (with a very small target FPR), where minor changes in score tails can translate into large relative violations. Consequently, robustness assessment should include both the realized OOD FPR at the calibrated threshold and statistics that characterize the distribution of constraint violations across repeated runs.

2.4. Robustness and Tail Behavior Under Fixed Operating Points

Many studies have summarized model behavior using averages. However, when an operating point is fixed and false alarms are costly, the worst-case outcomes—rare but severe alert explosions—can dominate operational risk. This highlights the need for tail-oriented robustness metrics, such as percentile statistics of false-alarm inflation or feasibility probabilities under explicit budgets. Such metrics are complementary to standard detection measures because they directly quantify deployability: the probability that a calibrated detector stays within an acceptable false-alarm budget after deployment-time shifts.

Our work connects these threads by (i) enforcing a target-FPR calibrated operating point that remains fixed across seen- and OOD-tests; (ii) evaluating graph-based methods where graph construction and neighborhood aggregation critically influence score stability; and (iii) reporting tail risk and budget feasibility measures that capture operational robustness beyond average detection performance.

3. Materials and Methods

In this section, we describe an operationally constrained evaluation protocol for zero-day (out-of-distribution, OOD) intrusion detection under an explicit target false positive rate (target FPR) calibration, together with host-session node construction, two graph construction strategies (base vs. KNN similarity), model families, and operational robustness metrics. The core principle of this study is to calibrate a decision threshold τ(α) on benign validation data to satisfy a target false alarm rate α, and then transfer the calibrated operating point, unchanged, to both a seen-test and OOD/zero-day test. This design isolates operational instability under distribution shifts arising from post hoc threshold tuning.

3.1. Task Definition and Zero-Day (OOD) Evaluation Setting

Intrusion detection is formulated as a binary classification problem at the host-session level. Each sample corresponds to a host aggregated over a fixed temporal window (e.g., 1 min or 5 min) and is labeled as benign or malicious. The zero-day (OOD) setting captures a practical deployment condition where attack behaviors observed after deployment differ from those available during development.

The OOD notion used here primarily corresponds to unseen attack shifts (i.e., attacks from families absent in training). A known limitation of this is that real deployments may also exhibit benign drift (non-attack-related shifts), such as changes in user activity, applications, background services, or network configuration. Such benign shifts can also affect the stability of a fixed calibrated operating point. This limitation is explicitly acknowledged, and additional analyses beyond a single held-out family are provided to strengthen generalization claims.

For each time window, a host-session node table is constructed and partitioned into three disjoint splits: Train, Validation, and Test. Train contains benign samples and only a subset of attack behaviors treated as seen (SEEN). Validation is used exclusively to determine an explicit operating point under a false alarm constraint: the decision threshold is calibrated on the benign subset of Validation to satisfy a target false positive rate (target FPR) α, and the calibrated threshold is then kept fixed. Test is used for two evaluations under the same fixed threshold: (i) seen-test, which contains benign samples and seen attacks (generalization within the SEEN pool), and (ii) OOD/zero-day test, which contains benign samples and attack behaviors that are entirely absent from Train (UNSEEN). This design ensures that any false-alarm inflation or feasibility violations under OOD reflect the instability of a fixed operating point under distribution shifts rather than post hoc re-tuning. The overall target-FPR calibrated OOD evaluation pipeline is illustrated in Figure 1.

In the CICIDS2017-derived setup, the main OOD/zero-day condition is instantiated by holding out an attack family from training. In the main experiment, the Botnet family is held out: all flow records whose raw labels match Botnet (including variants such as “Botnet” and “Botnet—Attempted”) are assigned to the held-out UNSEEN category, excluded from Train/Validation, and appear only in the OOD/zero-day test subset, while the remaining attack behaviors constitute the SEEN pool. To avoid conclusions that depend on a single held-out family, additional OOD splits are constructed by holding out other major attack families using the same procedure. This design provides a controlled form of distributional variation (across unseen families) while keeping the calibration protocol and operating point unchanged. These additional OOD-family hold-out results are summarized in Section 4.6. Across all splits, the models are trained on Train, the decision threshold is calibrated on benign Validation samples to meet the target FPR α, and the same calibrated operating point is transferred, unchanged, to seen-test and OOD/zero-day test.

3.2. Host-Session Node Construction and Dataset Statistics

Each node represents one host-session aggregated from CICIDS2017-derived traffic within a fixed time window. Two temporal granularities are considered: 1 min (1 m) and 5 min (5 m). Each host-session is represented by a fixed-length numeric feature vector computed from aggregated traffic statistics. After excluding split/label-like meta-data fields, the effective feature dimension used for learning is 142.

The node table includes an explicit split field (Train/Validation/Test). These splits are used consistently across all methods to prevent data leakage and to ensure that observed differences are attributable to modeling choices rather than data overlap. Table 1 summarizes the dataset scale per window. This standardization is used consistently when computing cosine similarity for KNN graph construction, ensuring that similarity is not dominated by a small subset of high-variance features.

Feature normalization. Numeric features are standardized prior to learning to ensure comparable scaling across dimensions and to stabilize similarity computations. To avoid information leakage across splits, normalization parameters (mean and standard deviation) are computed on the Train split and then applied consistently to Validation and Test.

3.3. Graph Construction: Base Graph vs. KNN Similarity Graph

To enable message passing with GNN backbones, we construct graphs over host-session nodes while keeping node features fixed. Two edge-generation strategies are compared.

Base graph. The base graph is derived from observed host-to-host communications within each aggregation window using the same flow records that generate host-session nodes. An undirected edge (i, j) is created if at least one flow record indicates communication between the corresponding hosts within the window; multiple records for the same host pair are aggregated into a single edge. Each edge stores a nonnegative domain weight w_ij that equals the aggregated interaction count between the two host-session nodes within the time window (as provided in the released edge list), along with the edge type (COMM or TEMP).

KNN similarity graph (ours). Our proposed graph is constructed in the node-feature space using a k-nearest-neighbor rule with cosine similarity (k = 3). We form the final undirected edge set by removing self-loops and merging duplicate pairs produced by neighbor selection; therefore, the resulting edge count (and average degree) can vary by window. Edge weights correspond to cosine similarity values. We use k = 3 as the main configuration and later examine the k-sensitivity (k ∈ {3, 5, 10}) to confirm that the robustness trend is not tied to a single neighborhood size.

Although the base graph reflects explicit domain relations, its connectivity can be incomplete or noisy in practice, especially when the environment changes. Under such shifts, message passing may propagate misleading neighborhood information and distort score distributions even when the classifier itself is unchanged. This is one reason why a threshold calibrated on benign validation traffic can fail to transfer to OOD-tests: small representation distortions can accumulate through aggregation and inflate false alarms in the score tail. In contrast, a similarity graph builds neighborhoods directly in feature space, which can remain behaviorally coherent even when explicit domain edges become brittle. Graph edge statistics are summarized in Table 2.

3.4. Models and Proposed Pre-Smoothing Mechanism

Three method families are compared under the same host-session node representation and the same Train/Validation/Test splits. Lightweight pre-smoothing is introduced to reduce high-variance components in node features before neighborhood aggregation. Under distribution shifts, similarity computation and message passing can become sensitive to small perturbations or unstable feature dimensions, which may distort score distributions and undermine calibration transfer. By mixing each node with a small portion of its neighbors, pre-smoothing encourages locally coherent representations and mitigates extreme fluctuations that are known to affect tail behavior under strict operating points. We also include γ = 0 (no pre-smoothing) in the ablation to isolate how much of the robustness gain stems from smoothing versus graph construction.

Tabular baselines operate on node features only without any graph structure and serve as non-graph reference baselines under the same target-FPR calibrated protocol. The tabular family includes standard learners such as Logistic Regression, Random Forest, and an MLP. These baselines help distinguish robustness gains attributable to graph-based modeling from gains that can be achieved by feature-only classifiers.

GNN-base uses standard message-passing GNN backbones (GCN, GIN, GraphSAGE, and GAT) trained on the rule-based base graph defined in Section 3.3. Each backbone performs neighborhood aggregation according to the base graph’s edges and produces a score s(x) for each node.

GNN-ours uses the same set of GNN backbones but replaces the rule-based base graph with the KNN similarity graph described in Section 3.3. In addition, a lightweight feature pre-processing step (pre-smoothing) is applied prior to message passing to stabilize local representations.

Let

X \in R^{N \times F}

be the node feature matrix, A be the (possibly weighted) adjacency matrix of the chosen graph, and D be the degree matrix. Using random-walk normalization

D^{- 1} A

, the pre-smoothed features are defined as:

X^{'} = (1 - γ) X + γ D^{- 1} A X

(1)

where γ ∈ [0, 1] controls the smoothing strength. The main configuration uses γ = 0 for GNN-base (no pre-smoothing) and γ = 0.3 for GNN-ours.

Because both the neighborhood size k (which controls KNN graph density) and smoothing coefficient γ can influence message-passing behavior and tail stability under OOD, sensitivity analyses over k and an ablation over γ (including γ = 0) are performed and summarized in Section 4.5.

3.5. Target-FPR Calibration and Operational Metrics

Let s(x) denote a model score for sample x (e.g., predicted malicious probability), and let τ be the decision threshold. For each target false positive rate α ∈ {0.001, 0.01, 0.05}, τ is calibrated on the benign subset of the validation split so that the false positive rate satisfies the target constraint. The validation FPR at threshold τ is

{F P R}_{v a l} (τ) = \frac{1}{|V_{b e n}|} \sum_{x_{i} \in V_{b e n}} 1 [s (x_{i}) \geq τ]

(2)

where

V_{b e n}

denotes the benign subset of the validation split,

|V_{b e n}|

is its cardinality, and 1[·] is the indicator function. A conservative calibrated threshold is selected as

τ (α) = {m i n \{τ| F P R}_{v a l} (τ) \leq α}

(3)

In practice, τ(α) can be obtained as the (1 − α)-quantile of benign validation scores {s(x)|x ∈

V_{b e n}

}, with ties resolved conservatively to ensure

{F P R}_{v a l} (τ) \leq α

. The same τ(α) is applied without modification to the seen-test and OOD-test, ensuring that the reported OOD behaviors reflect a fixed operating point rather than post hoc thresholding on the test distribution.

To quantify how well the false alarm constraint transfers under OOD shifts, let

{F P R}_{o o d} (τ (α))

denote the realized false positive rate measured on benign samples in the OOD evaluation at the calibrated threshold. The violation ratio is defined as

p (α) = \frac{{F P R}_{o o d} (τ (α))}{α}

(4)

Using the normalized ratio p(α) makes constraint violations comparable across different target FPR regimes, because it expresses OOD false-alarm inflation as a scale-free multiple of the intended operating constraint. A ratio near 1 indicates that the target-FPR constraint transfers well to OOD; larger values indicate operational risk due to alert explosion. Operational robustness is summarized using both tail risk and budget feasibility. For q ∈ {90, 95} and a budget B ∈ {3, 10}, define

{T a i l R i s k}_{q} (α) = {P e r c e n t i l e}_{q} (p (α)), {B u d g e t O K}_{B} (α) = P r (p (α) \leq B)

(5)

In addition, strict feasibility is reported as StrictOK(α) = Pr[p(α) ≤ 1], corresponding to satisfying the target constraint exactly. Each run corresponds to one random seed and one model (tabular learner or GNN backbone) under a fixed (window, α) setting. Additional run-level details for per-backbone variability and detection-quality metrics are provided in Supplementary Tables S1 and S2.

4. Experiments and Results

In this section, we report our empirical results under the target-FPR calibrated zero-day (OOD) evaluation protocol described in Section 3. The goal is to assess not only detection performance but also operational robustness, i.e., whether a calibrated operating point remains feasible under distribution shift without triggering excessive false alarms.

4.1. Experimental Protocol and Implementation

We evaluated host-session intrusion detection under two temporal aggregation windows (1 m and 5 m) and three target operating points, α ∈ {0.001, 0.01, 0.05}. For each run, the decision threshold τ(α) is calibrated once using only the benign subset of the validation split to satisfy the target false positive rate α (Section 3.5), and the same calibrated threshold is transferred unchanged to both the seen-test and OOD/zero-day test splits. The results are aggregated over five random seeds {0, 1, 2, 3, 4}. Uncertainty is visualized using empirical percentile bands over repeated runs under the same setting (five seeds × models). The specific band definition is stated in the corresponding figure captions, and the sensitivity results are reported with a 95% interval. Note that τ(α) is derived from model scores on benign validation data and can therefore vary across random seeds even under the same split. Our repeated-run summaries (five seeds × models) capture this combined variability under the benign-validation-only protocol, while τ(α) is never tuned on OOD data.

The aggregation protocol is described in the following results section. For tail risk statistics of the OOD violation ratio p(α) (reported as p90/p95), we first compute the percentile over seeds for each model (tabular: LR/RF/MLP; GNN: each backbone), and then report the average of these percentile values across models within the same family. In contrast, the feasibility rates (StrictOK and BudgetOK) are reported as empirical probabilities averaged over all runs (equivalently, averaging per-model rates over seeds). This reporting separates (i) the worst-case sensitivity within each model from (ii) the overall likelihood of meeting the operational constraints. This two-stage aggregation avoids a single extreme backbone dominating the family summary when pooling all runs directly. Simultaneously, it preserves within-model tail sensitivity (via per-model percentiles) while feasibility rates reflect the overall probability of satisfying operational constraints.

4.2. Method Comparison

We compare three method families under the same target-FPR calibration protocol: (i) tabular baselines (Logistic Regression, Random Forest, and MLP) trained on node features only; (ii) GNN-base, using standard message-passing backbones (GCN, GIN, GraphSAGE, GAT) trained on the base graph; and (iii) GNN-ours, using the same backbones trained on the KNN similarity graph with pre-smoothing (γ) prior to GNN training.

All of these methods share the same host-session node representation, backbone set, and training hyperparameters (Table 3). GNN-base and GNN-ours differ only in the underlying graph (rule-based base vs. KNN similarity) and the use of lightweight pre-smoothing, which isolates the effect of graph construction on OOD tail behavior and feasibility.

The base graph encodes domain/session relations derived from the host-session construction pipeline. In contrast, the similarity graph connects each node to its k-nearest neighbors in the feature space and is used to define neighborhoods that are consistent with observed behavioral similarity. Before training on the similarity graph, we applied a lightweight feature pre-smoothing step γ to reduce sensitivity to noisy or brittle connections and to stabilize score distributions under OOD shifts. Since the operating threshold is fixed by calibration, this stabilization is particularly important for preventing false-alarm inflation in the strict target FPR regime. No information from the OOD-test split is used to select τ(α) or tune any hyperparameters; all operating points are fixed using only benign validation traffic. Therefore, differences between GNN-base and GNN-ours at the same α reflect the stability of the score distributions under shifts rather than post hoc threshold adjustments.

4.3. Main Operational Results Under OOD Constraints

We evaluated the operational robustness under OOD shifts by focusing on constraint satisfaction at a fixed calibrated operating point. Let p(α) =

{F P R}_{o o d}

(τ(α))/α denote the violation ratio, where values above one indicate false-alarm inflation relative to the target. For example, at α = 0.001, p(α) = 10 corresponds to a realized OOD FPR of 1% at the same fixed operating threshold. We report both tail behavior (p90/p95 of p(α)) and feasibility probabilities, because rare but severe violations can dominate operational risk even when the average behavior appears acceptable. The corresponding trends are discussed below and complement the aggregated summaries in Table 4.

Table 4 reports the tail risk (p90/p95 of p(α)), strict feasibility Pr[p(α) ≤ 1], budget feasibility Pr[p(α) ≤ B] for B ∈ {3, 10}, and the realized mean OOD FPR at the calibrated operating point. Together, these summaries distinguish how severe violations (tail risk) can arise from how frequently operational constraints are satisfied (StrictOK/BudgetOK), which is essential when the operating point is fixed by calibration.

At the strictest operating point (α = 0.001), GNN-base exhibits severe tail behavior under OOD shifts. In the 1 m window, the p95 violation ratio reached 68.50, meaning that in the worst 5% of runs, the realized OOD false-alarm rate can inflate to roughly 68.5 × α despite calibration on benign validation traffic. In contrast, GNN-ours substantially compresses this tail risk (p95 = 3.41 in 1 m), indicating a markedly more stable transfer of the calibrated threshold to OOD. A similar pattern appears in the 5 m window, where GNN-base shows p95 = 67.95 while GNN-ours remains close to the target (p95 = 1.15). The tabular baselines remain near the target in the strict regime (p95 of 1.47 in 1 m and 1.05 in 5 m), but they do not exploit the relational structure.

Figure 2 visualizes the same trend across target FPRs. The separation is most pronounced at α = 0.001, where operational failures are dominated by tail events rather than average behavior. As α becomes less stringent (0.01 and 0.05), GNN-ours consistently maintains lower tail amplification than GNN-base in both windows, while the tabular baselines remain comparatively stable but without graph-based modeling capacity. In each panel, the solid line summarizes the central tendency across runs, and the shaded band denotes an inter-quantile range over repeated runs (five seeds × models) at the same target FPR α. The horizontal blue dashed reference line indicates ρ(α) = 1 (i.e., FPR_OOD = α), and values above 1 represent violations (FPR_OOD > α). The light gray dashed lines in the background are grid lines for readability on the log-scaled y-axis and do not indicate additional thresholds.

Budget feasibility further highlights the operational robustness. Under the tight budget B = 3 at α = 0.001, feasibility increases from 0.20 (GNN-base) to 0.55 (GNN-ours) in the 1 m window, and from 0.45 to 1.00 in the 5 m window. Under B = 10, GNN-ours achieves a feasibility of 1.00 in both windows at α = 0.001, whereas GNN-base remains unstable (0.50 in 1 m and 0.55 in 5 m). Even at less strict targets, GNN-ours preserves its high feasibility: for α = 0.01, B = 3 feasibility is 1.00 (1 m) and 1.00 (5 m), while that of GNN-base stays low (0.30 in both windows). These results indicate that the proposed KNN graph with pre-smoothing improves the probability of meeting operational false-alarm budgets under OOD shifts at a fixed calibrated operating point.

As α becomes less fixed, differences can appear smaller because the operating point is less demanding. This does not contradict the strict regime conclusion, since feasibility in deployment is primarily determined by rare but costly failures under tight false alarm constraints. Budget feasibility trends under OOD shifts are summarized in Figure 3.

Figure 3 indicates that a higher BudgetOK rate is strictly better because it means more runs stay within the allowed false-alarm budget under OOD. The fact that GNN-base reaches BudgetOK = 1 at B = 10, α = 0.05 indicates that when the operating point is already lenient, even a brittle model can satisfy a loose budget; however, the practical risk is driven by the strict regime (α = 0.001), where alert explosions are most damaging and where GNN-base fails frequently.

4.4. Architecture Consistency and Supplementary Details

To avoid masking backbone-specific failure cases, we report the per-backbone variability at the strict operating point (α = 0.001) using the min/median/max of the per-model p95 ratios within each family, together with the mean BudgetOK rates (B = 3 and B = 10). This backbone-level summary complements the family-level aggregation and highlights whether a small subset of backbones dominates tail behavior. This backbone-level summary is reported in Table 5, while the full per-model results are provided in Supplementary Table S1.

For completeness, the per-backbone results at α = 0.001 are provided in Supplementary Table S1. We next examine parameter sensitivity to confirm that these robustness trends persist beyond a single choice of neighborhood size k and pre-smoothing strength γ. Backbone-level consistency alone does not guarantee robustness to design parameters that directly control neighborhood aggregation. In particular, both k (graph density) and γ (feature mixing strength) can shift the extent of message propagation, which may affect tail behavior under strict operating points. We therefore report parameter sensitivity to verify that the operational robustness trend persists beyond a single hyperparameter configuration.

4.5. Sensitivity to Neighborhood Size k and Pre-Smoothing Strength γ

To reflect that operational failures are dominated by strict operating points, we report sensitivity results at α = 0.001 using the same tail-oriented robustness metric p(α) defined in Section 3.5. We summarize the tail behavior of p(α) over repeated runs and visualize uncertainty as a 95% interval. When varying k or γ, all remaining settings (representation, splits, calibration, backbones, and seeds) are kept fixed so that only neighborhood density (k) or feature mixing strength (γ) changes. Note that the realized undirected edge set follows the construction described in Section 3.3 (self-loop removal and duplicate-pair merging), and thus the final edge count can differ by window. Since a larger k generally increases graph density and deployment cost, Section 4.7 reports the overhead for the main configuration (k = 3). We then assess whether the robustness trend persists beyond a single choice.

Neighborhood size (k). We vary k ∈ {3, 5, 10}. The top row of the sensitivity results show that the qualitative separation between GNN-base and GNN-ours remains consistent across k values in both windows, indicating that the main configuration (k = 3) is representative rather than arbitrary.

Pre-smoothing strength (γ). We ablate γ including γ = 0 (no pre-smoothing). The bottom row of the sensitivity results indicate that the robustness pattern is not tied to a single hard-coded γ; the relative advantage of similarity-graph construction remains qualitatively stable across different pre-processing strengths. These sensitivity results are summarized in Figure 4.

4.6. OOD Splits Beyond a Single Held-Out Family

Beyond the primary Botnet hold-out, we additionally evaluate DDoS and DDoS+DoS as unseen-family splits using the same protocol (UNSEEN removed from Train/Validation and evaluated only at test time). This mitigates conclusions that would otherwise depend on a single held-out family and better reflects diverse operational distribution shifts. We report standard detection metrics at the fixed calibrated operating point (OOD-F1) together with the realized false alarm behavior on OOD traffic at the same operating threshold (realized OOD FPR). For completeness, Table 6 includes uncertainty intervals to reflect run-to-run variability under the identical calibrated operating point.

4.7. Computational Overhead and Deployment-Oriented Cost

Operational deployment requires that robustness improvements do not incur prohibitive computational overhead. We therefore summarize the additional cost introduced by (i) KNN graph construction relative to the base graph and (ii) practical parsing/loading time for the edge list under the same dataset scale and window settings. The deployment-oriented overhead summary is reported in Table 7.

5. Discussion

In this study, we evaluate zero-day (out-of-distribution, OOD) intrusion detection from an operational perspective in which false alarms must be explicitly controlled. Instead of comparing models at unconstrained thresholds, we adopt a target-FPR calibrated operating point: the decision threshold is selected based on benign validation traffic to satisfy a target false positive rate α, and the same calibrated threshold is transferred, unchanged, to both a seen-test and OOD/zero-day test. This protocol isolates a deployment-relevant question: can an operating point that appears acceptable during development remain feasible after a distribution shift, without relying on post hoc threshold tuning on the test distribution?

A key observation is that robustness under OOD shifts is not adequately characterized by averages alone. Under strict operating points (e.g., α = 0.001), GNN-base exhibits severe heavy-tailed amplification of false alarms on OOD data, where the worst-case behavior dominates operational risk. A plausible mechanism for this is a message-passing failure mode induced by brittle relational connectivity. When the rule-based base graph contains noisy, incomplete, or environment-specific domain/session links, neighborhood aggregation can propagate localized mismatches into many benign nodes, shifting the upper tail of the score distribution upward under OOD. Because the operating point is fixed by τ(α) on benign validation traffic, even a modest upward shift in the OOD tail can translate into a large multiplicative violation ratio p(α) in strict regimes. In contrast, the KNN similarity graph defines neighborhoods directly by feature proximity and, together with light pre-smoothing, reduces sensitivity to brittle edges and dampens high-variance components before aggregation, which stabilizes score tails and improves the transfer of τ(α) to OOD.

The proposed approach (GNN-ours) addresses this instability by modifying the relational structure and smoothing sensitivity in a lightweight manner. Replacing the rule-based base graph with a KNN similarity graph encourages behaviorally coherent local neighborhoods in feature space, which can be more stable when domain/session edges are noisy, incomplete, or brittle under shift. In addition, feature pre-smoothing attenuates high-variance components that can otherwise propagate through message passing and distort score distributions under OOD. Empirically, these choices substantially compress the tail risk: at α = 0.001, the p95 ratio decreases to 3.41 in the 1 m window and to 1.15 in the 5 m window, indicating markedly improved transfer of the calibrated operating point to OOD conditions.

In practice, budget feasibility can be interpreted as the probability that an analyst team can sustain alert handling without overload under zero-day conditions. Therefore, improvements in BudgetOK at a strict α are operationally meaningful even when the average FPR or unconstrained metrics are not emphasized. This perspective is aligned with deployment settings where the cost of excessive alerts is nonlinear and dominated by peak periods.

An important additional finding is that the robustness gains are not attributable to a single backbone architecture. The per-backbone variability under GNN-base is extreme (e.g., the p95 ratio ranges up to 111.00 in 1 m and 100.29 in 5 m), whereas GNN-ours yields a much smaller range across backbones (e.g., 2.87/3.22/4.34 in 1 m and 0.73/1.05/1.76 in 5 m for min/median/max p95 ratios). This consistency supports the interpretation that a KNN similarity graph with light pre-smoothing improves the operational stability in a backbone-agnostic manner, reducing the sensitivity to architectural idiosyncrasies.

Finally, we evaluate the detection quality at the same target FPR operating point to ensure that robustness gains are not driven by degenerate behavior (e.g., always-negative predictions). Since the operating point is fixed by α on benign validation traffic, improvements in operational robustness should be accompanied by meaningful detection capability. Across the target FPRs, GNN-ours tends to make more conservative decisions than GNN-base, typically improving OOD precision at the expense of OOD recall. For example, at α = 0.001, the OOD precision/recall changes from 0.3266/0.4938 (GNN-base, 1 m) to 0.5756/0.4003 (GNN-ours, 1 m), and from 0.2790/0.3868 (GNN-base, 5 m) to 0.5106/0.1953 (GNN-ours, 5 m); OOD F1 also increases from 0.3428 to 0.4641 (1 m) and from 0.2241 to 0.2753 (5 m). The full detection-quality results (precision/recall/F1 and realized FPR on Seen/OOD-tests) under target-FPR calibration are reported in Supplementary Table S2.

This study has limitations. First, the evaluation focuses on CICIDS2017-derived host-session representations under a fixed set of temporal windows and operating points, and the zero-day split is instantiated by holding out unseen attack behaviors during training. Broader validation on additional datasets (e.g., CICIDS2018 or other public intrusion datasets with compatible labeling) and longer time periods would strengthen the evaluation’s generality. Second, the current KNN construction uses a fixed neighbor count (k = 3) and cosine similarity; alternative similarity measures, adaptive neighborhood sizes, or learned graph construction may further improve robustness. Third, calibration is treated as a fixed operating-point selection using benign validation traffic; incorporating uncertainty-aware calibration, drift detection, or online recalibration mechanisms could improve reliability in non-stationary deployments and reduce sensitivity to evolving benign traffic patterns.

6. Conclusions

This work examined zero-day (OOD) intrusion detection under explicit false alarm constraints and highlighted that operational risk is often driven by worst-case false alarm amplification rather than average performance. By adopting a target-FPR calibrated evaluation protocol and transferring the calibrated threshold, unchanged, to OOD-test data, we directly measured whether a detector remains deployable after distribution shifts without relying on post hoc threshold tuning.

Under this operational protocol, standard message-passing GNNs trained on a rule-based base graph can be brittle under OOD shifts, exhibiting heavy-tailed false alarm amplification under strict regimes. In contrast, the proposed approach replaces the base graph with a KNN similarity graph and applies lightweight feature pre-smoothing prior to GNN training. Across two temporal windows (1 m and 5 m) and multiple target operating points (α ∈ {0.001, 0.01, 0.05}), the proposed method substantially reduces tail risk and improves feasibility under explicit alert budgets. At α = 0.001, it compresses the p95 violation ratio from 68.50 to 3.41 in 1 m and from 67.95 to 1.15 in 5 m, and it improves budget feasibility under B = 3 from 0.20 to 0.55 in 1 m and from 0.45 to 1.00 in 5 m. The gains are consistent across multiple GNN backbones, suggesting a backbone-agnostic improvement in operational stability.

In addition to robustness, the proposed method preserves its meaningful detection capability at the calibrated operating point, improving the OOD precision and OOD F1 under strict regimes. Overall, the results show that enforcing explicit false alarm constraints exposes failure modes that are invisible under unconstrained evaluation, and that combining a similarity-based graph with lightweight pre-smoothing can significantly reduce worst-case false alarm risk while maintaining detection quality. Future work will include extending the same pipeline to additional datasets (e.g., CICIDS2018) and longer time periods under the same target FPR protocol, conducting edge-level overlap/stability analyses between the rule-based base graph and the KNN graph (e.g., Jaccard overlap under split/window perturbations) to better quantify graph noise and its link to tail risk inflation, exploring adaptive or learned graph construction, and integrating uncertainty- or drift-aware calibration for non-stationary deployment environments.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/app16052284/s1, Table S1: Per-backbone results at α = 0.001 (CSV); Table S2: Detection quality results under target-FPR calibration (CSV).

Author Contributions

Conceptualization, Y.H. and K.K.; methodology, Y.H. and K.K.; validation, Y.H. and K.K.; formal analysis, Y.H.; investigation, Y.H.; data curation, Y.H.; writing—original draft preparation, Y.H.; writing—review and editing, Y.H. and K.K.; visualization, Y.H.; supervision, K.K.; project administration, K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is available at https://www.kaggle.com/datasets/ernie55ernie/improved-cicids2017-and-csecicids2018 (accessed on 26 November 2025). The data generated during this study and the code used for the experiments are available from the authors upon request.

Acknowledgments

During the preparation of this manuscript, the authors used OpenAI’s ChatGPT-5 for the purposes of language formatting. The authors reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area Under the Curve
AUPRC	Area Under the Precision–Recall Curve
AUROC	Area Under the Receiver Operating Characteristic Curve
CICIDS2017	Canadian Institute for Cybersecurity Intrusion Detection Dataset 2017
FPR	False Positive Rate
GAT	Graph Attention Network
GCN	Graph Convolutional Network
GIN	Graph Isomorphism Network
GNN	Graph Neural Network
GraphSAGE	Graph Sample and Aggregate
IDS	Intrusion Detection System
KNN	K-Nearest Neighbor
LR	Logistic Regression
MLP	Multilayer Perceptron
OOD	Out-of-Distribution
RF	Random Forest

References

Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Denning, D.E. An intrusion-detection model. IEEE Trans. Softw. Eng. 1987, 2, 222–232. [Google Scholar] [CrossRef]
Moreno-Torres, J.G.; Raeder, T.; Alaiz-Rodríguez, R.; Chawla, N.V.; Herrera, F. A unifying view on dataset shift in classification. Pattern Recognit. 2012, 45, 521–530. [Google Scholar] [CrossRef]
Axelsson, S. The base-rate fallacy and the difficulty of intrusion detection. ACM Trans. Inf. Syst. Secur. 2000, 3, 186–205. [Google Scholar] [CrossRef]
Bejtlich, R. The Practice of Network Security Monitoring: Understanding Incident Detection and Response; No Starch Press: San Francisco, CA, USA, 2013. [Google Scholar]
Hendrycks, D.; Gimpel, K. A Baseline for Detecting Misclassified and Out-Of-Distribution Examples in Neural Networks. arXiv 2016, arXiv:1610.02136. [Google Scholar]
Liang, S.; Li, Y.; Srikant, R. Enhancing the Reliability of Out-Of-Distribution Image Detection in Neural Networks. arXiv 2017, arXiv:1706.02690. [Google Scholar]
Ahmad, J.; Latif, S.; Khan, I.U.; Alshehri, M.S.; Khan, M.S.; Alasbali, N.; Jiang, W. An interpretable deep learning framework for intrusion detection in industrial Internet of Things. Internet Things 2025, 33, 101681. [Google Scholar] [CrossRef]
Hong, S.; Zeng, Y. A health assessment framework of lithium-ion batteries for cyber defense. Appl. Soft Comput. 2021, 101, 107067. [Google Scholar] [CrossRef]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
Shafer, G.; Vovk, V. A tutorial on conformal prediction. J. Mach. Learn. Res. 2008, 9, 371–421. [Google Scholar]
Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; Weinberger, K. Simplifying Graph Convolutional Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6861–6871. [Google Scholar]
Zhou, D.; Bousquet, O.; Lal, T.N.; Weston, J.; Schölkopf, B. Learning with local and global consistency. Adv. Neural Inf. Process. Syst. 2003, 16, 321–328. [Google Scholar]
Zhu, X.; Ghahramani, Z.; Lafferty, J. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), Washington, DC, USA, 21–24 August 2003; pp. 912–919. [Google Scholar]
Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proceedings of the International Conference on Information Systems Security and Privacy, Funchal, Portugal, 22–24 January 2018; pp. 108–116. [Google Scholar]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef] [PubMed]
Hastie, T. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A. A Detailed Analysis of the KDD CUP 99 Data Set. In Proceedings of the 2nd IEEE Symposium on Computational Intelligence for Security and Defence Applications, Ottawa, ON, Canada, 8–10 July 2009; pp. 1–6. [Google Scholar]
Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S.A.; Binder, A.; Müller, E.; Kloft, M. Deep One-Class Classification. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4393–4402. [Google Scholar]
Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A Survey. ACM Comput. Surv. (CSUR) 2009, 41, 1–58. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation Forest. In Proceedings of the 2008 8th IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
Kipf, T.N. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 1024–1034. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How Powerful are Graph Neural Networks? arXiv 2018, arXiv:1810.00826. [Google Scholar]
Rong, Y.; Huang, W.; Xu, T.; Huang, J. Dropedge: Towards Deep Graph Convolutional Networks on Node Classification. arXiv 2019, arXiv:1907.10903. [Google Scholar]

Figure 1. Target-FPR calibrated OOD evaluation pipeline.

Figure 2. Tail risk under OOD shifts (band of violation ratio p(α)) vs. target FPR α for 1 m and 5 m windows.

Figure 3. Budget feasibility under OOD shifts (BudgetOK rate with B = 3 and B = 10) vs. target FPR α for 1 m and 5 m windows.

Figure 4. Sensitivity summary combining (k sweep, 1 m/5 m) and (γ ablation, 1 m/5 m).

Table 1. Host-session node statistics (per window).

Window	Nodes	Feature Dim	Split(Train/Val/Test)
1 m	280,535	142	148,661/41,041/90,833
5 m	182,834	142	96,739/26,237/59,858

Table 2. Graph edge statistics (per window and graph type).

Window	Graph	Edges	Avg. Degree	Edge Weight
1 m	base	560,990	4.00	interaction count (domain weight)
5 m	base	392,264	4.29	interaction count (domain weight)
1 m	KNN (k = 3)	841,529	6.00	cosine similarity
5 m	KNN (k = 3)	401,135	4.39	cosine similarity

Table 3. Hyperparameter settings used in all experiments.

Item	Value
Windows	{1 m, 5 m}
Target FPR (α)	{0.001, 0.01, 0.05}
Seeds	{0, 1, 2, 3, 4}
Tabular models	Logistic Regression, Random Forest, MLP
GNN backbones	GCN, GIN, GraphSAGE, GAT
Training epochs	80
Hidden dimension	128
Layers	2
Dropout	0.2
Learning rate	1 × 10⁻³
GAT heads	4
Base graph	Domain/session rule-based edges (weighted)
Ours pre-smoothing γ	0.3

Table 4. Main results aggregated over five seeds (1 m/5 m and α ∈ {0.001, 0.01, 0.05}).

Window	Target FPR (α)	Group	Ratio_p95	Ratio_p90	Strict_ok	B3_ok	B10_ok	Mean OOD FPR
1 m	0.001	tabular	1.456561	1.44509	0.467	1	1	0.001335
		gnn_base	68.4983	66.1681	0.15	0.2	0.5	0.030437
		gnn_ours	3.41142	3.26494	0.2	0.55	1	0.002408
	0.01	tabular	1.89132	1.84151	0.333	0.867	1	0.013238
		gnn_base	11.1428	10.6874	0.15	0.3	0.75	0.05843
		gnn_ours	2.41893	2.36704	0.25	1	1	0.017773
	0.05	tabular	7.58255	7.56965	0.333	0.667	0.667	0.376107
		gnn_base	4.2246	4.15362	0.05	0.25	1	0.165096
		gnn_ours	2.06226	2.00977	0.2	1	1	0.08316
5 m	0.001	tabular	1.05002	0.997125	0.667	1	1	0.000787
		gnn_base	67.9525	63.3278	0.3	0.45	0.55	0.027274
		gnn_ours	1.14624	1.10235	0.8	1	1	0.000778
	0.01	tabular	1.45607	1.42659	0.667	1	1	0.012823
		gnn_base	14.73	14.0114	0.15	0.3	0.7	0.075885
		gnn_ours	1.38782	1.36688	0.3	1	1	0.011865
	0.05	tabular	7.40931	7.39893	0.333	0.667	0.667	0.367715
		gnn_base	5.35191	5.19265	0.1	0.25	1	0.190744
		gnn_ours	1.69499	1.65972	0.2	1	1	0.068934

Table 5. Per-backbone variability at α = 0.001 (range across models within each family).

Window	Group	p95 Ratio (Min/Median/Max)	Mean Budget3_Rate	Mean Budget10_Rate
1 m	tabular	0.992/1.167/2.238	1.00	1.00
	gnn_base	9.884/76.553/111.002	0.2	0.5
	gnn_ours	2.867/3.220/4.338	0.55	1.00
5 m	tabular	0.506/0.949/1.695	1.00	1.00
	gnn_base	0.939/85.290/100.292	0.45	0.55
	gnn_ours	0.726/1.050/1.759	1.00	1.00

Table 6. Extended OOD split summary at α = 0.001 (OOD-F1 and OOD-FPR with uncertainty).

Unseen Family	Window	Tabular OOD-F1	Tabular OOD-FPR	gnn_base OOD-F1	gnn_base OOD-FPR	gnn_ours OOD-F1	gnn_ours OOD-FPR
ddos	1 m	0.633 [0.572, 0.693]	0.000964 [0.000765, 0.0012]	0.095 [0.026, 0.163]	0.0269 [0, 0.0567]	0.306 [0.126, 0.485]	0.0024 [0.000914, 0.0040]
ddos	5 m	0.396 [0.302, 0.490]	0.0010 [0.000827, 0.0012]	0.070 [0.045, 0.095]	0.0248 [0.0020, 0.0477]	0.219 [0.156, 0.282]	0.0012 [0.000761, 0.0016]
ddos+dos	1 m	0.727 [0.705, 0.749]	0.0013 [0.0011, 0.0016]	0.088 [0.000, 0.200]	0.0818 [0, 0.1664]	0.159 [0.074, 0.245]	0.0025 [0.000604, 0.0044]
ddos+dos	5 m	0.559 [0.497, 0.620]	0.0012 [0.000857, 0.0015]	0.026 [0.000, 0.078]	0.2003 [0.0981, 0.3026]	0.334 [0.222, 0.447]	0.0011 [0.000746, 0.0015]

Table 7. Overhead summary (graph statistics + parse/load time, base vs. KNN, 1 m/5 m).

Graph	Window	Edge List (MB)	CSV Parse Time (s)	Edges (XBase)	Size (XBase)	Time (XBase)
Base	1 m	11.90	0.086	1	1	1
Base	5 m	8.17	0.0628	1	1	1
KNN (k = 3)	1 m	28.93	0.1643	1.50	2.43	1.85
KNN (k = 3)	5 m	9.53	0.0661	1.02	1.17	1.05

XBase denotes the multiplicative factor relative to the base graph (e.g., edges, file size, or parsing time).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ha, Y.; Kim, K. Operationally Constrained Zero-Day Intrusion Detection with Target-FPR Calibration and Similarity Graph Construction. Appl. Sci. 2026, 16, 2284. https://doi.org/10.3390/app16052284

AMA Style

Ha Y, Kim K. Operationally Constrained Zero-Day Intrusion Detection with Target-FPR Calibration and Similarity Graph Construction. Applied Sciences. 2026; 16(5):2284. https://doi.org/10.3390/app16052284

Chicago/Turabian Style

Ha, Yuseong, and Keecheon Kim. 2026. "Operationally Constrained Zero-Day Intrusion Detection with Target-FPR Calibration and Similarity Graph Construction" Applied Sciences 16, no. 5: 2284. https://doi.org/10.3390/app16052284

APA Style

Ha, Y., & Kim, K. (2026). Operationally Constrained Zero-Day Intrusion Detection with Target-FPR Calibration and Similarity Graph Construction. Applied Sciences, 16(5), 2284. https://doi.org/10.3390/app16052284

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Operationally Constrained Zero-Day Intrusion Detection with Target-FPR Calibration and Similarity Graph Construction

Abstract

1. Introduction

2. Related Work

2.1. Zero-Day and Distribution Shift in Intrusion Detection

2.2. Graph-Based Intrusion Detection

2.3. Calibration and Constraint-Aware Evaluation

2.4. Robustness and Tail Behavior Under Fixed Operating Points

3. Materials and Methods

3.1. Task Definition and Zero-Day (OOD) Evaluation Setting

3.2. Host-Session Node Construction and Dataset Statistics

3.3. Graph Construction: Base Graph vs. KNN Similarity Graph

3.4. Models and Proposed Pre-Smoothing Mechanism

3.5. Target-FPR Calibration and Operational Metrics

4. Experiments and Results

4.1. Experimental Protocol and Implementation

4.2. Method Comparison

4.3. Main Operational Results Under OOD Constraints

4.4. Architecture Consistency and Supplementary Details

4.5. Sensitivity to Neighborhood Size k and Pre-Smoothing Strength γ

4.6. OOD Splits Beyond a Single Held-Out Family

4.7. Computational Overhead and Deployment-Oriented Cost

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI