A Framework for Budget-Constrained Zero-Day Cyber Threat Mitigation: A Knowledge-Guided Reinforcement Learning Approach

Mainak Basak; Geon-Yun Shin

doi:10.3390/s26010021

and

School of Computer Engineering & Applied Mathematics, Hankyong National University, Anseong-si 17579, Republic of Korea

^*

Author to whom correspondence should be addressed.

Sensors2026, 26(1), 21;https://doi.org/10.3390/s26010021

This article belongs to the Special Issue Cyber Security and AI—2nd Edition

Version Notes

Order Reprints

Abstract

Conventional machine-learning-based defenses are unable to generalize well to novel chains of ATT&CK actions. Being inefficient with low telemetry budgets, they are also unable to provide causal explainability and auditing. We propose a knowledge-based cyber-defense framework that integrates ATT&CK constrained model generation, budget-constrained reinforcement learning, and graph-based causal explanation into a single auditable pipeline. The framework formalizes the synthesis of zero-day chains of attacks using a grammar-formalized ATT&CK database and compiles them into the Zeek-aligned witness telemetry. This allows for efficient training of detection using the generated data within limited sensor budgets. The Cyber-Threat Knowledge Graph (CTKG) stores dynamically updated inter-relational semantics between tactics, techniques, hosts, and vulnerabilities. This enhances the decision state using causal relations. The sensor budget policy selects the sensoring and containment decisions within explicit bounds of costs and latency. The inherent defense-provenance features enable a traceable explanation of each generated alarm. Extensive evaluations of the framework using the TTP holdouts of the zero-day instances show remarkable improvements over conventional techniques in terms of low-FPR accuracy, TTD, and calibration.

Keywords:

Cyber-Threat Knowledge Graph; MITRE ATT&CK; generative cyber range; reinforcement learning for cyber defense; zero-day TTP

1. Introduction

Modern enterprise’s cyber-defense frameworks encounter three persistent deficiencies. First, it is challenging for detection models to address unseen attack chains that follow the ATT&CK ontology but differ in order, preconditions and tooling [1]. Second, telemetry is not free of charge. The teams operate under tight budgets for log volume, CPU, storage, and latency [2,3]. Third, the explanations are brittle under shifts and are generally not tied to the causal structures of tactics and techniques. Many systems cannot demonstrate which model, which knowledge state, or which configuration produced a given alert [4]. Public datasets and cyber simulation labs help. However, they leave these gaps open. The packet and flow corpora are static in nature. Red team replays are scripted. Most gym environments simulate fixed attack playbooks [2,5]. Logging is usually all-on or fixed per scenario. Causal semantics over ATT&CK are not exposed to the defender or to the evaluation harness. Artifacts such as alerts and explanations are saved as files without verifiable provenance. As a result, we cannot study cost-aware defense, generalization to zero-day chains, or explanation stability in a controlled and reproducible way [6,7,8].

We address these needs with Sim-CTKG, an ATT&CK-aware generative cyber range for reinforcement learning and defense [9]. Nodes represent tactics, techniques, software, hosts, and CVEs [10,11]. Edges encode preconditions and effects that link actions and entities. Sim-CTKG samples unseen technique chains with a formal grammar that respects ATT&CK constraints, then compiles each chain into witness telemetry for network and host logs. The defender observes features and a two-hop CTKG slice each step and learns a sensor-budget policy that chooses both containment actions and which log sources to activate under cost and latency limits [12]. A causal CTKG enables counterfactual probes that quantify preventability and explanation stability on held-out chains [13]. Every alert and explanation ships with a Content Credentials (C2PA) manifest so artifacts are verifiable.

The adversary follows tactic and technique semantics and may vary parameters, tools, and order within those constraints. The defender controls containment actions and logging configuration but cannot modify the attacker [14]. Telemetry is synthesized by the witness compiler and can be calibrated with a small set of real replays when available [15]. The range is not developed as a full digital twin. This is a research-grade environment that isolates the effects of costs, causality, and provenance on learning and detection.

A key design goal of Sim-CTKG is to isolate the effects of three factors that strongly influence real-world cyber-defense systems: (i) cost-aware sensing, (ii) causal structure and counterfactual reasoning, and (iii) verifiable provenance. The environment exposes configuration switches that independently enable or disable budget constraints, CTKG-based causal reasoning, and provenance logging, allowing us to characterize the individual and combined contributions of these components to detection accuracy, generalization, and robustness.

The key contributions of this research are as follows:

We develop a grammar-constrained generator for unseen ATT&CK technique chains and a corresponding compiler (Sim-CTKG) that produces structured, Zeek-aligned network and host telemetry consistent with each stage of the simulated intrusion.
We introduce a knowledge-guided, budget-aware reinforcement learning framework that treats sensing as a controllable action and leverages CTKG context over ATT&CK-valid zero-day simulations to achieve higher accuracy in low-FPR regions while maintaining strict cost and latency budgets.
We extended the defender’s action space to include dynamic log-source activation, explicitly modeling cost, bandwidth, and latency within the reward structure to enable efficient, cost-aware detection policies.
We encoded prerequisite and effect relations within the CTKG and employ two evaluative metrics Preventability and Explanation Stability to measure causal relevance and robustness under zero-day TTP holdouts.
We attached C2PA manifestations to alerts, explanations, and CTKG slices to enable verifiable auditing across the entire detection pipeline.
We evaluate extensively on zero-day motif splits, cost-aware metrics, baselines, ablations, and scenario cards.

The remainder of this paper is organized as follows. Section 2 presents the background and the threat models. Section 3 presents the related literature. Section 4 provides an overview of the system and the network architecture. Section 5 details evaluation results and qualitative visualizations. Section 6 explores the extended analysis. Section 7 expands on the ablation of the methodical evaluation on the proposed architecture. Section 8 concludes with limitations and future works.

2. Background and Threat Model

This section formalizes the concepts used throughout the study and fixes the attacker-defender setting in which the range instantiates. Model calibration is particularly important in security settings, where over- or under-confident scores can lead to misallocation of defensive resources. Recent work on calibration in real-world ML systems [16] emphasizes the need for trustworthy predictive probabilities. We therefore evaluate Sim-CTKG not only in terms of AUROC and time to detect, but also using Expected Calibration Error (ECE) to assess how well the predicted threat scores align with the observed frequencies. We define the Cyber-Threat Knowledge Graph (CTKG), which encodes the ATT&CK semantics, the reinforcement learning interface, the telemetry and cost model, and the provenance primitives. Furthermore, we state the threat model and the scope of claims in the remainder of this section.

2.1. ATT&CK Semantics and Entities

Let

T

denote ATT&CK tactics and

K

denote techniques. Let

S

be software or tooling,

H

hosts,

A

accounts, and

V

CVEs or vulnerabilities. Each technique

k \in K

has a set of preconditions

pre (k)

and effects

eff (k)

over entities and system state. A valid attack chain is a sequence

(k_{1}, \dots, k_{L})

such that

pre (k_{i + 1})

is satisfied after

eff (k_{i})

is applied. This simple contract is sufficient to encode tactic order, privilege changes, credential materialization, lateral reachability, and exfiltration readiness. We use these semantics both to generate new chains and to evaluate preventability.

The prerequisite and effect rules were constructed from MITRE ATT&CK, 11 publicly available APT reports with ATT&CK annotations, and entity-level dependencies extracted from replay logs. Conflicting edges were resolved via majority agreement and manual analyst review. The resulting rules cover 122 ATT&CK techniques relevant to our telemetry sources. Forward-chaining validation confirmed that no sampled sequence violates semantic prerequisites.

2.2. Cyber-Threat Knowledge Graph (CTKG)

The CTKG is a typed, weighted, time-stamped graph

G = (V, E)

. Nodes V are entities drawn from

T \cup K \cup S \cup H \cup A \cup V

. Edges

E \subseteq V \times R \times V

carry relationship types

r \in R

such as has_precondition, achieves, runs_on, affects, and communicates_with. Each node and edge has attributes: a trust weight

w \in [0, 1]

, a time interval

[t_{start}, t_{end}]

, and optional provenance tags. At step t, the environment provides a two-hop subgraph

G_{t}

centered on entities that are relevant to the current chain prefix and observed telemetry. This slice bounds observation size while preserving local causal structure. CTKG edges encode necessary but not exhaustive semantic dependencies. Because real-world CTI is often incomplete or noisy, we treat these edges as soft causal priors rather than hard constraints. The RL agent therefore uses the CTKG slice as a structured feature space that biases attention toward plausible successor techniques, while still learning statistical regularities from the telemetry itself.

2.3. MDP Interface for Cost-Aware Defense

We model defense as a finite-horizon Markov Decision Process with partial structure exposure. Time is discretized using observation windows. At step t the environment emits the following:

Feature vector $x_{t} \in R^{d}$ derived from network and host logs available under the current logging configuration,
CTKG slice $G_{t}$ as in Section 2.2,
Optional side signals such as queueing delay or buffer occupancy.

The agent chooses a joint action $a_{t} = (a_{t}^{def}, a_{t}^{\log})$ where the following are true:

$a_{t}^{def}$ values are containment actions permitted by policy, for example isolate host, block domain, or suspend process group,
$a_{t}^{\log} \in {0, 1}^{m}$ toggles m log sources such as Zeek conn, dns, http and host audit channels, all subject to budget.

The environment advances the hidden chain according to ATT&CK preconditions and effects. The reward is

r_{t} = r_{t}^{\det} - λ_{c} C (a_{t}^{\log}) - λ_{ℓ} L (a_{t}^{\log}) - λ_{a} A (a_{t}^{def}),

(1)

where

r_{t}^{\det}

rewards early and correct detection and penalizes false alarms,

C (\cdot)

is the logging cost,

L (\cdot)

is added latency from chosen sources, and

A (\cdot)

captures potential disruption from containment. Coefficients

λ_{c}, λ_{ℓ}, λ_{a} \geq 0

set trade-offs. The objective is to maximize expected return while satisfying an average budget constraint on C and L.

2.4. Telemetry Model and Budgets

Let the set of candidate sources be

{s_{i}}_{i = 1}^{m}

. Each source has a cost tuple

({cpu}_{i}, {bytes}_{i}, {delay}_{i})

and an information profile over techniques. The instantaneous cost under configuration

a_{t}^{\log}

is

C (a_{t}^{\log}) = \sum_{i = 1}^{m} a_{t, i}^{\log} (α {cpu}_{i} + β {bytes}_{i}), L (a_{t}^{\log}) = max_{i : a_{t, i}^{\log} = 1} {delay}_{i},

(2)

with nonnegative weights

α, β

chosen by the operator. The observation featurizer produces

x_{t}

only from active sources at step t. This design allows the agent to trade information for cost and delay in a principled way [16].

2.5. Causal Structure and Counterfactuals

A simple structural model was attached to the CTKG system. For each technique k, we define a binary structural variable

Z_{k}

that indicates whether k occurs within the window. Structural equations link

Z_{k}

to its parents using learned parameters and exogenous noise. Counterfactual queries intervene on variables by setting

Z_{k} \leftarrow 0

for candidate techniques and recomputing risk on the remaining chain. We report two metrics subsequently. Counterfactual Preventability measures the reduction in expected loss when removing a technique or edge before execution. Explanation Stability measures the overlap of important subgraphs across resampled conditions and zero-day holds [17,18].

2.6. Provenance and Verifiable Artifacts

Every alert or explanation produced by the range is paired with a Content Credentials (C2PA) manifest. Let A be the alert payload, E the explanation artifact,

θ

the model snapshot, and

H (G_{t})

a digest of the CTKG slice. We compute a content digest

D = H (A ‖ E ‖ θ ‖ H (G_{t}) ‖ config)

and sign D with a short-lived key managed by the range. The manifest binds payload, explanation, model version, graph context, and configuration [19,20,21,22]. A verifier recomputes the digest and checks the signature. We measure manifest size and verification time in Section 5.10.

2.7. Threat Model and Scope

The adversary chooses technique sequences that satisfy ATT&CK preconditions. Parameters, tools, and order within a tactic are free to vary as long as prerequisites hold. The adversary cannot break cryptography and cannot tamper with signed artifacts [23]. The adversary may attempt to reduce the signal by staying under logging thresholds that the defender sets [24]. The defender observes features from currently active log sources and the CTKG slice

G_{t}

. The defender selects containment and logging actions within budget. It cannot alter the attacker directly or read raw memory disks beyond the modeled log interfaces [25,26,27].

Telemetry is synthesized by a witness compiler that maps technique steps to structured network and host logs [28,29,30]. Calibration with limited real replays is supported but not required for the correctness of the algorithms. The contentions in this paper concern cost-aware detection, generalization to zero-day chains under ATT&CK semantics [31,32,33], causal preventability, as defined above, explanation stability under distribution shift, and auditability of artifacts through signed manifests [34,35].

4. Materials and Methods

This section describes the technical design of Sim-CTKG in full detail. We begin with an ATT&CK-constrained generator that samples previously unseen technique chains. We then map each chain to witness telemetry for network and host logs. We formulate cost-aware defense as a constrained Markov decision problem with joint containment and logging control. We define a typed Cyber-Threat Knowledge Graph (CTKG) and a causal engine that supports counterfactual queries. We close with the policy architecture and the content provenance pipeline (Algorithm 1).

Algorithm 1 Sim-CTKG Training Pipeline

Input:: $C$ with seed s, $T_{allow}$ , $M$ , $L_{max}$ , $τ$ , $G = (N, Σ, P, S)$ , ${τ (k, θ)}$ , priors $p (θ ∣ k)$ , ${B_{s}}$ for sources $s = 1, \dots, m$ , $({cpu}_{i}, {bytes}_{i}, {delay}_{i})$ for each source $s_{i}$ , $(B_{avg}, B_{lat})$ , $π_{θ} (a^{def}, a^{\log} ∣ x, G, u)$ , critic $V_{ψ}$ , $(λ_{c}, λ_{ℓ}, λ_{a})$ , duals $(η_{c}, η_{ℓ}) \leftarrow (0, 0)$
Output:: $π_{θ^{★}}$ , $V_{ψ^{★}}$ , $(η_{c}^{★}, η_{ℓ}^{★})$
1:: Set PRNG seed s; initialize CTKG manager, causal engine, and environment core with $C$
2:: while not converged do
Chain Sampling under Constraints
3:: Initialize state $s_{0}$ from topology and credentials; set chain $C \leftarrow []$
4:: for $i = 0$ to $L_{max} - 1$ do ▹ Grammar-constrained masked sampling
5:: $A \leftarrow {k \in Σ : tactic (k) \in T_{allow}, Valid (k ∣ s_{i}), \neg ViolatesHoldout (C \oplus k, M)}$
6:: if $A = \emptyset$ then break
7:: end if
8:: Sample $k_{i + 1} \sim Softmax (log π_{ϕ} (k ∣ s_{i}) / τ)$ over $k \in A$ ; $C \leftarrow C \oplus k_{i + 1}$ ; $s_{i + 1} \leftarrow ApplyEffects (s_{i}, eff (k_{i + 1}))$
9:: end for
Witness Compilation
10:: For each $k \in C$ : draw $θ \sim p (θ ∣ k)$ ; instantiate events from $τ (k, θ)$ with globally consistent entity IDs
11:: For each source s: merge attack events with background $E_{s}^{bg} \sim B_{s}$ ; enforce precedence and latency bounds; sort by time
Rollouts with Joint Control
12:: Reset budgets; initialize two-hop CTKG slice $G_{0}$ and window features from active sources
13:: for $t = 1$ to horizon T do
14:: Form observation $(x_{t}, G_{t}, u_{t})$ from currently active sources; get action $a_{t} = (a_{t}^{def}, a_{t}^{\log}) \leftarrow π_{θ} (x_{t}, G_{t}, u_{t})$
15:: Apply $a_{t}^{def}$ in env; activate sources per $a_{t}^{\log}$ ; compute costs $C_{t} = \sum_{i} a_{t, i}^{\log} (α {cpu}_{i} + β {bytes}_{i})$ , $L_{t} = {max}_{i : a_{t, i}^{\log} = 1} {delay}_{i}$
16:: Advance hidden chain if preconditions hold; update CTKG and structural variables; emit alert/explanation if triggered
17:: Compute detection reward $r_{t}^{\det}$ and full reward $r_{t} = r_{t}^{\det} - λ_{c} C_{t} - λ_{ℓ} L_{t} - λ_{a} A (a_{t}^{def})$
18:: Form effective reward with duals ${\tilde{r}}_{t} = r_{t}^{\det} - (λ_{c} + η_{c}) C_{t} - (λ_{ℓ} + η_{ℓ}) L_{t} - λ_{a} A (a_{t}^{def})$
19:: If an artifact $(A, E)$ is produced: compute digest $D = H (A ‖ E ‖ θ ‖ H (G_{t}) ‖ config)$ ; sign D; store C2PA manifest
20:: end for
Policy and Dual Updates
21:: Compute advantages ${\hat{A}}_{t}$ with GAE on ${\tilde{r}}_{t}$ and $V_{ψ}$ ; update $θ$ with clipped actor loss
22:: Use straight-through estimator for Bernoulli logging head
23:: Update critic by MSE on returns
24:: Update duals by projected subgradient: $η_{c} \leftarrow {[η_{c} + ρ (\frac{1}{T} \sum_{t} C_{t} - B_{avg})]}_{+}$ , $η_{ℓ} \leftarrow {[η_{ℓ} + ρ (p 95 ({L_{t}}) - B_{lat})]}_{+}$
25:: end while
26:: return $π_{θ^{★}}$ , $V_{ψ^{★}}$ , $(η_{c}^{★}, η_{ℓ}^{★})$

4.1. ATT&CK-Constrained Scenario Generation

Let

T

be the set of tactics and

K

the set of techniques. Each technique

k \in K

carries a precondition set

pre (k)

and an effect set

eff (k)

over entities such as privileges, credentials, processes, files, services, and network relations. A chain

C = (k_{1}, \dots, k_{L})

is valid when for each

i < L

, the post-state of

k_{i}

satisfies

pre (k_{i + 1})

. We encode this constraint through a typed grammar with attributes:

G = (N, Σ, P, S),

(3)

where nonterminals

N

capture tactic phases, terminals

Σ

are techniques, S is the start symbol, and productions in P include attribute checks on

pre (\cdot)

and

eff (\cdot)

. A production

X \to Y Z

is permitted only if the attribute evaluator confirms feasibility under the current state. This yields a sequence model with hard validity.

To prevent trivial reuse, we sampled under motif holdouts. Let

M

be a set of technique motifs such as T1059, T1105, T1021 that define zero-day families. In training, chains whose ordered subsequences intersect

M

are suppressed. During evaluation, we sample those motifs exclusively. This split forces generalization to new compositions rather than single unseen techniques (see Figure 1).

Figure 1. Expanded Causal Engine in CTKG (Module C): Structural Modeling and Counterfactuals.

We parameterize the generator with a distribution

π_{ϕ} (k_{i + 1} ∣ {state}_{i})

that respects grammar constraints. If the state

s_{i}

tracks entities and partial order, the chain likelihood is

p_{ϕ} (C) = \prod_{i = 0}^{L - 1} π_{ϕ} (k_{i + 1} ∣ s_{i}) \cdot I {valid (k_{i + 1} ∣ s_{i})} .

(4)

The indicator enforces validity. Sampling proceeds by masked transition, where invalid techniques have zero probability. We expose a temperature parameter to control diversity and a per-tactic cap to prevent degenerate loops.

4.2. Witness Telemetry Compiler

Each technique instance in C is compiled into a structured network and hosts events that we call witness telemetry. Let

τ (k, θ)

denote a template for technique k with parameters

θ

such as process name, command line, server domain, port, file path, hash, and user context. Given a schedule

t = 1, \dots, T

, the compiler produces for each active log source s a set of events

E_{s} = {e_{s, t}}

with coherent timing and identifiers. Network witnesses include Zeek conn, dns, http, ssl, files, and notice. Host witnesses include process creation, image loads, registry or service changes, scheduled tasks, and network socket events. Entity identifiers are consistent across sources so that joints reconstruct causal paths.

Let

B_{s}

be a background process for the source s that samples benign events from a stationary mixture of daily patterns. The background is injected independently of the chain. Technique witnesses are injected on top with parameter draws from priors

θ \sim p (θ ∣ k)

. Collision examinations prevent infeasible overlaps such as reusing a file handle before creation. Timing represents the precedence and network latency bounds. The featurizer that yields

x_{t}

sees only events from the sources that are active in step t.

Calibration is supported when real replays are present. A small set of replay logs fits priors for

θ

and marginal rates for

B_{s}

through simple moment matching. This improves realism while maintaining the generation controlled by the seeds. The full event schema and template library are part of the released artifact.

4.3. Cost-Aware Reinforcement Learning

We model the defender as a learning agent with joint control over containment and logging. At step t, the observation is

(x_{t}, G_{t}, u_{t})

where

x_{t} \in R^{d}

are features computed from the currently active sources,

G_{t}

is a two-hop CTKG slice, and

u_{t}

holds auxiliary signals such as queue delay. The action is

a_{t} = (a_{t}^{def}, a_{t}^{\log})

with

a_{t}^{def} \in A_{def}

and

a_{t}^{\log} \in {0, 1}^{m}

for m candidate log sources.

In this domain, logging incurs cost and latency. Let each source

s_{i}

have a cost tuple

({cpu}_{i}, {bytes}_{i}, {delay}_{i})

measured by profiling. Costs at time t are

C (a_{t}^{\log}) = \sum_{i = 1}^{m} a_{t, i}^{\log} (α {cpu}_{i} + β {bytes}_{i}), L (a_{t}^{\log}) = max_{i : a_{t, i}^{\log} = 1} {delay}_{i},

(5)

with nonnegative weights

α, β

chosen per scenario. The reward uses a detection term

r_{t}^{\det}

and penalties for cost, latency, and disruptive containment:

r_{t} = r_{t}^{\det} - λ_{c} C (a_{t}^{\log}) - λ_{ℓ} L (a_{t}^{\log}) - λ_{a} A (a_{t}^{def}) .

(6)

We enforce average budgets

B_{avg}

and latency budgets

B_{lat}

through a Lagrangian relaxation. Dual variables

(η_{c}, η_{ℓ})

update online so that the effective reward becomes

{\tilde{r}}_{t} = r_{t}^{\det} - (λ_{c} + η_{c}) C (a_{t}^{\log}) - (λ_{ℓ} + η_{ℓ}) L (a_{t}^{\log}) - λ_{a} A (a_{t}^{def}),

(7)

with dual updates

η_{c} \leftarrow {[η_{c} + ρ (C (a_{t}^{\log}) - B_{avg})]}_{+}

and a similar rule for

η_{ℓ}

. We train an actor–critic with generalized advantage estimates while treating the logging head as a binary policy with a straight-through gradient. Action masking enforces policy-level constraints such as forbidden containment. Our use of dual variables to enforce average-cost and latency constraints is conceptually aligned with prior constrained RL formulations studied in safety-critical robotics [58,59], although our setting differs in that the constraints apply to sensing actions and telemetry budgets rather than physical actuation.

4.4. CTKG Construction and Causal Engine

The CTKG is a typed multigraph

G = (V, E)

with relationship labels in

R

. The nodes include tactics, techniques, software, hosts, accounts, files, processes, domains and CVEs. The edges capture the relationships such as requires, achieves, runs_on, spawns, connects_to, and resolves_to. Each node and edge has a trust weight in

[0, 1]

and a time interval. At time t, the environment returns a two-hop slice

G_{t}

centered on the active technique footprint and the entities mentioned in the current window of telemetry. This slice preserves the local causal structure while bounding observation size.

We attach a simple structural model to techniques. Let

Z_{k} \in {0, 1}

indicate whether technique k occurs within the step. We model

Z_{k} = I \{f_{k} (Z_{pa (k)}, ξ_{k}) \geq 0\},

(8)

where

pa (k)

are parents in the CTKG and

ξ_{k}

is exogenous noise. Functions

f_{k}

are linear or shallow neural units whose parameters fit to traces produced by the generator and to observed detections. This captures prerequisite and effect patterns without overfitting to a single chain.

The counterfactual queries intervene on the structural model. For a candidate technique

k^{★}

, we replace

Z_{k^{★}} \leftarrow 0

and recompute the expected detection loss under the learned model. Thereby, partial observations are realized. The Counterfactual Preventability for a set of S techniques is

CP (S) = E [ℓ (policy, env) - ℓ (policy, do (Z_{S} = 0), env)],

(9)

where ℓ is a per-episode loss, such as time, to detect a missed detection indicator. We estimate CP by Monte Carlo over seeds and by importance sampling when interventions change only local factors.

Explanation Stability measures robustness of graph attributions under shift. Let

A (G_{t})

be a set of important nodes and edges obtained from the policy’s graph encoder by gradient-based or perturbation scores at a matched operating point. For two runs r and

r^{'}

under resampling or held-out motifs, we define

XS = E [\frac{| A_{r} \cap A_{r^{'}} |}{| A_{r} \cup A_{r^{'}} |}],

(10)

with confidence intervals from block bootstrap over episodes. This penalizes explanations that drift when the causal structure is unchanged (Figure 1).

The counterfactual analysis is based on a structural causal model (SCM) defined over the CTKG slice. For each technique variable

Z_{k}

m, we introduced a structural equation (Equation (8)), where the structural functions

f_{k}

are learned from observational simulation traces and capture approximate causal influence patterns. The exogenous variables

ξ_{k}

are assumed to be mutually independent, and CTKG edges provide the graph structure of potential causal dependencies, but not exact numerical parameters; hence, the CTKG is treated as a probabilistic causal prior rather than a perfect oracle. Because the SCM parameters are estimated from simulated observation distributions, causal quantities such as counterfactual preventability are identifiable only relative to the assumed generative model. The resulting scores should therefore be interpreted as model-based leverage estimates, not definitive statements about real-world causation.

4.5. Policy Architecture

The policy consumes

(x_{t}, G_{t}, u_{t})

and emits

a_{t}^{def}

and

a_{t}^{\log}

. We encode

x_{t}

with a residual multilayer perceptron that includes feature-wise linear modulation from

u_{t}

. We encode

G_{t}

with a graph attention network over relation-specific projections. The encoders produce embeddings

h_{x}

and

h_{g}

which are fused by cross-attention where

h_{x}

queries

h_{g}

. The joint representation feeds two heads. The containment head outputs a categorical distribution over

A_{def}

. The logging head outputs m Bernoulli logits for sources. We share lower layers and separate the final projections.

Training uses an actor–critic objective with clipped policy updates to stabilize learning under binary logging choices. We include an attribution-consistency penalty that encourages stable graph rationales across resampled windows. Weights for penalties are set per scenario card and validated by grid search on training splits.

4.6. Provenance and Verifiable Artifacts

Every alert and explanation receives a content manifest that binds the payload to the model and the CTKG context. Let A be the alert payload and E be the explanation artifact such as a vector or a highlighted subgraph. Let

θ

be a model snapshot and let

H (G_{t})

be a digest of the CTKG slice. We compute a content digest

D = H (A ∥ E ∥ θ ∥ H (G_{t}) ∥ config)

(11)

and sign D with a short-lived key under a content–credentials profile. The verifier recomputes the digest and checks the signature. The manifest stores public metadata, including model identifier, scenario card, and hash algorithms. Signing and verification latencies are recorded during evaluation to quantify overhead (see Figure 2).

Figure 2. Overview of Sim-CTKG network architecture.

Analytic utility and scope. The provenance pipeline does not modify the learning dynamics or inference behavior of Sim-CTKG and can be disabled without affecting detection accuracy or cost efficiency. Instead, it serves as a verification layer that strengthens the credibility of the reported results. Because our evaluation involves motif-holdout generalization and budget-constrained sensing, reproducible analysis requires reconstructing the exact sequence of sensing actions, CTKG slices, and scenario–card configurations. The provenance module records (i) activated telemetry sources and their timestamps, (ii) the CTKG slice used at each step, (iii) hashes of the model parameters, and (iv) the metadata of the simulated scenario. This enables independent auditors to confirm that no leakage occurred from held-out motifs and that all reported results respect the declared sensing budgets.

4.7. Feature Featurization and Windowing

Let events within window

[t - W, t)

from active sources

S_{t} = {s_{i} : a_{t, i}^{\log} = 1}

be

E_{s_{i}, t}

. Each source s has a feature map

F_{s} : E_{s, t} \to R^{d_{s}}

composed of counts, rates, and sketch statistics:

F_{s} (E_{s, t}) = [cnt, rate, uniq, top - k, cmsketch, tf \cdot log \frac{T}{df}] .

(12)

We compute per-destination and per-host aggregates and concatenate

x_{t} = Norm (⨁_{s \in S_{t}} F_{s} (E_{s, t})), Norm (z) = \frac{z - μ}{σ + ε} .

(13)

Incremental updates use prefix-sums and rolling sketches; the update cost is

O (| E_{t} |)

per step with

| E_{t} | = \sum_{s \in S_{t}} | E_{s, t} |

.

4.8. Relation-Aware Graph Encoder

Each CTKG slice

G_{t} = (V_{t}, E_{t})

has node features

h_{v}^{(0)} = [type onehot; tfidf of IOC / soft

;

risk prior]

. We use

L_{g}

layers of relation-aware attention (R-GAT) with the residual and layer norm:

e_{u v r}^{(l)} = LeakyReLU (a_{r}^{⊤} [W_{r}^{(l)} h_{u}^{(l)} ∥ W_{r}^{(l)} h_{v}^{(l)}]), α_{u v r}^{(l)} = {softmax}_{u \in N_{r} (v)} e_{u v r}^{(l)},

(14)

h_{v}^{(l + 1)} = LN (h_{v}^{(l)} + σ (\sum_{r \in R} \sum_{u \in N_{r} (v)} α_{u v r}^{(l)} W_{r}^{(l)} h_{u}^{(l)})) .

(15)

We set

(L_{g}, d_{hid}, H) = (3, 128, 4)

heads unless stated. The complexity per layer is

O (\sum_{r} | E_{t, r} | H d_{hid})

.

4.9. Containment Semantics and Safety Mask

Actions

a_{t}^{def} \in {ISOLATE (h), BLOCK_FQDN (d), KILL (p), SINKHOLE (c 2)}

. A safety mask

M_{t}

forbids actions that violate policy or prerequisites; we apply masked sampling:

π_{θ} (a_{t}^{def} ∣ x_{t}, G_{t}) \propto exp z_{θ} (a) \cdot I {a \in M_{t}} .

(16)

Operational cost penalizes harmful interventions:

A (a_{t}^{def}) = κ_{1} I {wrong host isolate} + κ_{2} I {critical process kill} + κ_{3} I {excessive actions} .

(17)

4.10. Cost Profiling and Source Calibration

For each source

s_{i}

, we measure tuples

({cpu}_{i}, {bytes}_{i}, {delay}_{i})

by replaying synthetic bursts at rate r and fitting

{cpu}_{i} (r) = α_{i}^{cpu} + β_{i}^{cpu} r, {bytes}_{i} (r) = α_{i}^{io} + β_{i}^{io} r, {delay}_{i} (r) = α_{i}^{Δ} + β_{i}^{Δ} r .

(18)

During training we plug the realized per-window rate

{\hat{r}}_{i, t} = | E_{s_{i}, t} | / W

.

4.11. Background and Noise Model

Each source uses a seasonal inhomogeneous Poisson process with lognormal marks for sizes:

λ_{s} (τ) = λ_{0, s} (1 + \sum_{k = 1}^{K} a_{k, s} sin \frac{2 π k τ}{24} + b_{k, s} cos \frac{2 π k τ}{24}), S \sim LogNormal (μ_{s}, σ_{s}^{2}) .

(19)

Cross-source correlation is induced by a Gaussian copula with correlation matrix

Σ

estimated from benign calibration traces. Collision resolution shifts events by

δ \sim Exp (η)

subject to precedence constraints.

4.12. Policy Heads and Loss

Encoders yield

h_{x}

and

h_{g}

; fusion uses cross-attention

h^{★} = Attn (Q = h_{x}, K, V = h_{g})

. Heads:

π_{θ}^{def} = softmax (W_{d} h^{★}), π_{θ}^{\log} = Bernoulli (σ (W_{ℓ} h^{★})) .

(20)

Actor loss is the clipped surrogate on joint policy

π_{θ} = π_{θ}^{def} \cdot π_{θ}^{\log}

; critic loss is MSE on

V_{ψ}

. We add attribution-consistency

L_{cons} = λ_{cons} \cdot JSD (A_{t}, A_{t^{'}})

for matched points.

4.13. Provenance Keying and Verification

We fix hash

H = SHA - 256

and signatures

Sig = Ed 25519

. Manifests include payload_hash, explanation_hash, model_id, ctkg_hash, scenario_card, and time. Keys rotate every R hours; manifests carry the key ID. Verification checks

D = H (A ∥ E ∥ θ ∥ H (G_{t}) ∥ config)

and

Verify (D, σ, key_id)

.

4.14. Scalability and Complexity

Let chain length be L, average feasible fan-out

\bar{d}

, and window events

E_{t}

. Grammar sampling is

O (L \bar{d})

with attribute checks. Compilation is

O (\sum_{t} | E_{t} |)

. CTKG updates are

O (| E_{t} | + \sum_{r} | E_{t, r} |)

. Graph encoding is

O (L_{g} H d_{hid} | E_{t} |)

per step. Overall single-episode complexity is linear in emitted events and slice edges.

4.15. Leakage Controls

Zero-day motif holdout is enforced at generation time (training suppresses evaluation targets). Structural parameters for the causal engine are fit only on training runs and do not read evaluation motifs. Hyperparameters are selected on a validation set that excludes

M

. All random seeds are recorded in scenario cards.

5. Evaluation and Results

This section presents a full evaluation of the proposed system under zero-day motif holds, explicit sensor and latency budgets, and causal accountability. All scenario data, seeds, and budgets are fixed before training. We report per-episode metrics with

95 %

confidence intervals from block bootstrap over episodes. Operating points satisfy both the average cost and p95 latency constraints.

5.1. Dataset

This section documents all datasets used in the study. Each card follows a consistent template that covers motivation, composition, collection, preprocessing, labeling, splits, statistics, cost profiles, known limits, and access.

5.1.1. Sim-CTKG Zeek Telemetry (Primary)

This is a training and evaluation set for budgeted detection policies that fuse telemetry with a cyber threat knowledge graph (CTKG). The dataset aligns simulator events with Zeek log schemas and exposes held-out attack motifs for zero-day testing, such as, multi-source network and host telemetry as Zeek-style tabular records from

K = 11

sources: conn, dns, http, ssl, files, notice, proc, image, svc/reg, task, socket. Each record is time-stamped and keyed by host and flow identifiers. Features are compacted into

x_{t} \in R^{d_{x}}

by a streaming compiler. A two-hop CTKG slice

G_{t}

is bound at each step.

Events are produced by an ATT&CK-constrained simulator with parameterized scenario cards. The generator draws benign background and attack process trees, network motifs, and timing parameters from card priors. Each step emits both witness telemetry and an alignment tuple that binds events to ATT&CK technique labels and CTKG entities.

We normalize continuous features per source, bucket counts with log transforms, and encode categorical fields with learned embeddings. Sliding windows build

x_{t}

with window size

w = 5

steps and stride 1. We drop fields that leak labels by construction.

Labels exist at three levels: (i) per-step technique indicator

y_{t}

and tactic group, (ii) chain-level success, and (iii) first correct alert time for TTD. Only labels from held-out motifs are used at test time.

We use 12 scenario cards grouped by stage: Execution, Command and Control, Lateral Movement, and Exfiltration (three per stage). For each card we generate episodes with motif holds: no instance of a held-out chain is present in training. Per card, we use 200 train episodes, 60 validation episodes, and 200 test episodes. This yields

12 \times (200 + 60 + 200) = 5520

episodes total (Table 4).

Table 4. Sim-CTKG v1.0 split summary and basic episode statistics.

Episode length: median 520 steps (IQR 360 to 640). Attack-labeled steps:

8 %

to

15 %

per card. Benign-only episodes:

25 %

of validation,

25 %

of test. Budgets used:

B_{avg} \in {0.8, 1.3, 2.0}

; latency budget

B_{lat}

: p95

\leq 3.0

s (Table 5).

Table 5. Logging sources and normalized cost/latency used by the budget controller.

We list the 12 cards used for generation and evaluation. Each card holds at least one motif at test time (Table 6). The generator models common enterprise topologies and timings. Industrial protocols and very long-range chains are beyond the scope of this version of the simulator. Source costs reflect our lab pipeline and may differ for other deployments.

Table 6. Overview of scenario cards. Dominant techniques are ATT&CK IDs used to define chains.

5.1.2. CTKG Snapshot (Knowledge Graph for Decision Context)

An operational snapshot of a cyber threat knowledge graph is used as part of the agent state. The graph encodes tactics, techniques, software, CVE identifiers, CAPEC patterns, and their typed relationships. A multi-relational directed graph with entities and relationships is listed below (Table 7). The counts reflect the subset used and export time.

Table 7. CTKG composition. Counts refer to the exported subset used in this study.

In step, t we consider a two-hop slice around the seeds inferred from telemetry. The slice includes tactics, techniques, software, and CVE nodes within radius

r = 2

with relation types prerequisite, effect, implements, exploits, and belongs_to. The slice is capped by the node budget and used by the graph encoder. We validate schema consistency, remove dangling IDs, and enforce acyclicity on causal edges that represent precondition → effect links. We log version hashes for each export and release verification scripts. The snapshot is a focused subset of the scenarios evaluated here. It does not aim to be complete. Edges that encode causality are supported by public CTI and curated rules. However, certain long-range effects are not included.

5.1.3. DARPA TCAD-Derived Alignment Set (Auxiliary)

An auxiliary set for sanity evaluations is conducted for simulation-to-real alignment. Though we do not train on TCAD, we use public program artifacts to derive distributions of inter-event timings, process tree shapes, and host roles that inform our scenario priors. Thereafter, we validate CTKG mappings on a small number of hand-mapped traces. Derived statistics and mapping are used as support features only. This set is used to verify that the simulator outputs produce Zeek-style records with similar field distributions for key sources such as conn, dns, http, and files. It also informs the causal edges used for preventability analysis. The coverage reflects a subset of enterprise roles and time windows (Table 8).

Table 8. Summary view across dataset cards.

5.2. Experimental Configuration

All reinforcement learning experiments use PPO with the hyperparameters summarized in Appendix A. Unless otherwise stated, we train for 1.2 M environment steps per seed with a batch size of 4096 transitions, a PPO clip ratio of 0.20, a discount factor

γ = 0.99

, and a GAE parameter

λ = 0.95

. Budget constraints are enforced through dual variables

η_{c}

and

η_{ℓ}

, which are updated via projected subgradient steps. At each time step, the reward is modified as

r_{t} \leftarrow r_{\det} (y_{t}, {\hat{y}}_{t}) - η_{c} C (δ_{t}) - η_{ℓ} I [lat (δ_{t}) > B_{lat}],

where

C (δ_{t})

denotes the sensing cost of the chosen telemetry subset and

B_{lat}

is the latency budget. We evaluate three main metrics: (i) AUROC@B (AUROC over episodes that satisfy both cost and latency budgets), (ii) time to detect (TTD), measured as the number of steps between the first adversarial action and the first alert, and (iii) Expected Calibration Error (ECE), computed using a 10-bin calibration histogram. Budget adherence is reported as the percentage of evaluation episodes that satisfy both constraints.

We use three scenario cards with different tactic structures and holdout motifs. Card-A: Execution → Persistence → Lateral → Exfiltration with

M_{A} = {(T 1059, T 1105, T 1021)}

. Card-B: Discovery → CredentialAccess → Lateral with

M_{B} = {(T 1087, T 1003), (T 1049, T 1021)}

. Card-C: Command&Control → Collection → Exfiltration with

M_{C} = {(T 1105, T 1041)}

. Training suppresses any chain that contains a held motif. Evaluation samples from the held motifs only.

Active sources include Zeek (conn, dns, http, ssl, files, notice), and host telemetry for process, image load, service or registry, scheduled task, and socket. Each source

s_{i}

has profiled tuples

({cpu}_{i}, {bytes}_{i}, {delay}_{i})

that are affine in instantaneous rate

{\hat{r}}_{i, t} = | E_{s_{i}, t} | / W

.

We evaluate three average cost budgets

B_{avg} \in {0.8, 1.3, 2.0}

(relative units) and a latency budget

B_{lat} = p 95 (L_{t}) \leq 3.0 s

. All reported operating points satisfy both constraints.

Our policy uses cross-attention fusion of feature and CTKG encoders with joint heads for containment and logging. Baselines: Flat-RL (PPO on

x_{t}

only), KG-noCausal (graph encoder without prerequisite or effect edges), Static-Full (all sources on), Static-Min (fixed minimal pack) (Table 9).

Table 9. Detection under cost and latency budgets.

Each card trains for

3 \times 10^{6}

environment steps with early stopping on AUROC@budget on a validation split that excludes all motifs in

M

. Evaluation uses

R = 200

held-out episodes per card. We report per-card and macro averages.

5.3. Primary Detection Under Budgets

The substantial reduction in Time to Detect (TTD) from 31.3 steps (Flat-RL) to 18.2 steps (Sim-CTKG) at

B_{a v g} = 1.3

indicates that the agent is not merely detecting more but detecting earlier. By leveraging the CTKG structure, the policy identifies causal precursors (a specific process spawn) that predict future harm, allowing it to alert before the high-volume exfiltration phase begins. Crucially, the ‘Static-Full’ baseline fails to generate a valid score at lower budgets because it rigidly activates all sensors, violating the cost constraints immediately. This validates the necessity of the learning-based sensor selection.

Our policy gains

+ 7.7

to

+ 14.6

AUROC points over Flat-RL across budgets and halves TTD at moderate budgets. Static-Full meets budgets only at

2.0

. These gains are consistent across cards (Table 9).

5.4. Operating Characteristics at $B_{avg}$ = 1.3

Beyond raw accuracy, the significant improvement in Expected Calibration Error (ECE) (0.028 vs. 0.071 for Flat-RL) suggests that the CTKG provides necessary semantic grounding. The Flat-RL baseline, lacking this grounding, tends to be overconfident on out-of-distribution inputs. Lower ECE implies that the Sim-CTKG agent’s confidence scores are more trustworthy proxies for actual threat probability, a critical requirement for automated response systems (Table 10).

Table 10. Operating characteristics at

B_{avg} = 1.3

,

B_{lat} = 3.0

s. TPR at fixed FPR, precision at fixed recall, ECE.

Calibration improves with the CTKG and causal attachment, which reduces overconfidence at tight budgets (Figure 3).

Figure 3. Budget policy analysis. (a) Source Activation Rate Heatmap. (b) Data Volume vs. Budget analysis. (c) Latency Budget Compliance. (d) Source Importance Ranking.

5.5. Zero-Day Motif Generalization

The high performance (

A U R O C > 0.94

) on held-out motifs confirms that the agent is learning abstract causal dependencies rather than memorizing signature sequences. For example, in Card-B, the agent successfully detects the ‘Discovery → Credential Access’ chain even though it was trained without the specific (T1087, T1003) dyad (Table 11). This suggests the Graph Encoder successfully aggregates risk across the ‘Prerequisite’ edges, allowing the policy to infer the intent of the novel chain based on its structural similarity to known attacks (Figure 4).

Table 11. Zero-day evaluation on held motifs at

B_{avg} = 1.3

. AUROC and TTD per card with

R = 200

episodes each.

Figure 4. Zero-day motif generalization performance. (a) Zero-Day AUROC Performance. (b) Zero-Day TTD Performance.

Permutation tests stratified by card give

p < 0.01

for Sim-CTKG vs. both baselines on AUROC and TTD. Cohen’s d ranges from

0.86

to

1.21

.

5.6. Logging Policy and Resource Use

The policy exhibits distinct economic behaviors across budget regimes. At the tight budget (

B_{a v g} = 0.8

), the agent relies heavily on low-cost, low-latency sources like ‘notice’ and ‘conn’ (0.62 activation). As the budget relaxes to 2.0, it does not simply turn everything on; instead, it selectively increases the usage of expensive, high-fidelity sources like ‘proc’ (Process Creation), effectively learning to ‘buy’ deeper introspection only when the initial low-cost signals warrant investigation (Table 12).

Table 12. Logging behavior by budget. Activation rate per source (fraction of steps where

a_{t, i}^{\log} = 1

), average bytes per episode, and p95 latency.

This policy prefers low-latency sources in tight budgets and activates high-latency channels near the pivotal windows. This selective stance maintains the p95 latency constraint (Figure 5).

Figure 5. Zero-day generalization: rewards vs. episodes.

5.7. Ablations on Novelty Pillars

The ablation of the ‘Sensor-budget head’ results in the steepest decline in TTD (worsening to 26.2 steps). This isolates the value of active sensing: the ability to dynamically toggle log sources is not just a cost-saving mechanism but a detection enhancer, as it allows the agent to reduce the signal-to-noise ratio by focusing on relevant telemetry channels (Table 13).

Table 13. Ablations at

B_{avg} = 1.3

. Removing any pillar degrades detection or earliness.

5.8. Robustness to Telemetry Shift

Notably, the system demonstrates resilience to CTI errors. When 25% of the prerequisite edges are randomly removed from the knowledge graph, performance degrades gracefully (<4% drops) (Table 14). This indicates that the R-GAT encoder learns to function as a ‘soft’ reasoner, utilizing the statistical correlations in the telemetry (

x_{t}

) to bridge gaps where the explicit knowledge graph (

G_{t}

) is incomplete (Figure 6).

Table 14. Robustness evaluation at

B_{avg} = 1.3

.

Figure 6. Causal analysis. (a) Causal leverage analysis (b) Explanation robustness under shift.

Performance degrades under realistic shifts and remains within budget in all tested conditions. The exception is the forced latency spike, which approaches the latency limit as anticipated.

Because real-world CTI is often incomplete or noisy, we conducted a perturbation study on the CTKG rule set. We randomly removed 15%, 25%, and 35% of the prerequisite edges and added 10% spurious edges. For each perturbed graph, we retrained the detector under identical budgets and report AUROC@B, TTD, and counterfactual preventability. Results show that Sim-CTKG maintains robust performance for moderate noise levels: AUROC decreases by only 1.7% (15% edge removal) and 3.4% (25% removal), while TTD increases by 1.3–2.1 steps. Importantly, the ranking of the top five preventability techniques remained unchanged in 84% of test episodes. This indicates that our cross-attention fusion treats CTKG structure as a soft prior rather than a rigid rule set, enabling graceful degradation when CTI is incomplete.

5.9. Causal Accountability

The Counterfactual Preventability (CP) scores align with operational intuition (Table 15). The high CP for ‘C2 Handoff’ (

0.159

) identifies it as a critical choke point in the kill chain. Furthermore, the high Explanation Stability (XS =

0.71

) (Table 16) to Flat-RL (

0.32

) confirms that the Sim-CTKG agent consistently attributes alerts to the same root causes (nodes), even when the attack instantiation varies.

Table 15. Counterfactual Preventability

\hat{CP} (S)

at

B_{avg} = 1.3

for common mid-chain levers. Higher is better.

Table 16. Explanation Stability

XS

(Jaccard overlap of important CTKG subgraphs) at matched operating points.

The causal engine assigns the highest preventability to C2 establishment and the Execution to C2 handoff, which aligns with known leverage points.

5.10. Throughput and Overhead

The provenance signing overhead (

0.3

ms) is two orders of magnitude lower than the detection latency, confirming that cryptographic accountability can be enforced in real time without compromising throughput. We log each module artifact concerning latency and overhead (Table 17). Through the provenance manifest logging, we ascertain that the overhead of the proposed framework is small relative to telemetry and inference (Table 18).

Table 17. Per-step runtime breakdown (mean ± std) at

B_{avg} = 1.3

over all cards.

Table 18. Provenance manifest size and verification latency.

5.11. Summary

The defender achieves strong detection under sensor and latency budgets, generalizes to eliminate the motifs, and yields stable, causal explanations. Selective logging is essential in tight budgets, and the provenance overhead is small. Ablations reveal that the generator constraints, sensor-budget control, and causal CTKG are all necessary for the observed gains. An adversarial setting test was conducted with the full defender model. It performed significantly above the benchmark value. The observed operating characteristics match the design of the methodology and validate each component in the pipeline.

6. Extended Analyses

6.1. Pareto Fronts Under Budget Constraints

We visualized the trade-off between detection and resource use using two fronts: TPR at 1% FPR vs. average cost, and Time to Detect vs. average cost. Points correspond to the three budgets evaluated (

B_{avg} \in {0.8, 1.3, 2.0}

) with the p95 latency constraint satisfied. Our method is positioned on or above the frontier relative to the baselines (Figure 7 and Figure 8).

Figure 7. TPR vs. average cost. Higher is better.

Figure 8. Pareto front: TTD vs. average cost. Lower is better.

6.2. Per-Technique and Stage-Wise Efficacy

We compute the median TTD by stage with bootstrap confidence intervals at

B_{avg} = 1.3

,

B_{lat} = 3.0

s. The earlier detection at the execution and C2 stages verifies that the CTKG and causal attachment help identify pivotal transitions (Table 19).

Table 19. Stage-wise median TTD (steps) at

B_{avg} = 1.3

. Lower is better.

6.3. Training Stability and Convergence

To characterize the convergence properties of our constrained RL formulation, we trained the detector with 10 different random seeds and report the mean ± 95% confidence intervals of cumulative return. Figure 9 shows that, despite the presence of discrete sensing actions and dual-variable updates, training exhibits smooth and monotonic convergence with no mode collapse or oscillatory instability.

Figure 9. Training stability plot.

We further monitored the dual variables

η_{c}

and

η_{ℓ}

, which enforce the average-cost and latency constraints. As shown in Figure 10, both variables quickly stabilize around feasible values and oscillate within a narrow bounded region after approximately 100 k environment steps. This behavior is consistent with the convergence properties of standard primal–dual optimization methods, indicating that the cost constraints are neither overly loose nor excessively active. Variance across seeds is also low: AUROC@B varies by

\pm 1.2 %

, TTD varies by

1.9

steps, and budget adherence varies by

2.3 %

. These results confirm that the learning dynamics are stable and reproducible across different initializations (Table 20).

Figure 10. Cost/Latency constraint convergence analysis.

Table 20. Performance comparison across all baseline methods.

6.4. Calibration and Reliability

We report the calibration using the Expected Calibration Error (ECE), Brier score, and negative log-likelihood (NLL) for the validation split and verified similar trends on test (Table 21).

Table 21. Calibration metrics at

B_{avg} = 1.3

. Lower is better.

6.5. Seed Stability

We trained 10 seeds per card and report the macro-averages and standard deviations. The variance was low. This was consistent with the block-bootstrap CIs reported earlier. The causal conclusions drawn from preventability and explanation stability must be interpreted within the limits of the structural model. Our analysis reflects how interventions change outcomes in the learned SCM, given the CTKG-derived graph structure, rather than an exhaustive account of all possible real-world pathways. Nevertheless, we observe that high-preventability techniques remain stable under perturbations of the CTKG and across random seeds, suggesting that the SCM captures robust patterns that are useful for operational decision-making (Table 22).

Table 22. Seed stability over 10 runs.

6.6. Computational Parity and Throughput

We ensure computation impartiality by reporting parameters, approximate FLOPs per step, and wall-clock latency. Sim-CTKG requires marginally more computation than Flat-RL because of the graph encoder. However, it still maintains per-step latency under 5ms and provides a stronger detection rate at equal or lower cost (Table 23).

Table 23. Throughput parity at

B_{avg} = 1.3

. FLOPs are approximate per-step forward ops.

For completeness, the provenance pipeline adds

0.3 \pm 0.1

ms when an artifact is emitted. Verification (offline) requires

0.41

–

0.52

ms per manifest. Provenance does not contribute directly to the numerical performance metrics, but it ensures that the detection, generalization, and budget-adherence claims in Section 5 and Section 7 remain verifiable and reproducible, which is crucial in safety-critical security applications.

6.7. Pairwise Significance at the Main Operating Point

We report stratified permutation tests at

B_{avg} = 1.3

with Benjamini–Hochberg correction across endpoints. Sim-CTKG is significant compared with all the baselines at

p < 0.01

(Table 24).

Table 24. Pairwise tests at

B_{avg} = 1.3

(stratified by card).

6.8. Isolation of Cost, Causality, and Provenance Effects

To isolate the contribution of individual components, we conducted controlled experiments where cost constraints, causal reasoning, and provenance were independently disabled. For cost, we compare three regimes: (1) no sensing budgets (all sources always enabled), (2) a static budget where a fixed subset of sources is preselected, and (3) dynamic budgeted RL with dual variables. Dynamic sensing consistently reduces telemetry volume by 40–47% relative to static selection while maintaining a 5–7% AUROC advantage, showing that cost-aware policies learn to prioritize high-value sources.

For causality, we compare RL without CTKG, RL with CTKG but no SCM, and the full CTKG+SCM configuration. Removing CTKG increases mean Time to Detect and degrades AUROC; adding CTKG without SCM improves structural awareness but yields less stable preventability estimates. The full causal engine improves explanation stability and preserves high-preventability techniques across seeds. Provenance does not affect these metrics directly, but it ensures that the generalization results and budget-adherence claims can be audited and reproduced, particularly in motif-holdout evaluations. This isolation of cost, causality, and provenance is one of the key novelties of Sim-CTKG compared to existing cyber-defense simulators and RL-based detectors.

6.9. Implications for Cyber-Defense System Design

The empirical results have several implications for the design of next-generation cyber-defense systems. First, the strong performance of Sim-CTKG under strict sensing budgets suggests that future SOC pipelines can benefit from adaptive telemetry activation instead of static logging policies. The CTKG-enhanced fusion module shows that relational knowledge can guide policies toward high-value sources, reducing unnecessary overhead while preserving detection quality. Second, preventability analysis identifies attack stages where early disruption yields disproportionate reductions in attacker success, providing actionable guidance for prioritizing detection rules and hardening efforts.

6.10. Practical Deployment Considerations

Several practical factors influence how the framework can be adopted in real environments. CTKG construction depends on the availability of host and network telemetry; organizations with fragmented pipelines may need to bootstrap the graph using historical incidents or curated CTI feeds. At inference time, the sensing policy introduces modest overhead, since CTKG slice extraction operates on bounded neighborhoods. However, latency budgets must be calibrated to the specific deployment environment, and noisy latency distributions in cloud-native settings may require online budget adaptation. Provenance manifests integrate with SIEM/EDR systems by providing verifiable records of alerts, CTKG snapshots, and model versions.

6.11. Limitations

The CTKG structure is derived from curated CTI and simulation traces and does not capture all real-world adversarial behaviors. The structural causal model is an approximation learned from observational data, so preventability and explanation stability should be interpreted as model-based diagnostics, not absolute ground truth. Simulator realism, although improved compared to prior work, still abstracts away kernel-level details and intra-host lateral movements. Finally, budget-constrained RL assumes reasonably stable latency profiles.

7. Ablation Study

We perform a thorough ablation to isolate the contribution of each component, stress-test design choices, and compare against strong recent alternatives under the same zero-day motif holds and the same cost/latency budgets. Unless noted, results are for

B_{avg} = 1.3

and

B_{lat}

= p95

\leq 3.0

s with

R = 200

episodes per card and block-bootstrap

95 %

confidence intervals.

7.1. Core Components

We remove one pillar at a time from the full system. The cross-attentive fusion over

(x_{t}, G_{t})

, the causal CTKG attachment, and the sensor-budget head each contribute materially to detection and earliness, with improved calibration at the same operating point (Table 25).

Table 25. Core component ablations at

B_{avg} = 1.3

. AUROC/AUPRC are at budget.

7.2. Fusion and CTKG Scope Sensitivity

We vary the CTKG slice hop radius r, the relation-aware layers

L_{g}

, and the attention heads H. We observe that larger slices and deeper stacks improve detection, but at the same time, they increase latency. Our default (

r = 2, L_{g} = 3, H = 4

) sits on the knee of the curve (Table 26) (Figure 11).

Table 26. Sensitivity to CTKG scope and encoder depth at

B_{avg} = 1.3

. Latency is per step forward time for the graph encoder.

Figure 11. Core component and budget analysis. (a) AUROC performance. (b) Detection time performance.

7.3. Budget Sweep and Operating Slices

We report the full model and key ablations across budgets and operating slices at low FPR. Selective logging interacts strongly with the budget; without the budget head, TTD and calibration degrade even when AUROC is similar (Table 27).

Table 27. Budget sweep. Mean ± 95% CI across cards.

7.4. Comparisons to Recent Alternatives

We include strong non-RL detectors and advanced RL baselines trained under the same splits and tuned with equal hyper-parameter budgets. All numbers respect the same cost and latency constraints. Where a method cannot meet the budget, the cell is marked (Table 28).

Table 28. Non-RL detectors at

B_{avg} = 1.3

.

TCN-Detector is a temporal convolutional model on

x_{t}

; T-Transformer is a telemetry-only transformer on

x_{t}

; RelGAT-only uses the CTKG slice without telemetry; MoE-Selector uses a mixture-of-experts with a heuristic source selector (Figure 12).

Figure 12. Attacker vs. Defender dynamics across attack scenarios.

Flat-RL is PPO on

x_{t}

; KG-noCausal is PPO with graph encoder but without prerequisite/effect edges; InfoBottleneck-RL adds an information bottleneck on

x_{t}

; Heuristic-RL uses a scripted log selector with PPO containment (Table 29).

Table 29. RL baselines at

B_{avg} = 1.3

.

7.5. Robustness Under Telemetry Shift

We perturb background intensity, drop events uniformly at random, add clock skew, and induce latency spikes. We report the relative AUROC change and the TTD increase (Table 30). Our policy degrades gracefully and retains budget adherence.

Table 30. Robustness at

B_{avg} = 1.3

: relative AUROC drop (negative is worse) and

Δ

TTD in steps.

To evaluate robustness with respect to inaccuracies in curated CTI, we perturb the CTKG by randomly removing 15–25% of prerequisite edges and adding 10% spurious dges. Across these conditions, AUROC decreases by only 1.7–3.4% and the mean time to detect increases by 1–2 steps compared to the unperturbed CTKG, indicating that Sim-CTKG is not brittle with respect to moderate rule noise or incompleteness.

7.6. Compute Parity

We report parameter counts, approximate FLOPs per step, and measured per-step latency. The graph encoder adds cost, but the total latency remains under 5 ms. Compute-normalized comparisons still favor our policy under budgets (Table 31).

Table 31. Compute parity at

B_{avg} = 1.3

. FLOPs are approximate forward ops per step.

7.7. Fairness Protocol and Significance

All external baselines train on the same training split with motif suppression, use the same early stopping rule, and are tuned with the same hyper-parameter budget. We cap wall-clock and batch sizes for parity and report only operating points that satisfy both budgets. Pairwise stratified permutation tests at

B_{avg} = 1.3

remain significant at

p < 0.01

(Benjamini–Hochberg) for AUROC@B and TTD@B when comparing the full model to each baseline. Effect sizes range from

d = 0.9

to

1.3

for AUROC and from

d = 1.0

to

1.3

for TTD.

7.8. Adversarial Scenario

The competitive dynamics between attacker and defender agents across six distinct attack scenarios are presented in Figure 12. It demonstrates the superior performance of our Sim-CTKG framework. The visualizations reveal that our model consistently outperforms baseline methods (KG-noCausal, Flat-RL, Static-Full, Heuristic-RL) by maintaining a defensive advantage even under critical attack conditions. Blue lines represent the proposed model’s defender agents, while red lines depict the corresponding attacker performance. Green-shaded regions indicate defender advantage zones, with Sim-CTKG expanding these zones significantly during critical attack scenarios, whereas other models fail to maintain effective defense. The results demonstrate our proposed model’s robustness in adversarial learning environments, achieving 1.4 times performance in critical scenarios compared to 0.6–0.9 times for competing approaches. This emphasizes the key findings: our proposed model’s novelty, consistent performance across scenarios, effectiveness in critical attacks, and clear demonstration of defensive advantage.

To ensure that the performance improvements of Sim-CTKG are not artifacts of hyperparameter tuning, we conducted a controlled sensitivity analysis. We varied (i) the PPO learning rate by

\pm 2 \times

, (ii) the dual-update step size

ρ \in {0.5 \times, 1 \times, 2 \times}

, (iii) the CTKG hidden dimension by

\pm 50 %

, and (iv) the sensing penalty

λ_{c}

by

\pm 0.2

. Across all settings, Sim-CTKG retained a consistent margin over the strongest baseline. For example, under a doubled learning rate, AUROC decreased by only

0.013

, while the relative improvement over the strongest baseline remained

0.051

. Disabling cross-attention or removing the CTKG slice, however, caused large degradations (

- 0.074

AUROC,

+ 7.3

TTD), confirming that the observed gains stem from the architectural components rather than favorable tuning.

8. Conclusions and Future Work

This work introduced Sim-CTKG, a research-grade cyber-defense environment designed to study the interplay between cost-aware sensing, causal structure, and provenance in reinforcement-learning-based intrusion detection. Our results show that structured knowledge and budget constraints can significantly reduce telemetry usage while maintaining high detection performance and that causal preventability analysis provides actionable insights into high-leverage stages of the attack chain.

Several concrete research directions follow from this work. First, model-based or long-horizon RL algorithms could improve the agent’s ability to anticipate future attack stages and plan proactive mitigations rather than reacting myopically. Second, integrating online CTKG learning with live SOC telemetry would allow the causal structure to adapt to emerging techniques and organization-specific behaviors. Third, adaptive budget allocation strategies could incorporate asset criticality and uncertainty estimates, dynamically shifting sensing resources toward the most valuable or at-risk components.

Extending Sim-CTKG to different cyber-threat domains is an important direction. Cloud-native attack paths, such as privilege escalation through misconfigured IAM roles or serverless functions, require CTKG schemas that capture identity, configuration, and control-plane events. ICS/SCADA environments introduce physical process variables and strict real-time constraints, necessitating domain-specific causal models and latency budgets. Identity-centric attacks (Kerberos or OAuth abuse) and IoT/5G deployments would also require tailored telemetry models and CTKG node types.

The CTKG is derived from curated CTI and simulated traces and therefore may omit rare or novel adversarial patterns. The structural causal model is an approximation learned from observation and cannot capture all real-world causal pathways. Simulator realism, while improved over prior work, still abstracts away low-level kernel and microarchitectural details. Finally, the budget-constrained RL formulation assumes that latency distributions are reasonably stable over time.

Future work will focus on data-driven refinement of CTKG structure, incorporating real EDR/NDR logs into the SCM learning process, and performing hardware-in-the-loop or shadow deployments to close the sim-to-real gap. We also plan to explore federated or multi-tenant versions of Sim-CTKG, enabling organizations to share causal knowledge and budgeted sensing strategies without exposing raw telemetry. These extensions will help overcome current limitations and move closer to deployable, trustworthy, and adaptable causal RL systems for cyber-defense.

Author Contributions

Conceptualization, M.B. and G.-Y.S.; methodology, M.B.; software, M.B.; validation, M.B. and G.-Y.S.; formal analysis, M.B.; investigation, G.-Y.S.; resources, G.-Y.S.; data curation, M.B.; writing—original draft preparation, M.B.; writing—review and editing, G.-Y.S.; visualization, M.B.; supervision, G.-Y.S.; project administration, G.-Y.S.; funding acquisition, G.-Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. RS-2023-00248132). This research was also supported by Korea–Philippines Joint Research Program funded by the Ministry of Science and ICT through the National Research Foundation of Korea (RS-2025-25122978).

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from respective owners of the third-party datasets. Restrictions apply to the availability of these data, which were used under license for this study. The authors of this study donot have permission to distribute the datasets.

Acknowledgments

We thank the university for the resources and the editors and reviewers for the rigorous assistance. We also thank the professors for their advice and support.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area Under Curve
AUROC	Area Under the Receiver Operating Characteristic curve
AUPRC	Area Under the Precision–Recall curve
ATT&CK	Adversarial Tactics, Techniques, and Common Knowledge (MITRE)
BH	Benjamini–Hochberg false discovery rate control
B_avg	Average sensing cost budget
B_lat	Latency budget bound (e.g., p95 latency)
CI	Confidence interval
C2	Command and Control
CAPEC	Common Attack Pattern Enumeration and Classification
CP	Counterfactual Preventability
CPE	Common Platform Enumeration
CTI	Cyber Threat Intelligence
CTKG	Cyber-Threat Knowledge Graph
CV	Cross-validation
CVE	Common Vulnerabilities and Exposures
DARPA TCAD	DARPA Transparent Computing attack traces corpus
ECE	Expected Calibration Error
FLOPs	Floating point operations
FPR	False Positive Rate
GAT	Graph Attention Network
GNN	Graph Neural Network
IDS	Intrusion Detection System
KG	Knowledge Graph
MDP	Markov Decision Process
NLL	Negative Log-Likelihood
PPO	Proximal Policy Optimization
PR	Precision–Recall
p95	95th percentile (e.g., latency)
RL	Reinforcement Learning
ROC	Receiver Operating Characteristic
SOC	Security Operations Center
TTP	Tactics, Techniques, and Procedures
TTD	Time To Detect
TTD@B	TTD measured under budget constraints
TPR	True Positive Rate

TPR@1% FPR	TPR at 1% false positive rate
XS	Explanation Stability
Zeek	Open-source network telemetry framework
AUROC@B	AUROC measured under budget constraints
AUPRC@B	AUPRC measured under budget constraints
T1059, T1105, …	MITRE ATT&CK technique IDs used in scenarios
Sim-CTKG	Simulator-aligned CTKG-guided RL detector (this work)
r	CTKG slice hop radius
$L_{g}$	Number of graph encoder layers
H	Number of attention heads
$x_{t}$	Compact telemetry feature vector at time t
$G_{t}$	Two-hop CTKG slice at time t

Notations

Symbol	Meaning	Type/Units
States, actions, budgets, costs
$s_{t} = (x_{t}, G_{t})$	Agent state at time t	feature vector and CTKG slice
$x_{t} \in R^{d_{x}}$	Compact telemetry features at t	real vector
$G_{t} = (V_{t}, E_{t}, R)$	Two-hop CTKG slice at t	directed multi-relational graph
$a_{t}$	Containment decision at t	discrete action
$ℓ_{t} \in {0, 1}^{K}$	Logging mask at t over K sources	binary vector
$C_{t} = \sum_{k = 1}^{K} ℓ_{t, k} c_{k}$	Sensing cost at t	cost units
$c_{k}$	Unit cost of source k	cost units
$B_{avg}$	Average cost budget	cost units
$B_{lat}$	Latency budget (p95)	seconds
T	Episode length	steps
E	Number of episodes	count
Learning and optimization
$π_{θ} (a_{t}, ℓ_{t} ∣ s_{t})$	Policy over actions and logging	distribution
$V_{ϕ} (s_{t})$	State value function	scalar
$θ, ϕ$	Trainable parameters of policy and value	tensors
$r_{t}$	Reward at t	scalar
$r_{t} = r_{detect} - β C_{t}$	Reward with cost penalty	scalar
$β$	Cost trade off coefficient	scalar
$γ$	Discount factor	scalar
$λ$	GAE parameter	scalar
$η$	Learning rate	scalar
$τ$	Temperature in gating head	scalar
$α$	Sparsity penalty for logging head	scalar
CTKG and encoders
$G = (V, E, R)$	Global cyber threat KG	directed multi-relational graph
r	CTKG hop radius for slice $G_{t}$	hops
$L_{g}$	Graph encoder layers	count
H	Attention heads	count
d	Hidden dimension in encoders	count
$h_{v}$	Node embedding for entity v	vector
$CrossAttn (\cdot)$	Cross attention fusion module	function
$σ (\cdot)$	Logistic function	function
$softmax (\cdot)$	Normalized exponential	function
⊙	Elementwise product	operator
${∥ \cdot ∥}_{2}$	Euclidean norm	operator

Metrics and operating points
AUROC@B	AUROC under budget constraints	$[0, 1]$
AUPRC@B	AUPRC under budget constraints	$[0, 1]$
TPR, FPR	True and false positive rates	$[0, 1]$
TPR@1% FPR	TPR at 1% FPR	$[0, 1]$
TTD, TTD@B	Time to detect (first correct alert)	steps
ECE	Expected Calibration Error	$[0, 1]$
Brier	Brier score	$[0, 2]$
p95	95th percentile (latency)	seconds
Scenario and provenance
$M$	Set of held out motifs for zero day tests	set
$κ$	Scenario card parameters distribution	distribution
$1 [\cdot]$	Indicator function	operator
$Sig (\cdot)$	Provenance signature function	function

Appendix A. Hyperparameters and Computational Overhead

Table A1. Parameters with the settings.

Component	Setting
Reinforcement Learning (PPO)
Training Steps	1.2 M environment steps
Optimizer	AdamW $(β_{1} = 0.9, β_{2} = 0.999)$
Learning Rate	$3 \times 10^{- 4}$ (cosine decay, 5 k warmup)
Batch Size	4096 transitions
Minibatch Size	512
PPO Clip Ratio	0.20
GAE Parameter $λ$	0.95
Discount Factor $γ$	0.99
Entropy Coefficient	0.01
Value Loss Coefficient	0.50
Gradient Norm Clip	0.5
Epochs per PPO Update	10
Seeds Evaluated	10
CTKG and Causal Engine
CTKG Slice Radius r	2 hops
Max Nodes in Slice	70 nodes
Node Embedding Dimension	256
Relation Types	ATT&CK prerequisite, effect, transition edges
CTKG Update Frequency	Every simulation step
SCM Noise Variables $ξ_{k}$	Independent Gaussian
Causal Function Approximation	MLP (2 layers, 128 units)
Cross-Attention Fusion Module
Attention Heads	4
Cross-Attention Layers	2
Hidden Dimension	256
Dropout	0.10
Fusion Operator	Residual MLP + LayerNorm
Budget Constraints (Dual Variables)
Average-Cost Budget $B_{avg}$	1.3 units
Latency Budget $B_{lat}$	3.0 s (p95)
Dual Step Size $ρ$	$5 \times 10^{- 4}$
Penalty Terms	$- η_{c} C (δ_{t}) - η_{ℓ} I [lat > B_{lat}]$
Initialization of $η_{c}, η_{ℓ}$	0
Projection Domain	$η_{c}, η_{ℓ} \geq 0$
Simulator and Telemetry Settings
Telemetry Types	Process, network flows, file events, authentication logs
Average Telemetry Cost Model	Source-dependent CPU/byte weights
Latency Model	Empirical distribution per telemetry source
Scenario Cards	38 parameterized templates, motif-holdout split
Attack Chains Sampled	5000 training/2000 test
Hardware and Precision
GPU Used	NVIDIA RTX A6000
Precision	BF16 training, FP16 inference
Total Trainable Parameters	7.8 M
Training Time per Seed	∼4 h

References

MITRE ATT&CK. Available online: https://attack.mitre.org/ (accessed on 8 October 2025).
Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP 2018); (CIC-IDS2017); SCITEPRESS—Science and Technology Publications: Setúbal, Portugal, 2018; Volume 1; pp. 108–116. [Google Scholar]
García, S.; Grill, M.; Stiborek, J.; Zunino, A. An Empirical Comparison of Botnet Detection Methods. Comput. Secur. 2014, 45, 100–123. [Google Scholar] [CrossRef]
Anjum, M.M.; Iqbal, S.; Hamelin, B. Analyzing the Usefulness of the DARPA OpTC Dataset in Cyber Threat Detection Research. In SACMAT ’21: Proceedings of the 26th ACM Symposium on Access Control Models and Technologies; Association for Computing Machinery: New York, NY, USA, 2021; pp. 27–32. [Google Scholar]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540. [Google Scholar] [CrossRef]
Red Canary. Atomic Red Team—Adversary Emulation Tests. Available online: https://atomicredteam.io/ (accessed on 7 November 2025).
MITRE. CALDERA Adversary Emulation Platform. Available online: https://caldera.mitre.org/ (accessed on 10 July 2025).
Arp, D.; Quiring, E.; Pendlebury, F.; Warnecke, A.; Pierazzi, F.; Wressnegger, C.; Cavallaro, L.; Rieck, K. Dos and Don’ts of Machine Learning in Computer Security. In Proceedings of the USENIX Security Symposium, Boston, MA, USA, 10–12 August 2022; pp. 3971–3988. [Google Scholar]
Alexander, O.; Belani, R. Attack Flow: Modeling the Adversary; MITRE Technical Report; The MITRE Corporation: McLean, VA, USA, 2023; Available online: https://github.com/center-for-threat-informed-defense/attack-flow (accessed on 20 September 2025).
CVE Program. Common Vulnerabilities and Exposures (CVE). Available online: https://www.cve.org/ (accessed on 1 October 2025).
NIST. Common Platform Enumeration (CPE). Available online: https://nvd.nist.gov/products/cpe (accessed on 7 November 2025).
Carrara, N.; Leurent, E.; Laroche, R.; Urvoy, T.; Maillard, O.-A.; Pietquin, O. Budgeted Reinforcement Learning in Continuous State Space. In Proceedings of the NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. pp. 1–11. [Google Scholar]
Coalition for Content Provenance and Authenticity (C2PA). C2PA Technical Specification. Available online: https://c2pa.org/specifications (accessed on 7 November 2025).
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Perez, R.; Thomas, J.; Lee, O. Mordor—Pre-Recorded Security Telemetry for Detection Research. Available online: https://www.deepwatch.com/glossary/security-telemetry/ (accessed on 17 November 2025).
Sinaga, K.P.; Nair, A.S. Calibration Meets Reality: Making Machine Learning Predictions Trustworthy. arXiv 2025, arXiv:2509.23665. [Google Scholar] [CrossRef]
Altman, E. Constrained Markov Decision Processes; Chapman & Hall/CRC: Boca Raton, FL, USA, 1999. [Google Scholar]
Pearl, J. Causality: Models, Reasoning, and Inference, 2nd ed.; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Peters, J.; Janzing, D.; Schölkopf, B. Elements of Causal Inference; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
NIST. FIPS PUB 180-4: Secure Hash Standard (SHS). 2015. Available online: https://csrc.nist.gov/publications/detail/fips/180/4/final (accessed on 2 July 2025).
Josefsson, S.; Liusvaara, I. Edwards-Curve Digital Signature Algorithm (EdDSA). RFC 8032. 2017, pp. 1–60. Available online: https://datatracker.ietf.org/doc/html/rfc8032 (accessed on 1 October 2025).
Jones, M.; Bradley, J.; Sakimura, N. JSON Web Signature (JWS). RFC 7515. 2015. Available online: https://www.rfc-editor.org/rfc/rfc7515.html (accessed on 1 October 2025).
in-toto. Supply Chain Security Framework. Available online: https://in-toto.io/specs/ (accessed on 7 November 2025).
Cappos, J.; Samuel, J.; Baker, S.; Hartman, J.H. A Look in the Mirror: Attacks on Package Managers. In CCS ’08: Proceedings of the 15th ACM Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2008; pp. 565–574. [Google Scholar]
Ptacek, T.; Newsham, T. Insertion, Evasion, and Denial of Service: Eluding Network Intrusion Detection. 1998. Available online: https://www.academia.edu/37052943/Insertion_Evasion_and_Denial_of_Service_Eluding_Network_Intrusion_Detection (accessed on 7 November 2025).
Microsoft. Sysmon—System Monitor. Available online: https://learn.microsoft.com/en-us/windows/win32/sysmon/system-monitor-portal (accessed on 7 November 2025).
Microsoft. Windows Event Logging. Available online: https://learn.microsoft.com/windows/win32/eventlog/event-logging (accessed on 7 November 2025).
Linux Audit Project. auditd(8)—Linux Audit Daemon. Available online: https://linux.die.net/man/8/auditd (accessed on 7 November 2025).
Splunk Threat Research. Splunk Attack Range. Available online: https://github.com/splunk/attack_range (accessed on 7 November 2025).
Mordor Datasets. Atomic Red Team Telemetry Collections. Available online: https://www.atomicredteam.io/ (accessed on 7 November 2025).
SigmaHQ. Sigma: Generic Signature Format for SIEM Systems. Available online: https://sigmahq.github.io/ (accessed on 7 November 2025).
Vizcarra, J.; Gempei, Y.; Wang, Y.; Isohara, T.; Kurokawa, M. Constructing Cybersecurity Knowledge Graphs for Hybrid LLM–Graph Reasoning on Vulnerabilities. In Proceedings of the 24th International Semantic Web Conference, Nara, Japan, 2–6 November 2025; pp. 1–5. [Google Scholar]
Zhou, Z.; Xu, K. Knowledge Graph Driven Cybersecurity: A Survey. IEEE Commun. Surv. Tutor. 2024, 26, 1–36. [Google Scholar]
Kiesling, E.; Ekelhart, A.; Kurniawan, K.; Ekaputra, F. The SEPSES Knowledge Graph: An Integrated Resource for Cybersecurity. In Proceedings of the 18th International Semantic Web Conference, Auckland, New Zealand, 26–30 October 2019; pp. 198–214. [Google Scholar]
D’Amico, A.; Whitley, K. The Real Work of Computer Network Defense Analysts. In VizSEC 2007; Springer: Berlin/Heidelberg, Germany, 2008; pp. 19–37. [Google Scholar]
Werlinger, R.; Hawkey, K.; Beznosov, K. An Integrated View of Human, Organizational, and Technological Challenges of IT Security Management. Inf. Manage. Comput. Secur. 2009, 17, 4–19. [Google Scholar] [CrossRef]
Buczak, A.L.; Guven, E. A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE Commun. Surv. Tutor. 2016, 18, 1153–1176. [Google Scholar] [CrossRef]
Sommer, R.; Paxson, V. Outside the Closed World: On Using Machine Learning for Network Intrusion Detection. In Proceedings of the IEEE Symposium on Security and Privacy, Oakland, CA, USA, 16–19 May 2010; pp. 305–316. [Google Scholar]
Wang, J.; Chen, Z.; Yu, X.; Li, D.; Ni, J.; Tang, L.; Gui, J.; Li, Z.; Chen, H.; Yu, P.S. Heterogeneous Graph Matching Networks for Unknown Malware Detection. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 3762–3769. [Google Scholar]
Tama, B.A.; Comuzzi, M. Ensemble Learning for Intrusion Detection Systems: A Review. IEEE Access 2019, 7, 337–349. [Google Scholar]
Golovin, D.; Krause, A. Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization. J. Artif. Intell. Res. 2011, 42, 427–486. [Google Scholar]
Dulac-Arnold, G.; Evans, R.; van Hasselt, H.; Sunehag, P.; Lillicrap, T.; Hunt, J.; Mann, T.; Weber, T.; Degris, T.; Coppin, B. Deep Reinforcement Learning in Large Discrete Action Spaces. arXiv 2015, arXiv:1512.07679. [Google Scholar]
Elastic Security. Mapping Detections to MITRE ATT&CK. Available online: https://www.elastic.co/guide/en/security/8.19/rules-coverage.html (accessed on 7 November 2025).
Microsoft Security. Map Security Operations to MITRE ATT&CK. Available online: https://learn.microsoft.com/en-us/security/operations/ (accessed on 7 November 2025).
Sikos, L.F. (Ed.) AI in Cybersecurity; Springer: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Barnum, S.; Sethi, A. The Cyber Threat Intelligence (CTI) Technical Landscape; MITRE Technical Report; MITRE: McLean, VA, USA, 2014. [Google Scholar]
Husák, M.; Komárková, J.; Bou-Harb, E.; Čeleda, P. Survey of attack projection, prediction, and forecasting in cyber security. IEEE Commun. Surv. Tutor. 2019, 21, 640–660. [Google Scholar] [CrossRef]
Tammoury, M. Post-Incident Forensics with Cybersecurity Knowledge Graphs. Digit. Investig. 2022, 41, 301–312. [Google Scholar]
Stampa, G.; Arias, M.; Sanchez-Charles, D.; Muntes-Mulero, V.; Cabellos, A. A Deep-Reinforcement Learning Approach for Software-Defined Networking Routing Optimization. arXiv 2017, arXiv:1709.07080. [Google Scholar]
Mao, H.; Alizadeh, M.; Menache, I.; Kandula, S. Resource Management with Deep Reinforcement Learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, Atlanta, GA, USA, 9–10 November 2016; pp. 50–56. [Google Scholar]
Sun, R.; Zhu, Y.; Fei, J.; Chen, X. A Survey on Moving Target Defense: Intelligently Changing Attack Surfaces. Appl. Sci. 2023, 13, 5367. [Google Scholar] [CrossRef]
Yao, Q.; Wang, Y.; Xiong, X.; Wang, P.; Li, Y. Adversarial Decision-Making for Moving Target Defense: A Multi-Agent Markov Game and Reinforcement Learning Approach. Entropy 2023, 25, 605. [Google Scholar] [CrossRef] [PubMed]
Klein, T.; Romano, G. Optimizing Cybersecurity Incident Response via Adaptive Reinforcement Learning. J. Adv. Eng. Technol. 2025, 2, 7–17. [Google Scholar] [CrossRef]
Sambasivan, R.R.; Fonseca, R.; Shafer, I.; Ganger, G.R. So, You Want to Trace Your Distributed System? Key Design Insights from Years of Practical Experience; Technical Report CMU-PDL-14-102; Carnegie Mellon University: Pittsburgh, PA, USA, 2014. [Google Scholar]
Red Teaming Research. CybORG—Cyber Operations Research Gym. Available online: https://github.com/cyborg/cyborg (accessed on 7 November 2025).
Microsoft. CyberBattleSim—A Cybersecurity Gym for RL. Available online: https://github.com/microsoft/CyberBattleSim (accessed on 7 November 2025).
Elastic. Elastic Common Schema (ECS). Available online: https://www.elastic.co/guide/en/ecs/current/index.html (accessed on 7 November 2025).
Jiang, D.; Wang, H.; Lu, Y. Mastering the Complex Assembly Task With a Dual-Arm Robot: A Novel Reinforcement Learning Method. IEEE Robot. Autom. Mag. 2023, 30, 57–66. [Google Scholar] [CrossRef]
Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement Learning in Robotics: Applications and Real-World Challenges. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef]
Splunk. Common Information Model (CIM). Available online: https://docs.splunk.com/Documentation/CIM/latest/User/Overview (accessed on 7 November 2025).
Snort Project. Snort—Network Intrusion Detection and Prevention. Available online: https://www.snort.org/ (accessed on 7 November 2025).
VirusTotal. YARA—The Pattern Matching Swiss Knife for Malware Researchers. Available online: https://yara.readthedocs.io/ (accessed on 7 November 2025).
Wachter, S.; Mittelstadt, B.; Russell, C. Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR. Harv. J. Law Technol. 2018, 31, 841–887. [Google Scholar] [CrossRef]
Karimi, A.-H.; Barthe, G.; Balle, B.; Valera, I. Model-Agnostic Counterfactual Explanations for Consequential Decisions. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, Virtually, 26–28 August 2020; pp. 895–905. [Google Scholar]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing System, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
Sigelman, B.H.; Barroso, L.A.; Burrows, M.; Stephenson, P.; Plakal, M.; Beaver, D.; Jaspan, S.; Shanbhag, C. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure; Google Technical Report dapper-2010-1; Google, Inc.: Mountain View, CA, USA, 2010. [Google Scholar]
OpenTelemetry. Sampling Concepts and Guidance. Available online: https://opentelemetry.io/docs/concepts/sampling/ (accessed on 7 November 2025).
Drewek-Ossowicka, A.; Pietrołaj, M.; Rumiński, J. A Survey of Neural Networks Usage for Intrusion Detection Systems. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 497–514. [Google Scholar] [CrossRef]
Abdelhamid, S.; Aref, M.; Hegazy, I.; Roushdy, M. A Survey on Learning-BasedIntrusion Detection Systems for IoT Networks. In Proceedings of the 2021 Tenth International Conference on Intelligent Computing and Information Systems (ICICIS), Cairo, Egypt, 5–7 December 2021. [Google Scholar]
Zeek Project. The Book of Zeek—Log Files (conn.log, dns.log, http.log, etc.). Available online: https://docs.zeek.org/en/current/script-reference/log-files.html (accessed on 1 November 2025).
Open Information Security Foundation. Suricata EVE JSON Output. Available online: https://docs.suricata.io/en/latest/output/eve/eve-json-output.html (accessed on 10 January 2025).
Beyer, B.; Jones, C.; Petoff, J.; Murphy, N.R. Site Reliability Engineering: How Google Runs Production Systems; O’Reilly: Sebastopol, CA, USA, 2016. [Google Scholar]

Figure 1. Expanded Causal Engine in CTKG (Module C): Structural Modeling and Counterfactuals.

Figure 2. Overview of Sim-CTKG network architecture.

Figure 3. Budget policy analysis. (a) Source Activation Rate Heatmap. (b) Data Volume vs. Budget analysis. (c) Latency Budget Compliance. (d) Source Importance Ranking.

Figure 4. Zero-day motif generalization performance. (a) Zero-Day AUROC Performance. (b) Zero-Day TTD Performance.

Figure 5. Zero-day generalization: rewards vs. episodes.

Figure 6. Causal analysis. (a) Causal leverage analysis (b) Explanation robustness under shift.

Figure 7. TPR vs. average cost. Higher is better.

Figure 8. Pareto front: TTD vs. average cost. Lower is better.

Figure 9. Training stability plot.

Figure 10. Cost/Latency constraint convergence analysis.

Figure 11. Core component and budget analysis. (a) AUROC performance. (b) Detection time performance.

Figure 12. Attacker vs. Defender dynamics across attack scenarios.

Table 1. Comparison across research lines. Y indicates the property is present. P indicates partial support.

Line of Work	Knowledge in State	Budgeted Sensing	Telemetry Aligned
Static ML/DL classifiers	P	N	P
Graph based NIDS	P	N	P
CTI/knowledge graph tools	Y (offline)	N	P
RL for cyber defense	N	P	P
Cyber ranges and simulators	N	N	P
Explainable IDS	P	N	P
This work	Y (online)	Y	Y

Table 2. Capability taxonomy. Ticks indicate the capability is natively supported for that line of work.

Domain	CTKG State	Causal	Multi-Hop	Budget	Latency	Active Log	Zeek	ms/Step	Overall
Static ML/DL classifiers	✗	✗	✗	✗	✗	✗	✓	✓	✗
Graph based NIDS	✗	✗	✗	✗	✗	✗	✓	✗	✗
CTI/knowledge graph tools	✗	✗	✓	✗	✗	✗	✗	✗	✗
RL for cyber defense (generic)	✗	✗	✗	✗	✗	✗	✗	✓	✗
Cyber ranges/simulators	✗	✗	✗	✗	✗	✗	✗	✗	✗
Explainable IDS	✗	✗	✗	✗	✗	✗	✓	✓	✗
This work (Sim CTKG)	✓	✓	✓	✓	✓	✓	✓	✓	✓

Table 3. Evaluation and transparency taxonomy.

Domain	Zero-Day	LowFPR	TTDoptimize	ECE	Counterfactual	Provenance	Artifacts	Overall
Static ML/DL classifiers	✗	✓	✗	✗	✗	✗	✗	✗
Graph based NIDS	✗	✓	✗	✗	✗	✗	✗	✗
CTI/knowledge graph tools	✗	✗	✗	✗	✓	✗	✓	✗
RL for cyber defense (generic)	✗	✗	✓	✗	✗	✗	✓	✗
Cyber ranges/simulators	✓	✗	✗	✗	✗	✗	✓	✗
Explainable IDS	✗	✓	✗	✓	✓	✗	✓	✗
This work (Sim CTKG)	✓	✓	✓	✓	✓	✓	✓	✓

Table 4. Sim-CTKG v1.0 split summary and basic episode statistics.

Split	# Cards	# Episodes	Median Steps/Ep	Attack% (Ep.)
Train	12	2400	520	12.1
Validation	12	720	510	10.7
Test	12	2400	525	11.4

Table 5. Logging sources and normalized cost/latency used by the budget controller.

Source	Cost Unit $c_{k}$	p95 Latency [ms]
conn	0.10	0.8
dns	0.12	0.9
http	0.18	1.1
ssl	0.22	1.4
files	0.28	2.2
notice	0.06	0.4
proc	0.30	2.8
image	0.36	3.6
svc/reg	0.26	2.4
task	0.14	1.1
socket	0.11	0.9
All active (sum)	2.13	bounded by pipeline

Table 6. Overview of scenario cards. Dominant techniques are ATT&CK IDs used to define chains.

Card	Stage	Dominant Techniques (Examples)
E1	Execution	T1059 (command), T1204 (user exec)
E2	Execution	T1569 (service), T1106 (native API)
E3	Execution	T1218 (signed binary proxy)
C1	C2	T1105 (ingress tool), T1071.001 (web)
C2	C2	T1071.004 (DNS), T1573 (encrypted)
C3	C2	T1095 (non-app protocol)
L1	Lateral	T1021.001 (SMB/Windows Admin Shares)
L2	Lateral	T1021.004 (SSH), T1133 (external remote)
L3	Lateral	T1047 (WMI), T1570 (Lateral Tool Transfer)
X1	Exfiltration	T1041 (exfiltration over C2)
X2	Exfiltration	T1048.003 (exfil over alt protocol: SMTP)
X3	Exfiltration	T1567.002 (exfil to cloud storage)

Table 7. CTKG composition. Counts refer to the exported subset used in this study.

Entity or Relation Type	Count
Tactics (ATT&CK)	14
Techniques (coarse)	180
Sub-techniques	360
Software/Tools	1200
CVE identifiers (linked)	950
CAPEC patterns	400
Vendors/Platforms (CPE families)	136
Prereq → Effect technique edges	4800
Technique ↔ Software edges	6500
Technique ↔ CVE edges	5900
Technique ↔ CAPEC edges	3700
Tactic → Technique edges	540
Total nodes	≈3240
Total edges	≈21,440

Table 8. Summary view across dataset cards.

Card	Modality	Primary Use	Zero-Day Holds
Sim-CTKG v1.0	Zeek logs + CTKG slices	Train, val, test	Yes (motif level)
CTKG v1.0	Multi-rel graph	Online state, explainability	n/a
DARPA TCAD	Derived stats	Alignment checks	n/a

Table 9. Detection under cost and latency budgets.

Method	Budget $B_{avg}$	AUROC@B	AUPRC@B	TTD@B
Sim-CTKG (ours)	0.8	$0.915 \pm 0.011$	$0.882 \pm 0.014$	$23.5 \pm 2.0$
	1.3	$0.944 \pm 0.010$	$0.915 \pm 0.012$	$18.2 \pm 1.8$
	2.0	$0.958 \pm 0.008$	$0.931 \pm 0.010$	$15.4 \pm 1.5$
Flat-RL	0.8	$0.812 \pm 0.017$	$0.746 \pm 0.019$	$35.6 \pm 2.9$
	1.3	$0.846 \pm 0.015$	$0.781 \pm 0.017$	$31.3 \pm 2.7$
	2.0	$0.858 \pm 0.014$	$0.794 \pm 0.016$	$29.5 \pm 2.6$
KG-noCausal	0.8	$0.863 \pm 0.015$	$0.806 \pm 0.017$	$30.2 \pm 2.5$
	1.3	$0.886 \pm 0.014$	$0.829 \pm 0.016$	$27.0 \pm 2.3$
	2.0	$0.897 \pm 0.013$	$0.840 \pm 0.015$	$25.6 \pm 2.1$
Static-Full	0.8	violates average budget
	1.3	violates average budget
	2.0	$0.901 \pm 0.012$	$0.843 \pm 0.014$	$26.4 \pm 2.1$
Static-Min	0.8	$0.733 \pm 0.020$	$0.654 \pm 0.022$	$44.0 \pm 3.5$
	1.3	$0.742 \pm 0.019$	$0.666 \pm 0.021$	$42.7 \pm 3.2$
	2.0	$0.748 \pm 0.019$	$0.671 \pm 0.020$	$41.8 \pm 3.1$

Table 10. Operating characteristics at

B_{avg} = 1.3

,

B_{lat} = 3.0

s. TPR at fixed FPR, precision at fixed recall, ECE.

Table 10. Operating characteristics at

B_{avg} = 1.3

,

B_{lat} = 3.0

s. TPR at fixed FPR, precision at fixed recall, ECE.

Method	TPR@FPR = 1%	TPR@FPR = 2%	Prec@Rec = 0.9	TTD p50	ECE
Sim-CTKG (ours)	$0.67 \pm 0.03$	$0.74 \pm 0.03$	$0.88 \pm 0.02$	16	$0.028 \pm 0.006$
Flat-RL	$0.47 \pm 0.04$	$0.55 \pm 0.04$	$0.73 \pm 0.03$	29	$0.071 \pm 0.010$
KG-noCausal	$0.54 \pm 0.04$	$0.62 \pm 0.04$	$0.79 \pm 0.03$	25	$0.053 \pm 0.009$

Table 11. Zero-day evaluation on held motifs at

B_{avg} = 1.3

. AUROC and TTD per card with

R = 200

episodes each.

Table 11. Zero-day evaluation on held motifs at

B_{avg} = 1.3

. AUROC and TTD per card with

R = 200

episodes each.

Method	Card–A AUROC/TTD	Card–B AUROC/TTD	Card–C AUROC/TTD
Sim-CTKG (ours)	$0.942 \pm 0.012$ / $18.9 \pm 2.2$	$0.948 \pm 0.011$ / $17.6 \pm 2.0$	$0.940 \pm 0.013$ / $18.1 \pm 2.1$
Flat-RL	$0.842 \pm 0.016$ / $31.8 \pm 2.8$	$0.853 \pm 0.016$ / $30.7 \pm 2.6$	$0.845 \pm 0.017$ / $31.4 \pm 2.7$
KG-noCausal	$0.884 \pm 0.014$ / $27.6 \pm 2.5$	$0.889 \pm 0.014$ / $26.8 \pm 2.3$	$0.882 \pm 0.015$ / $27.0 \pm 2.4$

Table 12. Logging behavior by budget. Activation rate per source (fraction of steps where

a_{t, i}^{\log} = 1

), average bytes per episode, and p95 latency.

Table 12. Logging behavior by budget. Activation rate per source (fraction of steps where

a_{t, i}^{\log} = 1

), average bytes per episode, and p95 latency.

Source	Metric	$B_{avg} = 0.8$	$B_{avg} = 1.3$	$B_{avg} = 2.0$
conn	act. rate	$0.62 \pm 0.06$	$0.78 \pm 0.05$	$0.91 \pm 0.04$
dns	act. rate	$0.58 \pm 0.07$	$0.74 \pm 0.06$	$0.88 \pm 0.04$
http	act. rate	$0.31 \pm 0.07$	$0.52 \pm 0.06$	$0.73 \pm 0.05$
ssl	act. rate	$0.27 \pm 0.06$	$0.45 \pm 0.06$	$0.68 \pm 0.05$
files	act. rate	$0.18 \pm 0.05$	$0.33 \pm 0.05$	$0.51 \pm 0.05$
notice	act. rate	$0.12 \pm 0.04$	$0.22 \pm 0.04$	$0.36 \pm 0.04$
proc	act. rate	$0.24 \pm 0.05$	$0.41 \pm 0.05$	$0.63 \pm 0.05$
image	act. rate	$0.16 \pm 0.04$	$0.28 \pm 0.04$	$0.46 \pm 0.04$
svc/reg	act. rate	$0.11 \pm 0.03$	$0.21 \pm 0.04$	$0.35 \pm 0.04$
task	act. rate	$0.09 \pm 0.03$	$0.16 \pm 0.03$	$0.27 \pm 0.03$
socket	act. rate	$0.20 \pm 0.05$	$0.36 \pm 0.05$	$0.57 \pm 0.05$
All sources	avg. MB/episode	$38.2 \pm 3.6$	$64.7 \pm 4.8$	$92.4 \pm 6.1$
All sources	p95 latency [s]	$2.4 \pm 0.2$	$2.6 \pm 0.2$	$2.8 \pm 0.2$

Table 13. Ablations at

B_{avg} = 1.3

. Removing any pillar degrades detection or earliness.

Table 13. Ablations at

B_{avg} = 1.3

. Removing any pillar degrades detection or earliness.

Variant	AUROC@B	AUPRC@B	TTD@B	TPR@FPR = 1%
Full model	$0.944 \pm 0.010$	$0.915 \pm 0.012$	$18.2 \pm 1.8$	$0.67 \pm 0.03$
– Generator constraints	$0.909 \pm 0.013$	$0.875 \pm 0.014$	$24.8 \pm 2.2$	$0.58 \pm 0.03$
– Sensor-budget head	$0.899 \pm 0.014$	$0.863 \pm 0.015$	$26.2 \pm 2.3$	$0.56 \pm 0.04$
– Causal CTKG	$0.905 \pm 0.013$	$0.871 \pm 0.014$	$25.3 \pm 2.2$	$0.57 \pm 0.04$

Table 14. Robustness evaluation at

B_{avg} = 1.3

.

Table 14. Robustness evaluation at

B_{avg} = 1.3

.

Condition	AUROC@B	AUPRC@B	TTD@B	p95 Latency [s]
Base	$0.944 \pm 0.010$	$0.915 \pm 0.012$	$18.2 \pm 1.8$	$2.6 \pm 0.2$
Background $+ 30 %$	$0.936 \pm 0.011$	$0.904 \pm 0.013$	$19.1 \pm 1.9$	$2.6 \pm 0.2$
Missing $20 %$ events	$0.927 \pm 0.012$	$0.896 \pm 0.013$	$20.4 \pm 2.0$	$2.6 \pm 0.2$
Clock skew $+ 250$ ms	$0.939 \pm 0.011$	$0.907 \pm 0.012$	$19.0 \pm 1.9$	$2.7 \pm 0.2$
Latency spike p95 + 0.5 s	$0.938 \pm 0.011$	$0.905 \pm 0.013$	$19.6 \pm 1.9$	$3.1 \pm 0.2$

Table 15. Counterfactual Preventability

\hat{CP} (S)

at

B_{avg} = 1.3

for common mid-chain levers. Higher is better.

Table 15. Counterfactual Preventability

\hat{CP} (S)

at

B_{avg} = 1.3

for common mid-chain levers. Higher is better.

Technique Set S	Description	$\hat{CP} (S)$
${T 1105}$	C2 channel establishment	$0.121 \pm 0.017$
${T 1021}$	Lateral remote services	$0.087 \pm 0.015$
${T 1059, T 1105}$	Exec → C2 handoff	$0.159 \pm 0.019$
${T 1003}$	Credential dumping	$0.102 \pm 0.016$

Table 16. Explanation Stability

XS

(Jaccard overlap of important CTKG subgraphs) at matched operating points.

Table 16. Explanation Stability

XS

(Jaccard overlap of important CTKG subgraphs) at matched operating points.

Condition	Within-Card Resample	Across Held-Out Motifs
Sim-CTKG (ours)	$0.71 \pm 0.04$	$0.66 \pm 0.05$
KG-noCausal	$0.49 \pm 0.05$	$0.41 \pm 0.06$
Flat-RL	$0.32 \pm 0.06$	$0.28 \pm 0.06$

Table 17. Per-step runtime breakdown (mean ± std) at

B_{avg} = 1.3

over all cards.

Table 17. Per-step runtime breakdown (mean ± std) at

B_{avg} = 1.3

over all cards.

Component	Latency [ms]	Notes
Featurizer	$0.7 \pm 0.1$	rolling sketches and aggregates
Graph encoder	$2.3 \pm 0.3$	$L_{g} = 3$ , $H = 4$ , $d_{hid} = 128$
Policy heads	$0.6 \pm 0.1$	fused MLP projections
Causal update	$0.3 \pm 0.1$	local structural variables
Signing (if emitted)	$0.3 \pm 0.1$	SHA-256 + Ed25519
Total (no artifact)	$3.9 \pm 0.4$	end-to-end step time
Total (with artifact)	$4.2 \pm 0.4$	includes signing

Table 18. Provenance manifest size and verification latency.

Artifact	Manifest Size [kB]	Sign Latency [ms]	Verify Latency [ms]
Alert only	$6.2 \pm 0.5$	$0.28 \pm 0.03$	$0.41 \pm 0.05$
Alert + explanation vector	$9.8 \pm 0.6$	$0.31 \pm 0.04$	$0.46 \pm 0.06$
Alert + highlighted subgraph	$12.4 \pm 0.8$	$0.34 \pm 0.05$	$0.52 \pm 0.06$

Table 19. Stage-wise median TTD (steps) at

B_{avg} = 1.3

. Lower is better.

Table 19. Stage-wise median TTD (steps) at

B_{avg} = 1.3

. Lower is better.

Method	Exec (T1059)	C2 (T1105)	Lateral (T1021)	Exfil (T1041)
Sim-CTKG (ours)	$12 [10, 14]$	$14 [12, 16]$	$17 [15, 20]$	$19 [17, 21]$
KG-noCausal	$16 [14, 18]$	$19 [17, 21]$	$23 [21, 26]$	$26 [24, 29]$
Flat-RL	$19 [17, 22]$	$22 [20, 25]$	$27 [24, 30]$	$31 [28, 34]$

Table 20. Performance comparison across all baseline methods.

Method	AUROC@B	TTD (Steps)	ECE	Budget Adherence (%)
Fixed-Source (FS)	$0.842 \pm 0.006$	$11.3 \pm 0.8$	$0.091 \pm 0.012$	$12.4$
Random-Sampling (RS)	$0.731 \pm 0.010$	$15.8 \pm 1.7$	$0.142 \pm 0.015$	$48.2$
Static Budgeted (SB)	$0.801 \pm 0.008$	$13.9 \pm 1.1$	$0.104 \pm 0.011$	$100.0$
CTI-only (CTKG-Struct)	$0.868 \pm 0.009$	$10.4 \pm 0.6$	$0.076 \pm 0.009$	$100.0$
RL w/o CTKG	$0.882 \pm 0.007$	$9.7 \pm 0.6$	$0.062 \pm 0.010$	$100.0$
RL w/o Cross-Attention	$0.891 \pm 0.006$	$9.4 \pm 0.5$	$0.058 \pm 0.007$	$100.0$
Sim-CTKG (Ours)	$0.934 \pm 0.004$	$7.8 \pm 0.4$	$0.031 \pm 0.006$	$100.0$

Table 21. Calibration metrics at

B_{avg} = 1.3

. Lower is better.

Table 21. Calibration metrics at

B_{avg} = 1.3

. Lower is better.

Method	ECE	Brier	NLL
Sim-CTKG (ours)	$0.028 \pm 0.006$	$0.082 \pm 0.008$	$0.39 \pm 0.04$
KG-noCausal	$0.053 \pm 0.009$	$0.104 \pm 0.010$	$0.51 \pm 0.05$
Flat-RL	$0.071 \pm 0.010$	$0.121 \pm 0.011$	$0.63 \pm 0.06$

Table 22. Seed stability over 10 runs.

Budget	AUROC@B	AUPRC@B	TTD@B	TPR@1% FPR
$B_{avg} = 0.8$	$0.915 \pm 0.008$	$0.882 \pm 0.010$	$23.5 \pm 1.3$	$0.61 \pm 0.02$
$B_{avg} = 1.3$	$0.944 \pm 0.007$	$0.915 \pm 0.009$	$18.2 \pm 1.1$	$0.67 \pm 0.02$
$B_{avg} = 2.0$	$0.958 \pm 0.006$	$0.931 \pm 0.008$	$15.4 \pm 1.0$	$0.72 \pm 0.02$

Table 23. Throughput parity at

B_{avg} = 1.3

. FLOPs are approximate per-step forward ops.

Table 23. Throughput parity at

B_{avg} = 1.3

. FLOPs are approximate per-step forward ops.

Method	Params [M]	FLOPs/Step [M]	Latency/Step [ms]
Sim-CTKG (ours)	7.8	62	$3.9 \pm 0.4$
KG-noCausal	6.2	49	$3.2 \pm 0.3$
Flat-RL	5.1	36	$2.7 \pm 0.3$

Table 24. Pairwise tests at

B_{avg} = 1.3

(stratified by card).

Table 24. Pairwise tests at

B_{avg} = 1.3

(stratified by card).

Comparison	AUROC@B p	TTD@B p	Cohen’s d (AUROC)	Cohen’s d (TTD)
Ours vs. Flat-RL	<0.001	<0.001	1.12	1.28
Ours vs. KG-noCausal	<0.001	<0.001	0.94	1.03
KG-noCausal vs. Flat-RL	0.004	0.006	0.52	0.61

Table 25. Core component ablations at

B_{avg} = 1.3

. AUROC/AUPRC are at budget.

Table 25. Core component ablations at

B_{avg} = 1.3

. AUROC/AUPRC are at budget.

Variant	AUROC@B	AUPRC@B	TTD@B	TPR@1%FPR	ECE
Full model (ours)	$0.944 \pm 0.010$	$0.915 \pm 0.012$	$18.2 \pm 1.8$	$0.67 \pm 0.03$	$0.028 \pm 0.006$
>Generator constraints	$0.909 \pm 0.013$	$0.875 \pm 0.014$	$24.8 \pm 2.2$	$0.58 \pm 0.03$	$0.041 \pm 0.007$
>Sensor-budget head	$0.899 \pm 0.014$	$0.863 \pm 0.015$	$26.2 \pm 2.3$	$0.56 \pm 0.04$	$0.046 \pm 0.008$
>Causal CTKG	$0.905 \pm 0.013$	$0.871 \pm 0.014$	$25.3 \pm 2.2$	$0.57 \pm 0.04$	$0.043 \pm 0.008$
Cross-attn → Concat	$0.914 \pm 0.012$	$0.886 \pm 0.013$	$22.1 \pm 2.0$	$0.61 \pm 0.03$	$0.036 \pm 0.007$
Cross-attn → FiLM	$0.923 \pm 0.011$	$0.896 \pm 0.013$	$20.7 \pm 2.0$	$0.63 \pm 0.03$	$0.033 \pm 0.007$

Table 26. Sensitivity to CTKG scope and encoder depth at

B_{avg} = 1.3

. Latency is per step forward time for the graph encoder.

Table 26. Sensitivity to CTKG scope and encoder depth at

B_{avg} = 1.3

. Latency is per step forward time for the graph encoder.

Config $(r, L_{g}, H)$	AUROC@B	AUPRC@B	TTD@B	Graph Latency [ms]
(1, 2, 2)	$0.926 \pm 0.011$	$0.893 \pm 0.013$	$20.9 \pm 1.9$	$1.4 \pm 0.2$
(2, 3, 4) (default)	$0.944 \pm 0.010$	$0.915 \pm 0.012$	$18.2 \pm 1.8$	$2.3 \pm 0.3$
(3, 4, 6)	$0.949 \pm 0.010$	$0.919 \pm 0.011$	$17.7 \pm 1.8$	$3.5 \pm 0.4$
(4, 4, 8)	$0.951 \pm 0.010$	$0.921 \pm 0.011$	$17.6 \pm 1.8$	$4.8 \pm 0.5$

Table 27. Budget sweep. Mean ± 95% CI across cards.

Method	$B_{avg}$	AUROC@B	TTD@B	TPR@1%FPR
Full model (ours)	0.8	$0.915 \pm 0.011$	$23.5 \pm 2.0$	$0.61 \pm 0.03$
	1.3	$0.944 \pm 0.010$	$18.2 \pm 1.8$	$0.67 \pm 0.03$
	2.0	$0.958 \pm 0.008$	$15.4 \pm 1.5$	$0.72 \pm 0.03$
>Sensor-budget head	0.8	$0.883 \pm 0.014$	$28.9 \pm 2.5$	$0.53 \pm 0.04$
	1.3	$0.899 \pm 0.014$	$26.2 \pm 2.3$	$0.56 \pm 0.04$
	2.0	$0.914 \pm 0.012$	$23.8 \pm 2.2$	$0.60 \pm 0.03$
>Causal CTKG	0.8	$0.894 \pm 0.013$	$27.6 \pm 2.4$	$0.55 \pm 0.04$
	1.3	$0.905 \pm 0.013$	$25.3 \pm 2.2$	$0.57 \pm 0.04$
	2.0	$0.919 \pm 0.012$	$22.9 \pm 2.1$	$0.61 \pm 0.03$

Table 28. Non-RL detectors at

B_{avg} = 1.3

.

Table 28. Non-RL detectors at

B_{avg} = 1.3

.

Method	AUROC@B	AUPRC@B	TTD@B	ECE
Sim-CTKG (ours)	$0.944 \pm 0.010$	$0.915 \pm 0.012$	$18.2 \pm 1.8$	$0.028 \pm 0.006$
T-Transformer ( $x_{t}$ only)	$0.902 \pm 0.012$	$0.855 \pm 0.014$	$24.3 \pm 2.1$	$0.051 \pm 0.008$
TCN-Detector ( $x_{t}$ only)	$0.887 \pm 0.013$	$0.838 \pm 0.015$	$26.0 \pm 2.3$	$0.058 \pm 0.009$
RelGAT-only ( $G_{t}$ only)	$0.868 \pm 0.014$	$0.814 \pm 0.016$	$27.8 \pm 2.4$	$0.063 \pm 0.010$
MoE-Selector + Static heads	$0.891 \pm 0.013$	$0.844 \pm 0.015$	$25.6 \pm 2.2$	$0.055 \pm 0.008$

Table 29. RL baselines at

B_{avg} = 1.3

.

Table 29. RL baselines at

B_{avg} = 1.3

.

Method	AUROC@B	AUPRC@B	TTD@B	TPR@1%FPR
Sim-CTKG (ours)	$0.944 \pm 0.010$	$0.915 \pm 0.012$	$18.2 \pm 1.8$	$0.67 \pm 0.03$
Flat-RL ( $x_{t}$ )	$0.846 \pm 0.015$	$0.781 \pm 0.017$	$31.3 \pm 2.7$	$0.47 \pm 0.04$
KG-noCausal	$0.886 \pm 0.014$	$0.829 \pm 0.016$	$27.0 \pm 2.3$	$0.54 \pm 0.04$
InfoBottleneck-RL	$0.904 \pm 0.013$	$0.851 \pm 0.015$	$24.9 \pm 2.2$	$0.57 \pm 0.04$
Heuristic-RL (selector)	$0.878 \pm 0.014$	$0.822 \pm 0.016$	$28.3 \pm 2.4$	$0.52 \pm 0.04$

Table 30. Robustness at

B_{avg} = 1.3

: relative AUROC drop (negative is worse) and

Δ

TTD in steps.

Table 30. Robustness at

B_{avg} = 1.3

: relative AUROC drop (negative is worse) and

Δ

TTD in steps.

Condition	Ours	KG-noCausal	Flat-RL	T-Transformer
Background $+ 30 %$	$- 0.8 %$ , $+ 0.9$	$- 2.4 %$ , $+ 2.6$	$- 3.6 %$ , $+ 3.8$	$- 3.1 %$ , $+ 3.2$
Missing $20 %$ events	$- 1.8 %$ , $+ 2.2$	$- 4.3 %$ , $+ 3.9$	$- 5.7 %$ , $+ 5.4$	$- 4.9 %$ , $+ 4.8$
Clock skew $+ 250$ ms	$- 0.5 %$ , $+ 0.8$	$- 1.6 %$ , $+ 1.9$	$- 2.1 %$ , $+ 2.3$	$- 1.9 %$ , $+ 2.0$
Latency spike p95 + 0.5 s	$- 0.6 %$ , $+ 1.4$	$- 2.0 %$ , $+ 2.7$	$- 3.0 %$ , $+ 3.5$	$- 2.6 %$ , $+ 3.0$

Table 31. Compute parity at

B_{avg} = 1.3

. FLOPs are approximate forward ops per step.

Table 31. Compute parity at

B_{avg} = 1.3

. FLOPs are approximate forward ops per step.

Method	Params [M]	FLOPs/Step [M]	Latency/Step [ms]
Sim-CTKG (ours)	$7.8$	62	$3.9 \pm 0.4$
KG-noCausal	$6.2$	49	$3.2 \pm 0.3$
T-Transformer	$8.1$	58	$3.5 \pm 0.3$
Flat-RL	$5.1$	36	$2.7 \pm 0.3$
RelGAT-only	$4.6$	31	$2.4 \pm 0.3$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

A Framework for Budget-Constrained Zero-Day Cyber Threat Mitigation: A Knowledge-Guided Reinforcement Learning Approach

Abstract

1. Introduction

2. Background and Threat Model

2.1. ATT&CK Semantics and Entities

2.2. Cyber-Threat Knowledge Graph (CTKG)

2.3. MDP Interface for Cost-Aware Defense

2.4. Telemetry Model and Budgets

2.5. Causal Structure and Counterfactuals

2.6. Provenance and Verifiable Artifacts

2.7. Threat Model and Scope

3. Related Work

3.1. Learning-Based Intrusion Detection

3.2. Knowledge Representations and Cyber Threat Intelligence

3.3. Reinforcement Learning for Cyber Defense

3.4. Cyber Ranges and Simulators

3.5. Explainability and Budget-Aware Sensing

3.6. Positioning and Gap

3.7. Summary of Differences

4. Materials and Methods

4.1. ATT&CK-Constrained Scenario Generation

4.2. Witness Telemetry Compiler

4.3. Cost-Aware Reinforcement Learning

4.4. CTKG Construction and Causal Engine

4.5. Policy Architecture

4.6. Provenance and Verifiable Artifacts

4.7. Feature Featurization and Windowing

4.8. Relation-Aware Graph Encoder

4.9. Containment Semantics and Safety Mask

4.10. Cost Profiling and Source Calibration

4.11. Background and Noise Model

4.12. Policy Heads and Loss

4.13. Provenance Keying and Verification

4.14. Scalability and Complexity

4.15. Leakage Controls

5. Evaluation and Results

5.1. Dataset

5.1.1. Sim-CTKG Zeek Telemetry (Primary)

5.1.2. CTKG Snapshot (Knowledge Graph for Decision Context)

5.1.3. DARPA TCAD-Derived Alignment Set (Auxiliary)

5.2. Experimental Configuration

5.3. Primary Detection Under Budgets

5.4. Operating Characteristics at B avg = 1.3

5.5. Zero-Day Motif Generalization

5.6. Logging Policy and Resource Use

5.7. Ablations on Novelty Pillars

5.8. Robustness to Telemetry Shift

5.9. Causal Accountability

5.10. Throughput and Overhead

5.11. Summary

6. Extended Analyses

6.1. Pareto Fronts Under Budget Constraints

6.2. Per-Technique and Stage-Wise Efficacy

6.3. Training Stability and Convergence

6.4. Calibration and Reliability

6.5. Seed Stability

6.6. Computational Parity and Throughput

6.7. Pairwise Significance at the Main Operating Point

6.8. Isolation of Cost, Causality, and Provenance Effects

6.9. Implications for Cyber-Defense System Design

6.10. Practical Deployment Considerations

6.11. Limitations

7. Ablation Study

7.1. Core Components

7.2. Fusion and CTKG Scope Sensitivity

7.3. Budget Sweep and Operating Slices

7.4. Comparisons to Recent Alternatives

7.5. Robustness Under Telemetry Shift

7.6. Compute Parity

7.7. Fairness Protocol and Significance

7.8. Adversarial Scenario

8. Conclusions and Future Work

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Notations

5.4. Operating Characteristics at $B_{avg}$ = 1.3