Using Causality-Driven Graph Representation Learning for APT Attacks Path Identification

Cheng, Xiang; Kuang, Miaomiao; Yang, Hongyu

doi:10.3390/sym17091373

Open AccessArticle

Using Causality-Driven Graph Representation Learning for APT Attacks Path Identification

by

Xiang Cheng

^1,*,

Miaomiao Kuang

¹ and

Hongyu Yang

²

¹

School of Information Engineering, Yangzhou University, Yangzhou 225127, China

²

School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(9), 1373; https://doi.org/10.3390/sym17091373

Submission received: 14 July 2025 / Revised: 16 August 2025 / Accepted: 17 August 2025 / Published: 22 August 2025

(This article belongs to the Special Issue Advanced Studies of Symmetry/Asymmetry in Cybersecurity)

Download

Browse Figures

Versions Notes

Abstract

In the cybersecurity attack and defense space, the “attacker” and the “defender” form a dynamic and symmetrical adversarial pair. Their strategy iterations and capability evolutions have long been in a symmetrical game of mutual restraint. We will introduce modern Intrusion Detection Systems (IDSs) from the defender’s side to counter the techniques designed by the attacker (APT attack). One major challenge faced by IDS is to identify complex attack paths from a vast provenance graph. By constructing an attack behavior tracking graph, the interactions between system entities can be recorded, but the malicious activities of attackers are often hidden among a large number of normal system operations. Although traditional methods can identify attack behaviors, they only focus on the surface association relationships between entities and ignore the deep causal relationships, which limits the accuracy and interpretability of detection. Existing graph anomaly detection methods usually assign the same weight to all interactions, while we propose a Causal Autoencoder for Graph Explanation (CAGE) based on reinforcement learning. This method extracts feature representations from the traceability graph through a graph attention network(GAT), uses Q-learning to dynamically evaluate the causal importance of edges, and highlights key causal paths through a weight layering strategy. In the DARPA TC project, the experimental results conducted on the selected three datasets indicate that the precision of this method in the anomaly detection task remains above 97% on average, demonstrating excellent accuracy. Moreover, the recall values all exceed 99.5%, which fully proves its extremely low rate of missed detections.

Keywords:

causality; graph autoencoders; APT attack; data provenance

1. Introduction

Advanced Persistent Threat (APT) attack has become one of the serious threats in the field of network security because of its concealment, multi-stage complexity and long-term persistence. The interaction between APT attacks and normal network activities and defense systems always takes the concept of symmetry as the underlying logic, which does not refer to the complete equivalence between elements, but emphasizes the correlation system formed by different behaviors, entities or strategies through “benchmark reference, feature correspondence and logical adaptation”. From the perspective of offense and defense confrontation, the attack and defense systems form an interactive and mutually restrictive whole. The penetration strategies of attackers (such as social engineering, zero-day vulnerability exploitation) correspond and interact with the protection plans of defenders (such as abnormal behavior detection and vulnerability intelligence response). The innovation of attack methods will drive the upgrade of defense technologies, while the improvement of defense mechanisms will prompt attackers to adjust their attack paths. The two maintain a dynamic balance in the cycle of “attack strategy iteration and defense capability evolution”. APT attacks are usually an attack team organized by strict management and skilled hackers, with the purpose of spying, obtaining important information data in exchange for ultra-high funds and destroying critical information infrastructure. Their targets are mainly targeted attacks on governments, giant enterprises and critical infrastructure [1]. There is a symmetric relationship between the protection level of these targets and the APT attack intensity, that is, the more important the core information of the target is, the tighter the protection system is. However, when the APT attack selects such targets, the attack intensity will form a symmetric adaptation with the protection degree of the target. And such targets hold sensitive and very important information; once the information is stolen, leaked or destroyed, it will cause huge losses and turbulence. For example, the Pegasus [2] spyware attack exposed in 2021 used a zero-day vulnerability to compromise tens of thousands of devices around the world, by monitoring the communications data of government officials, journalists and other groups. Solarmarker [3] malware attacks, discovered in 2022, gradually stole sensitive information by masquerading as a legitimate software lurking in the victim’s system for months, fully demonstrating “low frequency, long cycle” characteristics of APT attacks. In 2023, the MOVEit [4] attack, using the file transfer of supply chain software vulnerabilities, led to leaks of thousands of data globally. This long-term latency and defender detection period form a time dimension symmetric game. The attack lasted for more than a year from the time the attacker broke into the internal network of SolarWinds to the time the incident was discovered, and because the software was widely used around the world, involving many enterprises and government agencies, the impact of the attack was extremely wide. With the characteristics of “low frequency and slow pace” [5], APT attacks are hidden in network activities. Attackers gradually achieve the purpose of stealing sensitive data and controlling critical systems by using long-term latent and progressive penetration, and taking advantage of the complex relationship between multi-stage attack behaviors. Therefore, before the attack behavior evolves into substantial damage, one of the important means to prevent such attacks is to explore the complex relationships in network activities to detect whether there are any attack activities in the system. The core lies in identifying the deviation of attack behaviors from normal activities under a symmetrical benchmark. Normal activities follow symmetrical associations such as specific roles corresponding to specific operation permissions, while attack behaviors will break this symmetry at key nodes. Due to the long-period characteristics of APT attacks, the attack process often lasts for months or even years, and there is a long time interval between the attack activities recorded in the audit log [6]. This “low frequency and long period” attack rhythm makes it difficult for the traditional Intrusion Detection System (IDS) [7,8,9,10] based on behavior analysis to capture the internal correlation of attack behaviors. In recent years, researchers in academia have widely used provenance graph (PG) [11,12,13] technology to deal with this challenge and have verified its effectiveness. By abstracting system entities as nodes, system activities and timestamps into connected edges with temporal labels between nodes, the traceability graph directly reflects entity interactions and data flow in network activities in the form of a dynamic graph structure, which provides support for temporal association analysis of attack paths. Yet, the traceability figure technology still faces two major challenges in practice. First, the existing modeling methods mostly focus on the surface correlation between entities, lacking the deep mining of the causal logic behind the attack behavior. Second, with the expansion of network scale and the increase in attack complexity, the node scale and the number of edges in the provenance graph grow exponentially, reaching the TB level. With the rapid increase in storage overhead and computing power consumption, the symmetrical balance between data size and analysis ability of graph analysis is broken, and it is difficult to adapt to the needs of large-scale scenarios.

In the context of Advanced Persistent Threat (APT) detection, traditional source graph analysis methods, such as RT-APT [14], realize the identification of abnormal interactions by analyzing the statistics of node degrees and edge frequencies of large-scale provenance graphs in real time. However, these methods rely on manually set thresholds, and have a high miss rate for types such as zero-day attacks. Based on metric learning [15], the method identifies anomalies by quantifying the similarity of entity behaviors, but it only focuses on surface features and fails to model the causal dependencies between behaviors, making it difficult to distinguish the causal chain of “malicious process to C2 server” from normal behaviors with high similarity, which can result in insufficient capturing of the long-term associations of multi-stage APT. PROGRAPHER [16] converts the traceability graph into a low-dimensional vector through graph embedding technology, and uses a one-class classifier to distinguish between normal and abnormal behaviors. However, the embedding process does not consider the causal logic between entities, resulting in insufficient judgment of the model on the attack mode of causal chain break. This type of graph embedding method also faces the “imitation attack” risk pointed out by Goyal et al. [17]. Attackers can evade detection by imitating the normal topological features. The root cause lies in the fact that the topological features relied upon the lack of causal semantic support and cannot distinguish between superficial similarities and essential causal relationships. Notably, the APTSHIELD [18] detection model based on the ATT&CK design framework, promotes efficiency by pruning the redundant data, but the characteristics of engineering rely on the known attack on the new APT limited generalization ability. The online detection system Nodlink [19], although capable of supporting fine-grained attack path tracking, does not introduce a causal weight differentiation mechanism. As a result, it cannot distinguish between key nodes (such as the vulnerability exploitation steps) and redundant nodes in the attack chain, and is vulnerable to interference from noise data, thus limiting its detection accuracy. There are also methods like THREATRACE [20], which addresses frame host-based threat detection as an anomaly node identification problem within the source graph using an anomaly detection framework. However, these methods have limited detection accuracy due to insufficient modeling of semantic relationships between nodes, which are crucial to capture subtle attack patterns. GNN is mature in complex relationship modeling. Its core is to dynamically capture the dependencies between nodes through the neighborhood aggregation mechanism, which enables it to deal with data with irregular topology. The method based on behavioral sequences [21] emphasizes that “behavior determines identity”, and detects malicious software by correlating entity behaviors. However, it does not quantify the causal strength between behaviors, and is insufficient in capturing long-term correlations of covert APT (such as multi-stage operations separated by several months), making it vulnerable to attacks with large time spans. TREC [22] proposed an APT attack technique and tactics recognition model based on small sample provenance subgraph learning. The multi-layer GAT is used to characterize the subgraph of attack techniques and tactics as a feature vector, and then the Siamese neural network is used for few-shot distance metric learning to realize the classification. Masked self-supervised learning has also been proposed [23], which forces the model to learn neighborhood dependencies by randomly masking node attributes, but does not introduce domain causal priors, which may cause the model to misinterpret abnormal causal edges as normal variations. KAIROS [24] addresses this gap by leveraging a GNN encoder–decoder architecture to learn the temporal evolution of the provenance graph structure. KAIROS quantifies the anomaly degree of system events by modeling the dynamics of structural changes over time, which enables real-time reconstruction of attack footprints and significantly improves the temporal sensitivity of detection. MAGIC [25] uses masked graph representation learning with self-supervised mechanism to perform deep feature extraction and structural abstraction of the source graph. This approach effectively captures the implicit semantic relationships and global topological features of nodes, enabling the model to characterize complex attack patterns with higher fidelity. MEGR-APT [26] introduces an innovative approach to transform system logs into Resource Description Framework (RDF)-based provenance graphs stored in a database. Through graph query and GNN-based attack representation learning, efficient subgraph matching is achieved while maintaining detection performance, which significantly reduces memory consumption and is suitable for large-scale datasets. However, its strategy of expanding the subgraph from the IOC node is insufficient for capturing global correlations. Notably, high-level systems such as FLASH [27] have incorporated semantic encoding modules (e.g., a semantic encoder based on word2vec) to capture the semantic properties of entities and the temporal order of events in the provenance graph. By integrating them with GNN context encoders to generate structured node embeddings, and further incorporating lightweight classifiers and embedding recycling mechanisms, these systems have achieved significant breakthroughs in both the accuracy and real-time performance of APT detection. These approaches address the semantic gap in traditional provenance graph analysis by combining semantic modeling with graph structure learning, paving the way for more robust and intelligent threat detection frameworks.

In this paper, we propose a cause–effect-based anomaly detection system, that leverages graph neural networks and reinforcement learning to identify and reconstruct attack paths from system-level provenance data. Our approach addresses the challenges of scalable anomaly detection and attack path reconstruction through a two-phase approach: (i) causal graph augmentation using reinforcement learning to assign weights to edges in a provenance graph, and (ii) masked graph representation learning to capture structural and behavioral patterns indicative of malicious activity. Our framework formalizes attack path reconstruction as a subgraph inference problem, where suspicious nodes and edges identified via anomaly detection are integrated into a coherent attack narrative. By leveraging a relational graph autoencoder, we learn to reconstruct both node features and graph structure under a masking scheme, enabling it to detect deviations from normal behavior. The learned representations are then used to generate confidence-scored attack paths, which are validated against known attack patterns. The following are the main contributions of this paper:

A dynamic causal weight assignment method based on reinforcement learning is proposed. The causal quantification of edges is achieved through interaction with the graph environment by Q-learning, breaking through the limitations of static association analysis in traditional traceability graphs.
A causal weighting mechanism is innovatively introduced into the GAT, enabling the attention allocation to prioritize paths with high causal importance and enhancing the model’s ability to capture the causal semantics of attacks.
The anomaly confidence score is integrated with attack path reconstruction, and the key abnormal elements are filtered through the anomaly score. The interpretable attack path is generated by combining the timing and causal dependencies, and the intuitive attack evolution and risk level are presented.

2. Materials and Methods

During the course of this research, we propose an APT detection system based on cause–effect driven graph representation learning, which aims to accurately identify potential attack chains from network activity data. as shown in the overall system architecture in Figure 1. Specifically, we first construct a traceback graph with timestamps based on system entities and activities, and assign causal weights to edges through reinforcement learning to reflect the dependencies between modeling entities. Then, the original graph data is perturbed by graph mask technology, and the GAT graph autoencoder is trained by using clean data samples that do not contain attack behaviors. The model learns the characteristics and structure of normal network activities through dual reconstruction mechanism, node reconstruction, and structure reconstruction. At the feature level, the attention mechanism is used to dynamically aggregate neighborhood information to generate node embedding representations. At the structural level, the edge connection pattern of normal interaction is learned by training with positive and negative samples. In the testing phase, the network activity data to be detected is input into the trained model, and the anomaly detection is realized by comparing the difference between the test sample and the normal behavior, and the confidence of the abnormal result is calculated. Finally, the abnormal nodes and edges with high confidence are screened out, and the attack path is constructed by combining their time sequence and causal dependence, which intuitively shows the hidden multi-stage evolution path and the risk degree of abnormal elements under the network activity behavior.

2.1. Construction of the Provenance Graph

In this study, we selected three classical datasets, TRACE, THEIA and CADETS, from the DARPA TC dataset as the research basis. In the following section, a specific scenario covered by the paper will be explained in detail. Figure 2 presents a traceability graph built based on part of the activities of theia dataset under DARPA TC. The graph contains nodes representing 12 types of system entities, such as Subject_Process and NetFlowObject, among others. Edges represent directed interactions between entities, including “open” and “clone,” and other types. In total, there are 24 edge types. Together, these nodes and edges outline the dynamic process of system behavior. In this particular scenario, there are many potential threats to the attacker’s behavior. Some network connections point to “NA/0”, which is extremely rare in normal network interaction and is most likely due to malicious port scanning, thereby trying to probe the open ports of the target host in order to find exploitable vulnerabilities. Or they may be conducting connectivity heuristics-testing of the network’s connectivity and defense mechanisms in preparation for a more targeted attack. At the same time, a process appears in the process entity with the command line showing “/var/log/mail”. Normally, “/var/log/mail” is the path to the mail log file, not the path to the executable program. Therefore, this process is most likely disguised as a malicious process, which can evade detection by forging command lines. It may secretly perform malicious operations such as stealing sensitive information, establishing illegal network connections, tampering with system configuration etc., which poses a serious threat to system security. From the perspective of formal definition, the provenance graph can be expressed as G = (V, E), where V is the node of the graph representing the system entities, including processes, files, IP, etc. E is the interaction between the entities represented by the edges in the graph. Moreover, a causal weight w (e) is assigned to each edge to measure the strength of the causal relationship, which will be described in the next module.

2.2. Reinforcement Learning and Causality

The causal reinforcement learning module is one of the innovations of this paper, which uses reinforcement learning to dynamically assign causal weights to edges in the provenance graph, aiming to break through the limitation of traditional anomaly detection methods that only focus on statistical correlation and mine the causal relationship in the graph structure. This module is based on the Markov Decision Process (MDP) framework; the weight assignment process is modeled as a sequential decision problem, and the agent learns to optimize the causal weights by interacting with the graph environment and maximizing the cumulative reward. It combines graph representation learning with reinforcement learning to capture structural and semantic causality in complex networks. The core architecture consists of four key components: an edge feature extractor, a causal weight predictor, a reward estimator, and experience replay buffer. The edge feature extractor processes heterogeneous features from nodes and edges, transforming them into a unified latent space. The weight predictor uses this representation to output a causal weight between 0 and 1 for each edge, indicating the strength of its causal relationship. The reward estimator uses a domain-specific reward function to evaluate the quality of the assigned weights and guide the reinforcement learning agent to learn an optimal policy. The experience replay buffer is used to store the experience during learning.

2.2.1. Edge Feature Extractor

The state space is composed of heterogeneous features of edges, and the “edge feature extractor” is used to fuse node pair information and edge attributes. Specifically, the edge feature extractor uses a three-layer neural network architecture to separately process the source node, the target node, and the edge features encoded into a unified state representation. (1) Node Encoder maps the node features to the hidden space

f_{n o d e} : ℝ^{d_{n o d e}} \to ℝ^{d_{h i d d e n}}

. (2) Edge Encoder maps the edge features to the same hidden space

f_{e d g e} : ℝ^{d_{e d g e}} \to ℝ^{d_{h i d d e n}}

. (3) Combined Encoder integrates the representations

f_{c o m b i n e d} : ℝ^{3 d_{h i d d e n}} \to ℝ^{d_{h i d d e n}}

of source nodes, target nodes, and edges. In the forward propagation, the source node features

v_{s r c}

and the target node features

v_{d s t}

go through the node encoder to generate representations

h_{s r c} = f_{n o d e} (v_{s r c})

and

h_{d s t} = f_{n o d e} (v_{d s t})

. The edge feature goes through the edge encoder to generate

h_{e d g e} = f_{e d g e} (e)

, and the unified state representation

h_{c o m b i n e d} = f_{c o m b i n e d} ([h_{s r c}, h_{e d g e}, h_{d s t}])

is output by the combined encoder after concatenation of the three.

2.2.2. Causal Weight Predictor

The action space is defined as continuous-valued causal weights a ∈ [0, 1], output by the “Causal Weight Predictor”. The causal weight predictor is based on the Deep Q-Network (DQN) design, which takes the state representation of the edge as input and predicts the Q-value of two actions (low/high causal weight):

Q (s, a) = f_{Q} (h_{c o m b i n e d})

. The decision-making process uses ϵ-greedy strategy to balance exploration and exploitation. Actions are randomly sampled with probability ϵ, which is used to discover new causal patterns. With probability 1 − ϵ, the optimal action predicted by the current policy network is selected and used to consolidate the learned effective policy.

a_{t} = \{\begin{cases} {argmax}_{a} Q (s_{t}, a), 1 - ε \\ random action, ε \end{cases}

(1)

ε = ε_{e n d} + (ε_{s t a r t} - ε_{e n d}) \cdot e^{- λ \cdot s t e p}

(2)

As the number of training rounds increases, the exploration rate ϵ gradually decays exponentially, making the agent gradually shift from random exploration to relying on the learned policy, improving the learning efficiency. Finally, the Q-value is mapped to the causal weight through the sigmoid function to ensure that the weights are limited within the range of [0, 1].

w = σ (Q (s, a = 1))

(3)

In this strategy, the agent accumulated diverse experience through extensive exploration in the early stage of training, and focused on the use of high-reward strategies in the later stage to achieve efficient convergence of the learning process.

2.2.3. Reward Estimator

The reward estimator is responsible for calculating the causal importance rewards of edges, taking into account the multi-dimensional graph structure characteristics. The total reward system consists of five main components. First, the degree-based reward measures the importance of nodes by considering their degree, reflecting their centrality and influence within the graph.

r_{src} = \frac{out-degree (src)}{\max (in-degree (src), 1)}

(4)

R_{degree} = 0.2 \cdot \frac{1}{1 + | r_{src} - 1 |} + 0.2 \cdot \frac{1}{1 + | r_{d s t} - 1 |}

(5)

Second, the edge type-based reward quantifies the potential contribution of different edge types to causal relationships, assigning higher weights to special edge types that may carry key causal information and thus better capture attack patterns or critical event sequences. Third, the temporal order-based reward enforces the directionality of causality by capturing the temporal dependencies between events, ensuring that the reward calculation aligns with the logical flow of time in causal inference. Fourth, the anomaly score-based reward leverages node anomaly indices to quantify abnormality; when both the source and target nodes exhibit high anomaly scores, the action of assigning a high causal weight is further encouraged, highlighting the importance of abnormal nodes in attack propagation. Fifth, the degree ratio-based reward evaluates the dynamic balance of information flow by analyzing the ratio of out-degree to in-degree of nodes, reflecting the equilibrium of node interactions in the graph. In terms of weight allocation, the system adopts an equal weighting scheme (0.2 for each component), and the total reward is combined as follows:

R (s, a) = 0.2 \cdot R_{degree} + 0.2 \cdot R_{type} + 0.2 \cdot R_{temporal} + 0.2 \cdot R_{anomaly} + 0.2 \cdot R_{degree-ratio}

(6)

And to avoid excessive rewards, we limit the reward to the range [−1, 1]:

R (s, a) = \max (\min (R (s, a), 1.0), - 1.0)

(7)

2.2.4. Experience Replay Buffer

The experience replay buffer stores the experiences during the training process, and each experience is represented as a tuple: (edge_id, s, a, r, s′, done).

It supports adding new experiences and random batch sampling operations. The whole causal weight learning algorithm includes three stages: initialization, training cycle, and weight application. In the training loop, the system extracts the state, selects the action, computes the reward, stores the experience for each edge, and then updates the Q-network by sampling batch experience from the buffer. The target value is calculated as follows:

y_{j} = r_{j} + γ \cdot \max_{a} Q_{target} (s_{j}^{'}, a) \cdot (1 - {done}_{j})

(8)

After training, the system uses the trained Q-network to predict causal weights for each edge and applies the weights to the edge attributes of the graph.

Algorithmic 1 is the process of the whole causal reinforcement learning.

Algorithm 1: Causal weight learning
	Input: G = (V, E)
	Output: G′ = (V, E, W)
1	Initialization $f_{e d g e}$ , Q, Q_target, Reward R, buffer D, optimizer
2	for e ← 1 to E
		for each edge( $v_{s r c}$ , $v_{d s t}$ ) ∈ E:
		a.	extract state $s = h_{c o m b i n e d} = f_{c o m b i n e d} ([h_{s r c}, h_{e d g e}, h_{d s t}])$
		b.	action a:
			choose $a_{t} = {argmax}_{a} Q (s, a)$ with probability 1 − ε
			choose random action with probability ε
		c.	r = R(s, a)
		d.	get next $s^{'}$
		e.	save (edge_id, s, a, r, s′, done) to D
		f.	sampling (s_j, a_j, r_j, s′_j, done_j) from D
		g.	target value $y_{j} = r_{j} + γ \cdot \max_{a} Q_{target} (s_{j}^{'}, a) \cdot (1 - {done}_{j})$
		h.	Loss $L = \frac{1}{N} \cdot \sum {(y_{j} - Q (s_{j}, a_{j}))}^{2}$
		i.	update Q_target = Q
3	$w = σ (Q (s, a = 1))$
4	return G′

2.3. Mask Graph Autoencoder

This part of the design is based on a key insight: in anomaly detection tasks, there are different levels of information in the structure and features of the graph, and key causal relationships may be hidden in complex interaction networks. Traditional graph neural networks tend to treat all edges equally, which cannot effectively capture these causal relationships. To solve this problem, we innovatively combine the masked autoencoder mechanism and causal reinforcement learning to create a framework that can distinguish the importance of edges and learn expressive graph representations.

Causal weight calculated by the previous point, by integrating the depth into each link of the process of the encoder, allows the model to distinguish between different causal importance in the study. This mechanism makes the model pay attention not only to “what is the structure of the graph” but also to “what is the causal flow in the graph”, thus improving the detection ability of abnormal patterns. By randomly masking the features of some nodes and then training the model to reconstruct this masked information from the remaining visible graph structures and features, the model is forced to learn the intrinsic structure and semantics of the graph without relying on external labels. This approach is particularly suitable for the field of anomaly detection, where anomaly samples are usually scarce and it is difficult to obtain enough labeled data.

2.3.1. Causal Weight Application and the Masking Mechanism

This step detects the inclusion of causal weights generated in the previous stage and is applied to enhance the influence of the weights in the autoencoder process. For the edge features, an adaptive non-linear transformation strategy is used. Differentiation processing increases the degree of discrimination between high causal edges and low causal edges.

X_{e}^{'} = s i g n (X_{e}) \cdot \{\begin{cases} | X_{e} | \cdot w_{e}^{2}, & if w_{e} > 0.5 \\ | X_{e} | \cdot \sqrt{w_{e}}, & if w_{e} \leq 0.5 \end{cases}

(9)

In addition to directly adjusting the edge features, the system also calculates the message passing weight msg_weight for each edge, which will be used in the subsequent GAT.

The masking process starts by generating random permutations for the nodes in the graph and then selecting a part of them as the mask target, where the mask rate is chosen appropriately. A low mask rate makes the task too easy, and the model may only need to memorize rather than understand; too high a mask rate may remove too much information, making the reconstruction task impossible. For the selected node, the system is not simply the characteristics of zero, but is replaced with a mask that can learn tag (mask token). This approach outperforms simple nulling because it allows the model to distinguish between “missing” and “zero values”, improving the representation power. Mask operation applies only to the node characteristics, without changing the structure of the graph. This design enables the model to use the complete graph structure information, including the connection relationship of the masked nodes, to infer the masked features, forcing the model to learn the deep relationship between the node features and the graph structure.

{\hat{X}}_{v} = \{\begin{cases} m a s k, & if v \in M \\ X_{v}, & other \end{cases}

(10)

2.3.2. Encoder and Decoder

The encoder is a modified GAT consisting of multiple layers of GAT. The following steps are performed for each layer:

Firstly, the encoder linearly transforms the node features to map the node features to the hidden space of the current layer. This process provides the basic feature representation for the subsequent attention mechanism. Next, the attention coefficient is calculated, which not only considers the node features, but also fuses the edge features, so that the model can more fully understand the structural and semantic information of the graph.

{\hat{h}}_{i}^{(l)} = W^{(l)} \cdot h_{i}^{(l - 1)}

(11)

e_{i j}^{(l)} = LeakyReLU (a^{{(l)}^{T}} [W^{(l)} h_{i}^{(l - 1)} | | W^{(l)} h_{j}^{(l - 1)} | | W_{e}^{(l)} e_{i j}])

(12)

In order to further enhance the model’s ability to perceive the causal relationship, the causal weight enhancement mechanism is introduced. Through this mechanism, the attention coefficient is adjusted according to the causal weights of the edges, so that the model pays more attention to the edges with high causal importance. This design helps to retain more causal information during message passing, thus improving the model’s awareness of causal relationships in the graph.

e_{i j}^{r (l)} = e_{i j}^{(l)} \cdot (1.0 + c a u s a l_s c a l e_{i j})

(13)

In order to ensure that the sum of the attention coefficients of each node is 1 and form a probability distribution, the model normalizes the attention coefficients by the softmax function. This step ensures that the attention weight of each node is reasonable and can reflect its importance in the graph when aggregating the information of neighboring nodes.

a_{i j}^{(l)} = \frac{\exp (e_{i j}^{r (l)})}{\sum_{k \in N (i)} \exp (e_{i k}^{r (l)})}

(14)

Finally, message aggregation is performed to generate a new node representation. This process updates the hidden state of the node by aggregating the information of the neighboring nodes into the representation of the current node by means of weighted summation. In order to further balance the graph structure information and causal messages, the model also introduces a dual-path message passing mechanism. In addition to the standard message passing, the system performs a message passing that specifically considers causal weights. The results of the two paths are then weighted and fused to generate the final node representation.

h_{i}^{(l)} = σ (\sum_{j \in N (i)} a_{i j}^{(l)} \cdot {\hat{h}}_{j}^{(l)})

(15)

m_{c a u s a l} (i, j) = a_{i j}^{(l)} \cdot {\hat{h}}_{j}^{(l)} \cdot m s g_w e i g h t_{i j}

(16)

h_{c a u s a l}^{(l)} = σ (\sum_{j \in N (i)} a_{i j}^{(l)} m_{c a u s a l} (i, j))

(17)

h_{i}^{(l)} = (1 - α) \cdot h_{i}^{(l)} + α \cdot h_{c a u s a l}^{(l)}

(18)

In addition, in order to further enhance the representation ability and stability of the model, a multi-head attention mechanism is used. Each attention head calculates independently, and then merges the results by feature concatenation. This design allows the model to learn the structure and feature information of the graph from different perspectives, thus improving the robustness and expressiveness of the model.

h_{i}^{(l)} = | |_{k = 1}^{K} σ (\sum_{j \in N (i)} a_{i j}^{(l, k)} \cdot {\hat{h}}_{j}^{(l, k)})

(19)

The decoder maintains the same basic architecture as the encoder, including edge feature integration and causal weight application, but contains only one layer to reduce computational complexity. The output dimension of the decoder is consistent with the original node feature dimension to generate the reconstructed features of all nodes.

2.4. Anomaly Detection and Building Attack Paths

Anomaly detection and attack path construction are key to understanding potential security threats. To this end, we propose a comprehensive modular approach to implement the full flow from anomaly detection to attack path visualization through the cooperative work of anomaly detectors, attack tree reconstructors, and anomaly interpreters. This method can not only identify the abnormal behavior efficiently, but also intuitively display the attack path in the form of attack tree, which provides strong support for network security analysis.

Anomaly detector is based on the K-nearest neighbor algorithm and embedded technology, through calculating the distance between the test sample and training sample to evaluate abnormal scores. First, graph embeddings are extracted from the trained graph autoencoder to compute the embedded representation of the whole graph. Subsequently, the Euclidean distance between the test sample and the training sample is calculated. The K-nearest samples in the training set are found, and the anomaly score is obtained by calculating the average distance between the test sample and its K-nearest neighbors.

z_{G} = \frac{1}{| V |} \sum_{v \in V} z_{v}

(20)

d (z_{i}, z_{j}) = | | z_{i} - z_{j} | |_{2} = \sqrt{\sum_{k = 1}^{d} {(z_{i, k} - z_{j, k})}^{2}}

(21)

s c o r e (G) = \frac{1}{k} \sum_{i = 1}^{k} d (z_{G}, z_{G}^{i})

(22)

Based on the learned causal weights and anomaly scores, the attack tree reconstructor reconstructs the propagation path of abnormal behaviors to form an interpretable attack tree.

3. Evaluation and Results

In this part, we conduct experiments to verify the key performance of the CAGE framework, focusing on answering the following two core questions: (1) How does CAGE perform in terms of detection accuracy compared with existing graph anomaly detection systems (RQ1). (2) How do the key parameters of causal learning affect the detection performance (RQ2).

Datasets. Here, we conduct experiments on THEIA, TRACE, and CADETS datasets from DARPA TC (E3), as shown in Table 1. These datasets are derived from the Red Blue combat exercise organized by DARPA, which simulates the Advanced Persistent Threat (APT) attack scenario in the enterprise network environment. They contain complete records from benign activities to multi-stage attacks, and provide the ground truth of attacks officially labeled by DARPA. This can provide reliable support for anomaly detection and attack chain reconstruction. The THEIA dataset records the interaction traces of system events, subjects, objects and other entities, including timestamps and parameters of process creation, file access and other behaviors. The TRACE dataset stores the causal dependencies between entities in the form of a directed graph, such as process tree structure and file operation chain. The CADETS dataset collects the full-lifecycle activities of multiple hosts through cross-platform tools, covering the multi-stage attack process from initial penetration to data percolation. The three selected datasets differ in terms of complexity and attack richness: Theia has a medium level of complexity and medium attack richness; CADETS has the lowest complexity and the least attack richness; TRACE has the highest complexity and the most diverse attacks. These differences can comprehensively evaluate the performance of CAGE in different scenarios. In data processing, we divide multi-day benign data for model training and attack day data for testing. By analyzing the association relationship between entities and event timing, combined with annotation information, detection indicators are defined, such as abnormal nodes and their associated entities are regarded as true positives, to quantify the ability of the model to track attack paths. The annotation granularity of the dataset is accurate to the attack time window, which ensures that the detection results are consistent with the real attack logic and provides a scientific basis for the performance evaluation of the algorithm.

Experimental Detail. Our CAGE system is developed based on Python 3.8.20 and has about 4000 lines of code. This experiment was run on a 13th-generation Intel^® Core^TM i7-13700H 2.40 GHz processor with Windows 11 and 32 GB RAM. We adopted DGL library to implement the graph neural network model. The model employs the GAT, a graph convolutional layer that is able to adaptively learn the importance of relationships between nodes. The model includes dropout and activation functions between layers to improve generalization. Implicit representations of graph structure and node features are learned through a mask reconstruction mechanism. Not only node features but also edge features are considered, and the causal weight mechanism is used to enhance the sensitivity of the critical path for anomaly detection. Before this, we wrote python to build the provenance graph, filter out unnecessary information, and abstract node entities. This helps us process the original graph to adapt it to our graph neural network model. The following Table 2 presents the hyperparameter settings.

RQ1:: Detection Performance

In the field of anomaly detection, the performance of detectors often needs to be verified by the performance of multi-dimensional indicators on standard datasets. In order to evaluate the detection accuracy of the CAGE system proposed by us, it is evaluated from the confusion matrix-related indicators and four indicators of Precision, Recall, Accuracy, and F1 score on three datasets under DARPA TC to measure the pros and cons of the model.

As shown in Table 3, CAGE achieves relatively excellent detection results, and the FN on the three datasets is low, which indicates that CAGE has a very low missed detection rate in the detection process, and each index is nearly perfect, which can accurately distinguish abnormal and normal samples, and performs well in balancing misjudgment and missed detection. We also conduct comparative experiments with the existing tracegraph-based detectors THREATRACE and FLASH.

THREATRACE relies on the statistical features of nodes and edges to identify anomalies, but it has difficulty distinguishing causal relationships from mere surface correlations between entities, and is prone to misclassifying normal high-frequency interactions as attack paths. FLASH, although it combines semantic encoding and graph neural networks, it assigns equal weights to all edges and thus fails to highlight the importance of key interactions in the attack chain. In contrast, CAGE is specifically optimized: by using reinforcement learning to dynamically quantify the causal importance of edges and focusing on key interactions, this improves the GAT to allocate attention based on causal weights and leverages masked graph autoencoders to learn normal patterns and more accurately capture the causal logic of attack paths.

From Table 4 of the experimental results, CAGE significantly outperforms the existing systems THREATRACE and FLASH in terms of detection accuracy on the three datasets THEIA-E3, CADETS-E3, and TRACE-E3. All indicators are stable at more than 97%, and significantly ahead of the existing detection systems THREATRACE and FLASH. It is worth noting that all three systems maintain a high level of 99% in the recall metric, indicating a comparable ability to capture abnormal samples and almost no missed detection. However, the advantage of CAGE is mainly reflected in the recognition accuracy of normal samples; where by reducing false positive misjudgment, the overall detection reliability is significantly improved while maintaining a high recall rate. These results fully prove that compared with the existing detection systems, CAGE achieves higher detection accuracy on the three types of datasets, and especially performs better in balancing the core requirements of “not missing anomalies” and “not misjudging normal”, showing stronger competitiveness in practical applications.

RQ2:: Parameter Sensitivity

In the process of model design and training, the selection of hyperparameters has a great impact on the performance of the model. This experiment focused on three parameters: the number of heads of multi-head attention mechanism (n_heads) and the number of network layers (n_layers). In order to clarify the adaptation law of different hyperparameters to the model on complex datasets, this experiment used the control variable method. We explored the performance impact of the hidden layer dimension (32/64/128), the number of multi-head attention heads (2/4/8), and the number of network layers (2/3/4) on THEIA, CADETS, and TRACE datasets. The results are as follows:

The effect of adjustment of hidden layer dimensions on model performance shows significant dataset dependence. As shown in Figure 3, On the THEIA dataset, as the hidden layer dimension increases from 32 to 128, the F1 score slightly decreases from 98.79% to 98.12%, the precision decreases from 97.71% to 96.40%, and the recall is always stable above 99.90%. It shows that the feature pattern of the dataset is simple, and the small dimension of the hidden layer can meet the requirements of high-precision detection. However, the increase in the dimension has limited improvement on the performance, and may lead to a slight decline due to redundant calculation. In contrast, the CADETS dataset shows strong sensitivity to the hidden layer dimension: when the dimension is 32, the F1 score is only 75.28%, and the precision is 60.41%. As the dimension increases to 128, the F1 score increases to 98.97% and the precision increases to 98.19%, which is close to the optimal level of THEIA dataset, indicating that the dataset contains more complex feature associations and requires a higher model capacity to effectively distinguish abnormal and normal samples. On the other hand, the TRACE dataset maintains near-perfect performance (F1 > 99.95%, precision > 99.97%) under all hidden layer dimension settings, and only when the dimension is 64, the F1 score drops briefly to 99.54%, which may be related to the local redundancy of feature extraction under this configuration, but overall is not sensitive to dimension changes. As the dimension of the hidden layer increased from 32 to 128, the memory usage rose from 1.56 GB to 3.37 GB. Moreover, the training time was the longest at 128 dimensions (67,091 s) and the shortest at 64 dimensions (31,784 s).

The configuration of the number of multi-head attention heads also shows a close correlation with the characteristics of the dataset. In GAT, 2, 4, and 8 are common configurations for the number of attention heads. Too few heads fail to capture different types of attention patterns, while too many heads result in excessively small feature dimensions for each head, which impairs the model’s expressive capability. Therefore, the range of 2–8 not only ensures the model’s expressive power, but also controls computational complexity. As shown in Figure 4, On the THEIA dataset, n_heads = 4 has the best performance (F1 = 98.49%, precision = 97.12%), slightly higher than n_heads = 2 (F1 = 98.09%) and n_heads = 8 (F1 = 98.18%). And the recall is always maintained at 99.91%, indicating that moderately increasing the number of heads can improve the feature discrimination through multi-subspace learning, but too many heads are easy to introduce redundancy. The response on the CADETS dataset is more complex: the F1 score is 90.90% when n_heads = 2, falls to 85.11% when n_heads = 4, and rises to 92.60% when n_heads = 8, suggesting that the multi-scale features of this dataset need more heads to cover, but the intermediate configuration may lead to performance fluctuations due to unreasonable factor space division. The performance of the TRACE dataset is stable (F1 > 99.96%) for all head count settings.

The influence of the number of network layers is reflected in the constraint on the depth of the model. As shown in Figure 5, On the THEIA dataset, the performance is the best when n_layers = 2 (F1 = 98.99%), and the F1 score decreases slightly when n_layers is increased to 3 or 4 layers, indicating that the shallow network can fully capture the characteristics of this dataset. The CADETS dataset also performs best with n_layers = 2 (F1 = 92.59%), but the F1 score drops to 84.46% when the number of layers increases to 4, possibly due to gradient disappearance or overfitting exacerbated by the deep network. The TRACE dataset is not sensitive to the change in the number of layers, and the F1 score remains above 99.96% under each configuration.

Figure 6 is the ROC curve chart comparing the hyperparameters of the THEIA dataset. In the figure, different colors correspond to the changes in different hyperparameters, and the corresponding relationship between the true positive rate and the false positive rate of the model is presented, which intuitively shows the influence of each hyperparameter on the performance of the APT detection model. In terms of the performance under the THEIA dataset, the model loading time is 0.020 s, and the parameter quantity reaches 225,015, demonstrating the advantage of lightweight deployment; the average inference time for a single sample is 1.001 s with no fluctuations, which can meet the requirements of real-time detection scenarios; when processing graph data containing 327,408 nodes and 597,282 edges, the average time for causal weight calculation is 725.34 s, providing a quantitative basis for the efficiency optimization of the model’s core mechanism.

4. Discussion

The CAGE framework proposed in this paper aims to solve the problem of identifying complex APT attack paths in large-scale traceability graphs. Aiming at the limitations of traditional intrusion detection systems that only focus on the surface correlation of entities and ignore the deep causal relationship, CAGE realizes the accurate detection and interpretability analysis of APT attacks by combining reinforcement learning and graph neural network technology.

Specifically, CAGE is divided into three parts. First, a reinforcement learning-based causal weight assignment mechanism is designed, which dynamically learns the causal importance of edges in the graph through edge feature extractor, causal weight predictor, reward estimator and experience replay buffer, and distinguishes key causal paths from surface associations. The second is to construct the masked graph autoencoder architecture, which integrates the causal weight into the message passing process of the GAT, and forces the model to learn the graph structure and semantic pattern with causal logic as the core through the node feature mask and reconstruction task. The third is to combine the anomaly detection and attack path reconstruction module, calculate the anomaly score based on the K-nearest neighbor algorithm, and construct the interpretable attack path by using causal dependence and time series.

5. Conclusions

Under the DARPA TC project, three experiment datasets have been performed to verify the effectiveness of the CAGE. The results show that the proposed framework is superior to the existing methods in key indicators such as precision and F1 score, and the recall rate is maintained at a high level of 99%, effectively balancing missed detection and false positive. Parameter sensitivity analysis further shows that CAGE is adaptive to different complex datasets, and the performance can be optimized by adjusting the dimension of hidden layer, the number of attention heads and the number of layers, which provides guidance for practical deployment. In summary, CAGE provides a solution with both high accuracy and strong interpretability for APT attack detection by deeply mining the causal relationship in the provenance graph, which proves the application value of the combination of causal reasoning and graph neural network in the field of network security, and lays a foundation for building a more efficient threat detection system.

CAGE also has limitations. Its validation relies solely on DARPA TC datasets, which, though representative of simulated APT scenarios, may not fully capture the diversity of real-world networks or emerging attack patterns. Additionally, the framework causal weight mechanism, though effective for enterprise entity interactions, may struggle to adapt to domains with distinct interaction semantics, where entity relationships and attack patterns differ significantly. These constraints point to the need for broader validation across diverse datasets and enhanced domain adaptability in future work.

Author Contributions

Conceptualization, X.C.; methodology, X.C.; software, M.K.; validation, X.C., M.K. and H.Y.; formal analysis, X.C.; investigation, M.K.; resources, H.Y.; data curation, M.K.; writing—original draft preparation, X.C.; writing—review and editing, M.K.; visualization, M.K.; supervision, X.C.; project administration, X.C.; funding acquisition, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Natural Science Foundation of Jiangsu Province (No. BK20230558), the Xinjiang Uygur Autonomous Region Natural Science Foundation project (No. 2024D01A40).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Y.; Liu, H.; Li, Z.; Su, Z.; Li, J. Combating advanced persistent threats: Challenges and solutions. IEEE Netw. 2024, 38, 324–333. [Google Scholar] [CrossRef]
Kareem, K. A comprehensive analysis of pegasus spyware and its implications for digital privacy and security. arXiv 2024, arXiv:2404.19677. [Google Scholar] [CrossRef]
Fraunhofer FKIE. SolarMarker (Win32)—Threat Summary[EB/OL]. Malpedia. 30 May 2024. Available online: https://malpedia.caad.fkie.fraunhofer.de/details/win.solarmarker (accessed on 9 August 2025).
McAfee. CLOP Ransomware Exploits MOVEit Software. 2023. Available online: https://www.mcafee.com/blogs/other-blogs/mcafee-labs/clop-ransomware-exploits-moveit-software/ (accessed on 9 August 2025).
Akbarzadeh, A.; Erdodi, L.; Houmb, S.H.; Soltvedt, T.G. Two-stage advanced persistent threat (APT) attack on an IEC 61850 power grid substation. Int. J. Inf. Secur. 2024, 23, 2739–2758. [Google Scholar] [CrossRef]
Mahmoud, M.; Mannan, M.; Youssef, A. APTHunter: Detecting advanced persistent threats in early stages. Digit. Threat. Res. Pract. 2023, 4, 1–31. [Google Scholar] [CrossRef]
Cheng, W.; Yuan, Q.; Zhu, T.; Chen, T.; Ying, J.; Zheng, A.; Ma, M.; Xiong, C.; Lv, M.; Chen, Y. TAGAPT: Towards Automatic Generation of APT Samples with Provenance-level Granularity. IEEE Trans. Inf. Forensics Secur. 2025, 20, 4137–4151. [Google Scholar] [CrossRef]
Lee, J.S.; Fan, Y.Y.; Cheng, C.H.; Chew, C.-J.; Kuo, C.-W. ML-based intrusion detection system for precise APT cyber-clustering. Comput. Secur. 2025, 149, 104209. [Google Scholar] [CrossRef]
Xuan, C.D.; Nguyen, T.T. A novel approach for APT attack detection based on an advanced computing. Sci. Rep. 2024, 14, 22223. [Google Scholar] [CrossRef] [PubMed]
Yue, H.; Li, T.; Wu, D.; Zhang, R.; Yang, Z. Detecting APT attacks using an attack intent-driven and sequence-based learning approach. Comput. Secur. 2024, 140, 103748. [Google Scholar] [CrossRef]
Liu, H.; Wang, Y.; Su, Z.; Wang, Z.; Pan, Y.; Lit, R. TRACEGADGET: Detecting and Tracing Network Level Attack Through Federal Provenance Graph. In Proceedings of the ICC 2024—IEEE International Conference on Communications, Denver, CO, USA, 9–13 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 2713–2718. [Google Scholar]
Xu, F.; Zhao, Q.; Liu, X.; Wang, N.; Gao, M.; Wen, X.; Zhang, D. Advanced persistent threat detection via mining long-term features in provenance graphs. Front. Comput. Sci. 2025, 19, 1910809. [Google Scholar] [CrossRef]
Li, T.; Liu, X.; Qiao, W.; Zhu, X.; Shen, Y.; Ma, J. T-trace: Constructing the APTs provenance graphs through multiple syslogs correlation. IEEE Trans. Dependable Secur. Comput. 2023, 21, 1179–1195. [Google Scholar] [CrossRef]
Weng, Z.; Zhang, W.; Zhu, T.; Dou, Z.; Sun, H.; Ye, Z.; Tian, Y. RT-APT: A real-time APT anomaly detection method for large-scale provenance graph. J. Netw. Comput. Appl. 2025, 233, 104036. [Google Scholar] [CrossRef]
Akbar, K.A.; Wang, Y.; Ayoade, G.; Gao, Y.; Singhal, A.; Khan, L.; Thuraisingham, B.; Jee, K. Advanced persistent threat detection using data provenance and metric learning. IEEE Trans. Dependable Secur. Comput. 2022, 20, 3957–3969. [Google Scholar] [CrossRef]
Yang, F.; Xu, J.; Xiong, C.; Li, Z.; Zhang, K. {PROGRAPHER}: An anomaly detection system based on provenance graph embedding. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 4355–4372. [Google Scholar]
Goyal, A.; Han, X.; Wang, G.; Bates, A. Sometimes, you aren’t what you do: Mimicry attacks against provenance graph host intrusion detection systems. In Proceedings of the 30th Network and Distributed System Security Symposium, San Diego, CA, USA, 27 February 2023. [Google Scholar]
Zhu, T.; Yu, J.; Xiong, C.; Cheng, W.; Yuan, Q.; Ying, J.; Chen, T.; Zhang, J.; Lv, M.; Chen, Y.; et al. Aptshield: A stable, efficient and real-time apt detection system for linux hosts. IEEE Trans. Dependable Secur. Comput. 2023, 20, 5247–5264. [Google Scholar] [CrossRef]
Li, S.; Dong, F.; Xiao, X.; Wang, H.; Shao, F.; Chen, J.; Guo, Y.; Chen, X.; Li, D. Nodlink: An online system for fine-grained apt attack detection and investigation. arXiv 2023, arXiv:2311.02331. [Google Scholar] [CrossRef]
Wang, S.; Wang, Z.; Zhou, T.; Sun, H.; Yin, X.; Han, D.; Zhang, H.; Shi, X.; Yang, J. Threatrace: Detecting and tracing host-based threats in node level through provenance graph learning. IEEE Trans. Inf. Forensics Secur. 2022, 17, 3972–3987. [Google Scholar] [CrossRef]
Wang, Q.; Hassan, W.U.; Li, D.; Jee, K.; Yu, X.; Zou, K.; Chen, H. You are what you do: Hunting stealthy malware via data provenance analysis. In Proceedings of the NDSS, San Diego, CA, USA, 23–26 February 2020. [Google Scholar]
Lv, M.; Gao, H.Z.; Qiu, X.; Chen, T.; Zhu, T.; Chen, J.; Ji, S. TREC: APT tactic/technique recognition via few-shot provenance subgraph learning. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, Salt Lake City, UT, USA, 14–18 October 2024; pp. 139–152. [Google Scholar]
Ren, J.; Geng, R. Provenance-based APT campaigns detection via masked graph representation learning. Comput. Secur. 2025, 148, 104159. [Google Scholar] [CrossRef]
Cheng, Z.; Lv, Q.; Liang, J.; Wang, Y.; Sun, D.; Pasquier, T.; Han, X. Kairos: Practical intrusion detection and investigation using whole-system provenance. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 3533–3551. [Google Scholar]
Jia, Z.; Xiong, Y.; Nan, Y.; Zhang, Y.; Zhao, J.; Wen, M. {MAGIC}: Detecting advanced persistent threats via masked graph representation learning. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 5197–5214. [Google Scholar]
Aly, A.; Iqbal, S.; Youssef, A.; Mansour, E. MEGR-APT: A Memory-Efficient APT Hunting System Based on Attack Representation Learning. IEEE Trans. Inf. Forensics Secur. 2024, 19, 5257–5271. [Google Scholar] [CrossRef]
Rehman, M.U.; Ahmadi, H.; Hassan, W.U. Flash: A comprehensive approach to intrusion detection via provenance graph representation learning. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 3552–3570. [Google Scholar]

Figure 1. The overall system architecture.

Figure 2. Provenance graph constructed from selected nodes in THEIA dataset.

Figure 3. Results for different hidden dimensions.

Figure 4. Results for different n_heads dimensions.

Figure 5. Results for different n_layers dimensions.

Figure 6. ROC curves for the THEIA dataset.

Table 1. Overview of DARPA TC dataset.

Dataset	Node of Graph	Edge of Graph	Attack Node	Attack Edge
THEIA-E3	327,408	597,282	18,904	25,789
CADETS-E3	357,173	840,299	12,846	177,849
TRACE-E3	3,288,676	4,080,457	68,086	457,011

Table 2. Hyperparameter settings.

Parameter Names	Default
lr	0.001
Epohcs	500
Batch_size	8
Activation	Prelu
Optimizer	Adam
Loss_fn	Sce

Table 3. Results of node level detection in DARPA TC dataset.

Dataset	TN	FN	TP	FP	AUC	F1	Precision	Recall
THEIA-E3	308,061	18	18,886	443	99.83%	98.79%	97.71%	99.90%
CADETS-E3	344,091	31	128,151	236	98.06%	98.97%	98.19%	99.76%
TRACE-E3	616,010	25	68,061	11	99.99%	99.97%	99.98%	99.96%

Table 4. Comparison between CAGE and state-of-the-art APT detection methods on different datasets.

Dataset	System	Precision	Recall	Accuracy	F1
THEIA-E3	THREATRACE	87%	99%	99%	93%
	FLASH	93%	99%	99%	96%
	CAGE	97%	99%	99%	98%
CADETS-E3	THREATRACE	90%	99%	98%	95%
	FLASH	95%	99%	99%	97%
	CAGE	98%	99%	98%	99%
TRACE-E3	THREATRACE	72%	99%	99%	83%
	FLASH	95%	99%	99%	97%
	CAGE	99%	99%	99%	99%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, X.; Kuang, M.; Yang, H. Using Causality-Driven Graph Representation Learning for APT Attacks Path Identification. Symmetry 2025, 17, 1373. https://doi.org/10.3390/sym17091373

AMA Style

Cheng X, Kuang M, Yang H. Using Causality-Driven Graph Representation Learning for APT Attacks Path Identification. Symmetry. 2025; 17(9):1373. https://doi.org/10.3390/sym17091373

Chicago/Turabian Style

Cheng, Xiang, Miaomiao Kuang, and Hongyu Yang. 2025. "Using Causality-Driven Graph Representation Learning for APT Attacks Path Identification" Symmetry 17, no. 9: 1373. https://doi.org/10.3390/sym17091373

APA Style

Cheng, X., Kuang, M., & Yang, H. (2025). Using Causality-Driven Graph Representation Learning for APT Attacks Path Identification. Symmetry, 17(9), 1373. https://doi.org/10.3390/sym17091373

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using Causality-Driven Graph Representation Learning for APT Attacks Path Identification

Abstract

1. Introduction

2. Materials and Methods

2.1. Construction of the Provenance Graph

2.2. Reinforcement Learning and Causality

2.2.1. Edge Feature Extractor

2.2.2. Causal Weight Predictor

2.2.3. Reward Estimator

2.2.4. Experience Replay Buffer

2.3. Mask Graph Autoencoder

2.3.1. Causal Weight Application and the Masking Mechanism

2.3.2. Encoder and Decoder

2.4. Anomaly Detection and Building Attack Paths

3. Evaluation and Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI