Continual Graph Learning with Knowledge-Augmented Replay: A Case for Ethereum Phishing Detection

Tian, Zonggui; Zhang, Du

doi:10.3390/electronics14173345

Open AccessArticle

Continual Graph Learning with Knowledge-Augmented Replay: A Case for Ethereum Phishing Detection

by

Zonggui Tian

and

Du Zhang

^*

School of Computer Science and Engineering, Faculty of Innovation Engineering, Macau University of Science and Technology, Macau SAR 999078, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(17), 3345; https://doi.org/10.3390/electronics14173345

Submission received: 29 June 2025 / Revised: 8 August 2025 / Accepted: 20 August 2025 / Published: 22 August 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Humans have the ability to incrementally learn, accumulate, update, and apply knowledge from dynamic environments. This capability, known as continual learning or lifelong learning, is also a long-term goal in the development of artificial intelligence. However, neural network-based continual learning suffers from catastrophic forgetting: the acquisition of new knowledge typically disrupts previously learned knowledge, leading to partial forgetting and a decline in the model’s overall performance. Most current continual learning methods can only mitigate catastrophic forgetting and fail to incrementally improve the overall performance. In this work, we aim to incrementally improve performance within sample incremental context by utilizing inter-stage edges as a pathway for explicit knowledge transfer in continual graph learning. Building on this pathway, we propose a knowledge-augmented replay method by leveraging evolving subgraphs of important nodes. This method enhances the distinction between patterns associated with different node classes and consolidates previously learned knowledge. Experiments on phishing detection in Ethereum transaction networks validate the effectiveness of the proposed method, demonstrating effective knowledge retention and augmentation while overcoming catastrophic forgetting and incrementally improving performance. The results also reveal the relationship between average accuracy and average forgetting. Lastly, we identify the key factor to incremental performance improvement, which lays a foundation for convergence of continual graph learning.

Keywords:

continual graph learning; catastrophic forgetting; inter-stage edges; incremental performance improvement; Ethereum phishing detection

1. Introduction

Human beings can incrementally acquire, accumulate, update, and utilize knowledge from the ever-changing environments throughout their lifespan. This ability is referred to as continual learning [1]. While machine learning, especially deep learning, has made significant strides in various tasks within natural language processing (NLP) and computer vision (CV), endowing machine learning systems with this ability remains a long-standing challenge. This difficulty primarily stems from catastrophic forgetting [2], where an intelligent agent typically experiences a substantial decline in performance on previously learned tasks when learning new ones. This phenomenon is rooted in the plasticity–stability dilemma [3].

To address catastrophic forgetting, many strategies have been proposed, including replay-based, regularization-based, and parameter isolation-based approaches [1]. These methods strike a good balance between the plasticity and stability of agents and mitigate the phenomenon of catastrophic forgetting. Nevertheless, existing methods for mitigating catastrophic forgetting typically only control the extent of performance decline and struggle to achieve incremental performance improvement.

Recent work extends applications of continual learning to graph data, particularly within graph neural networks (GNNs), a field referred to as continual graph learning [4,5,6,7,8]. To address the catastrophic forgetting problem in continual graph learning, several recent studies undertake preliminary explorations and report promising results [9,10,11,12,13,14,15]. Despite recent advancements, addressing catastrophic forgetting in continual graph learning remains challenging as the structural information introduces additional complexity across various aspects of continual graph learning, including scenario settings, task types, and anti-forgetting strategies. Moreover, they all fail to improve model performance incrementally.

Although many continual graph learning works explicitly consider structural information to develop topological structure-aware methods, such as TWP [10] and PDGNNs [16], they rarely handle connections across different graphs. More precisely, these connections are called inter-task edges [17] in multi-task continual learning scenarios (e.g., class incremental and task incremental settings) or inter-stage edges in single-task scenarios (e.g., sample-incremental and domain-incremental settings). SEA-ER [18] argues that inter-task edges induce structural shifts and catastrophic forgetting and introduce a structure–evolution–aware replay method to mitigate these shifts. IncreGNN [19] separately samples replay nodes from those connected by inter-task edges and those without such connections, thereby incorporating inter-task edges into continual graph learning for link prediction. TACO [20] explicitly retrieves and retains inter-task edges by merging the previous reduced graph with the new task subgraph via a node-mapping table, reconnecting new-to-old edges during combination. These preserved inter-task edges are carried through the coarsening and proxy graph generation steps to maintain each node’s evolving receptive field. Nevertheless, they do not investigate how connections between nodes in different tasks or stages affect model performance. Ref. [17] conducts experiments on node-level tasks under task incremental settings with and without considering inter-task edges. The results show that the inter-task edges introduce contradictory factors for model performance, and the conditions under which they provide benefits versus drawbacks remain uncertain.

This paper explores incremental performance improvement by investigating the influence of inter-stage edges on knowledge transfer. Specifically, we utilize inter-stage edges as an explicit pathway for knowledge transfer in continual graph learning. Through this pathway, we propose to reframe node classification tasks as graph classification tasks to achieve efficient knowledge transfer in sample incremental scenarios of continual graph learning. Moreover, we present a knowledge-augmented replay method that can augment and supplement old knowledge with new knowledge while preserving it, thereby incrementally improving overall performance and overcoming catastrophic forgetting. We validate its effectiveness in a real-world application. Drawing on the experimental results, we further investigate the relationship between the two metrics and identify the key factor to incremental performance improvement. The contributions of this paper are summarized as follows:

Knowledge-augmented replay: We investigate the influence of inter-stage edges on model performance in the sample-incremental scenario of continual graph learning and propose a knowledge-augmented replay method for node classification in this scenario. It leverages inter-stage edges and evolving subgraphs of reappearing nodes as pathways for effective knowledge transfer. It preserves prior knowledge while integrating new knowledge, simultaneously reinforcing and augmenting existing knowledge to overcome catastrophic forgetting, thereby achieving incremental performance improvement and yielding results on par with full retraining but with fewer resources.
A real-world application: We demonstrate the practical effectiveness of the proposed method through validation in Ethereum phishing scam detection. To the best of our knowledge, this is the first study aimed at addressing catastrophic forgetting and improving performance incrementally in Ethereum phishing scam detection using continual graph learning.
Key factor to incremental performance improvement: We explore the relationship between the average accuracy and average forgetting based on the experimental results. Furthermore, we identify the key factor to incremental performance improvement, laying a foundation for convergence analysis in broader continual learning contexts.

The rest of this paper is organized as follows: Section 2 reviews related work. Section 3 provides the necessary preliminaries, including the notations used, definitions, and problem formalization. The methodology is detailed in Section 4, and the experiments are presented in Section 5. Section 6 discusses the key to incremental performance improvement in continual graph learning from a knowledge perspective. Finally, Section 7 concludes the paper.

2. Related Work

2.1. Ethereum Phishing Scam Detection

Ethereum phishing scam detection hinges on obtaining highly discriminative features, typically achieved through two primary approaches. One approach leverages traditional machine learning to obtain statistical features. For example, ref. [21] extracts over 200 statistical features within the first-order and second-order node neighbors. The other approach utilizes various network embedding techniques to generate node or graph embeddings, such as node2vec [22,23,24], trans2vec [25], and graph2vec [26,27].

Recent work focuses on extracting richer information from transaction networks to augment representations, such as those using self-supervised learning, including graph contrastive learning [28,29] and self-supervised regression [30], to obtain high-quality and robust representations. Additionally, some other work incorporates additional information to enhance embeddings, such as interaction intensity [31], node and edge types [32], and temporal information. An increasing number of studies consider temporal information as a critical factor, as phishing accounts often follow specific temporal patterns, such as conducting a large number of transactions within a short period. Integrating temporal information with transactions can typically enhance feature representations to improve model performance. Such work can be referred to [33,34,35,36,37,38,39]. Particularly, ref. [37] emphasizes capturing the dynamic evolution patterns of nodes to continuously identify new transaction features.

Nevertheless, these studies do not explore Ethereum phishing detection through the lens of continual learning. Ref. [40] employs incremental learning during pre-training to obtain an optimal GNN as the encoder to generate node embeddings. However, it does not leverage the encoders obtained from different pre-training stages to generate node embeddings, thereby failing to assess the impact of incremental learning on the performance change of the downstream task.

2.2. Graph Representation Learning

Graph representation learning, also called network representation learning, aims to learn low-dimensional representations of nodes, edges, subgraphs, or entire graphs from complex networks while preserving the essential structural and relational information inherent in the networks. A typical graph representation learning process can be framed within an encoder–decoder architecture [41]. In this framework, the encoder takes the graph as input, and possibly additional attributes like node features or edge weights, and maps its nodes, edges, subgraphs, or the entire graph into a low-dimensional vector space and generates embeddings. The decoder then utilizes these embeddings to either reconstruct or predict certain properties of the original graph. GNNs are commonly used graph encoders, including GCNs [42], GATs [43], GraphSAGE [44], GINs [45], GAEs and VGAEs [46], and graph transformer networks (GTNs) [47].

Learning representations in the Ethereum transaction network for phishing scam detection presents unique challenges due to its dynamic nature, large scale, label scarcity, and complex structure. According to the statistics on Etherscan, Ethereum currently has nearly 300 million unique addresses and over 2.4 billion transactions, with both numbers increasing continuously. The vast majority of these addresses are unlabeled. Moreover, Ethereum exhibits a pronounced long-tail distribution, with a small number of nodes having very high degrees, while the majority of nodes maintain very low degrees. This imbalance hinders the model’s ability to learn stable, effective representations for low-degree nodes.

2.3. Continual Learning

Numerous strategies have been proposed to overcome catastrophic forgetting in continual learning, including replay-based, regularization-based, and parameter isolation-based approaches [48]. Replay-based methods [49,50,51] store a few historical samples within a memory buffer and replay these sample in the buffer when learning new tasks to overcome catastrophic forgetting. Regularization-based methods [52,53,54] alleviate catastrophic forgetting by imposing extra regularization terms in loss function to consolidate previously learned knowledge when learning new tasks. Parameter isolation-based methods [55,56,57] allocate parameters to different tasks to avoid interference between tasks. More recently, an increasing number of approaches have been proposed, including optimization-based methods and representation-based methods, as summarized in [1]. More recently, ref. [58] proposes parabolic continual learning by formulating continual learning as a partial differential equation (PDE)-constrained optimization problem, constraining the evolution of the loss function over time to follow a parabolic PDE. Nevertheless, these methods typically fail to improve performance incrementally.

2.4. Continual Graph Learning

Continual graph learning is an emerging field where the structural information between data samples is incorporated into continual learning. Refs. [4,6,7,8] summarized recent continual graph learning efforts and strategies from various perspectives comprehensively. These approaches broadly fall into the three main continual learning categories.

TWP [10] is a typical regularization-based continual graph learning method. It measures parameter importance by integrating both task-related loss and topological information, and then preserves the most critical parameters. However, regularization methods usually interfere with the model’s ability to learn new knowledge. PI-GNN [59] proposes parameter isolation and expansion to circumvent the tradeoff between learning new patterns and maintaining old ones. In another work, GCL [60] also follows this line. It learns optimal model structure through reinforcement learning and adds or prunes neurons to ensure sufficient model capacities. These methods typically require maintaining a gradually expanding model capacity. A notable work is the HPN [5], where the knowledge is represented by various levels of prototypes and their combinations, and the number of model parameters is theoretically upper bounded by giving a threshold. Replay-based methods occupy a dominant position in continual graph learning. They evaluate node importance using various criteria, optionally considering structural information, and then use importance-based sampling to select and store representative nodes from earlier tasks for replay. Representative works include ContinualGNN [9], ER-GNN [61], DyGRAIN [62], and PBR [63]. Some works, such as SSM [64], CaT [11], and PDGNNs [16], use sparsified or condensed subgraphs instead of individual nodes for replay to reduce memory overhead. More recently, PromptCGL [65] proposes to combine prompt learning with continual graph learning, which is beyond the three main methods. Nevertheless, they do not explore and reveal the impact of the inter-stage edges on model performance.

To the best of our knowledge, we are the first to apply continual graph learning to address catastrophic forgetting in Ethereum phishing detection and improve performance incrementally.

3. Preliminaries

We present the definitions of key concepts and the problem formulation in this section. Table 1 summarizes the key notations used in this paper.

3.1. Definitions

Dynamic Network: A dynamic network is represented by

G_{t} = (V_{t}, E_{t})

, where

V_{t}

and

E_{t}

are the set of vertices and edges at time t, respectively. Specifically,

G_{t} = G_{t - 1} \cup Δ G_{t}

, where

Δ G_{t} = (Δ V_{t}, Δ E_{t})

is the incremental network between time

t - 1

and t, with

Δ V_{t}

and

Δ E_{t}

representing the sets of vertex changes and edge changes, respectively. Consequently,

V_{t} = V_{t - 1} \cup Δ V_{t}

,

E_{t} = E_{t - 1} \cup Δ E_{t}

. Additionally,

G_{t} = (A_{t}, X_{t})

if the network is attributed, where

A_{t}

and

X_{t}

denote the adjacency matrix and node feature matrix of the network at time t, respectively. Correspondingly,

Δ G_{t} = (Δ A_{t}, Δ X_{t})

, where

Δ A_{t}

and

Δ X_{t}

represent the adjacency matrix and node feature matrix of the incremental network between time

(t - 1)

and t, respectively.

Graph Neural Networks: A graph neural network is a general neural network designed to capture the dependencies between nodes in graph-structured data. It propagates information across nodes in the graph and learns embeddings that encode graph structures and node features through message passing. The representation of a node v in the

l th

layer is defined as follows:

h_{v}^{l} = UPDATE (h_{v}^{l - 1}, AGG (h_{u}^{l - 1}, \forall u \in N_{v})),

(1)

where

h_{v}^{l}

denotes the representation of node v at the

l th

layer,

N_{v}

is the set consisting of all neighboring nodes of node v.

AGG (\cdot)

and

UPDATE (\cdot)

denote the aggregation function and update function, respectively. The specific form of the aggregation function varies across GNN architectures and may include operations such as sum, attention, mean, or max pooling. The aggregated representations are typically passed through an update function, which may involve linear transformations and non-linear activation functions, to produce the final representation of each node in the

l th

layer. Representative GNN models include GCNs [42], GraphSAGE [44], GATs [43], and GINs [45].

3.2. Problem Formalization

As shown in Figure 1, in the evolving Ethereum transaction network (ETN) is denoted by

G = \{G_{1}, G_{2}, \dots, G_{n}\}

, where

G_{n}

is the snapshot of the network at time

t_{n}

, and

G_{n} = G_{n - 1} \cup Δ G_{n}

. The term

Δ G_{n}

is the incremental transaction network (ITN) between time

t_{n - 1}

and

t_{n}

(stage n). We aim to train a sequence of GNN models

(f_{1}, f_{2}, \dots, f_{n})

for corresponding snapshots

\{G_{1}, G_{2}, \dots, G_{n}\}

to continuously detect phishing nodes within the sample incremental setting of continual graph learning, incrementally improving the model performance as the network evolves, and to eliminate catastrophic forgetting throughout this process. In this process, each subsequent model

f_{i}

is trained using the previous model

f_{i - 1}

as its initial state.

Notably, we employ the term “inter-stage edges” rather than “inter-task edges” in this work, as the task is single and constant—detecting newly added phishing nodes as the network evolves. In other continual graph learning scenarios with multiple tasks, “inter-task edges” is more appropriate.

4. Methodology

Figure 2 depicts the proposed approach, which consists of three steps, namely temporal partitioning, transaction subgraph extraction, and continual training.

4.1. Temporal Partitioning

Temporal partitioning involves splitting the entire ETN

G_{n}

into multiple ITNs

Δ G_{i}

to accommodate the continual learning setting. Since each transaction is associated with a specific timestamp, the most straightforward and logical partitioning method is based on the chronological order of timestamps. Additionally, the ETN can be partitioned according to equal time intervals or by an equal number of phishing nodes. Specifically, the temporal partitioning is formalized as follows.

Temporal Partitioning: Let

G_{t} = (V_{t}, E_{t})

be the snapshot of the ETN at time t. Choosing a strictly increasing sequence of timestamps

T = (t_{1}, t_{2}, \dots, t_{n})

. For each

i = 1, 2, \dots, n

, define the incremental transaction network

Δ G_{i} = (Δ V_{i}, Δ E_{i})

by

Δ E_{i} = \{e = (u, v, a, τ) \in E_{t} ∣ t_{i - 1} < τ \leq t_{i}\}

, and

Δ V_{i} = \{u ∣ \exists v, (u, v) \in Δ E_{i} \lor (v, u) \in Δ E_{i}\}

. Temporal partitioning splits the ETN into a sequence of incremental transaction networks

{(Δ G_{i})}_{i = 1}^{n}

.

Since the ETN is a weighted directed multigraph, there may be multiple directed transactions between any two accounts at different times. Consequently, an account may be associated with multiple timestamps. We designate the timestamp that an account is first involved in a transaction as the basis for chronological order. After partitioning the ETN, multiple ITN

Δ G_{i}

are obtained.

4.2. Transaction Subgraph Extraction

Phishing scam detection in Ethereum is essentially a typical node classification task. However, in real-world ETN, there is a significant class imbalance between phishing nodes and non-phishing nodes: the number of phishing nodes is typically orders of magnitude smaller than the number of non-phishing nodes. This imbalance makes it highly challenging to conduct effective node classification for phishing detection in the ETN. To address this challenge, we propose transforming the node classification task into a graph classification task by introducing k-order transaction subgraphs (k-TSGs).

k-order subgraphs: The k-order subgraphs

G_{v}^{k}

centered at a node v is defined as

G_{v}^{k} = (V, E)

,

V = \{u | \forall u, | d (u, v) | \leq k\}, E = \{e_{u v} | \forall u, | d (u, v) | \leq k\}

, where

d (u, v)

is the graph distance (hop) between nodes u and v, as shown in Figure 3.

Given the extremely small proportion of phishing accounts and our goal to detect them across the entire ETN, we aim for the datasets to include all known phishing accounts. Therefore, we extract the k-TSGs of all phishing accounts in each ITN as positive samples. Next, we randomly select an equal number of k-TSGs from non-phishing accounts in the corresponding ITN as negative samples. In this manner, we can obtain multiple datasets

D_{i}

for subsequent continual learning, with each dataset being class-balanced. This paper uses the first-order transaction subgraphs (1-TSGs) of nodes as the samples for graph classification.

4.3. Continual Learning

After obtaining a sequence of class-balanced datasets, we can continuously perform phishing scam detection in the ETN through continual learning with graph classification. During this process, various strategies for continual learning can be implemented. We introduce a knowledge-augmented replay (KAR) method for continual graph learning.

4.3.1. Knowledge-Augmented Replay

Evidence from [66] suggests that phishing nodes exhibit distinct features and behavior patterns compared to non-phishing nodes. For better clarification, we adopt the theoretical framework of distributional separability.

Let

P_{i}^{a p} = \{z_{u}^{i} : u \in Δ V_{i}^{a p}\}

and

P_{i}^{a n} = \{z_{v}^{i} : v \in Δ V_{i}^{a n}\}

be the distributions of newly added phishing and non-phishing nodes in

Δ G_{i}

, respectively.

z_{u}^{i}

and

z_{v}^{i}

are the embedding of nodes u and v produced by

f_{i}

, respectively. Let

Δ_{i}^{a} = S (P_{i}^{a p}, P_{i}^{a n})

be any chosen separability measure (e.g., distance, divergence, margin) between the two distributions of newly added phishing and non-phishing nodes in

Δ G_{i}

. If

Δ_{i}^{a}

increases with i, the classifier’s decision boundary can better separate the two classes.

As the network evolves, the differences between phishing and non-phishing nodes tend to become more pronounced. For example, from times

t_{j}

to

t_{i}

, a phishing account A typically accumulates larger increases in its neighbor count, degrees, and transaction amount than a representative non-phishing node B. As a result, the separability measure between their class-conditional feature distributions grows over time, i.e.,

Δ_{i}^{a} > Δ_{i - 1}^{a}

.

Nevertheless, an increase in the separability between the newly added phishing and non-phishing nodes in

Δ G_{i}

alone does not guarantee that knowledge acquired on current stage will positively transfer back to earlier stages because the classifier is optimized solely to maximize

Δ_{i}^{a}

. It may relocate its decision boundary to suit only the new data, thereby misclassifying earlier samples and suffering catastrophic forgetting.

By contrast, knowledge-augmented replay enriches the distribution by taking union with samples in memory buffer, i.e.,

Δ V_{i}^{p} = Δ V_{i}^{a p} \cup M_{i}^{p}

and

Δ V_{i}^{n} = Δ V_{i}^{a n} \cup M_{i}^{n}

. Let

P_{i}^{p} = \{z_{u}^{i} : u \in Δ V_{i}^{p}\}

and

P_{i}^{n} = \{z_{v}^{i} : v \in Δ V_{i}^{n}\}

be the distributions of all phishing and non-phishing nodes in

Δ G_{i}

, respectively. Define

Δ_{i} = S (P_{i}^{p}, P_{i}^{n})

as the separability of

P_{i}^{p}

and

P_{i}^{n}

. These two distributions incorporate both newly added and replay samples from earlier stages, encouraging the model to learn a decision boundary that simultaneously maximizes separability on both new and historical distributions. i.e.,

Δ_{i} \geq \max {(Δ_{j})}_{j = 1}^{i - 1}

.

In this paper, we track the activity of all known phishing accounts across ITNs and store the reappearing phishing nodes in the memory buffer. Additionally, to preserve the class balance, we randomly sample an equal number of reappearing non-phishing nodes from earlier ITNs to store in the buffer. As the ETN evolves, the features and structures of these reappearing accounts change, motivating a shift from node classification to graph classification. We therefore introduce evolutionary transaction subgraphs (ETSGs) to capture these differences, which are crucial for continual learning.

4.3.2. Evolutionary Transaction Subgraph

Unlike representing all transactions formed by many nodes over a period of time as a network, we describe all transactions associated with a specific node over a given period as a graph. Consequently, ETSGs depict the evolution process of a node over time. Specifically, ETSGs include incremental transaction subgraphs (ITSGs) and accumulative transaction subgraphs (ATSGs), which are formalized as follows.

Incremental /Accumulative Transaction Subgraph: In a dynamic transaction network

G_{t}

, let

Δ G_{i} = (Δ V_{i}, Δ E_{i})

be the incremental transaction network between time

t_{i - 1}

and

t_{i}

, and

G_{i} = (V_{i}, E_{i})

be the snapshot up to

t_{i}

. Given a node

u \in Δ V_{i}

and a pre-defined rule

R

, the incremental transaction subgraph

Δ H_{u}^{i}

of u is defined as

Δ H_{u}^{i} = (Δ V_{u}^{i}, Δ E_{u}^{i})

, where

Δ E_{u}^{i} = R (u, Δ G_{i}) \subset Δ E_{i}

, and

Δ V_{u}^{i} = V (Δ E_{u}^{i})

includes all endpoints of edges in

Δ E_{u}^{i}

. The accumulative transaction subgraph of

u \in V_{i}

is defined as

H_{u}^{i} = (V_{u}^{i}, E_{u}^{i})

, where

E_{u}^{i} = R (u, G_{i}) \subset E_{i}

, and

V_{u}^{i} = V (E_{u}^{i})

contains all endpoints of edges in

E_{u}^{i}

.

R (u, G)

returns the set of transactions in G that involve node u according to rule

R

. In this paper, we define

R

to extract all edges incident to u, and we refer to the resulting subgraph as the first-order transaction subgraph.

Figure 4a,b illustrate the first-order ITSG and first-order ATSG, respectively. ITSGs (

Δ S_{u}^{i}

) capture a node’s incremental transactions within a specific time period, whereas ATSGs (

S_{u}^{i}

) represent all transactions associated with that node up to the current time. Formally,

Δ S_{u}^{i} = S_{u}^{i} ∖ S_{u}^{i - 1}

.

As the patterns of phishing and non-phishing nodes become increasingly distinct over time, retaining the 1-ATSGs of reappearing phishing and non-phishing nodes provides more discriminative examples. By replaying these examples during the current stage, the model can learn more discriminative representations, thereby improving its performance.

4.3.3. Overall Process

Algorithm 1 delineates an overall procedure for continual graph learning using the KAR method.

Algorithm 1 Continual Graph Learning with KAR Method

Input: Incremental Transaction Network

Δ G_{i}

, previous model

f_{i - 1}

Output: Performance matrix

R_{i j}

1:: Initialization: all initialized to the empty set
2:: for $i = 1$ to n do
3:: Assemble & simplify:
$G_{i} \leftarrow simplify (G_{i - 1} \cup Δ G_{i})$
4::      Extract nodes:
      $Δ V_{i}^{p}, Δ V_{i}^{a p}, Δ V_{i}^{r p} \leftarrow extract_Phishing_Nodes (Δ G_{i}),$
      $Δ V_{i}^{n}, Δ V_{i}^{a n}, Δ V_{i}^{r n} \leftarrow extract_Non-Phishing_Nodes (Δ G_{i}),$
      $C_{i}^{K p} \leftarrow Δ V_{i}^{p}$
5:: Balanced sampling:
6:: $C_{i}^{K a n} \leftarrow sample (Δ V_{i}^{a n}, | Δ V_{i}^{a p} |)$
7:: for $j = 1$ to $i - 1$ do
8:: $Δ V_{i j}^{r p} \leftarrow Δ V_{i}^{p} \cap Δ V_{j}^{a p}$ , $Δ V_{i j}^{r n} \leftarrow Δ V_{i}^{n} \cap Δ V_{j}^{a n}$
9:: $C_{i j}^{K r n} \leftarrow sample (Δ V_{i j}^{r n}, | Δ V_{i j}^{r p} |)$
10:: end for
11:: $C_{i}^{K r n} \leftarrow ⋃_{j = 1}^{i - 1} C_{i j}^{K r n}$ , $C_{i}^{K n} \leftarrow C_{i}^{K a n} \cup C_{i}^{K r n}$
12:: $M_{i} \leftarrow C_{i}^{K r p} \cup C_{i}^{K r n}$
13:: $C_{i}^{K} \leftarrow C_{i}^{K p} \cup C_{i}^{K n}$
14:: Subgraph & feature extraction:
15:: for all $u \in C_{i}^{K}$ do
16:: $S_{u}^{i} \leftarrow extract_1 - Order_ATSG (G_{i}, u)$
17:: for all $v \in S_{u}^{i}$ do
18:: $x_{v}^{i} \leftarrow compute_Node_Features (G_{i}, v)$
19:: end for
20:: end for
21:: Dataset & train:
$D_{i} \leftarrow {(S_{u}^{i}, x_{v}^{i}) | u \in C_{i}^{K}, v \in S_{u}^{i}}, {D_{i}^{tr}, D_{i}^{te}} \leftarrow split (D_{i})$
22:: $f_{i} \leftarrow train (f_{i - 1}, D_{i}^{tr})$
23:: for $j = 1$ to i do
24:: $R_{i, j} \leftarrow eval (f_{i}, D_{j}^{te})$
25:: end for
26:: end for

Given a timestamp

t_{i}

, we extract the ITN

Δ G_{i}

and assemble the cumulative snapshot

G_{i}

, which comprises all transactions up to

t_{i}

. Initially, the snapshot is a multigraph. We simplify the multigraph into a simple graph (still denoted by

G_{i}

for convenience) by merging parallel edges and summing their transaction amounts. This simplification preserves the main node features of the ETN while avoiding the additional complexity of handling multi-edge structures during GNN message passing.

At each stage i, we first extract phishing and non-phishing nodes from the ITN

Δ G_{i}

, distinguishing nodes newly added in

Δ G_{i}

from those reappearing from earlier ITNs. We then construct the class-balanced central node set

C_{i}^{K}

by balanced sampling, selecting equal numbers of phishing and non-phishing nodes. The memory buffer

M_{i}

is populated with the reappearing phishing nodes and an equal number of non-phishing nodes. Next, for every central node

u \in C_{i}^{K}

, we extract its 1-ATSG

S_{u}^{i}

from the simplified snapshot

G_{i}

and compute node features for each node

v \in S_{u}^{i}

. The collection

\{S_{u}^{i} | u \in C_{i}^{K}\}

constitutes the dataset

D_{i}

, which we use to train the GNN model and evaluate its performance under continual learning settings. Further implementation details are given in Section 5.

5. Experiments

We conduct extensive experiments to evaluate the performance of the proposed method on detecting Ethereum phishing scams.

5.1. Data

We construct the ETN dataset for continual graph learning based on ETN data on XBlock (https://xblock.pro/#/dataset/13, accessed on 15 March 2025), which contains 2,973,489 addresses, 13,551,303 transactions, and 1165 labeled phishing addresses. This transaction network is composed of 13 connected components, where a connected component is a network in which all nodes are reachable from each other through (un)directed edges. We select the largest component, which contains 2,973,382 addresses, 13,551,214 transactions, and 1157 labeled phishing addresses, to construct the continual graph learning datasets. Table 2 summarizes the statistics of this dataset.

Although node features are not strictly required for graph classification tasks, they can provide valuable information that enables GNNs to more effectively capture patterns and features within the graph structure. In particular, we extract the following node features:

Node labels: Each node is labeled to indicate whether it represents a phishing or non-phishing entity. Although phishing scam detection in Ethereum is fundamentally a node classification task, these node labels can also serve as features when reformulating the problem as a graph classification task.
The number of neighbors: A neighbor of a given node is any node reachable from it. First-order neighbors are directly connected to the given node, while higher-order neighbors are reached through one or more intermediate nodes. For instance, second-order neighbors are connected via one intermediate node, and k-order neighbors are connected through $k -$ 1 intermediaries. In this paper, we consider the number of first-order neighbors as a node feature.
In-degree: The number of edges directed towards a node, indicating how many edges terminate at that node. In the ETN, in-degree represents the number of incoming transactions an account receives from other accounts.
Out-degree: The number of edges originating from a node, indicating how many edges start from that node. In the ETN, out-degree reflects the number of outgoing transactions from an account to other accounts.
Total degree: The sum of in-degree and out-degree, representing the total number of edges connected to a node, regardless of direction. In the ETN, the total degree corresponds to the number of transactions associated with an account.
In-strength: In a weighted graph, in-strength is the sum of the weights of all incoming edges to a node, serving as the weighted counterpart of in-degree. It captures the total influence or flow that a node receives from its neighbors. In the ETN, it refers to the total amount of Ether received by an account.
Out-strength: In a weighted graph, out-strength is the sum of the weights of all outgoing edges from a node, representing the weighted counterpart of out-degree. It reflects the total influence or flow that a node exerts on its neighbors. In the ETN, it denotes the total amount of Ether sent by an account.
Total strength: The sum of in-strength and out-strength, representing the total weighted connections of a node. It quantifies the overall influence the node exerts and receives within the network. In the ETN, it refers to the total amount of Ether involved in an account’s transactions.

5.2. ETN Partition

Following the chronological order, we propose partitioning the ETN based on the principle of an approximately equal number of phishing accounts. Specifically, we first sort all 1157 phishing accounts according to their first timestamps and then allocate them into 10 ITNs on average (i.e., 116 for the first 7 ITNs and 115 for the last 3 ITNs).

Defining the first timestamp of a phishing node

v_{i}

is

t_{i}^{1} (i = 1, 2, \dots, 1157

), and the first ITN contains all transactions between the beginning timestamp of the ETN (denoted by

t_{0}

) and the timestamp

t_{117}^{1} - 1

. The second ITN contains all transactions between the timestamp

t_{117}^{1}

and

t_{233}^{1} - 1

, and so forth. The last ITN contains all transactions between the timestamp

t_{1043}^{1} - 1

and the last timestamp (denoted as

t_{n}

). In this way, we derive 10 ITNs (

Δ G_{1}, Δ G_{2}, \dots, Δ G_{10}

) by partitioning the ETN

G_{n}

, along with the distribution of phishing node across these ITNs, as detailed in Table 3.

Note that the above partitioned ITNs include the inter-stage edges. To further investigate the influence of inter-stage edges on continual learning performance, we also perform continual learning without inter-stage edges.

5.3. Datasets Construction

5.3.1. Central Node Set Construction

After obtaining ITNs, we construct a corresponding class-balanced dataset

D_{i}

for each ITN

Δ G_{i}

. Since we transform the node classification into a graph classification task, we first determine the central node set of

D_{i}

. Under the KAR method, the central node set

C_{i}^{K}

is the union of the phishing central node set

C_{i}^{K p}

and the central non-phishing node set

C_{i}^{K n}

.

To maximize the presence of phishing examples in the dataset, we incorporate all phishing nodes into the central node set. Therefore,

C_{i}^{K p} = Δ V_{i}^{p}

. Furthermore, the phishing central node set comprises both newly added and reappearing phishing central nodes, denoted by

C_{i}^{K a p}

and

C_{i}^{K r p}

, respectively. Particularly,

C_{i}^{K a p} = Δ V_{i}^{a p}

, and

C_{i}^{K r p} = Δ V_{i}^{r p}

.

The reappearing phishing central nodes are those that were newly added in earlier ITNs and then reappear in

Δ G_{i}

. Formally, for each

j < i

, let

Δ V_{i j}^{r p} = Δ V_{i}^{p} \cap Δ V_{j}^{a p}

be the set of phishing nodes newly added in

Δ G_{j}

that reappear in

Δ G_{i}

. Then the total reappearing phishing node set is

Δ V_{i}^{r p} = ⋃_{j = 1}^{i - 1} Δ V_{i j}^{r p}

.

Analogously, the central non-phishing node set

C_{i}^{K n}

is partitioned into newly added and reappearing subsets, denoted by

C_{i}^{K a n}

and

C_{i}^{K r n}

, respectively. Since non-phishing nodes far outnumber phishing nodes, we maintain a consistent distribution of phishing and non-phishing nodes within the central node set by uniformly sampling from the corresponding pools of non-phishing candidates.

Specifically, we obtain the newly added central non-phishing node set by uniformly sampling a subset

C_{i}^{K a n} \subset Δ V_{i}^{a n}

such that

| C_{i}^{K a n} | = | C_{i}^{K a p} | = | Δ V_{i}^{a p} |

. For the reappearing central non-phishing nodes, we first randomly sample

C_{i j}^{K r n} \subset Δ V_{i j}^{r n}

with

| C_{i j}^{K r n} | = | Δ V_{i j}^{r p} |

for each

j < i

and then aggregate

C_{i}^{K r n} = ⋃_{j = 1}^{i - 1} C_{i j}^{K r n}

. In this way, we obtain the central node set

C_{i}^{K} = C_{i}^{K p} \cup C_{i}^{K n} = Δ V_{i}^{p} \cup C_{i}^{K a n} \cup C_{i}^{K r n}

.

To illustrate, we present a concrete example of central node set construction. As shown in Table 3,

Δ G_{3}

contains 158 phishing nodes in total: 116 newly added (

| Δ V_{3}^{a p} | = 116

) and 42 reappearing (

| Δ V_{3}^{r p} | = 42

), of which 8 reappear from

Δ G_{1}

(

| Δ V_{31}^{r p} | = 8

) and 34 reappear from

Δ G_{2}

(

| Δ V_{32}^{r p} | = 34

). Accordingly, the phishing central node set also contains 158 phishing nodes (

| C_{3}^{K p} | = | Δ V_{3}^{a p} |

).

Next, we randomly sample 116 non-phishing nodes from the newly added non-phishing node set

Δ V_{i}^{a n}

so that

| C_{3}^{K a n} | = 116

. Similarly, we randomly sample 8 non-phishing nodes from the reappearing non-phishing node pool in

Δ G_{1}

(

| C_{31}^{K r n} | = 8

) and 34 from the corresponding pool in

Δ G_{2}

(

| C_{32}^{K r n} | = 34

), thereby matching the distribution of reappearing phishing nodes. Consequently, the central node set in the third stage contains 316 central nodes in total (

| C_{3}^{K} | = | Δ V_{3}^{a p} | + | C_{3}^{K n} | = 316

).

5.3.2. Subgraph and Node Feature Extraction

We then extract subgraphs and compute node features for each node contained within each subgraph. Specifically, we extract 1-ATSG

S_{u}^{i}

from each snapshot

G_{i}

for each central node

u \in C_{i}^{K}

and compute the node features

x_{v}^{i}

of each node

v \in S_{u}^{i}

. The resulting dataset

D_{i}

comprises all such extracted 1-ATSGs.

Additionally, we include naïve and retraining strategies for evaluation. The naïve approach simply fine-tunes the model on each new ITN without any forgetting mitigation, serving as the lower bound for continual learning performance. In contrast, the retraining method retrains the model from scratch on the accumulated date at every stage, thereby establishing an upper bound. The main difference between these two methods and the KAR method lies in the construction of the central node set and the extraction of subgraphs.

Taking stage 3 as an example,

Δ G_{3}

introduces 116 newly added phishing nodes. Under the naïve strategy, the phishing central node set consists solely of these 116 nodes. By contrast, the retraining approach aggregates all newly added phishing nodes from stages 1 through 3—totaling 348—into the phishing central node set. The sampling procedure for non-phishing central nodes mirrors that of the KAR method. As for subgraph extraction, the naïve approach extracts 1-ITSGs

Δ S_{u}^{3}

from

Δ G_{3}

, whereas the retraining method extracts 1-ATSGs

S_{u}^{3}

from

G_{3}

.

For the scenario in which inter-stage edges are removed, we extract 1-ITSGs from

Δ G_{i}^{'}

for the central nodes under the naïve method, and 1-ITSGs from

G_{i}^{'}

for the central nodes under both the KAR and retraining methods.

5.4. Experimental Settings

5.4.1. Models

We conduct experiments using the following GNNs.

GCN [42]: The GCN uses convolutional operations on graph data to aggregate information from neighboring nodes through a first-order approximation of spectral convolution.
GAT [43]: The GAT employs attention mechanisms to assign different weights to the edges connecting a node to its neighbors, enabling more flexible and context-aware feature aggregation in graph neural networks.
GATv2 [67]: GATv2 improves upon the original GAT by using a dynamic attention mechanism that better captures the importance of neighboring nodes, resulting in more accurate and expressive attention weights.
GIN [45]: The GIN improves the expressiveness of graph neural networks by using sum aggregation, which closely approximates the Weisfeiler–Lehman graph isomorphism test, making it capable of distinguishing a wider variety of graph structures.

All four models share a common blueprint: two stacked graph convolutional layers (each followed by a nonlinear activation and 50% dropout); a global pooling operation that collapses node embeddings into a single graph-level vector; and a final linear classifier that produces the output logits.

In the GCN variant, both convolutional layers are implemented as GCNConv modules that project node features into a d-dimensional latent space, each followed by a ReLU activation. The GAT and GATv2 versions replace these with attention-based convolutions: the first layer employs eight parallel heads, each producing a d-dimensional output that is concatenated into a

8 d

-dimensional representation and subjected to ELU activation and dropout, before a single-head projection back to d dimensions; the only difference between them is the use of GATConv versus GATv2Conv. The GIN model instead uses two GINConv layers, each internally parameterized by an MLP that applies a linear map to d dimensions, batch normalization, a ReLU, and a second linear layer with ReLU. After each convolution, global sum pooling yields two d-dimensional graph summaries that are concatenated into a

2 d

-dimensional embedding and passed through a two-layer MLP head with a

2 d

-unit hidden layer, ReLU activation, and 50% dropout prior to the final classification.

5.4.2. Training

We train the aforementioned GNN models using the following parameters: the proportion of the training set within each dataset (

α

), the number of neurons in the hidden layers (d), the maximum number of epochs (m), the learning rate (

η

), and the batch size (b). Specifically, we set

α

to 0.7, d to 32, m to (100, 300, 500),

η

to (0.001, 0.003, 0.005), and b to 32. All datasets are randomly shuffled before being split into training and test sets.

Training is conducted on a server equipped with an NVIDIA V100 GPU with 16GB of GPU memory. The models are implemented in PyTorch v2.1.2, with the GNNs constructed using the PyTorch Geometric library.

5.4.3. Evaluation

We assess the overall performance and degree of forgetting of the models using the average accuracy (ACC) and backward transfer (BWT), respectively, as defined in [68].

\begin{matrix} {ACC}_{n} & = \frac{1}{n} \sum_{i = 1}^{n} R_{n, i} \end{matrix}

(2)

\begin{matrix} {BWT}_{n} & = \frac{1}{n - 1} \sum_{i = 1}^{n - 1} (R_{n, i} - R_{i, i}), \end{matrix}

(3)

where n is the number of all stages, and

R_{i, j}

refers to the test performance on the

j th

stage after training on the

i th

stage.

ACC describes the overall performance of a continual learning model across all datasets after sequentially training on all n stages. BWT measures the extent to which a model forgets previously learned knowledge. Therefore, we use overall performance and average accuracy interchangeably, as well as backward transfer and average forgetting.

5.5. Results

5.5.1. Overall Performance

Table 4 compares the average accuracy and average forgetting of four GNNs after training on all 10 datasets across two scenarios. It shows that retaining inter-stage edges significantly improves model’s overall performance and backward transfer compared to removing them.

In the scenario without inter-stage edges, nearly all models exhibit catastrophic forgetting across all three methods, as indicated by the negative backward transfer values, with the exception of the retraining-o method using a GIN. However, in the scenario with inter-stage edges, both the KAR and retraining methods eliminate catastrophic forgetting for all models. Particularly, the KAR method achieves performance comparable to the retraining method.

An intriguing observation is that, in the scenario without inter-stage edges, the replay-o method unexpectedly underperforms compared to the naïve-o method. A reasonable explanation is the high proportion of recurring nodes in the ETN, resulting in a significant number of inter-stage edges between ITNs. When inter-stage edges are removed, phishing nodes lose substantial transactional information, making their 1-TSGs less distinguishable from those of non-phishing nodes. The replay-o method introduces more of these indistinguishable samples, which paradoxically leads to worse performance compared to the naïve-o method.

5.5.2. Stage-Wise Average Accuracy

We also compare the average accuracy of four GNNs using three methods under two scenarios of each stage, as shown in Figure 5. The plots demonstrate that retaining inter-stage edges allow the model to incrementally improve overall performance or maintain stability, while removing inter-stage edges causes the model’s overall performance to decline sharply over time and eventually stabilize.

When inter-stage edges are removed, all methods experience a sharp decline in average accuracy over time, particularly during the initial stages. The naïve-o and replay-o methods eventually stabilize around 50%, while the retraining-o method performs slightly better, stabilizing at 60%. Conversely, when inter-stage edges are retained, all methods exhibit an incremental increase in average accuracy over time, except for those using the GIN model, which remain stable or show a slight decrease. Specifically, the KAR and retraining methods reach or exceed 90%, consistently delivering the best performance. The KAR method achieves results comparable to the retraining method while using less memory.

These results highlight that removing inter-stage edges hinders knowledge transfer, leading to a gradual decline in overall performance. In contrast, retaining inter-stage edges facilitates knowledge transfer, promoting improvements in overall performance over time.

5.5.3. Stage-Wise Average Forgetting

Figure 6 plots the results of the stage-wise average forgetting mechanisms of four GNNs using three methods across two scenarios. Each method generally shows stronger positive backward transfer when inter-stage edges are retained compared to when they are removed. Furthermore, in conjunction with stage-wise average accuracy, the results reveal that incremental performance improvement does not necessarily coincide with positive backward transfer values.

When inter-stage edges are removed, the naïve-o and replay-o methods consistently exhibit negative backward transfer values across all stages, indicating persistent catastrophic forgetting. The retraining-o method shows predominantly positive backward transfer, remaining stable or with slight fluctuations around zero in most cases.

In contrast, when inter-stage edges are retained, both the KAR and retraining methods show predominantly positive backward transfer values that are significantly above zero, with the exception of their implementations using the GIN model. This outcome suggests positive knowledge transfer in most cases, effectively overcoming catastrophic forgetting. Notably, the KAR method achieves performance comparable to the retraining method while using less memory. The naïve method displays considerable fluctuations, with only a few stages showing positive backward transfer values, but catastrophic forgetting remains prevalent in most stages.

5.5.4. Comparison with Existing Methods

We further compare the proposed method with several existing baselines, EWC [53], LwF [52], and TWP [10], under a unified training protocol, using a maximum of 300 epochs, a batch size of 32, and a learning rate of

0.005

. Table 5 presents the results under two scenarios.

As shown in Table 5, in the scenario of removing inter-stage edges, the Replay-o method confers no clear advantage: the Retraining-o method achieves the best results in most cases, while the replay-o method lags behind or at best matches other methods. Moreover, all approaches exhibit negative backward transfer values in this setting, indicating persistent forgetting of prior knowledge. In contrast, when inter-stage edges are preserved, every method attains substantially improved backward transfer—most values turn positive—and the KAR method not only matches or slightly exceeds the retraining method in average accuracy across all four GNNs but also surpasses established baselines (EWC, LwF, TWP) on both accuracy and forgetting metrics. These results underscore the critical importance of maintaining inter-stage connectivity in the Ethereum transaction network for continuous and effective phishing detection.

5.6. Determinants of Performance Improvement

From the results of stage-wise average accuracy and stage-wise average forgetting, we observe that performance improvement does not necessarily require positive backward transfer values, thereby somewhat contradicting our intuition. Figure 7 depicts the changes in average accuracy and forgetting values of the KAR method across different stages.

Figure 7 demonstrates that the variations in backward transfer closely mirror those in average accuracy, with their trends remaining consistent in most cases. However, certain exceptions to this alignment are observed. To further investigate this effect, we explore the mathematical relationship between them starting from their definitions.

According to Equations (2) and (3), we have

\begin{matrix} n \cdot {ACC}_{n} = \sum_{i = 1}^{n} R_{n, i} = \sum_{i = 1}^{n - 1} R_{n, i} + R_{n, n}, \end{matrix}

(4)

\begin{matrix} (n - 1) \cdot {BWT}_{n} = \sum_{i = 1}^{n - 1} R_{n, i} - \sum_{i = 1}^{n - 1} R_{i, i} . \end{matrix}

(5)

Therefore,

{ACC}_{n}

can be expressed as a function of

{BWT}_{n}

. In other words, we have

\begin{matrix} {ACC}_{n} = \frac{n - 1}{n} \cdot {BWT}_{n} + \frac{1}{n} \sum_{i = 1}^{n} R_{i, i} = \frac{n - 1}{n} \cdot {BWT}_{n} + {\bar{S}}_{n}, \end{matrix}

(6)

where

{\bar{S}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} R_{i, i} = \frac{S_{n}}{n}

is the average initial performance over all known n stages.

To uncover the relationship between the changes in overall performance and backward transfer, we define

Δ {ACC}_{n} = {ACC}_{n} - {ACC}_{n - 1}

as the change in overall performance after training on

n th

stage, compared to the performance after training on the

(n - 1) th

stage in the continual learning process. According to Equation (6), we obtain the following equation:

\begin{matrix} Δ {ACC}_{n} = (\frac{n - 1}{n} {BWT}_{n} + {\bar{S}}_{n}) - (\frac{n - 2}{n - 1} {BWT}_{n - 1} + {\bar{S}}_{n - 1}) . \end{matrix}

(7)

This equation can be further simplified to the following one:

\begin{matrix} Δ {ACC}_{n} = \frac{n - 1}{n} Δ {BWT}_{n} + \frac{1}{n (n - 1)} {BWT}_{n - 1} + \frac{1}{n} (R_{n, n} - {\bar{S}}_{n - 1}), \end{matrix}

(8)

where

Δ {BWT}_{n} = {BWT}_{n} - {BWT}_{n - 1}

represents the change in backward transfer after training on the

n th

stage, compared to the backward transfer after training on the

(n - 1) th

stage.

Equation (8) indicates that overall performance change is primarily influenced by the variation in backward transfer, alongside its value and the initial performance of each stage.

5.6.1. Backward Transfer

In the early stages of continual learning, both the value of backward transfer and the initial performance of each stage have a notable influence on overall performance changes. This is particularly evident in cases where positive changes in backward transfer do not necessarily lead to improvements in overall performance. A clear example of this can be observed in the naive-o method, as shown in Figure 5d and Figure 6d.

When

n = 2

, the naïve-o method exhibits a large negative backward transfer value, indicating substantial forgetting. Although there is a positive change in backward transfer from stages two to three, this improvement is insufficient to offset the negative impact of the large negative backward transfer and the continuously declining initial performance, leading to a further decrease in overall performance. This highlights the critical role of both backward transfer and initial performance in the early stages of continual learning.

5.6.2. Initial Performance

Figure 5 and Figure 6 also reveal the significant influence of initial performance in early stages on overall performance changes. For highly discriminative models such as GINs, the initial performance at the beginning tends to be high, as shown in Figure 5d. According to Equation (8), this can negatively impact overall performance. If changes in backward transfer and its values fail to offset the gap between the initial performance of new stages and the average initial performance of previous stages, overall performance declines. For example, in the naïve method, both the backward transfer value and its change are negative, which exacerbates this decline. In contrast, less discriminative models such as GCNs, GATs, and GATv2 exhibit lower initial performance in the early stages. In the sample-incremental setting of this study, a lower starting performance proves beneficial for improving overall performance, particularly when the introduction of new samples reinforces the model’s inductive bias, as observed when inter-stage edges are retained.

5.6.3. Backward Transfer Change

As the number of stage increases, the influence of backward transfer variation on performance becomes progressively stronger. In the limit, as the number of stages approaches infinity, overall performance change is governed solely by the change in backward transfer.

The relationship between the average accuracy and backward transfer indicates that predominantly negative backward transfer values do not necessarily result in a continuous performance decline. For example, in the naïve method, the backward transfer values are negative most of the time due to the absence of forgetting-resistant strategies. However, by retaining inter-stage edges, both phishing samples and non-phishing samples preserve most transaction information, maintaining the primary pattern distinction between them. As a result, the initial performance across stages differs slightly, preventing sharp drops in overall performance and even allowing for gradual improvement.

Similarly, predominantly positive backward transfer values do not guarantee incremental performance improvement. In the retraining-o method, the backward transfer values are generally positive due to the accumulation of data from all prior stages, indicating no forgetting. However, the removal of inter-stage edges causes a significant loss of transaction information, disrupting the pattern distinction between phishing and non-phishing nodes. Consequently, subsequent stages fail to enhance the patterns from earlier ones, and the initial performance on new stages is substantially lower than the average initial performance of previous stages, thereby negatively impacting overall performance.

5.7. Convergence

5.7.1. Performance Perspective

In deep learning, convergence refers to the state where the training process stabilizes, typically characterized by the loss function, model parameters, gradient norms, or validation error reaching a steady state or approaching an optimal value. It can also be reflected in the stabilization of model performance metrics, such as accuracy, which exhibit no further improvement as training progresses. Currently, the convergence of deep learning is predominantly studied through the lens of loss function optimization via stochastic gradient descent, with limited focus on alternative approaches. In this work, we assess convergence from the perspective of average accuracy and backward transfer. Specifically, we define the convergence in continual learning from the changes in average accuracy and backward transfer.

5.7.2. $ϵ$ -k Bounded Convergence

Considering the changes in overall performance and backward transfer as functions of n, and given their bounded ranges (

Δ {ACC}_{n} \in [- 1, 1], Δ {BWT}_{n} \in [- 2, 2]

), it is impossible that these changes exhibit a strictly monotonic increase or decrease over the long term as n grows. Instead, they can only fluctuate within their respective ranges.

To formalize this, we introduce two thresholds

ϵ

and k to define

ϵ

-k bounded convergence, where continual learning is considered converge when backward transfer variation or overall performance variation falls below a specified threshold

ϵ

over k consecutive stages, rather than requiring the variation to strictly reach zero. This allows for more flexibility in determining convergence based on task-specific requirements. Since different scenarios prioritize different metrics, we separately define convergence based on overall performance and backward transfer.

Specifically, continual learning is considered to converge on average accuracy at t stage if

| Δ {ACC}_{n} | \leq ϵ, \forall n \in [t, t + k - 1]

. Similarly, it converges on backward transfer at t stage if

| Δ {BWT}_{n} | \leq ϵ, \forall n \in [t, t + k - 1]

. As the change in overall performance with increasing stages is primarily influenced by variations in backward transfer, it is generally sufficient to use either of these two metrics to assess convergence.

5.7.3. Convergence in KAR Method

In the sample incremental scenario examined in this paper, as shown in Figure 5 and Figure 6, the introduction of new samples initially exerts a pronounced impact on previous stages, leading to substantial changes in both overall performance and backward transfer. As n increases, the influence of new samples on earlier stages gradually diminishes, resulting in smaller variations in overall performance and backward transfer, eventually stabilizing with minor fluctuations around zero. This indicates that, in this sample-incremental setting, the variations in average accuracy and backward transfer will ultimately converge to a relatively stable state over time. To explain this, we further explore the key factors influencing convergence, building on their respective definitions.

According to Equation (2), we obtain

Δ {ACC}_{n} = \frac{1}{n} \sum_{i = 1}^{n} R_{n, i} - \frac{1}{n - 1} \sum_{i = 1}^{n - 1} R_{n - 1, i} .

(9)

Since

\sum_{i = 1}^{n} R_{n, i} = \sum_{i = 1}^{n - 1} R_{n, i} + R_{n, n}

, Equation (9) can be simplified to the following one:

\begin{matrix} Δ {ACC}_{n} = \frac{1}{n} \sum_{i = 1}^{n - 1} (R_{n, i} - R_{n - 1, i}) + \frac{1}{n} (R_{n, n} - {ACC}_{n - 1}) . \end{matrix}

(10)

Similarly, according to Equation (3), we have

\begin{matrix} Δ {BWT}_{n} = \frac{1}{n - 1} \sum_{i = 1}^{n - 1} (R_{n, i} - R_{i, i}) - \frac{1}{n - 2} \sum_{i = 1}^{n - 2} (R_{n - 1, i} - R_{i, i}) . \end{matrix}

(11)

Since

\sum_{i = 1}^{n - 1} (R_{n, i} - R_{i, i}) = \sum_{i = 1}^{n - 2} (R_{n, i} - R_{i, i}) + (R_{n, n - 1} - R_{n - 1, n - 1})

, Equation (11) can be simplified to

\begin{matrix} Δ {BWT}_{n} = \frac{1}{n - 1} \sum_{i = 1}^{n - 1} (R_{n, i} - R_{n - 1, i}) - \frac{1}{n - 1} {BWT}_{n - 1} . \end{matrix}

(12)

Equations (10) and (12) suggest that the convergence of continual learning process is predominantly influenced by the term

R_{n, i} - R_{n - 1, i}

, thereby representing the change in performance on the

i th

stage after training on the

n th

stage relative to the

(n - 1) th

stage. This term quantifies the extent to which performance on the

i th

stage is affected by additional training on the

n th

stage, thereby reflecting the influence of newly acquired knowledge on previously learned knowledge.

Let

I_{n} = \sum_{i = 1}^{n - 1} (R_{n, i} - R_{n - 1, i})

,

{\bar{I}}_{n}^{a} = \frac{I_{n}}{n}

, and

{\bar{I}}_{n}^{b} = \frac{I_{n}}{n - 1}

. The term

I_{n}

represents the cumulative influence of learning the

n th

stage on all preceding stages. Furthermore,

{\bar{I}}_{n}^{a}

quantifies the average influence on average accuracy, while

{\bar{I}}_{n}^{b}

captures the average influence on backward transfer.

In this paper, we identify two primary factors that influence convergence: uncertainties arising from random sampling and significant shifts in phishing patterns. The KAR method effectively preserves prior knowledge by retaining inter-stage edges and evolutionary subgraphs of reoccurring phishing nodes. When phishing patterns remain stable, learning at a new stage typically has a positive effect on earlier stages, with the magnitude of this effect gradually diminishing and eventually stabilizing around zero. However, this influence can fluctuate irregularly due to random sampling uncertainties. For instance, if a stage’s sampling disproportionately includes exceptional non-phishing nodes, the impact of learning this new stage on prior stages may be more pronounced than usual.

When excluding random sampling uncertainties, significant changes in phishing patterns at a later stage—if not captured by the node features—may cause substantial fluctuations in the influence on prior stages, even if the model has already converged at an earlier stage. In cases where the phishing patterns of subsequent stages differ significantly from those of earlier stages, learning the new stage may negatively affect prior stages, leading to a decline in overall performance. However, if the phishing patterns of later stages align with those of earlier stages, the model’s performance may gradually recover and eventually converge to a new stable state. Consequently, convergence in continual learning may not be a one-time event but rather a multi-stage process.

5.7.4. Convergence in Broader Contexts

In broader continual learning scenarios, the convergence of continual learning may depend on the relationships between tasks, model capacity, and anti-forgetting strategies. Specifically, similar or complementary tasks encourage positive transfer, supporting convergence, while unrelated tasks can lead to interference and hinder stability. High-capacity models (e.g., with a large number of parameters) better accommodate new information without overwriting prior knowledge, whereas limited-capacity models may struggle to represent both old and new tasks, slowing or even preventing convergence. Effective anti-forgetting strategies, such as replay methods or regularization techniques, help maintain learned representations and stabilize performance. In scenarios where these factors are misaligned or inadequately addressed, the convergence of continual learning becomes uncertain and warrants further discussion.

6. Discussion

6.1. A Knowledge Perspective

The core of the KAR method lies in the ability of knowledge learned from later stages to supplement or augment the knowledge acquired from earlier stages, thereby facilitating incremental improvement in overall performance. In deep learning, knowledge takes various forms and is represented through different carriers, including data samples, data distributions, embeddings, and model parameters, all of which form a hierarchical structure.

In graph theory [69], nodes serve as the fundamental elements. Edges, subgraphs, and entire graphs are all defined based on nodes and their relationships. As such, nodes can be considered the smallest unit of knowledge within graph data. In knowledge graphs, for example, nodes often represent specific entities, each containing attribute information. A node representing a cat, for instance, might include attributes such as breed, fur color, and other relevant details. This attribute information constitutes the knowledge associated with the node and is commonly expressed in the form of a triplet: “node–attribute–attribute value.”

However, the knowledge encapsulated within a single node is often isolated. In most cases, knowledge is organized and represented in a systematic, hierarchical, or networked manner to facilitate understanding, usage, and reasoning. Consequently, nodes can also represent abstract categories or concepts, with their relationships to entity-specific nodes expressing more complex knowledge. For example, in a knowledge graph, a node might represent the concept of “animal”, and through its relationship to the node representing “cat”, it can convey the knowledge “a cat is an animal”. In this case, the node becomes a component of a more intricate knowledge structure, with triplets of the form “node–relationship–node” serving as fundamental building blocks of such organized knowledge.

In this paper, a node represents an account, which encompasses various details such as account type, permissions, registration time, address, and public/private key pairs. However, this information alone is insufficient for identifying phishing accounts. To extract the necessary knowledge for phishing detection, we require interaction data (transactions) between accounts. The more comprehensive the transactions, the clearer the distinction between phishing and non-phishing accounts, enabling the model to learn more discriminative representations and thereby improving performance. To this end, the KAR method utilizes transaction subgraphs as a fundamental knowledge structure to capture account transaction behaviors. By retaining inter-stage edges and the 1-ATSGs of reoccurring phishing accounts, the method preserves more transactions. This approach not only addresses catastrophic forgetting but also strengthens the differentiation between phishing and non-phishing accounts. As new stages are introduced, the model can continuously refine and update its knowledge, resulting in a gradual improvement in overall performance.

6.2. Applicability of KAR

The KAR method is not confined to Ethereum phishing detection; it can be extended to a wider range of cryptocurrency anomaly detection applications, including anti-money laundering and fraud detection. Additionally, the Ethereum transaction network is a multigraph, whereas simple graphs (i.e., graphs without multiple edges between two nodes and self-loops) do not exhibit reoccurring nodes, making it impossible to identify important nodes. However, alternative approaches can be used to assess node importance. In graph theory, metrics such as degree and centrality are commonly employed to gauge the significance of nodes. In deep learning, node importance can be evaluated based on its contribution to the learning task, which can be quantified through factors such as loss function values, gradients, and model parameters. The specific metrics and methods for evaluating node importance vary according to the task and application context.

On the other hand, knowledge is not solely confined to being expressed through subgraphs. In continual graph learning scenarios involving multiple tasks, such as task incremental and class incremental settings, samples from a particular class can be represented by one or multiple prototypes. When a new sample belonging to an existing class appears, its similarity to the prototype of the known class can be measured using Euclidean distance or cosine similarity, allowing the prototype to be updated or augmented. When a new class emerges, a new prototype is added to the existing set of prototypes. This approach enables the model to retain knowledge from previous tasks while simultaneously learning from new ones. The concept of prototypes can also be interpreted from a graph perspective. If the prototype of a class is viewed as a central node, each sample from that class can be seen as a child node connected to the prototype by their similarity, forming a subgraph. The greater the number of representative samples, the richer the information, and the more accurately the prototype can represent the class.

In general continual learning settings, data often lack clear relationships. However, relationships can be discovered or constructed, such as temporal relationships in scenarios with time-series (e.g., video analysis, action recognition, object tracking) and spatial relationships in tasks like multi-view learning or 3D reconstruction. Images can also be linked through similarity measures; for instance, ref. [70] uses the RBF kernel to compute pairwise similarities and generate random graphs that capture relational structures. However, how these relational structures are effectively leveraged to incrementally enhance model performance in general continual learning remains an open question.

7. Conclusions

In this work, we address the challenge of incremental performance improvement in sample-incremental continual graph learning for node classification tasks, with a focus on leveraging inter-stage edges as a pathway for explicit knowledge transfer. We propose a knowledge-augmented replay method, KAR, to use these edges to reinforce learned patterns across stages, effectively mitigating catastrophic forgetting and achieving incremental performance improvement by consolidating previously acquired knowledge while integrating new information. Experimental results on Ethereum phishing scam detection validate KAR’s effectiveness, achieving performance comparable to retraining with lower resource requirements. Additionally, we analyze the role of backward transfer variation in long-term performance changes and introduce

ϵ

-k bounded convergence as a practical criterion for assessing the convergence of continual learning.

Looking forward, our findings on knowledge augmentation provide a foundation for preserving and enhancing knowledge in evolving graph structures for incremental performance improvement. This approach has potential applicability beyond Ethereum phishing detection, extending to cryptocurrency anomaly detection and even other various continual graph learning scenarios. Further research could explore adapting knowledge augmentation to other continual learning settings. Further refinement of the

ϵ

-bounded convergence criterion may also facilitate standardization in convergence assessment, supporting sustained model performance across dynamic data landscapes.

Author Contributions

Conceptualization, Z.T. and D.Z.; methodology, Z.T.; software, Z.T.; validation, Z.T.; resources, D.Z.; writing—original draft preparation, Z.T.; writing—review and editing, D.Z.; visualization, Z.T.; supervision, D.Z.; project administration, Z.T.; funding acquisition, D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Macao Science and Technology Development Fund through the Macao Funding Scheme for Key Research and Development Projects under Grant 0025/2019/AKP.

Data Availability Statement

The dataset utilized in this study is accessible at https://xblock.pro (accessed on 15 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, L.; Zhang, X.; Su, H.; Zhu, J. A Comprehensive Survey of Continual Learning: Theory, Method and Application. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5362–5383. [Google Scholar] [CrossRef]
Benavides-Prado, D.; Riddle, P. A Theory for Knowledge Transfer in Continual Learning. In Proceedings of the 1st Conference on Lifelong Learning Agents, Montreal, QC, Canada, 22–24 August 2022; Chandar, S., Pascanu, R., Precup, D., Eds.; PMLR: Cambridge, MA, USA, 2022; Volume 199, pp. 647–660. [Google Scholar]
Mermillod, M.; Bugaiska, A.; Bonin, P. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects. Front. Psychol. 2013, 4, 504. [Google Scholar] [CrossRef]
Tian, Z.; Zhang, D.; Dai, H.N. Continual Learning on Graphs: A Survey. arXiv 2024, arXiv:2402.06330. [Google Scholar] [CrossRef]
Zhang, X.; Song, D.; Tao, D. Hierarchical prototype networks for continual graph representation learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4622–4636. [Google Scholar] [CrossRef]
Yuan, Q.; Guan, S.U.; Ni, P.; Luo, T.; Man, K.L.; Wong, P.; Chang, V. Continual graph learning: A survey. arXiv 2023, arXiv:2301.12230. [Google Scholar] [CrossRef]
Febrinanto, F.G.; Xia, F.; Moore, K.; Thapa, C.; Aggarwal, C. Graph lifelong learning: A survey. IEEE Comput. Intell. Mag. 2023, 18, 32–51. [Google Scholar] [CrossRef]
Zhang, X.; Song, D.; Tao, D. Continual Learning on Graphs: Challenges, Solutions, and Opportunities. arXiv 2024, arXiv:2402.11565. [Google Scholar] [CrossRef]
Wang, J.; Song, G.; Wu, Y.; Wang, L. Streaming graph neural networks via continual learning. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 19–23 October 2020; pp. 1515–1524. [Google Scholar]
Liu, H.; Yang, Y.; Wang, X. Overcoming catastrophic forgetting in graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 8653–8661. [Google Scholar]
Liu, Y.; Qiu, R.; Huang, Z. Cat: Balanced continual graph learning with graph condensation. In Proceedings of the 2023 IEEE International Conference on Data Mining (ICDM), Shanghai, China, 1–4 December; IEEE: New York, NY, USA, 2023; pp. 1157–1162. [Google Scholar]
Mondal, A.K.; Nandy, J.; Kaul, M.; Chandran, M. Stochastic Experience-Replay for Graph Continual Learning. In Proceedings of the Third Learning on Graphs Conference, Virtual, 9–12 December 2024. [Google Scholar]
Liu, Y.; Qiu, R.; Tang, Y.; Yin, H.; Huang, Z. PUMA: Efficient Continual Graph Learning for Node Classification With Graph Condensation. IEEE Trans. Knowl. Data Eng. 2025, 37, 449–461. [Google Scholar] [CrossRef]
Hoang, T.D.; Tung, D.V.; Nguyen, D.H.; Nguyen, B.S.; Nguyen, H.H.; Le, H. Universal Graph Continual Learning. arXiv 2023, arXiv:2308.13982. [Google Scholar] [CrossRef]
Song, L.; Li, J.; Si, Q.; Guan, S.; Kong, Y. Exploring Rationale Learning for Continual Graph Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 20540–20548. [Google Scholar]
Zhang, X.; Song, D.; Chen, Y.; Tao, D. Topology-aware embedding memory for continual learning on expanding networks. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 4326–4337. [Google Scholar]
Zhang, X.; Song, D.; Tao, D. CGLB: Benchmark Tasks for Continual Graph Learning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 13006–13021. [Google Scholar]
Su, J.; Zou, D.; Wu, C. On the Limitation and Experience Replay for GNNs in Continual Learning. arXiv 2023, arXiv:2302.03534. [Google Scholar]
Wei, D.; Gu, Y.; Song, Y.; Song, Z.; Li, F.; Yu, G. IncreGNN: Incremental Graph Neural Network Learning by Considering Node and Parameter Importance. In Proceedings of the International Conference on Database Systems for Advanced Applications, Virtual, 11–14 April 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 739–746. [Google Scholar]
Han, X.; Feng, Z.; Ning, Y. A topology-aware graph coarsening framework for continual graph learning. Adv. Neural Inf. Process. Syst. 2024, 37, 132491–132523. [Google Scholar]
Chen, W.; Guo, X.; Chen, Z.; Zheng, Z.; Lu, Y. Phishing Scam Detection on Ethereum: Towards Financial Security for Blockchain Ecosystem. In Proceedings of the IJCAI, Yokohama, Japan, 11–17 July 2020; Volume 7, pp. 4456–4462. [Google Scholar]
Yuan, Q.; Huang, B.; Zhang, J.; Wu, J.; Zhang, H.; Zhang, X. Detecting phishing scams on ethereum based on transaction records. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Virtual, 10–21 October 2020; IEEE: New York, NY, USA, 2020; pp. 1–5. [Google Scholar]
Tan, R.; Tan, Q.; Zhang, Q.; Zhang, P.; Xie, Y.; Li, Z. Ethereum fraud behavior detection based on graph neural networks. Computing 2023, 105, 2143–2170. [Google Scholar] [CrossRef]
Luo, J.; Qin, J.; Wang, R.; Li, L. A phishing account detection model via network embedding for Ethereum. IEEE Trans. Circuits Syst. II Express Briefs 2023, 71, 622–626. [Google Scholar] [CrossRef]
Wu, J.; Yuan, Q.; Lin, D.; You, W.; Chen, W.; Chen, C.; Zheng, Z. Who are the phishers? phishing scam detection on ethereum via network embedding. IEEE Trans. Syst. Man, Cybern. Syst. 2020, 52, 1156–1166. [Google Scholar] [CrossRef]
Yuan, Z.; Yuan, Q.; Wu, J. Phishing detection on ethereum via learning representation of transaction subgraphs. In Proceedings of the Blockchain and Trustworthy Systems: Second International Conference, BlockSys 2020, Dali, China, 6–7 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 178–191. [Google Scholar]
Xia, Y.; Liu, J.; Wu, J. Phishing detection on ethereum via attributed ego-graph embedding. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 2538–2542. [Google Scholar] [CrossRef]
Li, S.; Gou, G.; Liu, C.; Xiong, G.; Li, Z.; Xiao, J.; Xing, X. TGC: Transaction Graph Contrast Network for Ethereum Phishing Scam Detection. In Proceedings of the 39th Annual Computer Security Applications Conference, Austin, TX, USA, 4–8 December 2023; pp. 352–365. [Google Scholar]
Chen, Y.; Hou, W.; Zhang, X.; Li, R. Ethereum Phishing Scams Detection Based on Graph Contrastive Learning with Augmentations. In Proceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD), San José, CA, USA, 9–13 November 2024; IEEE: New York, NY, USA, 2024; pp. 2053–2058. [Google Scholar]
Sun, H.; Liu, Z.; Wang, S.; Wang, H. Adaptive attention-based graph representation learning to detect phishing accounts on the Ethereum blockchain. IEEE Trans. Netw. Sci. Eng. 2024, 11, 2963–2975. [Google Scholar] [CrossRef]
Huang, H.; Zhang, X.; Wang, J.; Gao, C.; Li, X.; Zhu, R.; Ma, Q. PEAE-GNN: Phishing Detection on Ethereum via Augmentation Ego-Graph Based on Graph Neural Network. IEEE Trans. Comput. Soc. Syst. 2024, 11, 4326–4339. [Google Scholar] [CrossRef]
Wang, Y.; Liu, Z.; Xu, J.; Yan, W. Heterogeneous network representation learning approach for ethereum identity identification. IEEE Trans. Comput. Soc. Syst. 2022, 10, 890–899. [Google Scholar] [CrossRef]
Li, S.; Gou, G.; Liu, C.; Hou, C.; Li, Z.; Xiong, G. TTAGN: Temporal transaction aggregation graph network for ethereum phishing scams detection. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 661–669. [Google Scholar]
Lin, Z.; Xiao, X.; Hu, G.; Li, Q.; Zhang, B.; Luo, X. Tracking phishing on Ethereum: Transaction network embedding approach for accounts representation learning. Comput. Secur. 2023, 135, 103479. [Google Scholar] [CrossRef]
Wen, T.; Xiao, Y.; Wang, A.; Wang, H. A novel hybrid feature fusion model for detecting phishing scam on Ethereum using deep neural network. Expert Syst. Appl. 2023, 211, 118463. [Google Scholar] [CrossRef]
Wang, L.; Xu, M.; Cheng, H. Phishing scams detection via temporal graph attention network in Ethereum. Inf. Process. Manag. 2023, 60, 103412. [Google Scholar] [CrossRef]
Zhang, J.; Sui, H.; Sun, X.; Ge, C.; Zhou, L.; Susilo, W. GrabPhisher: Phishing Scams Detection in Ethereum via Temporally Evolving GNNs. IEEE Trans. Serv. Comput. 2024, 17, 3727–3741. [Google Scholar] [CrossRef]
Tang, M.; Ye, M.; Chen, W.; Zhou, D. BiLSTM4DPS: An attention-based BiLSTM approach for detecting phishing scams in ethereum. Expert Syst. Appl. 2024, 256, 124941. [Google Scholar] [CrossRef]
Xu, C.; Li, R.; Zhu, L.; Shen, X.; Sharif, K. EWDPS: A Novel Framework for Early Warning and Detection on Ethereum Phishing Scams. IEEE Internet Things J. 2024, 11, 30483–30495. [Google Scholar] [CrossRef]
Li, S.; Wang, R.; Wu, H.; Zhong, S.; Xu, F. SIEGE: Self-Supervised Incremental Deep Graph Learning for Ethereum Phishing Scam Detection. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 8881–8890. [Google Scholar]
Hamilton, W.L. Graph representation learning. In Synthesis Lectures on Artificial Intelligence and Machine Learning; Morgan & Claypool Publishers: San Rafael, CA, USA, 2020; Volume 14, pp. 1–159. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Hamilton, W.; Ying, R.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How Powerful are Graph Neural Networks? In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Kipf, T.N.; Welling, M. Variational Graph Auto-Encoders. In Proceedings of the NIPS Workshop on Bayesian Deep Learning, Barcelona, Spain, 9 December 2016. [Google Scholar]
Yun, S.; Jeong, M.; Kim, R.; Kang, J.; Kim, H.J. Graph transformer networks. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]
De Lange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A.; Slabaugh, G.; Tuytelaars, T. A continual learning survey: Defying forgetting in classification tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3366–3385. [Google Scholar] [CrossRef]
Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Toulon, France, 24–26 April 2017; pp. 2001–2010. [Google Scholar]
Rolnick, D.; Ahuja, A.; Schwarz, J.; Lillicrap, T.; Wayne, G. Experience replay for continual learning. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]
Isele, D.; Cosgun, A. Selective experience replay for lifelong learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Li, Z.; Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
Zenke, F.; Poole, B.; Ganguli, S. Continual learning through synaptic intelligence. In Proceedings of the International Conference on Machine Learning. PMLR, Sydney, Australia, 6–11 August 2017; pp. 3987–3995. [Google Scholar]
Serra, J.; Suris, D.; Miron, M.; Karatzoglou, A. Overcoming catastrophic forgetting with hard attention to the task. In Proceedings of the International Conference on Machine Learning. PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4548–4557. [Google Scholar]
Mallya, A.; Lazebnik, S. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7765–7773. [Google Scholar]
Aljundi, R.; Chakravarty, P.; Tuytelaars, T. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3366–3375. [Google Scholar]
Yang, H.; Hasan, A.; Tarokh, V. Parabolic continual learning. arXiv 2025, arXiv:2503.02117. [Google Scholar]
Zhang, P.; Yan, Y.; Li, C.; Wang, S.; Xie, X.; Song, G.; Kim, S. Continual learning on dynamic graphs via parameter isolation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 601–611. [Google Scholar]
Rakaraddi, A.; Siew Kei, L.; Pratama, M.; De Carvalho, M. Reinforced continual learning for graphs. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–22 October 2022; pp. 1666–1674. [Google Scholar]
Zhou, F.; Cao, C. Overcoming catastrophic forgetting in graph neural networks with experience replay. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 4714–4722. [Google Scholar]
Kim, S.; Yun, S.; Kang, J. DyGRAIN: An Incremental Learning Framework for Dynamic Graphs. In Proceedings of the IJCAI, Vienna, Austria, 23–29 July 2022; pp. 3157–3163. [Google Scholar]
Perini, M.; Ramponi, G.; Carbone, P.; Kalavri, V. Learning on streaming graphs with experience replay. In Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, Virtual, 25–59 April 2022; pp. 470–478. [Google Scholar]
Zhang, X.; Song, D.; Tao, D. Sparsified subgraph memory for continual graph representation learning. In Proceedings of the 2022 IEEE International Conference on Data Mining (ICDM), Orlando, FL, USA, 28 November–1 December 2022; IEEE: New York, NY, USA, 2022; pp. 1335–1340. [Google Scholar]
Wang, Q.; Zhou, T.; Yuan, Y.; Mao, R. Prompt-Driven Continual Graph Learning. arXiv 2025, arXiv:2502.06327. [Google Scholar]
Chen, L.; Peng, J.; Liu, Y.; Li, J.; Xie, F.; Zheng, Z. Phishing scams detection in ethereum transaction network. ACM Trans. Internet Technol. (TOIT) 2020, 21, 1–16. [Google Scholar] [CrossRef]
Brody, S.; Alon, U.; Yahav, E. How Attentive are Graph Attention Networks? In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Lopez-Paz, D.; Ranzato, M. Gradient episodic memory for continual learning. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Bondy, J.; Murty, U. Graph Theory, 1st ed.; Springer Publishing Company, Incorporated: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
Tang, B.; Matteson, D.S. Graph-Based Continual Learning. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]

Figure 1. The evolution of the Ethereum transaction network. Inter-stage edges refer to the connections linking the current incremental network

Δ G_{i}

with the previous incremental network

Δ G_{j}, j < i

.

Figure 1. The evolution of the Ethereum transaction network. Inter-stage edges refer to the connections linking the current incremental network

Δ G_{i}

with the previous incremental network

Δ G_{j}, j < i

.

Figure 2. The overview of the proposed framework, where orange nodes indicate phishing addresses and white nodes indicate non-phishing ones. Nodes with question mark refer to the unknown label and

f_{i}

denotes the trained model after learning on dataset

D_{i}

.

Figure 2. The overview of the proposed framework, where orange nodes indicate phishing addresses and white nodes indicate non-phishing ones. Nodes with question mark refer to the unknown label and

f_{i}

denotes the trained model after learning on dataset

D_{i}

.

Figure 3. The k-order transaction subgraph of a node v.

Figure 4. Evolutionary transaction subgraphs.

Figure 5. Stage-wise average accuracy of four GNNs across consecutive ITNs.

Figure 6. Stage-wise average forgetting of four GNNs across consecutive ITNs.

Figure 7. The changes in average accuracy and backward transfer values of the KAR method across stages.

Table 1. Summarizations of key notations.

Notations	Descriptions
$D_{i}$	The dataset (a set of 1-TSGs) extracted from $Δ G_{i}$ or $G_{i}$ .
$M_{i}$	The memory buffer for stage i. $M_{i} = M_{i}^{p} \cup M_{i}^{n}$ , where $M_{i}^{p}$ and $M_{i}^{n}$ are the sets of phishing and non-phishing samples in $M_{i}$ , respectively.
$G_{i}, G_{i}^{'}$	The Ethereum transaction network (ETN, i.e., snapshot) at time $t_{i}$ with inter-stage edges and corresponding ETN without inter-stage edges, respectively. $G_{i} = {V_{i}, E_{i}} = G_{i - 1} \cup Δ G_{i}$ , and $G_{i}^{'} = ⋃_{j = 1}^{i} G_{j}^{'} = G_{i} ∖ (⋃_{j = 1}^{i} Δ E_{j}^{r})$ .
$Δ G_{i}, Δ G_{i}^{'}$	The incremental transaction network (ITN) between time $t_{i - 1}$ and $t_{i}$ with inter-stage edges and corresponding ITN without inter-stage edges, respectively. $Δ G_{i} = {Δ V_{i}, Δ E_{i}}$ , and $Δ G_{i}^{'} = Δ G_{i} ∖ Δ E_{i}^{r}$ .
$E_{i}, Δ E_{i}$	The set of all edges in $G_{i}$ and $Δ G_{i}$ , respectively. $Δ E_{i} = Δ E_{i}^{a} \cup Δ E_{i}^{r}$ .
$Δ E_{i}^{a}, Δ E_{i}^{r}$	The set of edges between newly added nodes (i.e., intra-task edges) and the set of edges between newly added nodes and reappearing nodes (i.e., inter-stage edges) in $Δ G_{i}$ , respectively.
$V_{i}, Δ V_{i}$	The set of all nodes in $G_{i}$ and $Δ G_{i}$ , respectively. $Δ V_{i} = Δ V_{i}^{a} \cup Δ V_{i}^{r} = Δ V_{i}^{p} \cup Δ V_{i}^{n}$ .
$Δ V_{i}^{a}, Δ V_{i}^{r}$	The set of newly added nodes and reappearing nodes in $Δ G_{i}$ , respectively. $Δ V_{i}^{a} = Δ V_{i}^{a p} \cup Δ V_{i}^{a n}$ , and $Δ V_{i}^{r} = Δ V_{i}^{r p} \cup Δ V_{i}^{r n}$ .
$Δ V_{i}^{p}, Δ V_{i}^{n}$	The set of all phishing nodes and non-phishing nodes in $Δ G_{i}$ , respectively. $Δ V_{i}^{p} = Δ V_{i}^{a p} \cup Δ V_{i}^{r p}$ , and $Δ V_{i}^{n} = Δ V_{i}^{a n} \cup Δ V_{i}^{r n}$ .
$Δ V_{i}^{a p}, Δ V_{i}^{r p}$	The set of newly added phishing nodes and reappearing phishing nodes in $Δ G_{i}$ , respectively. $Δ V_{i}^{r p} = ⋃_{j = 1}^{i - 1} Δ V_{i j}^{r p}$ .
$Δ V_{i}^{a n}, Δ V_{i}^{r n}$	The set of newly added non-phishing nodes and reappearing non-phishing nodes in $Δ G_{i}$ , respectively. $Δ V_{i}^{r n} = ⋃_{j = 1}^{i - 1} Δ V_{i j}^{r n}$ .
$Δ V_{i j}^{r p}, Δ V_{i j}^{r n}$	The set of newly added phishing nodes and non-phishing nodes in $Δ G_{j}$ that are also reappearing in $Δ G_{i}$ , respectively. $Δ V_{i j}^{r p} = Δ V_{j}^{a p} \cap Δ V_{i}^{p}$ , and $Δ V_{i j}^{r n} = Δ V_{j}^{a n} \cap Δ V_{i}^{n}, i > j$ .
$S_{u}^{i}, Δ S_{u}^{i}$	The 1-order accumulative transaction subgraph (1-ATSG) and 1-order incremental transaction subgraph (1-ITSG) of a node u in $G_{i}$ and $Δ G_{i}$ , respectively.
$C_{i}^{N}, C_{i}^{R}, C_{i}^{K}$	The set of central nodes in the $i th$ dataset under the naïve method, retraining method, and KAR method, respectively. $C_{i}^{N} = C_{i}^{N p} \cup C_{i}^{N n}$ , $C_{i}^{R} = C_{i}^{R p} \cup C_{i}^{R n}$ , and $C_{i}^{K} = C_{i}^{K p} \cup C_{i}^{K n}$ .
$C_{i}^{N p}, C_{i}^{R p}, C_{i}^{K p}$	The set of phishing central nodes in the $i th$ dataset under the naïve, retraining, and KAR methods, respectively.
$C_{i}^{N n}, C_{i}^{R n}, C_{i}^{K n}$	The set of non-phishing central nodes in the $i th$ dataset under the naïve, retraining, and KAR methods, respectively.

Table 2. Ethereum phishing transaction network statistics.

Items	Original Network	Max Component
Components	13	1
Nodes (addresses)	2,973,489	2,973,382
Edges (transactions)	13,551,303	13,551,214
Phishing nodes	1165	1157
Non-phishing nodes	2,972,324	2,972,225

Table 3. Distribution of phishing nodes across ITNs.

$Δ G_{i}$	$Δ G_{1}$	$Δ G_{2}$	$Δ G_{3}$	$Δ G_{4}$	$Δ G_{5}$	$Δ G_{6}$	$Δ G_{7}$	$Δ G_{8}$	$Δ G_{9}$	$Δ G_{10}$
$Δ G_{1}$	116	49	8	15	3	4	7	3	3	6
$Δ G_{2}$	0	116	34	22	11	9	10	6	4	9
$Δ G_{3}$	0	0	116	51	24	15	16	11	7	9
$Δ G_{4}$	0	0	0	116	50	17	11	9	8	9
$Δ G_{5}$	0	0	0	0	116	37	17	11	7	12
$Δ G_{6}$	0	0	0	0	0	116	59	20	12	20
$Δ G_{7}$	0	0	0	0	0	0	116	55	17	33
$Δ G_{8}$	0	0	0	0	0	0	0	115	50	28
$Δ G_{9}$	0	0	0	0	0	0	0	0	115	47
$Δ G_{10}$	0	0	0	0	0	0	0	0	0	115
Sum	116	165	158	204	204	198	236	230	223	288

Table 4. Comparison of different models on three methods across settings with and without inter-stage edges. ↑ indicates that larger values imply better performance.

Scenario	Model	GCN		GAT		GATv2		GIN
Scenario	Metric (%)	ACC(↑)	AF(↑)	ACC(↑)	AF(↑)	ACC(↑)	AF(↑)	ACC(↑)	AF(↑)
Remove inter-stage edges	Naïve-o	$51.4 \pm 1.0$	$- 2.6 \pm 2.0$	$49.8 \pm 1.6$	$- 4.8 \pm 2.8$	$51.0 \pm 1.6$	$- 4.4 \pm 2.4$	$51.5 \pm 0.8$	$- 6.5 \pm 1.9$
	Retraining-o	$59.3 \pm 4.6$	$- 0.1 \pm 2.8$	$61.4 \pm 6.3$	$- 0.5 \pm 6.9$	$61.2 \pm 8.0$	$- 1.7 \pm 8.7$	$67.6 \pm 1.8$	$0.6 \pm 1.6$
	Replay-o	$50.2 \pm 0.4$	$- 4.6 \pm 1.3$	$48.9 \pm 1.8$	$- 6.7 \pm 2.7$	$49.0 \pm 1.2$	$- 7.2 \pm 2.9$	$51.2 \pm 1.2$	$- 6.6 \pm 1.7$
Retain inter-stage edges	Naïve	$82.6 \pm 6.4$	$0.3 \pm 5.8$	$87.5 \pm 4.1$	$- 1.1 \pm 1.6$	$86.4 \pm 8.6$	$- 3.4 \pm 7.6$	$84.5 \pm 5.9$	$- 5.9 \pm 7.2$
	Retraining	$86.8 \pm 4.9$	$5.8 \pm 5.2$	$95.4 \pm 2.5$	$5.0 \pm 3.9$	$96.5 \pm 1.9$	$4.3 \pm 3.5$	$94.9 \pm 1.4$	$1.0 \pm 1.7$
	Replay (KAR)	$86.8 \pm 4.0$	$3.8 \pm 2.7$	$93.1 \pm 5.7$	$5.0 \pm 3.0$	$94.9 \pm 2.8$	$3.8 \pm 2.3$	$91.6 \pm 0.4$	$0.3 \pm 1.2$

Table 5. Comparison of four GNNs with existing methods across settings with and without inter-stage edges. Results for our method Replay (KAR), and the best results in the two scenarios are bolded.

Scenario	Model	GCN		GAT		GATv2		GIN
Scenario	Metric (%)	ACC(↑)	AF(↑)	ACC(↑)	AF(↑)	ACC(↑)	AF(↑)	ACC(↑)	AF(↑)
Remove inter-stage edges	Naïve-o	$51.50$	$- 6.06$	$49.88$	$- 3.39$	$49.37$	$- 3.94$	$51.36$	$- 7.6$
	EWC	$52.9$	$- 2.24$	$51.54$	$- 3.25$	$53.67$	$- 1.92$	$50.72$	$- 6.26$
	LwF	$52.31$	$- 0.96$	$50.91$	$- 7.1$	$51.35$	$- 3.62$	$51.83$	$- 5.6$
	TWP	$50.92$	$- 2.47$	$52.82$	$- 0.71$	$51.24$	$- 0.08$	$52.16$	$- 2.75$
	Replay-o	$50.42$	$- 2.99$	$49.62$	$- 11.09$	$49.48$	$- 9.07$	$50.41$	$- 6.5$
	Retraining-o	$56.29$	$- 1.77$	$52.35$	$- 10.15$	$68.92$	$5.34$	$64.39$	$- 2.64$
Retain inter-stage edges	Naïve	$89.38$	$5.72$	$87.52$	$- 4.47$	$92.11$	$- 1.60$	$86.08$	$- 3.99$
	EWC	$83.8$	$- 0.95$	$91.53$	$1.91$	$89.1$	$0.94$	$85.22$	$- 7.64$
	LwF	$77.76$	$4.43$	$91.97$	$- 0.97$	$92.83$	$- 1.76$	$87.80$	$- 5.26$
	TWP	$79.76$	$0.63$	$79.20$	$- 0.63$	$76.03$	0	$89.38$	$- 0.16$
	Replay (KAR)	$88.38$	$7.48$	$95.73$	$4.38$	$95.86$	$1.26$	$91.1$	$0.37$
	Retraining	$86.13$	$4.39$	$95.68$	$5.66$	$94.86$	$2.08$	$95.22$	$1.81$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, Z.; Zhang, D. Continual Graph Learning with Knowledge-Augmented Replay: A Case for Ethereum Phishing Detection. Electronics 2025, 14, 3345. https://doi.org/10.3390/electronics14173345

AMA Style

Tian Z, Zhang D. Continual Graph Learning with Knowledge-Augmented Replay: A Case for Ethereum Phishing Detection. Electronics. 2025; 14(17):3345. https://doi.org/10.3390/electronics14173345

Chicago/Turabian Style

Tian, Zonggui, and Du Zhang. 2025. "Continual Graph Learning with Knowledge-Augmented Replay: A Case for Ethereum Phishing Detection" Electronics 14, no. 17: 3345. https://doi.org/10.3390/electronics14173345

APA Style

Tian, Z., & Zhang, D. (2025). Continual Graph Learning with Knowledge-Augmented Replay: A Case for Ethereum Phishing Detection. Electronics, 14(17), 3345. https://doi.org/10.3390/electronics14173345

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Continual Graph Learning with Knowledge-Augmented Replay: A Case for Ethereum Phishing Detection

Abstract

1. Introduction

2. Related Work

2.1. Ethereum Phishing Scam Detection

2.2. Graph Representation Learning

2.3. Continual Learning

2.4. Continual Graph Learning

3. Preliminaries

3.1. Definitions

3.2. Problem Formalization

4. Methodology

4.1. Temporal Partitioning

4.2. Transaction Subgraph Extraction

4.3. Continual Learning

4.3.1. Knowledge-Augmented Replay

4.3.2. Evolutionary Transaction Subgraph

4.3.3. Overall Process

5. Experiments

5.1. Data

5.2. ETN Partition

5.3. Datasets Construction

5.3.1. Central Node Set Construction

5.3.2. Subgraph and Node Feature Extraction

5.4. Experimental Settings

5.4.1. Models

5.4.2. Training

5.4.3. Evaluation

5.5. Results

5.5.1. Overall Performance

5.5.2. Stage-Wise Average Accuracy

5.5.3. Stage-Wise Average Forgetting

5.5.4. Comparison with Existing Methods

5.6. Determinants of Performance Improvement

5.6.1. Backward Transfer

5.6.2. Initial Performance

5.6.3. Backward Transfer Change

5.7. Convergence

5.7.1. Performance Perspective

5.7.2. ϵ -k Bounded Convergence

5.7.3. Convergence in KAR Method

5.7.4. Convergence in Broader Contexts

6. Discussion

6.1. A Knowledge Perspective

6.2. Applicability of KAR

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.7.2. $ϵ$ -k Bounded Convergence