Next Article in Journal
Machine Learning Integration in Ultra-Wideband-Based Indoor Positioning Systems: A Comprehensive Review
Next Article in Special Issue
Maturity-Aware Cyber Insurance Optimization in IoT Networks
Previous Article in Journal
Optimal Control for Networked Control Systems with Stochastic Transmission Delay and Packet Dropouts
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

BERTSC: A Multi-Modal Fusion Framework for Stablecoin Phishing Detection Based on Graph Convolutional Networks and Soft Prompt Encoding

1
College of Computer and Cyberspace Security, Fujian Normal University, Fuzhou 350007, China
2
Fujian Yuke Information Technology Co., Ltd., Fuzhou 350002, China
3
Department of Information Engineering, Fuzhou Polytechnic, Fuzhou 350108, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(1), 179; https://doi.org/10.3390/electronics15010179
Submission received: 20 November 2025 / Revised: 22 December 2025 / Accepted: 26 December 2025 / Published: 30 December 2025

Abstract

As stablecoins become increasingly prevalent in financial crimes, their usage for illicit activities has reached a scale of USD 51.3 billion. Detecting phishing activities within stablecoin transactions has emerged as a critical challenge in blockchain security. Currently, existing detection methods predominantly target mainstream cryptocurrencies like Ethereum and lack specialized models tailored to the unique transaction patterns of stablecoin networks. This paper introduces a deep learning framework, BERTSC, based on multi-modal fusion. The model integrates three core modules graph convolutional networks (GCNs), BERT semantic encoders, and soft prompt encoders to identify malicious accounts. The GCN constructs directed multi-graph representations of account interactions, incorporating multi-dimensional edge features; the BERT encoder transforms discrete transaction attributes into semantically rich continuous vector representations; the soft prompt encoder maps account interaction features into learnable prompt vectors. An innovative three-way gated dynamic fusion mechanism optimally combines the information from these sources. The fused features are then classified to predict phishing account labels, facilitating the detection of phishing scams in stablecoin transaction datasets. Experimental results on large-scale stablecoin datasets demonstrate that BERTSC outperforms baseline models, achieving improvements of 4.96%, 3.60%, and 4.23% in Precision, Recall, and F1-score, respectively. Ablation studies validate the effectiveness of each module and confirm the necessity and superiority of the three-way gating fusion mechanism. This research offers a novel technical approach for phishing detection within blockchain stablecoin ecosystems.

1. Introduction

In recent years, blockchain technology has become increasingly sophisticated, exerting profound influence across various sectors, particularly within the financial industry [1]. Nevertheless, as the scope of blockchain applications expands, there has been a concomitant rise in related fraudulent activities, posing significant societal risks [2]. While blockchain’s distributed ledger and transparent nature enhance security and operational efficiency in financial transactions, these same features have been exploited by malicious actors as channels for illegal conduct [3]. Stablecoins, a class of cryptocurrencies designed to maintain price stability, aim to address the excessive volatility observed in traditional cryptocurrencies such as Bitcoin, thereby facilitating their use as reliable mediums of exchange and stores of value for everyday transactions [4]. However, due to their low volatility, stablecoins have increasingly been leveraged by illicit entities for activities such as phishing and scams, surpassing Bitcoin as the preferred asset for criminal operations, accounting for approximately 63% of all illicit transactions by 2024. The scale of illegal activities involving stablecoins, including phishing scams, is estimated at USD 51.3 billion, with their intrinsic stability making them suitable for cross-border illegal fund transfers [5]. Consequently, despite the substantial growth potential of stablecoins, their development raises pressing social and regulatory challenges. Addressing these issues is essential to ensuring their secure and sustainable evolution within the financial ecosystem [6]. In blockchain-based systems, the primary phishing attack process, as depicted in Figure 1, involves the attacker deceiving victims by directly sending malicious addresses designed to lure them into compromised transactions.
Currently, numerous studies employ graph neural networks (GNNs) [7] and graph representation learning techniques [8] to detect cryptocurrency-based phishing attacks, such as those targeting Ethereum [9]. GNNs capture the global transaction topology through graph structures but are limited in their ability to encode interaction features between accounts, such as low-value frequent transactions within short timeframes or fixed transaction patterns. While graph representation learning provides an intuitive and straightforward approach, it often fails to model the temporal sequencing of transactions across accounts. Blockchain transactions are inherently complex, with multiple records connecting accounts, and existing methods typically aggregate directed multigraph transaction information into simplified, unidirectional representations through feature fusion to facilitate computation. However, this approach compromises critical features like transaction priority and temporal order, which are essential for effective de-anonymization of accounts.
In high-volatility cryptocurrency transaction networks such as Ethereum and Bitcoin, mainstream anomaly detection paradigms predominantly rely on traditional volatility features, including abrupt changes in transaction amounts, surges in transaction frequency, asset flow path characteristics such as onion-like mixing, and anomalies indicative of economic behaviors like mass selling. These features are directly observable on-chain through metrics such as transaction amounts, timestamps, sequencing, and the structure of transaction relationship networks. Within volatile assets like Ethereum and stablecoins, large single transfers, extreme high-frequency operations, multi-hop rapid accumulation or distribution pathways, and aggressive economic activities often coincide with significant market price disturbances and on-chain behavioral deviations, enabling detection via conventional “mutation-price fluctuation” models. However, the price anchoring mechanisms inherent to stablecoins substantially diminish the efficacy of these traditional volatility features within their network environments [10]. Large-scale fund transfers or short-term high-frequency interactions in stablecoins no longer trigger substantial price movements, rendering corresponding behaviors highly covert within transaction networks. Furthermore, although coin-mixing processes persist in stablecoin transactions, the absence of price-driven structural variations and significant arbitrage opportunities limits the effectiveness of traditional economic anomaly features dependent on price volatility, market panic, and arbitrage potential in this context. Therefore, direct adaptation of anomaly detection models based on Bitcoin and Ethereum to stablecoin datasets is infeasible. Detection strategies tailored for stablecoin networks should diminish reliance on price volatility cues and sensitive behavioral signals, instead emphasizing the modeling of behavioral anomalies manifested through non-price-based transaction amount mutations, short-term high-frequency patterns, and multi-hop pathways within transaction structures and temporal sequences. This approach aims to enhance the identification of covert anomalies within stablecoin networks by capturing complex behavioral deviations beyond traditional price-dependent indicators.
To address the inherent complexities in stablecoin phishing detection, we propose BERTSC, a sophisticated tri-modal fusion framework that synergistically integrates structural, semantic, and numerical transaction data through original algorithmic designs. The architecture utilizes a GCN to extract global topological features, complemented by a pre-trained BERT model to capture long-term behavioral dependencies.
A primary innovation of this work is the design of a novel multi-dimensional adjacency weighting formula which uniquely synthesizes temporal n-gram intervals, log-transformed volumes, gas prices, and stablecoin interaction ratios. This self-designed weighting scheme enables the construction of a high-fidelity transaction graph that precisely captures latent anomalies in account relationships. Furthermore, we design a soft-prompt encoder framework to bridge the gap between numerical priors and semantic features, mapping account-level interaction characteristics into learnable prompt vectors. To effectively integrate these heterogeneous inputs, we develop a tri-gate fusion algorithm. This algorithm dynamically modulates the integration of embeddings at a granular, per-sample level, adaptively calibrating the relative influence of structural and semantic representations. This integrated strategy not only optimizes the feature space for binary classification but also effectively mitigates class imbalance and overfitting, significantly enhancing detection reliability in complex transaction environments.
  • Stablecoin anomaly phishing detection framework with multi-edge heterogeneous graph: This work introduces a directed multi-edge heterogeneous graph of global account interactions that integrates multi-dimensional edge attributes—such as transaction amounts, temporal differences, and GasPrice—directly into Graph Convolutional Network inputs, augmented by a soft prompt encoder that adaptively maps numerical interaction priors into semantically enriched learnable vectors for enhanced multi-modal fusion.
  • Hierarchical dynamic three-way gating for iterative multi-modal fusion: A novel tri-gate algorithm is proposed to synergistically integrate semantic embeddings, graph topology, and soft-prompted features. Through an iterative fusion architecture, the mechanism refines the comprehensive representation by re-integrating initial semantic and structural cues, ensuring superior detection reliability and robust feature expressiveness.
  • Extensive real-world dataset construction and empirical validation: A large-scale stablecoin transaction dataset is curated, encompassing over 2.5 million nodes, 13 million directed edges with rich attributes, and 1766 phishing instances; rigorous evaluations demonstrate the BERTSC model’s superiority, achieving a 4.96% precision uplift over state-of-the-art baselines, underscoring its robustness in detecting phishing scams within decentralized financial ecosystems.

2. Related Work

With the rapid development of blockchain technology, the detection of fraudulent activities within blockchain networks has emerged as a global challenge. In recent years, researchers have developed various detection methodologies to address these issues, ensuring the security of blockchain transactions [11]. Prior studies primarily employed graph-based representation learning techniques, which can be categorized into three types: random walk strategies based on DeepWalk [12], graph neural network approaches, and fraud detection methods leveraging time-series data.

2.1. Multi-Semantic Perception Approaches Based on DeepWalk

The core idea of DeepWalk is to serialize nodes in a graph into sequences akin to corpus data and employ the Skip-gram model from natural language processing for unsupervised representation learning. Building upon this, Trans2Vec [13] introduces a semantic-aware traversal strategy and sampling bias within the random walk framework, biasing walks toward the semantic and temporal significance of transactional relationships, thereby addressing DeepWalk’s insensitivity to heterogeneous relationships and edge attributes. Inspired by Trans2Vec, Belghith et al. [14] proposed Hui2Vec, which integrates unary and high-efficiency itemsets to learn transaction embeddings, thereby enhancing representation capabilities for utility-sensitive data. Similarly, Ahmed et al. [15] developed a federated learning approach to frequent itemset mining in a distributed privacy-preserving environment. Luo et al. [16] designed a network embedding model based on Trans2Vec to develop a phishing account detection system for the Ethereum blockchain, leveraging transaction embeddings to identify anomalous patterns and improve security accuracy. To overcome limitations in capturing higher-order global structural patterns and diffusion dynamics, Rozemberczki et al. [17] proposed Diff2Vec, which generates more comprehensive subgraph contexts around nodes and extracts sequences from diffusion subgraphs for representation learning. Additionally, Ahmed et al. [18] introduced Role2Vec, which constructs attribute-aware random walks to generate contextually similar but spatially distant nodes, enhancing cross-region generalization and interpretability. However, these methods struggle to effectively characterize the semantic attributes and temporal intensities of transactional edges and exhibit insensitivity to cross-modal numerical priors, limiting their ability to integrate semantic information of transaction sequences with account interaction features comprehensively.

2.2. Approaches Based on Graph Neural Networks

GNNs propagate information via message passing through graph convolution operations, effectively capturing both direct and indirect relationships within transaction networks. This enables the modeling of complex interaction patterns among accounts. Shen et al. [19], utilizing graph convolutional networks, inferred distinct account identities from transaction data, categorizing accounts into legitimate, phishing, and scripted accounts. Building upon this, Huang et al. [20] developed a phishing detection framework based on GNNs, which constructs enhanced interaction graphs to enrich node contextual information, thereby effectively identifying anomalous behaviors associated with fraudulent accounts. Zhou [21] proposed a hierarchical graph attention encoder that integrates global node features with subgraph information to improve phishing detection accuracy. Chen et al. [22] introduced a phishing fraud detection approach leveraging data augmentation techniques combined with a hybrid GNN architecture, designed to capture complex patterns within transaction graphs and address class imbalance issues, thereby enhancing detection precision. However, GNN-based methods currently lack the capacity to model semantic information within transaction sequences, limiting their understanding of the contextual meaning of transaction contents and overlooking the temporal frequency of transactions.

2.3. Fraud Detection Approaches Based on Time Series Analysis

Time series methodologies leverage analysis of the temporal characteristics and frequency patterns of transaction data to identify anomalous behavioral sequences indicative of phishing accounts. Blockchain transaction records inherently encompass extensive time series information, with primary features such as timestamps and transaction frequencies; processing these records through temporal analysis facilitates the detection of potential small-scale, high-frequency fraudulent activities. Farrugia [23] proposed integrating the XGBoost model with time series features, emphasizing their importance in machine learning frameworks for identifying illicit accounts by extracting salient temporal characteristics. Hu et al. [24] introduced a Long Short-Term Memory (LSTM)-based temporal sequence model that enhances the security and accuracy of smart contract risk assessment by analyzing the temporal attributes of transaction data. For computational efficiency, Li et al. [25] developed the Temporal Tripartite Aggregated Graph Neural Network (TTAGNN), which employs LSTM to fuse multiple directed edges and builds node embeddings atop Graph Attention Networks (GATs) [26]. However, time series-based approaches often neglect global graph structural information, limiting their capacity to capture complex inter-account interactions and the semantic content of transactions.
Compared to these models, the proposed BERTSC model exhibits notable advantages. It explicitly incorporates multi-dimensional edge features through a weighted adjacency matrix, significantly enhancing the contribution of edge semantics to node representations. Additionally, the model employs a gating mechanism for adaptive fusion, integrating transaction semantics, graph topology, and numerical prior information thus outperforming existing methods in the detection of complex transaction patterns and anomalous fund flows. Consequently, BERTSC maintains sensitivity to diverse temporal behaviors such as short-term high-frequency, microtransactional, and large-scale abrupt activities while effectively combining global graph structures and stablecoin interaction features to achieve more robust and interpretable identification of phishing accounts.

3. Methods

We introduce a dynamic multi-modal fusion approach tailored for blockchain stablecoin scenarios. The method employs a stablecoin phishing node detection mechanism, effectively identifying on-chain phishing addresses. The three modules in the system are: Graph convolutional network based on a directed transaction graph structure; BERT encoder leveraging transaction semantic information; Numerical prior soft prompt encoder based on account interaction features. The features extracted from each module are integrated through a tri-channel gated dynamic fusion mechanism and subsequently utilized for final phishing address detection. The overall framework is illustrated in Figure 2.
Graph convolutional network based on a directed transaction graph structure:We construct a weighted adjacency matrix by modeling accounts as nodes and transactions as directed edges within a directed multigraph. Multi-dimensional edge features, including timestamp differences, transaction amounts, GasPrice, and interaction frequencies, are integrated to capture temporal and transactional heterogeneity. Message passing is performed over this adjacency matrix using a vocabulary-level graph convolutional network. Multiple adjacency matrices undergo weighted aggregation to model both direct and indirect transactional relationships, thereby enabling the extraction of more precise higher-order graph structural features.
BERT encoder leveraging transaction semantic information: We convert transaction sequence data, including timestamps, transaction amounts, contract addresses, GasPrice, and other relevant features, using a pretrained BERT model to capture semantic representations. This process employs joint embeddings of word, positional, and type embeddings to encode temporal dependencies within the transaction sequences. By utilizing a 12-layer Transformer encoder, we further extract deep semantic features, facilitating nuanced understanding and contextual modeling of transaction behavioral patterns. This approach establishes a robust and versatile feature foundation for downstream analytical tasks. Numerical prior soft prompt encoder based on account interaction features: We introduce a soft prompt encoding mechanism that maps account interaction numerical features into learnable prompt vectors via an MLP encoder. An average pooling operation generates a summarized prompt vector, thus enabling an end-to-end learning process that maps numeric prior information into the semantic space. This provides a crucial numeric feature basis for multi-modal feature representation.

3.1. Data Acquisition and Preprocessing

Firstly, we extracted account addresses identified as phishing nodes involved in stablecoin trading from Dune Analytics. Subsequently, all transaction records associated with these addresses were obtained from Etherscan, spanning from the account creation to the present. During the processing of stablecoin transaction data, each record encompassed multiple transaction features, including BlockNumber, TimeStamp, Hash, FromAddress, ToAddress, Amount, and GasPrice. These features collectively detail the block in which the transaction occurred, information about both parties involved, and the gas price willingness to facilitate the transaction. Through data cleaning and preprocessing, the most informative edge and node features were extracted from the complex feature set, with specific features summarized in Table 1.
After excluding isolated nodes and duplicate transactions within the trading network, we obtained a dataset of stablecoin phishing node addresses and their transaction data spanning from 2017 to 2025. Following data cleaning and preprocessing, a large-scale stablecoin transaction network was constructed, comprising 2,529,625 nodes, 13,071,630 edges, and 1766 identified phishing nodes. As detailed in Table 2, this dataset encompasses transaction records of major stablecoins such as USDT, USDC, and DAI over an eight-year period.
Based on preprocessed data, we constructed a multi-layer directed graph centered on stablecoin transactions. Selected nodes and their transactional interactions were visualized, as shown in Figure 3. In this visualization, blue nodes represent legitimate participants within the transaction network, while red nodes indicate malicious phishing accounts. The orange box highlights the large transfer patterns characteristic of phishing nodes, whereas the red box delineates the pattern of small, frequent, high-frequency interactions within a short period. Compared to legitimate nodes, phishing accounts exhibit more pronounced transaction interaction features.
In terms of data preprocessing, this paper adopts the data processing methodology proposed by Sheng et al. [27]. To construct transaction records based on account addresses, we developed a dictionary named Address, where each unique address serves as a key and associated transaction information as its value, with specific features detailed in Table 3. This dictionary aggregates all transaction records for a unique address into a single entry. Initially, we reorganized the data based on sender and receiver addresses. Since each transaction involves both a sender and a receiver, after identifying a unique address, if it appears as the sender in a transaction record, we set a variable Send = 1 and log the relevant amount, timestamp, and gas price into the dictionary. Conversely, if the address acts as the receiver, we set Send = 0 and similarly record the pertinent transaction data. This process yields the Address dictionary, enabling swift retrieval of any account’s transaction history. This step not only streamlines complex transaction data query methods but also lays the groundwork for subsequent semantic information extraction.

3.2. Graph Convolutional Network Feature Extraction

3.2.1. Graph Data Generation Based on Adjacency Matrices

We adopt the transductive learning setting for node classification tasks in the field of graph neural networks. Firstly, the core characteristic of the transductive setting is that it allows the model to utilize the structural information and feature statistical information of the full graph but strictly restricts access to the labels of the validation/test sets during training—only the labels of training set nodes are used for supervised optimization, validation set labels are solely for hyperparameter tuning, and test set labels are invisible throughout the process. Secondly, the core value of the blockchain transaction network lies in the relational connections between accounts. If the dataset is split in advance before constructing subgraphs, the original neighborhood structure of accounts is truncated, leading the model to fail to learn complete transaction behavior patterns. This contradicts the core mechanism of GNNs which “learn features through neighborhood aggregation”. This workflow has been verified by several authoritative studies in the field of blockchain phishing detection [7,20,28].
Given the inherent complexity of transaction records within trading networks, effectively extracting inter-node transaction features presents a significant challenge. To address this, we propose a method that constructs adjacency matrices between account nodes and incorporates transaction weights to capture global trading characteristics. When processing each transaction record for an account, we first sorted transactions by timestamp. This allowed us to calculate the time difference between successive transactions, reflecting the actual sequence of account activity and the flow of funds within specific timeframes. By enhancing temporal aggregation features, we can effectively identify anomalous behavior within these accounts.
Based on the preprocessed transaction data, we constructed a directed multigraph G = ( V , E ) , where V represents the set of account address nodes and E represents the set of transaction edges. Each node v i V in the graph represents a unique account address, and each directed edge e i , j = ( v i , v j ) E signifies a transaction record from account v i to account v j . To quantify the degree of frequent transactions within a short period, we introduced the concept of n-gram time difference [29]. The n-gram time difference is defined as the measure of an account’s transaction frequency by calculating the time difference between a transaction and its preceding n − 1 transactions. In this study, we computed 2-gram to 5-gram time differences, as represented by the following formula:
Δ T r a n s T i m e n = T r a n s T i m e i T r a n s T i m e i ( n 1 )
where T r a n s T i m e i represents the timestamp of the ith transaction for an account and T r a n s T i m e i ( n 1 ) represents the timestamp of the ( i ( n 1 ) ) th transaction for that account. If the number of transactions for an account is limited, this time difference is set to 0.
We constructed an n × n zero matrix A, where n is the total number of unique account addresses in the transaction network. The elements within this adjacency matrix represent the connection weights between corresponding addresses. For instance, A [ i , j ] signifies the transaction weight between account i and account j across all transactions. The initial state of the adjacency matrix is as follows:
A = 0 0 0 0 n × n
In a directed graph, each transaction record T i includes a sender address and a receiver address. By traversing all transaction records within the directed graph, we constructed a dictionary, “Index”, to map unique addresses to their corresponding indices. The keys of this dictionary are account addresses, and their values are their respective indices. We then mapped these account addresses to the indices of an empty adjacency matrix, as shown in the following formula:
F r o m index = Index ( F r o m address )
T o index = Index ( T o address )
where F r o m address denotes the sender’s address, F r o m index denotes the sender’s index, T o address denotes the recipient’s address, and  T o index denotes the recipient’s index.
We employed a multi-dimensional weighted calculation strategy, integrating four dimensions—timestamp differences, transaction amounts, gas prices, and stablecoin interaction characteristics to construct the connection weights of the adjacency matrix. For each element A [ i , j ] within the adjacency matrix, its weight is calculated using the following fusion formula:
ω i , j = 0.5 × W e + 0.2 × W n + 0.3 × W i
where W e denotes the weighted fusion of transaction amount and gas price within transaction records, W n represents the n-gram time difference feature, and  W i signifies the stablecoin account interaction feature. W e captures characteristics of high-value transactions; W n captures characteristics of frequent transactions within short timeframes; W i captures the proportion of transaction frequency between two distinct node categories.
Regarding the transaction execution time of phishing nodes, abnormal nodes usually set a high Gasprice to accelerate fund transfer. Therefore, when constructing matrix weights, a weighted calculation is performed on the transaction amount and Gasprice of the two types of nodes, with the formula as follows:
F a = log a + 1 log Max a + 1
F g = log g + 1 log Max g + 1
W e = ( 0.7 × F a + 0.3 × F g ) × F type ,   F type { 1 , 1.5 , 2 }
where a denotes the transaction amount for a given transaction, g denotes the Gas price for that transaction, F a and F g , respectively, represent the logarithmically normalized transaction amount and Gas price, and  F type denotes the node type weighting. If the transaction originates from a phishing node to another phishing node, F type is 2; if the transaction originates from a normal node to another phishing node, F type is 1.5. If the transaction is sent from a normal node to another normal node, F type is 1.
Some phishing nodes seek to evade suspension due to large transactions, employing frequent transactions within short timeframes to raise funds or transfer capital. In such scenarios, the n-gram time difference characteristic becomes particularly crucial. The formula for W n is as follows:
W n = 1 Max 5 gram i = 2 N w i × θ i
Here, w i denotes the timestamp difference of the n-gram, θ i represents the weighting coefficient of the n-gram, and  Max 5 gram signifies the maximum value among all 5-gram differences.
Within on-chain transaction networks, phishing nodes typically exhibit transaction input counts several times greater than their output counts, whereas normal nodes maintain nearly equal input and output counts due to their long-term transactional characteristics. To capture the interaction patterns of these two node types throughout their lifecycle, this paper defines a node’s input-to-output value ratio and input-to-output frequency ratio. These metrics are ultimately weighted and fused to form the account interaction feature W i , calculated as follows:
D radio a = max ( InOutRadio amount , 0 ) , S radio a = max ( OutInRadio amount , 0 )
D radio c = max ( InOutRadio count , 0 ) , S radio c = max ( OutInRadio count , 0 )
F amount = S ( D radio a ) + S ( S radio a )
F count = S ( D radio c ) + S ( S radio c )
W i = 0.7 × F amount + 0.3 × F count
where D radio a and S radio a denote the ratio of incoming to outgoing amounts for the destination node and the ratio of incoming to outgoing amounts for the source node in a transaction record; D radio c and S radio c denote the ratio of incoming to outgoing occurrences for the destination node and the ratio of incoming to outgoing occurrences for the source node in a transaction record. The  S ( · ) function is a non-linear compression function. The weights ultimately populated into the adjacency matrix are
A [ F r o m index , T o index ] = i , j γ ( F r o m index , T o index ) w i , j
where γ ( F r o m index , T o index ) denotes the set of all transactions between accounts F r o m index and T o index . Consequently, the adjacency matrix elements reflect the transaction amounts and Gas price weights between two accounts, the overall transaction frequency, and the characteristics of interactions between different account types. The resulting adjacency matrix serves as input to the graph convolutional network module, enabling the model to capture structural relationships within the directed graph.

3.2.2. Graph-Based Representation Module

Within the GCN module, the adjacency matrix undergoes feature extraction through multiple convolutional layers, as illustrated in Figure 4. The formula for the convolutional operation in each layer is
Y i + 1 = σ ( X ˜ 1 2 A ˜ X ˜ 1 2 Y ( i ) W ( i ) )
where Y ( i ) denotes the feature matrix for the ith feature, A ˜ = A + I represents the adjacency matrix with self-loops added, X denotes the degree matrix of the adjacency matrix, W ( i ) denotes the weight matrix for the ith layer, and  σ ( · ) denotes a non-linear activation function. In graph convolution operations, features are fused with the original graph structural features through linear transformations whilst feature representations from multiple adjacency matrices are integrated via weighted aggregation strategies. To prevent overfitting, Dropout regularization (Dropout = 0.2) is incorporated during feature transformation. Dimension mapping is ultimately performed through a fully connected layer, yielding a final output feature as a three-dimensional tensor, as expressed by the following formula:
E GCN = Linear ( Y n , Linear ( Dropout ( A 1 ) , dim ) )
where Y n denotes the transposed sum-of-adjacency matrix while A 1 represents the matrix after feature transformation. This framework encodes graph structural information into low-dimensional feature representations, providing a robust feature foundation for subsequent classification learning tasks.

3.3. BERT Semantic Information Module

The second branch in Figure 2 comprises a BERT encoder based on transaction semantic information. This module is responsible for converting discrete transaction attributes into continuous vector representations rich in semantic content. By leveraging the powerful semantic understanding capabilities of the pre-trained BERT model, it captures dynamic behavioral patterns and long-term dependencies within transaction sequences. As the raw data within the transaction network contains non-essential information such as timestamps, sender addresses, and recipient addresses, this paper removes these fields to simplify the data structure for textual analysis. As the BERT model also benefits from training on sequences in arbitrary order, this paper reorders the transaction lists for each unique account address. This operation disrupts the temporal sequence of transactions, enabling the model to focus more on transactional content features and thereby mitigate potential noise interference. Phishing nodes often conceal malicious activity by disguising multiple transactions. To ensure effective phishing detection, this paper assigns the tag = 1 label to addresses confirmed as phishing nodes, recording this tag in the first transaction entry of each address’s transaction list. This annotation method helps the model learn account risk features more accurately, thereby enhancing overall detection performance.
When generating input text data for the BERT model, this paper converts each account’s transaction records into single-line descriptive texts. Each transaction’s label, transaction amount, gas price, and n-gram time difference are combined and summarized into concise textual representations. The resulting concise text dataset is partitioned into training, validation, and test sets in an 8:1:1 ratio. Specific features are detailed in Table 4:
To ensure data conforms to the format required by the semantic representation model, we first read the generated training and validation set files and randomly shuffle the data order. This prevents the model from becoming dependent on specific data sequences and overlooking textual content. Similarly, we perform random shuffling on the test set. To generate the input corpus and supervision signals required for subsequent model training, we merge the shuffled training, validation, and test sets into a single data file. From this, we extract two key features: the transaction text description corpus and account labels. The transaction text description corpus captures the account’s transactional behavior, while the account labels distinguish whether the account is involved in illicit phishing activities.
During tokenization, all characters in tokens are converted to lowercase and undergo Unicode normalization. Subsequently, the WordPiece tokeniser from the BERT model is employed to convert upstream text chat tokens into a single source of subword tokens. This ensures token consistency and reduces excessive lexical redundancy. To provide input for text processing model embeddings, the tokenised sequences are converted into token IDs for subsequent training. Finally, we align the segmented sentences with the annotated data to serve as the supervision signal for the supervised learning process.
The input to the BERT module comprises the processed transaction information text data. Following data cleansing and WordPiece segmentation, these data are converted into a token sequence, generating token IDs, position IDs, and token type IDs. These are subsequently processed through the BERT model’s word embeddings, position embeddings, and token type embeddings, respectively, as per the following formula:
E B = Drop ( E word + E Position + E TokenType ) ,
E BERT = LayerNorm 1 ( E B )
This module ultimately outputs a three-dimensional feature tensor. The framework simplifies stablecoin transaction information on the blockchain into a list of transaction details headed by node addresses, thereby acquiring valid local features on independent nodes.

3.4. Soft Prompt Encoder Based on Account Interaction Features

We employ a GCN model to capture transaction correlation information across the entire directed graph and a BERT model to process node-level semantic information. However, it overlooks the proportion of interactions entering and exiting during the lifecycle of different account nodes, as well as the role characteristics of accounts within the stablecoin trading network. To address this limitation, this paper introduces a soft-prompt encoder mechanism. This maps numerical a priori information about account interactions into learnable prompt vectors via a multi-layer perceptron, achieving deep integration of numerical and semantic feature. This paper presents the pseudocode for the soft prompt encoder. The pseudocode for Algorithm 1 is shown below.
Algorithm 1 Vocab Graph Convolution (Soft Prompt Encoder)
Require: 
Adjacency matrices A = { A 1 , . . . , A K } where A i R V × V , Input features X dv , Dropout rate p
Ensure: 
Encoded graph embeddings H out R V × D out
  1:
K len ( A )
  2:
H aggr 0
  3:
for  i 0  to  K 1  do
  4:
     W i R V × D hid
  5:
     Z i MatMul ( A i , W i )
  6:
     H i Linear ( Z i )
  7:
     H i ReLU ( H i )
  8:
     H aggr H aggr + H i i
  9:
end for
10:
H out Dropout ( H aggr , p )
11:
if Linear mapping required then
12:
     H res Linear ( X dv )
13:
     H out H out + H res
14:
end if
15:
Return  H out

Account Interaction Feature Extraction

Within stablecoin transaction networks, distinct interaction patterns emerge across different account nodes. Legitimate accounts, characterized by long-term stable transaction behavior, typically exhibit a deposit-to-withdrawal ratio approaching 1:1. Conversely, phishing accounts, operating to raise and transfer funds, often record deposit frequencies several times higher than withdrawals, with ratios reaching 3:1 or even higher. To capture these critical account interaction characteristics, this paper designs a 13-dimensional numerical feature vector, as detailed in Table 5.
The core concept of the soft prompt encoder is to transform numerical prior information into prompt vectors that align with the dimensions of BERT semantic embeddings. The soft prompt encoder designed in this paper adopts a two-layer fully connected neural network architecture, implemented as follows:
P = K 2 · ReLU ( K 1 · x + b 1 ) + b 2
where x R 13 denotes the input 13-dimensional account interaction feature vector, h 1 R d h represents the hidden layer dimension of the first layer (typically set to 2 / 3 of the BERT hidden size), K 2 R d h × ( N p · d h ) is the number of prompt vectors, and  d h is the hidden dimension of the BERT model. The first layer employs the ReLU activation function for nonlinear transformation. The second layer maps the features into a concatenated form of N p prompt vectors.
The output of the soft prompt encoder is P = [ p 1 ; p 2 ; ; p m ] R m × d , where each prompt vector p i ( i = 1 , , m ) has the same dimension as the BERT word embeddings. These prompt vectors are aggregated via mean pooling to generate the prompt summary vector:
E Soft = 1 N p i = 1 N p P i
where E Soft R d h represents the prompt summary vector, which is subsequently utilized in the dynamic gated fusion process. Soft prompt encoders automatically learn optimal mappings from numerical features to semantic spaces through end-to-end training. This enables the model to dynamically adjust the importance weights of different information sources based on an account’s specific interaction patterns, effectively resolving the technical challenge of deeply integrating numerical and textual features that plagues traditional methods.

3.5. Three-Way Gate Control Mechanism

3.5.1. Gate-Controlled Network Design

The gated network employs a two-layer fully connected neural network architecture, with inputs comprising concatenated embeddings from three sources and outputs consisting of three weight distribution channels. The specific implementation is as follows:
g = W 2 · R e L U ( W 1 · [ E Soft ; E GCN ; E BERT ] + b 1 ) + b 2
Among these, [ E Soft ; E GCN ; E BERT ] R d h represent the concatenated embedding vectors, g R d h denotes the raw output of the gated network, W 1 and W 2 denote the weight matrices of the two fully connected layers, respectively, while b 1 and b 2 denote the corresponding bias vectors. We employ Diffsoftmax to support smooth switching between soft and hard gating, defined as follows:
w Soft = Softmax g τ
w hard = one_hot argmax ( g )
w = w hard , if hard = True w soft , if hard = False
where o n e _ h o t denotes one-hot encoding, which converts category indices into one-hot vectors, while a r g m a x performs maximum indexing, returning the index position of the maximum value within the array. g R d h denotes the raw output of the gated network, where τ is the temperature parameter controlling the smoothness of the Softmax output. Smaller τ values yield steeper distributions. The Hard parameter determines whether hard gating is employed; hard gating retains only the information source with the highest weight, whereas soft gating preserves the weighted combination of all information sources. This paper presents the pseudocode of the gating mechanism. The pseudocode for Algorithm 2 is shown as follows.
Algorithm 2 GCN-BERT-SoftPrompt Fusion Mechanism (Gating)
Require: 
Input IDs T R B × L , Graph Adj A , SoftPrompt Features P R B × D p , Learnable weights w GCN , w BERT , w Soft
Ensure: 
Fused hidden states E final R B × L × H
  1:
E GCN VocabGraphConvolution ( A )
  2:
E BERT BERT . Embeddings ( T )
  3:
E GCN Unsqueeze ( E GCN , 0 )
  4:
E GCN Expand ( E GCN , B , L , H )
  5:
E Soft SoftPromptEncoder ( P )
  6:
E Soft View ( E Soft , B , N p , H )
  7:
E Soft Mean ( E Soft , dim = 1 )
  8:
E Soft Expand ( E Soft , B , L , H )
  9:
α w GCN
10:
β w BERT
11:
γ w Soft
12:
E fused α · E GCN + β · E BERT + γ · E Soft
13:
E fused LayerNorm ( E fused )
14:
E fused Dropout ( E fused )
15:
H enc BERT . Encoder ( E fused )
16:
E final BERT . Pooler ( H enc )
17:
Return  E final

3.5.2. Feature Fusion

Based on gated weights, the three-channel feature fusion process is as follows:
E weight = α · E BERT + β · E GCN + γ · E Soft
E fused = ω 1 · E BERT + ω 2 · E GCN + ω 3 · E weight
where α , β , γ denote learnable parameters representing the fusion ratios for the base embeddings while ω 1 , ω 2 , ω 3 denote the gating weights for the corresponding three feature streams, satisfying ω 1 + ω 2 + ω 3 = 1 .

3.6. Classification Module

The node classification module employs a classical fully connected neural network architecture comprising pooling layers, Dropout regularization layers, and a linear classifier. This module receives the output feature vector from the three-channel gate fusion, undergoes non-linear transformation and dimensionality mapping, and ultimately outputs the probability distribution for each account category. The forward propagation process of the node classification module is as follows:
H dropout = Dropout ( Tanh ( W pool · E fused + b pool ) )
where W pool denotes the weight matrix of the pooling layer, b pool represents the bias vector of the pooling layer, Tanh serves as the activation function, and Dropout is employed for regularization. The final output is obtained through a linear classifier:
Z = W cls · H dropout + b cls
where W cls denotes the weight matrix of the classifier whilst b cls represents the bias vector of the classifier, with a dimension of 2. During the inference phase, the node classification module outputs the raw logits vector Z. To convert this into a probability distribution, the Softmax function must be applied, with the final category prediction determined via argmax:
y ˜ = argmax ( Softmax ( Z ) )
where y ˜ denotes the predicted category label. The argmax operation locates the category index corresponding to the element with the highest numerical value within the probability distribution, which is then adopted as the final prediction result.

4. Experimental Evaluation

4.1. Experimental Techniques and Evaluation Metrics

To effectively train the BERTSC model and address the class imbalance issue in stablecoin phishing detection, we employ a weighted cross-entropy loss function [30] for the classification task, formulated as follows:
L o s s = 1 N i = 1 N ( y i l o g ( p i ) + ( 1 y i ) l o g ( 1 p i ) )
where N denotes the batch size, y i represents the true label, and p i signifies the predicted probability. Based on this, this paper employs the AdamW optimizer for optimization. This optimizer combines Adam’s adaptive learning rate with L2 regularization achieved through weight decay. The AdamW formula is as follows:
x t + 1 = x t η · m t v t + ϵ
where m t and v t denote the first-order gradient moment and second-order gradient moment, respectively, while ϵ is a small constant introduced to prevent division by zero. To evaluate the model’s accuracy, we employ Precision, Recall, F1-score, ROC-AUC, PR-AUC and False Positive Rate (FPR) to validate its predictive performance. The metrics are defined as follows:
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
R O C - A U C = 0 1 T P R ( F P R ) d F P R
P R - A U C = 0 1 P r e c i s i o n ( R e c a l l ) d R e c a l l
F P R = F P F P + T N
TP denotes the number of samples correctly predicted as landslides while TN denotes the number of samples correctly predicted as non-landslides. FP denotes the number of samples incorrectly predicted as landslides and FN denotes the number of samples incorrectly predicted as non-landslides.
To further quantify the consistency of ground-truth labels under our data split, we conduct a Kruskal–Wallis test between the training and test sets, yielding H = 0.247 and p = 0.62 (Cohen’s d = −0.024). These results indicate no statistically or practically significant differences, ensuring that the model is trained and evaluated on statistically indistinguishable distributions, thereby avoiding any exploitation of data-specific biases.
Our experimental environment is based on the Ubuntu operating system, utilizing Python version 3.10.18. The development framework relies on CUDA 11.8 and PyTorch 1.12.1. The model training server is equipped with two NVIDIA GeForce RTX 4090 graphics cards, providing 48 GB of graphics memory. Throughout the experiments, all models employ stochastic gradient descent (ADAMW) as the optimizer. The training parameters are set as follows: 40 epochs, a batch size of 16, and a dropout rate of 0.2. The specific hyperparameters for this experiment are shown in Table 6.
We conducted a comprehensive runtime and scalability analysis. Under the aforementioned hardware configuration, the end-to-end training time for BERTSC is approximately 25 min per epoch. Notably, while traditional models often require 40 or more epochs to converge, BERTSC utilizes an early stopping strategy that achieves optimal performance within 9 epochs, significantly optimizing the total computational cost. Once the model parameters are initialized, the GPU memory consumption remains stable at approximately 15.57 GB, which is well within the capacity of mainstream industrial-grade graphics cards. The runtime per epoch and cumulative runtime are shown in Figure 5.
Furthermore, the inference time for a single batch is kept at a sub-second level, supporting high-throughput transaction monitoring. To assess scalability, we analyzed the model’s performance under varying data densities through oversampling and undersampling. The results indicate that the computational complexity of the tri-modal fusion and the self-designed adjacency weighting scheme scales linearly with the number of nodes and edges. This linear scalability ensures that BERTSC can effectively handle the continuous expansion of Ethereum stablecoin transaction networks without exponential increases in resource consumption.

4.2. Comparative Analysis

In this paper, we selected graph embedding methods based on random walks for comparison with graph neural network models and mainstream baseline models from recent years.
  • Graph embedding methods based on random walks include DeepWalk [12], Trans2Vec [13], Diff2Vec [17], and Role2Vec [18]: DeepWalk generates node sequences through random walks on graphs, utilising skip-gram models to learn low-dimensional node representations, and stands as a classic approach in graph embedding; Trans2Vec makes walks more inclined to follow the semantic and temporal weights of transaction relationships. Diff2Vec extracts sequences from subgraphs for representation learning; Role2Vec enables nodes with similar functions to obtain proximate representations.
  • Graph neural network-based methods include GCN [7], GAT [26], and GSAGE [31]: GCN employs graph convolution operations for message propagation, updating node representations by aggregating neighbouring node information. GAT introduces an attention mechanism to compute weights between nodes, enabling adaptive learning of neighbouring node importance. GSAGE adopts a sampling and aggregation strategy, capable of processing large-scale graph data while supporting inductive learning.
  • Mainstream baseline models include BERT4ETH [32], ETH-GBERT [27], TGN [33], and TLMG4Eth [34]: BERT4ETH applies the BERT model to Ethereum transaction data, learning semantic representations of transaction sequences through a pre-trained language model. ETH-GBERT combines a hybrid model of graph neural networks and BERT, enhancing BERT’s semantic understanding capabilities through graph structural information. TGN proposes a generic framework to represent as sequences of timed events, combining memory modules with graph-based operators to capture temporal dynamics. TLMG4Eth integrates a transaction language model with graph representation learning, fusing semantic embeddings from transaction sentences with similarity and structural features for Ethereum fraud detection.
In the comparative experiments, all baseline models were evaluated according to their original configurations as specified in their respective papers, ensuring a level playing field for performance comparisons. Our evaluations focused on key metrics such as Precision, Recall, and F1-scores, with detailed performance results presented in Table 7.
The proposed BERTSC model achieves state-of-the-art performance on the stablecoin phishing detection task, attaining a Precision of 89.90%, a Recall of 89.47%, and an F1-score of 89.59%. Compared to the strongest baseline model, ETH-GBERT, our model demonstrates improvements of 4.96% (89.90% vs. 84.94%) in Precision, 3.60% (89.47% vs. 85.87%) in Recall, and 4.23% (89.59% vs. 85.36%) in F1-score. Furthermore, BERTSC excels in comprehensive ranking metrics, achieving the highest ROC-AUC (94.73%) and PR-AUC (90.43%) while maintaining the lowest False Positive Rate (FPR) of 10.16%, which underscores its superior ability to distinguish phishing accounts with minimal misclassification.
The graph embedding methods based on random walks and traditional GNN-based methods generally exhibit poor performance, with F1-scores mostly below 60%. Their ROC-AUC values hover around 43.12–71.42%, and notably, they suffer from extremely high FPRs. This indicates that while these models can capture some structural information, they struggle to filter out normal accounts, leading to a high volume of false alarms in real-world stablecoin environments.
TGN and TLMG4Eth show a significant performance leap over static graph models. TGN achieves an F1-score of 76.32% and an ROC-AUC of 80.17%, demonstrating the importance of capturing temporal dynamics in transaction events. TLMG4Eth further improves this by integrating transaction language models with graph features, reaching an F1-score of 79.43% and a high ROC-AUC of 91.31%. However, TLMG4Eth’s PR-AUC (82.26%) and FPR (14.79%) still lag behind BERTSC, suggesting that its fusion of semantic and structural features is not as optimized as our proposed architecture.
While BERT4ETH and ETH-GBERT demonstrate competitive performance with ROC-AUC scores exceeding 91.00%, their reliance on uni-modal sequential modeling or limited bi-modal fusion constrains their ability to capture the full spectrum of transaction characteristics. The architectural evolution from BERT4ETH to BERTSC represents a transition from uni-modal sequential modeling to a sophisticated tri-modal fusion paradigm. While BERT4ETH relies exclusively on a Transformer-based encoder to capture sequential semantic patterns, it remains constrained by its inability to model global topological relationships within the transaction network. ETH-GBERT advances this by integrating a graph convolutional component to fuse structural and semantic features; however, its bi-modal fusion is limited by the omission of critical numerical interaction priors. BERTSC overcomes these deficiencies by introducing an original soft-prompt encoder that maps account-level numerical features—such as gas prices and stablecoin interaction ratios—into learnable prompt vectors. Unlike its predecessors, BERTSC employs a self-designed adjacency weighting scheme and a tri-modal gating mechanism that adaptively modulates the contributions of semantics, topology, and numerical priors at a granular level. By synthesizing enhanced graph weights with multi-modal representations, BERTSC achieves a more robust decision boundary, which directly addresses the high False Positive Rates (FPRs) seen in BERT4ETH (16.42%) and ETH-GBERT (13.93%). The various metrics of the BERTSC model are shown in Figure 6.

4.3. Ablation Study

To validate the effectiveness of each component within the BERTSC model, we designed two ablation experiments to assess the contribution of different modules to the model’s performance. By progressively removing key components from the model, we were able to quantify the impact of each module on the final detection performance, thereby verifying the rationality and necessity of the proposed approach.
We evaluated the independent contributions of the three core modules within the BERTSC model. As shown in Table 8, we separately tested the combinations of BERT’s semantic module with the soft-prompt encoder (BERT&SOFT), the graph convolutional network with the soft-prompt encoder (GCN&SOFT), and the graph convolutional network with BERT’s semantic module (GCN&BERT). Experimental results indicate that the BERT&SOFT combination achieved 86.10%, 86.13%, and 86.11% in Precision, Recall, and F1-score, respectively, demonstrating the effectiveness of integrating semantic information with numerical prior knowledge. The GCN&SOFT combination exhibited relatively lower performance, achieving Precision, Recall, and F1-score of 81.34%, 72.63%, and 73.41%, respectively, indicating that relying solely on graph structural information and numerical features struggles to fully capture complex phishing behaviors. The GCN&BERT combination performed better, achieving metrics of 84.94%, 85.87%, and 85.36%, yet still fell short of the full BERTSC model’s 89.90%, 89.47%, and 89.59%. While the BERT&SOFT variant shows competitive discriminative power with an ROC-AUC of 93.52%, it is the full BERTSC model that achieves the optimal balance, particularly in minimizing the False Positive Rate to 10.16%. The significant increase in FPR observed in the GCN&SOFT variant underscores that the absence of deep semantic modeling leads to a substantial rise in misclassifications, confirming the critical role of the three-gate fusion mechanism in ensuring both high precision and detection reliability.
We further validated the efficacy of the adjacency matrix weighting strategy and interaction feature encoder. As shown in Table 9, the adjacency matrix weighting mechanism optimizes the topological relevance of transactions, significantly boosting Precision to 86.75%, ROC-AUC to 94.27%, and PR-AUC to 86.39% by refining the model’s global discriminative power. Furthermore, the soft-prompt encoding of account interaction features serves as a critical regularizer that minimizes the FPR to 10.16%, demonstrating that numerical prior information is essential for distinguishing subtle phishing patterns from legitimate behaviors.
To further mitigate the severe class imbalance inherent in stablecoin transaction data, we integrated the Synthetic Minority Over-sampling Technique (SMOTE) [35] into the soft prompt encoder. As shown in Table 10, the integration of SMOTE significantly optimizes the model’s ability to learn from minority class instances within the account interaction feature space. Specifically, applying SMOTE to the interaction feature module yielded a substantial performance leap, improving the F1-score from 39.52% to 50.02% and reducing the FPR from 73.52% to 56.91%. These results demonstrate that SMOTE effectively balances the decision boundary by oversampling phishing samples.
Under identical data partitioning and architectural settings, we evaluated the impact of three training strategies—original (no sampling), oversampling, and undersampling—on model performance. As demonstrated in Table 11, the original distribution yields slightly higher F1-score and AUC metrics. However, both oversampling and undersampling maintain competitive performance, with F1-scores remaining around 87% and fluctuations in ROC-AUC and PR-AUC restricted within a narrow range of 1–2 percentage points. This stability in global discriminative power and Precision–Recall trade-offs suggests that the proposed model is highly robust to categorical distribution perturbations. Such resilience further confirms that the architecture effectively avoids severe performance degradation or overfitting when subjected to oversampling or undersampling strategies.

5. Discussion

The superior performance of BERTSC over baseline models such as BERT4ETH [32] and ETH-GBERT [34] stems from its specialized tri-modal architecture, which addresses the fundamental limitations of single-source data modeling in stablecoin phishing detection. Our findings demonstrate that while sequential semantics capture behavioral dependencies, they are insufficient for identifying complex gangs without the global structural context provided by our multi-dimensional adjacency weighting scheme. By uniquely synthesizing temporal n-gram intervals and log-scaled transaction volumes into the graph topology, BERTSC captures latent relational anomalies that traditional binary-edge GCNs overlook. This transition from “feature-blind” to “feature-aware” graph construction represents a key methodological shift, enabling the model to distinguish between high-frequency legitimate trading and orchestrated phishing operations.
The integration of the original soft-prompt encoder further enhances the model’s representational expressiveness. Unlike conventional fusion methods that treat numerical features as auxiliary metadata, our framework maps interaction priors directly into the semantic subspace of the pre-trained BERT model. This allows numerical knowledge to guide the attention mechanism, effectively bridging the gap between behavioral semantics and financial attributes. The hierarchical tri-gate fusion mechanism ensures that this multi-source information is not merely concatenated but dynamically calibrated. By re-fusing initial structural and semantic cues with the integrated representation, BERTSC maintains a robust decision boundary even under conditions of severe class imbalance.
Regarding the practical deployment and ethical governance of BERTSC, several critical considerations must be addressed. While the model utilizes public blockchain records, the synthesis of high-dimensional behavioral features increases the risk of inadvertent de-anonymization. Future iterations should incorporate privacy-preserving primitives like zero-knowledge proofs to safeguard the financial pseudonymity of non-malicious users. Wrongful blacklisting in a stablecoin environment could lead to irreversible financial exclusion. Consequently, we propose that BERTSC be implemented as a decision-support component within a “human-in-the-loop” framework. By subjecting model-generated alerts to expert forensic verification, the system balances automated surveillance efficiency with legal accountability, thereby mitigating the risks of algorithmic bias and ensuring a secure, equitable blockchain ecosystem.

6. Conclusions

With the rapid growth of blockchain and cryptocurrency markets, stablecoins, due to their price stability, have become the preferred tool for financial criminals, facilitating USD 51.3 billion in illicit activities, particularly cross-border fund transfers [5]. Their low volatility and wide applicability significantly reduce operational risks and enable seamless illegal transactions. This paper proposes BERTSC, a tri-modal deep learning framework tailored for stablecoin phishing detection. It integrates graph convolutional networks, BERT semantic encoders, and soft-prompt encoders, with a dynamic gated fusion mechanism to adaptively combine structural, semantic, and numerical features. Experiments on large-scale stablecoin datasets show that BERTSC outperforms benchmarks, enabling robust phishing detection in stablecoin networks.

Author Contributions

Conceptualization, W.X., Q.C. and Z.C.; methodology, W.X., Q.C. and K.Z.; formal analysis, W.X.; data curation, W.X. and Q.C.; writing—original draft preparation, W.X.; writing—review and editing, W.X., Q.C., Z.C., K.Z. and C.F.; visualization, W.X.; supervision, Z.C., K.Z. and C.F.; project administration, Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Databases and source codes are available at: https://github.com/UnluckyXwX/BERTSC (accessed on 20 November 2025).

Conflicts of Interest

Author Kexin Zhu was employed by the company Fujian Yuke Information Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Pal, A.; Tiwari, C.K.; Behl, A. Blockchain technology in financial services: A comprehensive review of the literature. J. Glob. Oper. Strateg. Sourc. 2021, 14, 61–80. [Google Scholar] [CrossRef]
  2. Bhowmik, M.; Chandana, T.S.S.; Rudra, B. Comparative study of machine learning algorithms for fraud detection in blockchain. In Proceedings of the 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 8–10 April 2021; pp. 539–541. [Google Scholar]
  3. Wenhua, Z.; Qamar, F.; Abdali, T.-A.N.; Hassan, R.; Jafri, S.T.A.; Nguyen, Q.N. Blockchain technology: Security issues, healthcare applications, challenges and future trends. Electronics 2023, 12, 546. [Google Scholar] [CrossRef]
  4. Mita, M.; Ito, K.; Ohsawa, S.; Tanaka, H. What is stablecoin? A survey on price stabilization mechanisms for decentralized payment systems. In Proceedings of the 2019 8th International Congress on Advanced Applied Informatics (IIAI-AAI), Toyama, Japan, 7–11 July 2019; pp. 60–66. [Google Scholar]
  5. Aldasoro, I.; Frost, J.; Lim, S.H.; Perez-Cruz, F.; Shin, H.S. An Approach to Anti-Money Laundering Compliance for Cryptoassets; Bank for International Settlements: Basel, Switzerland, 2025. [Google Scholar]
  6. Givargizov, I. Unstable Financial and Economic Factors in the World and Their Influence on the Development of Blockchain Technologies. Int. Humanit. Univ. Her. Econ. Manag. 2023, 55. [Google Scholar] [CrossRef]
  7. Kipf, T.N. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  8. Béres, F.; Seres, I.A.; Benczúr, A.A.; Quintyne-Collins, M. Blockchain is watching you: Profiling and deanonymizing ethereum users. In Proceedings of the 2021 IEEE International Conference on Decentralized Applications and Infrastructures (DAPPS), Online, 23–26 August 2021; pp. 69–78. [Google Scholar]
  9. Ancelotti, A.; Liason, C. Review of blockchain application with graph neural networks, graph convolutional networks and convolutional neural networks. arXiv 2024, arXiv:2410.00875. [Google Scholar] [CrossRef]
  10. Mahrous, A.; Caprolu, M.; Di Pietro, R. Stablecoins: Fundamentals, Emerging Issues, and Open Challenges. arXiv 2025, arXiv:2507.13883. [Google Scholar] [CrossRef]
  11. Osterrieder, J.; Chan, S.; Chu, J.; Zhang, Y.; Misheva, B.H.; Mare, C. Enhancing security in blockchain networks: Anomalies, frauds, and advanced detection techniques. arXiv 2024, arXiv:2402.11231. [Google Scholar] [CrossRef]
  12. Perozzi, B.; Al-Rfou, R.; Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar]
  13. Wu, J.; Yuan, Q.; Lin, D.; You, W.; Chen, W.; Chen, C.; Zheng, Z. Who are the phishers? Phishing scam detection on ethereum via network embedding. IEEE Trans. Syst. Man Cybern. Syst. 2020, 52, 1156–1166. [Google Scholar] [CrossRef]
  14. Belghith, K.; Fournier-Viger, P.; Jawadi, J. Hui2Vec: Learning transaction embedding through high utility itemsets. In Proceedings of the International Conference on Big Data Analytics, Hyderabad, India, 19–22 December 2022; pp. 211–224. [Google Scholar]
  15. Ahmed, U.; Srivastava, G.; Lin, J.C.-W. A federated learning approach to frequent itemset mining in cyber-physical systems. J. Netw. Syst. Manag. 2021, 29, 42. [Google Scholar] [CrossRef]
  16. Luo, J.; Qin, J.; Wang, R.; Li, L. A phishing account detection model via network embedding for Ethereum. IEEE Trans. Circuits Syst. II Express Briefs 2023, 71, 622–626. [Google Scholar] [CrossRef]
  17. Rozemberczki, B.; Sarkar, R. Fast sequence-based embedding with diffusion graphs. In Proceedings of the International Workshop on Complex Networks, Santiago de Compostela, Spain, 11–13 December 2018; pp. 99–107. [Google Scholar]
  18. Ahmed, N.K.; Rossi, R.; Lee, J.B.; Willke, T.L.; Zhou, R.; Kong, X.; Eldardiry, H. Learning role-based graph embeddings. arXiv 2018, arXiv:1802.02896. [Google Scholar] [CrossRef]
  19. Shen, J.; Zhou, J.; Xie, Y.; Yu, S.; Xuan, Q. Identity inference on blockchain using graph neural network. In Proceedings of the International Conference on Blockchain and Trustworthy Systems, Guangzhou, China, 8–10 December 2021; pp. 3–17. [Google Scholar]
  20. Huang, H.; Zhang, X.; Wang, J.; Gao, C.; Li, X.; Zhu, R.; Ma, Q. PEAE-GNN: Phishing detection on Ethereum via augmentation ego-graph based on graph neural network. IEEE Trans. Comput. Soc. Syst. 2024, 11, 4326–4339. [Google Scholar] [CrossRef]
  21. Zhou, J.; Hu, C.; Chi, J.; Wu, J.; Shen, M.; Xuan, Q. Behavior-aware account de-anonymization on Ethereum interaction graph. IEEE Trans. Inf. Forensics Secur. 2022, 17, 3433–3448. [Google Scholar] [CrossRef]
  22. Chen, Z.; Liu, S.-Z.; Huang, J.; Xiu, Y.-H.; Zhang, H.; Long, H.-X. Ethereum phishing scam detection based on data augmentation method and hybrid graph neural network model. Sensors 2024, 24, 4022. [Google Scholar] [CrossRef] [PubMed]
  23. Farrugia, S.; Ellul, J.; Azzopardi, G. Detection of illicit accounts over the Ethereum blockchain. Expert Syst. Appl. 2020, 150, 113318. [Google Scholar] [CrossRef]
  24. Hu, T.; Liu, X.; Chen, T.; Zhang, X.; Huang, X.; Niu, W.; Lu, J.; Zhou, K.; Liu, Y. Transaction-based classification and detection approach for Ethereum smart contract. Inf. Process. Manag. 2021, 58, 102462. [Google Scholar] [CrossRef]
  25. Li, S.; Gou, G.; Liu, C.; Hou, C.; Li, Z.; Xiong, G. TTAGN: Temporal transaction aggregation graph network for ethereum phishing scams detection. In Proceedings of the ACM Web Conference 2022, Virtual Event, 25–29 April 2022; pp. 661–669. [Google Scholar]
  26. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
  27. Sheng, Z.; Song, L.; Wang, Y. Dynamic Feature Fusion: Combining Global Graph Structures and Local Semantics for Blockchain Phishing Detection. IEEE Trans. Netw. Serv. Manag. 2025; in press. [Google Scholar]
  28. Zhang, J.; Sui, H.; Sun, X.; Ge, C.; Zhou, L.; Susilo, W. GrabPhisher: Phishing scams detection in Ethereum via temporally evolving GNNs. IEEE Trans. Serv. Comput. 2024, 17, 3727–3741. [Google Scholar] [CrossRef]
  29. Pan, B.; Stakhanova, N.; Zhu, Z. Ethershield: Time-interval analysis for detection of malicious behavior on ethereum. ACM Trans. Internet Technol. 2024, 21, 1–30. [Google Scholar] [CrossRef]
  30. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  31. Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
  32. Hu, S.; Zhang, Z.; Luo, B.; Lu, B.; He, S.; Liu, L. Bert4eth: A pre-trained transformer for ethereum fraud detection. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 2189–2197. [Google Scholar]
  33. Rossi, E.; Chamberlain, B.; Frasca, F.; Eynard, D.; Monti, F.; Bronstein, M. Temporal graph networks for deep learning on dynamic graphs. arXiv 2020, arXiv:2006.10637. [Google Scholar] [CrossRef]
  34. Sun, J.; Jia, Y.; Wang, Y.; Tian, Y.; Zhang, S. Ethereum fraud detection via joint transaction language model and graph representation learning. Inf. Fusion 2025, 120, 103074. [Google Scholar] [CrossRef]
  35. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Figure 1. Typical process of a stablecoin phishing scam via direct transfer of malicious addresses.
Figure 1. Typical process of a stablecoin phishing scam via direct transfer of malicious addresses.
Electronics 15 00179 g001
Figure 2. Overall architecture of the proposed BERTSC framework. The model takes a heterogeneous transaction graph with multi-dimensional edge features as input, employs a soft prompt encoder to inject numerical priors, and utilizes a dynamic three-way gating mechanism to adaptively fuse graph structural, semantic, and numerical information for stablecoin phishing detection.
Figure 2. Overall architecture of the proposed BERTSC framework. The model takes a heterogeneous transaction graph with multi-dimensional edge features as input, employs a soft prompt encoder to inject numerical priors, and utilizes a dynamic three-way gating mechanism to adaptively fuse graph structural, semantic, and numerical information for stablecoin phishing detection.
Electronics 15 00179 g002
Figure 3. Visualization of the constructed directed transaction graph. Phishing nodes are marked in red, normal nodes in blue, and directed edges represent stablecoin transfers on Ethereum, with edge thickness proportional to transaction amount.
Figure 3. Visualization of the constructed directed transaction graph. Phishing nodes are marked in red, normal nodes in blue, and directed edges represent stablecoin transfers on Ethereum, with edge thickness proportional to transaction amount.
Electronics 15 00179 g003
Figure 4. Framework diagram of graph convolutional networks.
Figure 4. Framework diagram of graph convolutional networks.
Electronics 15 00179 g004
Figure 5. The training duration line chart of BERTSC.
Figure 5. The training duration line chart of BERTSC.
Electronics 15 00179 g005
Figure 6. Line chart of BERTSC results.
Figure 6. Line chart of BERTSC results.
Electronics 15 00179 g006
Table 1. Node and edge features used in the stablecoin transaction graph.
Table 1. Node and edge features used in the stablecoin transaction graph.
TypeFeatures
NodeTag (phishing label or normal label)
EdgeTimestamp difference (time interval between consecutive transactions)
Amount (transferred stablecoin value in USD equivalent)
GasPrice (transaction fee rate in Gwei)
Table Note: Node labels are obtained from publicly available phishing address datasets and official reports. Edge features are directly extracted from real stablecoin (USDT/USDC/DAI) transaction records on the Ethereum blockchain, reflecting temporal dynamics, economic scale, and cost characteristics of interactions.
Table 2. Statistical attributes of the constructed stablecoin phishing dataset.
Table 2. Statistical attributes of the constructed stablecoin phishing dataset.
AttributeValue
Time span2017–2025
Number of nodes2,529,625
Number of directed edges13,071,630
Phishing nodes1766
Normal nodes2,527,859
Table Note: The dataset consists of real on-chain USDT, USDC, and DAI transactions on Ethereum collected from January 2017 to June 2025. Phishing addresses are aggregated and cross-verified from Etherscan labels, CryptoScamDB, PhishTank, and official security reports.
Table 3. Features used for address dictionary construction and transaction preprocessing.
Table 3. Features used for address dictionary construction and transaction preprocessing.
Processing TypeDescription
Address Dictionary ConstructionUnique address to integer ID mapping
Transaction direction encoding (send = 1, receive = 0)
Aggregation of historical transaction records per address
Transaction-Level FeaturesTimestamp difference between transactions (seconds)
Transaction amount (in USD equivalent)
GasPrice (in Gwei)
Block number of the transaction
Table Note: The address dictionary maps raw Ethereum addresses to consecutive integers to enable efficient graph construction. All transaction records are preprocessed to extract directed edges and multi-dimensional features for input to the heterogeneous graph.
Table 4. Feature description.
Table 4. Feature description.
ParameterExample Value
tag0
Amount1000
Send1
2-gram5.3 s
3-gram10.25 s
4-gram20.13 s
5-gram60 s
Table Note: Example values of the features extracted and used in the experiments.
Table 5. Thirteen-dimensional account interaction features.
Table 5. Thirteen-dimensional account interaction features.
FeatureDescription
in_out_amount_ratioReflecting the account fund flows
out_in_amount_ratio
in_out_count_ratioTransaction frequency patterns
out_in_count_ratio
avg_in_gaspriceReflecting the transaction priority
avg_out_gasprice
log_in_amountLogarithmically transformed features
log_out_amount
log_ratio_amount
counterpart_diversityMeasuring the breadth of an account’s interactions with different addresses
is_high_in_out_ratioMarking anomalous patterns of fund flows
is_sink_nodeDetermining whether the account is a sink (hub) node
is_source_nodeDetermining whether the account is a source node
Table Note: These 13-dimensional statistical features are extracted from historical transaction sequences of each Ethereum address to characterize its behavioral patterns in stablecoin transfers.
Table 6. Hyperparameter settings for the Fusion Enhanced ETH-GBert model.
Table 6. Hyperparameter settings for the Fusion Enhanced ETH-GBert model.
HyperparameterValue
OptimizerBertAdam
Learning rate 8 × 10 6
Batch size16
Epochs9
Weight decay ( L 2 )0.001
Dropout rate0.2
Warmup proportion0.1
Max sequence length216
GCN embedding dimension16
Number of prompt tokens4
Activation functionReLU
Table 7. Performance comparison of different methods on the stablecoin phishing detection dataset.
Table 7. Performance comparison of different methods on the stablecoin phishing detection dataset.
MethodPrecisionRecallF1-ScoreROC-AUCPR-AUCFPR
DeepWalk30.0746.6336.5645.5838.6576.93
Trans2Vec49.1752.3649.5857.5350.7663.60
Diff2Vec51.8467.1058.4357.9552.4249.84
Role2Vec62.3551.2256.2771.4262.8034.82
GCN43.7051.7246.2349.7544.2766.39
GAT46.7656.2343.5752.6847.1158.13
GSAGE35.0642.3938.3843.1237.2164.82
TGN75.3277.1976.3280.1777.4226.34
TLMG4Eth75.1484.2479.4391.3182.2614.79
BERT4ETH78.5875.6777.0491.9780.7916.42
ETH-GBERT84.9485.8785.3693.3885.6313.93
BERTSC (ours)89.9089.4789.5994.7390.4310.16
Table 8. Ablation study on key modules of BERTSC.
Table 8. Ablation study on key modules of BERTSC.
MethodPrecisionRecallF1-ScoreROC-AUCPR-AUCFPR
BERT&SOFT86.1086.1386.1193.5286.1510.29
GCN&SOFT81.3472.6373.4188.6881.6236.72
GCN&BERT84.9485.8785.3693.8489.4413.28
BERTSC (ours)89.9089.4789.5994.7390.4310.16
Table 9. Effectiveness of adjacency matrix weighting and interaction features.
Table 9. Effectiveness of adjacency matrix weighting and interaction features.
MethodPrecisionRecallF1-ScoreROC-AUCPR-AUCFPR
Baseline84.9485.8785.3693.3885.6313.93
Only Weight86.7587.7287.1994.2786.3912.70
IF&W89.9089.4789.5994.7390.4310.16
Table Note: “IF&W” means Interaction Feature and Weight, which combines the interaction characteristics of different node accounts within the transaction network with the weights of the adjacency matrix.
Table 10. Impact of SMOTE on model performance.
Table 10. Impact of SMOTE on model performance.
MethodPrecisionRecallF1-ScoreROC-AUCPR-AUCFPR
Without SMOTE32.6751.1539.5253.6937.4173.52
I&W + SMOTE53.1553.3650.0254.0037.9256.91
Table 11. Impact of different sampling strategies on BERTSC performance.
Table 11. Impact of different sampling strategies on BERTSC performance.
MethodPrecisionRecallF1-ScoreROC-AUCPR-AUCFPR
Oversampling88.5486.8487.1493.8989.8015.62
Undersampling89.0888.4288.5993.6289.4211.72
Original (no sampling)89.9089.4789.5994.7390.4310.16
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, W.; Chen, Q.; Zhu, K.; Feng, C.; Chen, Z. BERTSC: A Multi-Modal Fusion Framework for Stablecoin Phishing Detection Based on Graph Convolutional Networks and Soft Prompt Encoding. Electronics 2026, 15, 179. https://doi.org/10.3390/electronics15010179

AMA Style

Xie W, Chen Q, Zhu K, Feng C, Chen Z. BERTSC: A Multi-Modal Fusion Framework for Stablecoin Phishing Detection Based on Graph Convolutional Networks and Soft Prompt Encoding. Electronics. 2026; 15(1):179. https://doi.org/10.3390/electronics15010179

Chicago/Turabian Style

Xie, Weixin, Qihao Chen, Kexin Zhu, Chen Feng, and Zhide Chen. 2026. "BERTSC: A Multi-Modal Fusion Framework for Stablecoin Phishing Detection Based on Graph Convolutional Networks and Soft Prompt Encoding" Electronics 15, no. 1: 179. https://doi.org/10.3390/electronics15010179

APA Style

Xie, W., Chen, Q., Zhu, K., Feng, C., & Chen, Z. (2026). BERTSC: A Multi-Modal Fusion Framework for Stablecoin Phishing Detection Based on Graph Convolutional Networks and Soft Prompt Encoding. Electronics, 15(1), 179. https://doi.org/10.3390/electronics15010179

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop