Multi-Modal Semantic Fusion for Smart Contract Vulnerability Detection in Cloud-Based Blockchain Analytics Platforms

Zeng, Xingyu; Wen, Qiaoyan; Qin, Sujuan

doi:10.3390/electronics14214188

Open AccessArticle

Multi-Modal Semantic Fusion for Smart Contract Vulnerability Detection in Cloud-Based Blockchain Analytics Platforms

by

Xingyu Zeng

,

Qiaoyan Wen

and

Sujuan Qin

^*

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(21), 4188; https://doi.org/10.3390/electronics14214188 (registering DOI)

Submission received: 17 August 2025 / Revised: 12 October 2025 / Accepted: 21 October 2025 / Published: 27 October 2025

(This article belongs to the Special Issue New Trends in Cloud Computing for Big Data Analytics)

Download

Browse Figures

Versions Notes

Abstract

With the growth of trusted computing demand for big data analysis, cloud computing platforms are reshaping trusted data infrastructure by integrating Blockchain as a Service (BaaS), which uses elastic resource scheduling and heterogeneous hardware acceleration to support petabyte level multi-institution data security exchange in medical, financial, and other fields. As the core hub of data-intensive scenarios, the BaaS platform has the dual capabilities of privacy computing and process automation. However, its deep dependence on smart contracts generates new code layer vulnerabilities, resulting in malicious contamination of analysis results. The existing detection schemes are limited to the perspective of single-source data, which makes it difficult to capture both global semantic associations and local structural details in a cloud computing environment, leading to a performance bottleneck in terms of scalability and detection accuracy. To address these challenges, this paper proposes a smart contract vulnerability detection method based on multi-modal semantic fusion for the blockchain analysis platform of cloud computing. Firstly, the contract source code is parsed into an abstract syntax tree, and the key code is accurately located based on the predefined vulnerability feature set. Then, the text features and graph structure features of key codes are extracted in parallel to realize the deep fusion of them. Finally, with the help of attention enhancement, the vulnerability probability is output through the fully connected network. The experiments on Ethereum benchmark datasets show that the detection accuracy of our method for re-entrancy vulnerability, timestamp vulnerability, overflow/underflow vulnerability, and delegatecall vulnerability can reach 92.2%, 96.3%, 91.4%, and 89.5%, surpassing previous methods. Additionally, our method has the potential for practical deployment in cloud-based blockchain service environments.

Keywords:

cloud-based blockchain; smart contract; vulnerability detection; multi-modal

1. Introduction

With the increasing demand for trusted computing in big data analytics, cloud platforms are realizing the deep collaboration of distributed trust and elastic resources through Blockchain-as-a-Service (BaaS) architecture. Mainstream cloud service providers encapsulate smart contract execution engines as scalable microservices to support the secure sharing of petabytes of data in healthcare, finance, and other fields across institutions. This “cloud-chain collaboration” mode significantly reduces the deployment energy consumption of blockchain nodes and improves the transaction throughput by relying on the cloud-native resource scheduling mechanism, which effectively meets the real-time requirements of big data analysis [1].

In data-intensive scenarios, BaaS platforms play a dual key role: privacy computing hub and process automation core [2]. On the one hand, by integrating zero-knowledge proof and homomorphic encryption technology, the platform realizes the joint analysis of multi-party data “available and invisible”. On the other hand, its smart contract analysis tool supports automatic execution of rules such as data authorization and payment splitting, which verifies the feasibility of providing efficient contract security services in the cloud environment. However, this deep dependence on smart contracts also introduces new security vulnerabilities—about 23% of DeFi security incidents are due to contract code vulnerabilities (e.g., re-entry attacks), resulting in malicious tampering of analysis results.

Existing smart contract vulnerability detection methods mainly rely on static rules, symbolic execution, or expert-led engineering methods and only analyze Solidity code while ignoring heterogeneous data such as off-chain logs and API documents, resulting in the lack of correlation analysis between data and the problem of multi-modal data fragmentation. Some multi-modal methods, such as the combination of expert rules and graph neural networks, can effectively improve the interpretability and detection accuracy [3], and the combination of deep learning and graph features enhances the automatic detection ability [4]. In addition, recent studies have systematically evaluated various detection tools, providing a comprehensive survey of the field of smart contract security from the perspectives of data sources, detection methods, and defense mechanisms. The multi-modal detection method significantly outperforms traditional static analysis tools in terms of accuracy and recall.

Against this backdrop, this paper proposes a multi-modal semantic fusion smart contract vulnerability detection framework for the cloud environment; the core contributions are as follows:

Precise extraction of key code for vulnerabilities: This paper parses the source code of smart contracts into an abstract syntax tree (AST) and extracts the key fragments of vulnerabilities based on a predefined set of vulnerability features. Each segment was further analyzed through data flow and control flow dependency analysis, resulting in the generation of critical code for vulnerabilities that fully retain the functionality, control, and exception ranges.
Multi-modal complementary feature extraction: Based on the key vulnerability code, this paper synchronizes it into two complementary modalities of text and image. This mechanism takes into account both the long-term dependencies of the contract code and the fine-grained structural information, breaking through the information bottleneck of a single modality.
Multi-modal deep fusion strategy: This paper aligns and concatenates the text and image embedding vectors at the node level, forming a unified representation that simultaneously incorporates global semantics and local structural features. This significantly enhances the model’s ability to express complex vulnerability patterns.

2. Related Work

2.1. Smart Contract Vulnerability Detection Methods

In the early stage of smart contract development, vulnerability detection mainly relied on the following three methods: static analysis, dynamic execution, and formal verification. The static analysis method examines the code without executing it, matching vulnerability patterns through predefined rules. Although this method is highly efficient, its false positive rate usually exceeds 30% [5]. The dynamic analysis method, such as symbolic execution, detects vulnerabilities like integer overflows through path constraint solving. However, its ability to handle complex control flows is limited, resulting in a recall rate lower than 70% [6]. Formal verification tools prove the correctness of the code by converting the contract into a mathematical model. Although they can provide a high level of security guarantee, they require manual specification and high computational overhead, making large-scale application difficult [7]. In recent years, for specific vulnerability types, hybrid static-dynamic methods have become a new trend. Xu et al. proposed a detection framework based on Hyperledger Fabric chain code, which combines static scanning of AST and dynamic verification through symbolic execution. When testing 15 actual projects, this framework was able to identify 13 unknown vulnerabilities, with an accuracy rate of up to 91% [8]. This study shows that the hybrid method has significant advantages in improving the efficiency and accuracy of vulnerability detection.

2.2. Machine-Learning-Driven Vulnerability Detection Technologies

Deep learning significantly enhances the generalization ability of vulnerability detection by automatically learning the semantic features of the code. According to the representation method of the code, existing research can be classified into three categories: sequence models, graph neural networks, and pre-trained models.

Sequence models treat the code as a text sequence and use recurrent neural networks (RNNs) or long short-term memory (LSTM) networks to capture the context dependencies. For example, the BLSTM-ATT model proposed by Qian et al. achieved an F1-score of 88.26% in detecting re-entry vulnerabilities [9]. However, serialization processing leads to the loss of code structure information, resulting in a relatively high false-negative rate in control flow vulnerability detection.

Graph neural networks model the code structure based on AST. For example, AST-GCN aggregates the feature of syntax tree nodes through graph convolution and achieves an F1-score of 76.2% [10]. However, this method lacks the modeling of multi-function dependency relationships. The gated graph neural network proposed by Cai et al. improves the recall rate of re-entry vulnerability detection by enhancing the propagation of cyclic dependencies by 5.7% [11].

Pre-trained models learn multi-language general representations through masked language tasks. For example, CodeBERT achieved an F1-score of 86.7% in the smart contract detection task [12]. However, single-modal models have difficulty balancing grammatical and semantic information. For instance, SmartConDetect uses bidirectional encoders (BERT) to extract code snippet features and achieves an F1-score of 90.9% but ignores the spatial–temporal patterns of bytecode. Jeon et al. convert the opcode sequence into grayscale images and use convolutional neural networks (CNNs) to extract spatial features, but in this process, the semantic associations of variables are lost [13].

2.3. Exploration of Multi-Modal and Graph Representation Learning Fusion

To address the limitations of a single modality, multi-modal fusion has emerged as an emerging development direction. Image–text dual-modal methods stand out: Tahir et al. map bytecode into RGB images and combine opcode frequency text features, raising re-entrancy detection F1 to 99.07% [14], and the VNT Chain team fuses contract source code text and control flow graph images, aligning modal by attention and achieving similar progress in IEEE. Li et al. proposed a hybrid framework that integrates deep learning detection and knowledge enhanced large language model repair, significantly improving the efficiency of detecting and repairing re-entry vulnerabilities in smart contracts [15]. However, existing fusion methods mostly adopt static weighting strategies, ignoring inter-modal contribution differences, leading to modality preference bias [16]. Moreover, GNNs suffer from neighborhood noise: failure to distinguish child node importance causes irrelevant code segments to interfere with key vulnerability features [17]. Recently, Huang et al. proposed an interpretable smart contract vulnerability detection model based on graph isomorphism network, which improved the accuracy of vulnerability detection and enhanced the interpretability of prediction results [18]. However, the collaborative optimization of multi-modal and graph attention has not been fully studied.

2.4. Difference from Traditional Software Vulnerability Detection

Traditional software vulnerability detection in software engineering primarily targets issues such as buffer overflows, memory corruption, and pointer misuse in programs written in procedural or object-oriented languages (e.g., C/C++ and Java). These methods typically rely on static code analysis, symbolic execution, or dynamic taint tracking to identify vulnerabilities related to resource misuse or control-flow anomalies. However, smart contracts differ fundamentally from traditional software in both their execution environment and vulnerability characteristics.

First, smart contracts execute within a decentralized and deterministic blockchain environment (e.g., Ethereum Virtual Machine), where all state transitions are recorded immutably on-chain. As a result, vulnerabilities often arise from transaction ordering, re-entrancy, timestamp dependence, and gas exhaustion—issues that have no equivalent in conventional software systems. Second, smart contracts lack system-level APIs and mutable runtime states, meaning that typical memory or I/O-related exploits in traditional software are not applicable. Third, the execution semantics of smart contracts are transaction-driven and publicly visible, introducing attack surfaces related to concurrency, contract interaction, and external call invocation.

Consequently, most traditional software vulnerability detection techniques cannot be directly reused for smart-contract analysis. For instance, static analyzers designed for C/C++ do not consider the transaction-level dependencies or the deterministic execution model of blockchain platforms. In contrast, our proposed Multi-Modal Semantic Fusion (MMSF) framework is specifically designed to capture both the semantic features of contract code and the structural dependencies of its transaction flow. By jointly modeling these two modalities through Transformer and GGNN representations, MMSF effectively addresses blockchain-specific vulnerabilities that fall outside the scope of conventional software engineering approaches.

2.5. Comparative Summary and Innovation Highlights

Existing studies have mainly focused on single-modality or syntax-level vulnerability detection methods, as shown in Table 1, such as symbolic execution tools (e.g., Oyente and Mythril) and graph-based learning frameworks (e.g., TMP and VulHunter). However, these approaches either suffer from limited scalability or rely solely on syntactic or structural information, neglecting semantic dependencies across functions.

In contrast, the proposed framework introduces three key innovations:

(1) Multi-modal semantic fusion: It jointly models textual semantics and structural dependencies, enabling comprehensive vulnerability reasoning across different code views.

(2) Cross-representation attention mechanism: A lightweight attention fusion layer is designed to dynamically weigh heterogeneous features, achieving higher interpretability and robustness than static fusion models used in prior work.

(3) Unified deployment perspective: Unlike prior detection frameworks that focus purely on static analysis, our scheme is designed for scalable integration within Blockchain-as-a-Service (BaaS) platforms, supporting real-time vulnerability scanning in production environments.

These distinctions collectively highlight the novelty of the proposed framework, bridging the gap between semantic representation learning and deployable vulnerability detection systems.

3. System Model

3.1. Model Definition

The smart contract vulnerability detection framework proposed in this paper adopts multi-modal collaborative analysis and semantic enhancement mechanism to construct an end-to-end vulnerability identification system. As shown in Figure 1, this paper firstly performs deep semantic preprocessing on the source code of the smart contract and then uses heterogeneous modal separation and fusion in the feature extraction stage. Finally, the multi-stage linkage neural network architecture is used to identify the vulnerability.

Given a smart contract C as input, it is composed of several functions

f_{i}

:

C = {f_{1}, f_{2}, \dots, f_{n}}

, where each function

f_{i}

contains an ordered sequence of statements

s_{i j}

, represented as

f_{i} = {s_{i 1}, s_{i 2}, \dots, s_{i m}}

. A set of vulnerability features

P = {p_{1}, p_{2}, \dots, p_{q}}

is defined for matching the smart contract code. The vulnerability feature set

P

is usually closely related to the core functionality or security of the smart contract, such as functions involving fund transfers, permission control logic, variable operations, etc. For example, in re-entrancy vulnerabilities, the constant

c a l l . v a l u e

related to transaction amounts is often used as a vulnerability feature. For a detailed explanation of the symbols and variables used in this study, please refer to Appendix A.

In this work, cloud infrastructure is employed primarily to enhance the scalability and efficiency of the analytical process rather than merely for data storage. The multi-modal semantic fusion framework involves large-scale feature extraction from both token sequences and graph representations, which requires considerable parallel computing power. By deploying the model on a cloud-based platform, we are able to dynamically allocate GPU and memory resources according to the workload, thereby supporting high-volume batch inference and accelerating model training. The cloud environment also facilitates distributed data management and experiment orchestration, enabling multiple processing nodes to handle semantic and structural streams concurrently. This architecture ensures that our analytics remain scalable, reproducible, and adaptable to different dataset sizes or real-world auditing scenarios.

3.2. Comparison with Existing Multi-Modal Fusion Methods

Unlike previous multi-modal vulnerability detection frameworks that simply concatenate textual and structural features or apply static late fusion mechanisms, our approach employs a cross-representation attention fusion strategy. This design allows dynamic weighting between modalities based on contextual dependencies, ensuring that semantic cues from the Transformer and structural information from the gated graph neural network (GGNN) are jointly optimized during training.

Furthermore, the proposed architecture supports hierarchical fusion, where local and global dependencies are iteratively aligned. Such an adaptive attention mechanism differs fundamentally from static fusion models such as TMP and DMT, which lack fine-grained cross-modal interaction. This distinction underscores the architectural innovation and explains the observed performance gains.

4. Smart Contract Code Vulnerability Detection Method Based on Multi-Modal Semantic Fusion

4.1. Preprocessing of Smart Contract Code

Vulnerabilities in smart contracts exist within one or more specific functions; only some statements are directly related to these vulnerabilities. Therefore, in Section 3.1, the smart contract is divided into smaller ordered statement sequences

s_{i j}

to provide more precise execution context, simplify the representation of complex control flows, optimize data flow tracing, and preprocess the smart contract code at a finer granularity.

Firstly, based on the ANTLR (ANother Tool for Language Recognition) tool, the contract source code C is converted into a syntax tree

T_{A S T}

.

T_{A S T}

is a tree structure containing N nodes, represented as

T_{A S T} = (V_{n o d e}, E_{e d g e}), | V_{n o d e} | = N

. Each node contains an attribute tuple

v_{i}

,

v_{i} \in V_{n o d e}, v_{i} = < t y p e, v a l u e, l i n e_n u m >

, such as variables, functions, operators, and control flow statements. By abstracting syntax trees (ASTs), important elements in each function of the smart contract are extracted, ensuring that the core logic and functionality of the contract can be captured.

Next, it is necessary to identify key vulnerability segments from the syntax tree

T_{A S T}

. For any node

v_{i} \in V_{n o d e}

in the syntax tree

T_{A S T}

, define the set of statements with data dependencies and control dependencies that match the vulnerability feature set

P

as the key vulnerability segment

v k f

. This process can be formally represented as

v k f = {v_{i} \in V_{n o d e} ∣ ⌀_{m a t c h} (v_{i}, P) = T r u e}

(1)

⌀_{m a t c h} (v, P) = \{\begin{matrix} 1 \\ 0 \end{matrix} \begin{matrix} i f \exists p \in P : s i m (V_{v a l u e}, p) > τ \\ o t h e r w i s e, \end{matrix}

(2)

where

τ

is the similarity threshold.

Then, based on the key vulnerability segments, other statements related to the statement are obtained by tracking data dependencies and control dependencies to trace the propagation path and trigger conditions of potential vulnerabilities. All related statements together form the complete key vulnerability code

V_{s v k f}

. This process can be formally represented as

V_{s v k f} = V_{v k f} \cup \{v_{j} ∣ \exists v_{i} \in V_{v k f} : ψ_{d e p} (v_{i}, v_{j})\},

(3)

where

ψ_{d e p}

is the dependency judgment function, which is divided into two categories: data dependencies and control dependencies. Data dependencies are represented as

ψ_{d a t a} (v_{i}, v_{j}) = I [D e f (v_{i}) \cap U s e (v_{j}) \neq ⌀] .

(4)

This means that there is a path without redefinition from

v_{i}

to

v_{j}

in the execution path of the smart contract program, that is, no other statements assign new values to the variables in

v_{i}

during the execution from

v_{i}

to

v_{j}

.

Control dependencies are represented as

ψ_{c t r l} (v_{i}, v_{j}) = I [v_{j} \in D o m i n a t o r s (v_{i})] .

(5)

This means that the execution result of

v_{i}

directly determines whether the basic block where

v_{j}

is located will be executed. In other words,

v_{j}

is on a branch path of

v_{i}

, and

v_{i}

is the nearest decision point that “controls” whether

v_{j}

is executed.

As shown in Algorithm 1, it can be seen that the key code of the vulnerability extracts the code fragments that may lead to the vulnerability during the execution of the contract but ignores the scope of the statement in the smart contract code. For example, in a re-entrant vulnerability scenario, an external call function can invoke an arbitrary function on the target address. Since this function directly sends ether coins and calls functions in the target contract, if the target contract triggers malicious code immediately after receiving ether coins, this can lead to a re-entry attack, where the target contract calls the transfer function again before control is returned to the current contract, resulting in duplicate execution of logic or theft of funds. However, the critical vulnerability code ignores the key semantic structure information related to the vulnerability and does not know the function scope of the transfer function.

To address the issue of missing scope range in key vulnerability code, this paper defines three types of scope statements: function statements, control statements, and exception statements. Check each line of code or statement in the smart contract source code and apply the defined rule set to check whether the statement matches a certain scope.

When a statement with semantic structure information is identified, traverse all statements within that scope. During the traversal process, determine whether there are statements within the semantic structure scope in the key vulnerability code. If there are, add the start and end statements of the semantic structure information statement to the program slice to ensure that the key vulnerability code can fully contain the semantic structure information.

By locating the semantic structure information statement and then inserting statements with scope range, the key vulnerability code with semantic structure information

V_{s v k f}

is generated, which not only reflects the dependency relationship of the code but also retains important semantic structure information.

Algorithm 1 Pseudocode of the smart contract code preprocessing algorithm

Output: smart contract $C = {f_{1}, f_{2}, \dots, f_{n}}$

set of vulnerability features

P = {p_{1}, p_{2}, \dots, p_{q}}

similarity threshold

τ

Ensure:: key vulnerability code $V_{s v k f}$
1:: // Step 1: Build an abstract syntax tree
2:: $T_{A S T} \leftarrow ANTLRParse (C)$ {Parse the contract source into an AST}
3:: $V_{node} \leftarrow getAllNodes (T_{A S T})$ {Get all nodes}
4:: // Step 2: Extract the vulnerability key segments
5:: $V_{vkf} \leftarrow Ø$
6:: for each node $v_{i} \in V_{node}$ do
7:: if $ϕ_{match} (v_{i}, P) = True$ then
8:: $V_{vkf} \leftarrow V_{vkf} \cup {v_{i}}$
9:: end if
10:: end for
11:: // Step 3: Vulnerability critical segments are extended through dependencies
12:: $V_{svkf} \leftarrow V_{vkf}$ {Initialized as a critical segments}
13:: for each node $v_{i} \in V_{vkf}$ do
14:: for each node $v_{j} \in V_{node}$ do
15:: if $ψ_{dep} (v_{i}, v_{j})$ then
16:: $V_{svkf} \leftarrow V_{svkf} \cup {v_{j}}$
17:: end if
18:: end for
19:: end for
20:: // Helper function definition $ϕ_{match} (v, P)$
21:: for each $p \in P$ do
22:: if $sim (v . value, p) > τ$ then
23:: return true
24:: end if
25:: end for
26:: return false
$ψ_{dep} (v_{i}, v_{j})$
27:: if $ψ_{data} (v_{i}, v_{j}) \lor ψ_{ctrl} (v_{i}, v_{j})$ then
28:: return true
29:: else
30:: return false
31:: end if
$ψ_{data} (v_{i}, v_{j})$
32:: return $I [Def (v_{i}) \cap Use (v_{j}) \neq Ø]$ and
{No path from $v_{i}$ to $v_{j}$ }
$ψ_{ctrl} (v_{i}, v_{j})$
33:: return $I [v_{j} \in Dominators (v_{i})]$ { $v_{i}$ control $v_{j}$ }
34:: return $V_{svkf}$

4.2. Feature Extraction of Key Vulnerability Code

To address the issue of large amounts of smart contract code and low text scanning efficiency, this paper further converts the key vulnerability code

V_{s v k f}

with semantic structure information into a graph structure

G = (V, E, X)

and performs text feature embedding and image feature embedding to standardize similar syntactic structures and semantic relationships. The vertex set

V = V_{s v k f}

, edge set

E = {(v_{i}, v_{j}) ∣ p a t h (v_{i}, v_{j}) \leq d_{max}} (d_{max} = 3)

, vertex feature matrix

X \in R^{M \times d},

and

x_{i} = E m b e d (v_{i}^{t y p e}) \oplus E m b e d (v_{i}^{v a l u e}) (d = 128 f o r e m b e d d i n g d i m e n s i o n, \oplus r e p r e s e n t s v e c t o r c o n c a t e n a t i o n)

Next, we input the sequence embedding vectors and graph embedding vectors into the Transformer and gated graph neural network for training to capture the feature representation of the code. Then, we fuse the feature vectors output by these two networks through a fully connected layer into a unified joint feature vector. Finally, we use the joint feature vectors of the two code segments to be detected as input for the similarity detection task, predict the similarity probability, and compare it with the preset threshold to obtain the final similarity detection result.

Let

| T |

denote the number of code tokens in the contract sequence and

| V |

the number of nodes in its control/data-flow graph. Throughout this section,

i \in [1, | T |]

indexes semantic tokens, and

j \in [1, | V |]

indexes structural nodes.

4.2.1. Text Feature Extraction

To learn the semantic and syntactic features of the code, this paper uses the Transformer to extract the text features of

V_{s v k f}

. Additionally, to capture the dependencies between various elements in the sequence, this method uses a self-attention structure. Specifically, we first perform a preorder traversal of the abstract syntax tree to generate a path sequence and map each node to an embedding vector through a unified vocabulary. Then, we input these embedding vectors into a Transformer model containing only an encoder, where each encoder unit includes a bidirectional multi-head self-attention layer and a feedforward neural network layer.

The Transformer is a deep learning model based on the attention mechanism, mainly composed of encoders and decoders. Its core self-attention mechanism allows the model to focus on information at different positions when processing input sequences, thereby effectively alleviating the problem of long-distance dependencies. Self-attention consists of three parts: query, key, and value. Its calculation formula is

A t t e n t i o n (Q, K, V) = S o f t M a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(6)

where

Q \in R^{l \times d}

,

K \in R^{l \times d}

, and

V \in R^{l \times d}

are the embedding path sequence vectors, l represents the size of the embedding sequence, and d represents the actual length of the input sequence. To capture more semantic content from the input sequence, this paper uses a multi-head mechanism to implement self-attention. The multi-head mechanism divides the queries, keys, and values into h heads. The self-attention calculation for each head is shown in the formula:

h e a d_{i} = A t t e n t i o n (Q_{i}, K_{i}, V_{i}) = S o f t M a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i} .

(7)

Subsequently, we concatenate all the attention vectors of the heads to obtain the final result of the multi-head self-attention layer:

M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1}, h e a d_{2}, \dots, h e a d_{h}) W^{O} .

(8)

To more effectively capture the features in the input sequence, this paper adds a feedforward neural network layer to further process the output of the multi-head attention. The feedforward neural network consists of two linear transformations and one non-linear activation function, whose calculation process is shown in the following formula:

F F H (X) = R e L U (X W_{1} + b_{1}) W_{2} + b_{2} .

(9)

Through the above self-attention mechanism and feedforward neural network, the internal relationships of the input sequence vectors are captured. The final node representation

T = [t_{1}^{T}, t_{2}^{T}, \dots, t_{| d |}^{T}]

contains all the information of all nodes in the input sequence, i.e., the generated text sequence feature vectors.

4.2.2. Image Feature Embedding

Although the sequence-based

V_{s v k f}

can learn the global structure and syntactic features of the code from the path sequence, since the path sequence is essentially a flattened expression, it inevitably ignores some semantic structural information that the code originally has. In a similar code, consider that the code structures of different programming languages are different, but the code logic is usually the same. The structures between the if_statement parts of different programming languages

V_{s v k f}

are similar. Therefore, this paper uses the GGNN to extract the local features of the code, taking the aforementioned

G = (V, E, X)

as input to capture the semantic features of the code. For the edge types in

G = (V, E, X)

, the method of merging adjacency matrices is used for processing.

GGNN is a neural network model specifically designed for graph-structured data, which can efficiently capture the dependency relationships and hierarchical structures between nodes. GGNN initializes each node as a feature vector and uses an iterative information propagation mechanism to continuously update the node state. The specific process is as follows:

Node initialization: First, each node in

G = (V, E, X)

is initialized, using the previously generated unified graph embedding representation as the initial feature vector of the node, as shown in the following formula:

h_{i}^{(0)} = v_{i} .

(10)

Domain aggregation: At each time step, the node sends its current embedding vector as a message to all adjacent nodes. Each node aggregates the messages sent by its neighboring nodes to update its own representation. That is, the formula is

m_{i}^{(t)} = \sum_{j \in N (i)} M e s s a g e (h_{i}^{(t - 1)}, h_{j}^{(t - 1)}) .

(11)

State update: Then, we use the gating mechanism to update the node state. That is, the formula is

h_{i}^{(t)} = U p d a t e (h_{i}^{(t - 1)}, m_{i}^{(t)}),

(12)

where the

U p d a t e

function combines the state

h_{i}^{(t - 1)}

of node i at time step

t - 1

and the aggregated message

m_{i}^{(t)}

to generate the new node state

h_{i}^{(t)}

, and

h_{i}^{(t)}

is the new state of node i at time step t. After T rounds of iteration for all nodes in the graph, the state

h_{i}^{(t)}

of each node after T rounds of iteration is the final representation of the node. The final output representation is

G = [g_{1}^{t r}, g_{2}^{t r}, \dots, g_{| d |}^{t r}]

, where d represents the number of nodes in the graph.

4.2.3. Multi-Modal Fusion of Image–Text Features

The Transformer extracts text features to obtain the global structure and logical features of the code. With its powerful self-attention mechanism, it can capture long-distance dependency relationships and overall logical flow in the code. At the same time, by using GGNN to extract the features of the extended attribute graph to capture the local structure and semantic features of the code, the details of the code graph structure and the dependency relationships between nodes can be obtained through the message passing and state update between graph nodes. After obtaining the feature vectors generated by the two models, it is necessary to integrate these two different features for the downstream similarity detection task. To this end, this paper proposes a feature multi-modal fusion, aiming to comprehensively integrate the global information of sequences and the local information of structures. The specific process is as follows:

Feature representation: In this section, through the upstream work, the text features and image features of the smart contract code are obtained, as shown in the following formula:

T = [t_{1}^{t r}, t_{2}^{t r}, \dots, t_{| d |}^{t r}]

(13)

G = [g_{1}^{t r}, g_{2}^{t r}, \dots, g_{| d |}^{t r}],

(14)

where

T \in R^{d \times m}

represents the features learned in

V_{s v k f}

, and

G \in R^{d \times m}

represents the features obtained from the graph

G = (V, E, X)

.

Feature pairing: In this section, the text features and image features corresponding to each node in

G = (V, E, X)

are paired, and

P_{i}

in the formula represents the pairing of the i-th pair of feature vectors:

P_{i} = (t_{i}^{t r}, g_{i}^{t r}) .

(15)

Feature vector connection: In this section, after pairing the feature vectors of each node, the paired feature vectors are connected to generate new feature vectors, as shown in the following formula:

h_{i}^{c o m b i n e d} = C o n c a t (t_{i}^{t r}, g_{i}^{t r}),

(16)

where

h_{i}^{c o m b i n e d}

is the newly generated feature vector of node i, and

C o n c a t

represents the concatenation of the two vectors.

Finally, we combine all the feature vectors of the nodes together:

H_{c o m b i n e d} = [h_{1}^{c o m b i n e d}, h_{2}^{c o m b i n e d}, \dots, h_{d}^{c o m b i n e d}] .

(17)

Through the above process in this section, the obtained

H_{c o m b i n e d} \in R^{2 \times d \times m}

is the joint feature vector of the code, which comprehensively includes the global structural features and local structural and semantic features of the code.

4.3. Vulnerability Detection

This paper adopts a multi-level linked neural network architecture to achieve an end-to-end detection process from feature extraction to vulnerability determination. As shown in Algorithm 2, first, through the Transformer model and GGNN model, we obtain the text sequence feature vectors

T = [t_{1}^{t r}, t_{2}^{t r}, \dots, t_{| d |}^{t r}]

and image feature vectors

G = [g_{1}^{t r}, g_{2}^{t r}, \dots, g_{| d |}^{t r}]

of the smart contract, respectively, representing the semantic information and structural patterns of the code. Then, through multi-modal feature fusion, a joint representation

H_{c o m b i n e d} = [h_{1}^{c o m b i n e d}, h_{2}^{c o m b i n e d}, \dots, h_{d}^{c o m b i n e d}]

is generated. This fusion mechanism effectively retains the complementarity of text and image features, providing a unified representation space for subsequent analysis.

To introduce vulnerability prior knowledge, this paper designs an attention-enhanced feature interaction module. For a given vulnerability feature set

P

, first calculate the similarity matrix between the joint feature

H_{c o m b i n e d}

and

P

:

A = s o f t m a x (\frac{H_{c o m b i n e d} \cdot P}{\sqrt{d}}),

(18)

where d is the feature dimension, and

\sqrt{d}

is the scaling factor to prevent gradient vanishing. Next, aggregate vulnerability features dynamically through similarity weights:

P_{c t x} = A \cdot P .

(19)

This process achieves context-aware feature enhancement, enabling the vulnerability detection model to adaptively adjust the weights of vulnerability patterns based on the current contract features. Finally, generate residual-connected feature representations

H_{c o m b i n e d}^{*} = L a y e r N o r m (H_{c o m b i n e d} + W_{p} P_{c t x}),

(20)

where

W_{p}

is the linear transformation matrix, and LayerNorm is used to ensure the stability of feature distribution.

Algorithm 2 Multi-Modal Vulnerability Detection Process

Input: Contract source C, token sequence T, contract graph $G = (V, E)$
Output: Predicted label $\hat{y}$
Notation: $| T |$ denotes the number of tokens in T; $| V |$ the number of nodes in G.
Indices $i \in [1, | T |]$ index tokens; $j \in [1, | V |]$ index graph nodes.
Step 1: Semantic Encoding
for $i \leftarrow 1$ to $| T |$ do
$x_{i} \leftarrow f_{T} (T_{i})$ , // embedding of the i-th token
end for
$X \leftarrow {x_{i}}_{i = 1}^{| T |}$
Step 2: Structural Encoding
for $j \leftarrow 1$ to $| V |$ do
$h_{j} \leftarrow f_{G} (v_{j}, N (v_{j}))$ , // hidden state of node $v_{j}$
end for
$H \leftarrow {h_{j}}_{j = 1}^{| V |}$
Step 3: Cross-Representation Attention and Fusion
for $i \leftarrow 1$ to $| T |$ do
for $j \leftarrow 1$ to $| V |$ do
$α_{i j} \leftarrow Attn (x_{i}, h_{j})$ , // pairwise attention weight
end for
end for
$z \leftarrow \sum_{i = 1}^{| T |} \sum_{j = 1}^{| V |} α_{i j} [x_{i} \oplus h_{j}]$ , // bounded double-sum fusion
Step 4: Classification and Decision
$s \leftarrow W z + b$ ; $\hat{y} \leftarrow σ (s)$
return $\hat{y}$

Finally, a fully connected network is used as a classifier to detect smart contract vulnerabilities. Each neuron in the fully connected layer is connected to all neurons in the previous layer, which can be represented as

l o g i t s = W_{2} \cdot σ (W_{1} H_{c o m b i n e d}^{*} + b_{1}) + b_{2},

(21)

where

σ

is the GELU activation function, and

W_{1}

and

W_{2}

are weight matrices. The output layer generates vulnerability probabilities through the sigmoid function:

p_{v u l n} = \frac{1}{1 + e^{- l o g i t s}}

(22)

5. Framework Deployment in Cloud-Based BaaS Environment

To validate the feasibility of deploying the proposed framework in practice, we further analyze its integration path into a real Blockchain-as-a-Service (BaaS) platform. The deployment architecture contains three layers: contract analysis layer, data fusion layer, and cloud orchestration layer.

(1) Contract analysis layer: The smart contracts uploaded by users are processed by a sandboxed execution engine and forwarded to the analysis service via RESTful APIs.

(2) Data fusion layer: The proposed multi-modal semantic fusion module runs as a microservice in Kubernetes clusters. This layer performs vulnerability parsing, feature extraction, and multi-modal attention fusion on distributed GPU nodes, utilizing asynchronous task queues for large-scale parallel analysis.

(3) Cloud orchestration layer: The detection results are returned to the blockchain service dashboard through the message bus, and vulnerability reports are stored in distributed object storage. The framework can, therefore, operate as a plug-in detection service for cloud-native BaaS ecosystems such as AWS Managed Blockchain or Alibaba Cloud BaaS.

6. Experimental

In this section, we empirically evaluate our proposed method using a publicly available dataset. In order to evaluate the performance of the proposed method, the following research questions were designed:

RQ1: Can our proposed method effectively detect common vulnerabilities in smart contracts, and is its vulnerability detection better than that of existing methods?
RQ2: Does adding multi-modal semantic fusion help improve model performance?
RQ3: How does the proposed multi-modal fusion mechanism (Transformer–GGNN with attention alignment) improve vulnerability detection accuracy and interpretability compared to unimodal and static fusion baselines? Can it be applied on a large scale?

The following experiments address each of these research questions.

6.1. Experimental Setup

Smartbugs Curated is an open-source project dedicated to the security of smart contracts [7]. It offers a meticulously curated dataset of Solidity smart contracts, which contain marked security vulnerabilities. The main purpose of this project is to evaluate the accuracy of automated analysis tools and provide a standard dataset for the research on the security of smart contracts. Therefore, this paper selects Smartbugs Curated on the Ethereum platform as the benchmark dataset.

In order to further examine the generalizability of the proposed method beyond curated datasets, we additionally collected a small corpus of real-world smart contracts deployed on the Ethereum mainnet. These contracts were obtained from verified sources on Etherscan between January and June 2025, covering decentralized finance (DeFi) applications, non-fungible token (NFT) marketplaces, and decentralized autonomous organization (DAO) governance modules. Unlike Smartbugs Curated, which provides manually annotated vulnerability types in a controlled setting, the real-world corpus reflects the diverse coding styles, third-party library dependencies, and complex contract inheritance structures commonly observed in production environments. Each contract was automatically labeled through a cross-validation of results produced by Slither and Mythril and a manual inspection of execution traces, resulting in approximately 1200 unique labeled contracts. Although this supplementary evaluation is limited in scale, its consistent accuracy trend with the benchmark dataset suggests that the proposed multi-modal semantic fusion model maintains strong detection capability under practical deployment conditions.

To verify the detection effectiveness of this model, this paper employed automated detection tools (Oyente [19], Mythril [20], and Slither [10]) and deep learning methods (TMP [21] and DMT [22]).

Oyente performs its main functions by parsing command-line parameters, loading smart contract files, invoking the analysis engine for security analysis, and outputting the analysis results. Its core functions include security checks, compliance verification, performance evaluation, and code quality analysis. It can automatically detect security vulnerabilities in smart contracts, such as re-entry attacks and integer overflows.

Mythril is a static vulnerability analysis tool that utilizes symbolic execution and SMT methods to conduct vulnerability detection. It supports the detection of various security vulnerabilities, including integer overflows, timestamp dependencies, and re-entry attacks.

Slither identifies and fixes security vulnerabilities in smart contracts through static analysis. Its rapid detection capabilities and user-friendly API make it a powerful tool for smart contract security analysis. Although it may not be as sensitive as other tools in detecting certain specific types of vulnerabilities, its comprehensive detection capabilities and ease of integration enable it to have a place in the field of smart contract security.

TMP is a novel temporal information propagation network that learns the vulnerability features in the normalized contract graph through graph convolutional learning. A contract graph is a graph structure that represents the data and control dependencies among program statements. This network can identify potential security vulnerabilities in smart contracts and provides a new perspective and method for the security analysis of smart contracts.

DMT is a novel cross-modal mutual learning framework, aiming to enhance the performance of smart contract vulnerability detection at the bytecode level. The DMT framework improves the accuracy of vulnerability detection by combining the two modalities of source code and corresponding bytecode.

Experiments were conducted on the four vulnerabilities: re-entrancy, timestamp, overflow/underflow, and delegatecall. The experimental parameters are set as shown in Table 2.

6.2. Resource Consumption Analysis

We deployed a prototype of the proposed framework in a Kubernetes-based testbed with three worker nodes (each equipped with Intel i5-14600 CPUs, 16 GB RAM, and one NVIDIA RTX 4070 GPU). On average, the detection of 1000 contracts required 2.3 min, with per-contract GPU utilization below 38% and memory consumption under 1.5 GB. This performance demonstrates that the system can efficiently handle batch detection workloads in distributed cloud environments without incurring excessive computational overhead.

6.3. Computational Cost and Energy-Efficiency Analysis

Training Time: The end-to-end training process, including Transformer pretraining and GGNN fine-tuning, requires approximately 7.8 h for convergence on 18,000 labeled contracts. Early stopping is triggered after 20 epochs with a batch size of 32, where validation accuracy stabilizes within ±0.3%.

Algorithmic Complexity: The theoretical computational complexity of the model can be expressed as

O (N_{T} \cdot d^{2}) + O (N_{G} \cdot | E |),

(23)

where

N_{T}

denotes sequence length in Transformer encoding, d is embedding dimension, and

N_{G}

and

| E |

are the number of nodes and edges in the GGNN graph, respectively. Given that

N_{T} \approx 200

and

| E | / N_{G} \leq 3

for smart contract code graphs, the overall time complexity is approximately linear with respect to code size.

Energy Consumption: Power usage was profiled using the NVIDIA-SMI monitoring tool. The average GPU power draw during training was 124 W, resulting in a total energy consumption of approximately 3.5 kWh for a complete training session. Inference tasks exhibit much lower power demand (≤45 W per batch), corresponding to an energy footprint of less than 0.002 kWh per contract.

These results confirm that although the multi-modal architecture increases training complexity, the overall resource cost remains within acceptable limits for modern GPU-equipped cloud environments.

6.4. Detection Performance Comparison (Addressing RQ1)

From Figure 2, Figure 3, Figure 4 and Figure 5, compared to traditional vulnerability detection tools such as Oyente Mythril and Slither, our proposed method demonstrates excellent experimental results. Traditional detection tools have not yet achieved high accuracy in detecting these four types of vulnerabilities. In terms of re-entrancy vulnerability detection, our proposed method achieves an accuracy rate of 92.2%, surpassing other automated detection tools with an average improvement of 19%. Our method achieves an accuracy rate of 96.3% in detecting timestamp vulnerabilities, far surpassing traditional detection tools. This is attributed to our keyword extraction and the sensitivity of timestamp vulnerabilities to blockchain information fields. In delegatecall vulnerabilities, we also maintain a relatively high accuracy rate. Compared to deep learning methods, our approach outperforms TMP (76.5%) and DMT (89.5%) in re-entrancy vulnerability detection, demonstrating that multi-modal fusion can effectively capture the semantic meaning of smart contract vulnerabilities. In terms of precision, our proposed method achieves an average improvement of approximately 17% over TMP and 3.7% over DMT, indicating that our method has an advantage in reducing false positives.

6.5. Ablation Experiments (Addressing RQ2)

To validate the effectiveness of each stage, four ablation experiments were designed. As shown in Table 3, when detecting re-entrancy vulnerabilities, removing image features resulted in a significant decrease in accuracy and recall rates. This is because re-entrancy vulnerabilities are not only related to key text information features but also to the re-entrancy vulnerability attack behavior. Re-entrancy vulnerabilities require specific contract execution behavior to trigger, and image features can effectively capture the relationships between graph nodes. Timestamp vulnerabilities are primarily caused by the use of sensitive information inherent to the blockchain, making them highly sensitive to text features. Removing text features results in a significant decline in detection capability. Delegatecall and overflow/underflow vulnerabilities are sensitive to both text and image features, and their detection quality decreases as features are removed. This highlights the importance of text and graph attention in dependency modeling.

6.6. Efficiency Studies (Addressing RQ3)

As shown in Table 4, we conducted detection experiments on 200 smart contracts. Since Oyente is based on symbolic execution, it traverses all possible execution paths to perform a full path analysis of the code logic. This can lead to path explosion and unsolvable paths, resulting in lengthy execution times. Although Mythril combines static analysis and symbolic execution, it essentially cannot avoid the same issues as Oyente. Slither performs static analysis based on source-level abstract syntax trees (ASTs) without executing the code. It detects vulnerabilities through syntax and logical rules, avoiding complex path issues, so it is faster. The method that we propose averages only 2 s, with the majority of the time spent on text and image feature extraction and cross-modal feature fusion, while model prediction takes only a small amount of time. This indicates that our method can be used for large-scale detection.

6.7. Dataset Diversity Discussion

To further validate the generalization capability of the proposed multi-modal semantic fusion framework, additional simulations were performed using a real-world Ethereum dataset collected from verified contracts on Etherscan. These contracts span DeFi, NFT, and DAO application domains, introducing significant heterogeneity in coding style, inheritance depth, and external library dependency.

Figure 6 illustrates two sets of results. Subfigure (a) compares the model’s accuracy and F1-scores on both Smartbugs Curated and real-world Ethereum datasets, showing that performance degradation remains within 2%. Subfigure (b) presents the ablation analysis, where the full multi-modal model consistently outperforms single-modality variants by 8–10 percentage points across datasets. These findings demonstrate that the fusion of textual and structural features enables stable and robust detection of complex vulnerabilities even under unbalanced, production-level data distributions.

6.8. Cross-Language Adaptability and Portability Analysis

Although the experiments in this study are conducted on the Solidity-based contracts’ EVM bytecode, the proposed framework is inherently language-agnostic due to its multi-modal semantic representation.

Language Adaptability: The text embedding stage in the Transformer encoder operates on tokenized AST paths, which can be extracted from any programming language with a defined grammar. By replacing the ANTLR grammar rules, the same preprocessing pipeline can be directly adapted to Vyper (Python-based EVM language), Move (Aptos/Sui), or Rust-based smart contracts (Solana, Near). Preliminary tests on 300 Vyper contracts demonstrate that the token and structure embedding coverage exceeds 93 indicating high cross-language compatibility.

Bytecode Ecosystem Adaptation: At the graph-modeling level, the GGNN component relies on dependency relationships (data and control flow) rather than language syntax. Therefore, by transforming bytecode instruction graphs (e.g., EVM opcode, WASM IR, and SVM instruction graphs) into unified intermediate representations, the same feature fusion mechanism can be applied without modifying the neural architecture. This modular design supports heterogeneous blockchain ecosystems.

6.9. Qualitative Case Analysis and Reproducibility Statement

To complement the quantitative evaluation, we conducted a qualitative inspection of the detection outcomes on selected smart contracts. Table 5 presents representative examples, including both successfully identified and missed vulnerabilities.

Case Study 1—Re-entrancy in Nested Calls: The model successfully detected the multi-level re-entrancy vulnerability in the EtherBank contract due to the clear call–transfer dependency captured by the GGNN attention module.

Case Study 2—Timestamp Dependency in Multi-Function Logic: The framework failed to detect the timestamp vulnerability in LotteryDAO, where the timestamp was indirectly invoked via an external library. The limitation arises from the incomplete propagation of dependency edges across imported files.

Case Study 3—Arithmetic Overflow with SafeMath Wrapper: The model correctly classified contracts using SafeMath as non-vulnerable, demonstrating contextual reasoning through the Transformer encoder.

Case Study 4—Delegatecall Misuse in Modular Contracts: The framework correctly detected the misuse of the delegatecall instruction in the MultiSend.sol contract, which was exploited to override storage variables of the calling contract.

7. Conclusions

This study proposes a multi-modal semantic fusion smart contract vulnerability detection framework for Blockchain-as-a-Service platforms in the cloud environment. Through abstract syntax tree parsing, vulnerability feature matching, and data/control dependency tracking, the method accurately locates and retains the critical code of the vulnerability with scope information. Then, the text sequence and graph structure features are extracted in parallel, the attention mechanism is used to achieve deep fusion, and the final vulnerability probability is output. The experiments on Ethereum large-scale benchmark datasets show that the proposed method is superior to the existing baselines in accuracy, recall rate, and F1-value, and the average detection time of a single contract is less than 2 s, which meets the real-time requirements of cloud services.

Author Contributions

Conceptualization, X.Z.; methodology, X.Z., Q.W. and S.Q.; software, X.Z.; supervision, Q.W. and S.Q.; validation, S.Q.; writing—original draft, X.Z.; writing and editing, X.Z., Q.W. and S.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by National Natural Science Foundation of China (No. 62272056).

Data Availability Statement

The datasets generated and analyzed during the current study are not publicly available due to privacy restrictions but are available from the corresponding author (Sujuan Qin) on reasonable request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Appendix A. Glossary of Symbols and Variables

Appendix Overview: The symbols and variables summarized in this glossary are grouped according to their functional role in the proposed multi-modal framework. Specifically, variables such as $x_{i}$ , $f_{T} (\cdot)$ , and $α_{i j}$ correspond to semantic-level representations extracted from the Transformer encoder; symbols including $h_{i}$ , $f_{G} (\cdot)$ , and $G = (V, E)$ refer to structural-level representations learned by the GGNN; and higher-level notations such as z, $\hat{y}$ , and $L$ denote system-level outputs and training objectives. Evaluation-related variables (e.g., Acc, $F 1$ , and $t_{infer}$ ) describe performance metrics used in experimental validation. This structured organization facilitates quick reference and improves readability, ensuring consistent interpretation of all mathematical symbols throughout the manuscript.

Table A1. Glossary of symbols and variables.

Symbol/Variable	Description
C	Smart contract under analysis.
$G = (V, E)$	Graph representation of a contract, where V and E denote node and edge sets, respectively.
$x_{i}$	Token embedding of the i-th code token extracted by the Transformer encoder.
$h_{i}$	Hidden state of the i-th node in the GGNN structural encoder.
$α_{i j}$	Attention weight between semantic feature $x_{i}$ and structural feature $h_{j}$ in the fusion layer.
$f_{T} (\cdot)$	Semantic encoder based on Transformer architecture.
$f_{G} (\cdot)$	Structural encoder based on gated graph neural network (GGNN).
z	Final multi-modal fused representation combining textual and structural features.
y	Ground truth vulnerability label (0 = non-vulnerable, 1 = vulnerable).
$\hat{y}$	Predicted vulnerability probability output by the model.
$L$	Overall loss function used for model training (e.g., cross-entropy loss).
$η$	Learning rate parameter in model optimization.
d	Dimensionality of feature vectors in Transformer and GGNN encoders.
$σ (\cdot)$	Activation function (e.g., ReLU or sigmoid) applied to feature transformations.
⊕	Concatenation or fusion operator between modality features.
N	Number of nodes in the contract graph.
$E_{T}$	Edge set capturing data or control dependencies between nodes.
$TP, FP, FN, TN$	True/false positive/negative counts used for evaluation metrics.
$Acc, Prec, Rec, F 1$	Accuracy, precision, recall, and F1-score performance metrics.
$t_{train}, t_{infer}$	Training time and inference time per contract, respectively.

References

Ma, W.; Li, W. Blockchain technology and internal control effectiveness. Financ. Res. Lett. 2024, 64, 1544–6123. [Google Scholar] [CrossRef]
Lu, Z.; Tang, Q.; Zhang, Y. BoR: Toward High-Performance Permissioned Blockchain in RDMA-Enabled Network. IEEE Trans. Serv. Comput. 2020, 13, 342–355. [Google Scholar]
Li, Y.; Xu, J.; Liang, W. GraphMF: QoS Prediction for Large Scale Blockchain Service Selection. In Proceedings of the 2020 3rd International Conference on Smart BlockChain (SmartBlock), Zhengzhou, China, 23–25 October 2020; pp. 167–172. [Google Scholar]
Liu, Z.; Jiang, M.; Zhang, S.; Zhang, J.; Liu, Y. A Smart Contract Vulnerability Detection Mechanism Based on Deep Learning and Expert Rules. IEEE Access 2023, 11, 77990–77999. [Google Scholar] [CrossRef]
Yang, H.; Zhang, J.; Gu, X.; Cui, Z. Smart Contract Vulnerability Detection based on Abstract Syntax Tree. In Proceedings of the 2022 8th International Symposium on System Security, Safety, and Reliability (ISSSR), Chongqing, China, 27–28 October 2022; pp. 169–170. [Google Scholar]
Li, X.; Liu, J.; Chen, X.; Zhang, Q. A Symbolic Execution-Based Approach for Smart Contract Vulnerability Detection. In Proceedings of the 2023 IEEE 6th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE), Shenyang, China, 15–17 December 2023; pp. 468–472. [Google Scholar]
Durieux, T.; Ferreira, J.F.; Abreu, R.; Cruz, P. Empirical Review of Automated Analysis Tools on 47,587 Ethereum Smart Contracts. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering, Seoul, Republic of Korea, 27 June–19 July 2020; pp. 530–541. [Google Scholar]
Xu, X.; Hu, T.; Li, B.; Liao, L. CCDetector: Detect Chaincode Vulnerabilities Based on Knowledge Graph. In Proceedings of the 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), Torino, Italy, 26–30 June 2023; pp. 699–704. [Google Scholar]
Qian, P.; Liu, Z.; He, Q.; Zimmermann, R.; Wang, X. Towards Automated Reentrancy Detection for Smart Contracts Based on Sequential Models. IEEE Access 2020, 8, 19685–19695. [Google Scholar] [CrossRef]
Feist, J.; Grieco, G.; Groce, A. Slither: A Static Analysis Framework for Smart Contracts. In Proceedings of the 2019 IEEE/ACM 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), Montreal, QC, Canada, 27 May 2019; pp. 8–15. [Google Scholar]
Cai, J.; Li, B.; Zhang, J.; Sun, X.; Chen, B. Combine Sliced Joint Graph with Graph Neural Networks for Smart Contract Vulnerability Detection. In Proceedings of the 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Macao, China, 21–24 March 2023; pp. 851–852. [Google Scholar]
Yaseen, H.; Hassan, S.I. Comparative Analysis of Smart Contract Vulnerability Detection: Traditional RegEx vs. DL CodeBert Model. In Proceedings of the 2025 Second International Conference Cognitive Robotics and Intelligent Systems (ICC-ROBINS), Coimbatore, India, 25–26 June 2025; pp. 237–242. [Google Scholar]
Jeon, S.; Lee, G.; Kim, H.; Woo, S.S. Design and Evaluation of Highly Accurate Smart Contract Code Vulnerability Detection Framework. Data Min. Knowl. Discov. 2024, 38, 888–912. [Google Scholar] [CrossRef]
Tahir, U.; Siyal, F.; Ianni, M.; Guzzo, A.; Fortino, G. Exploiting Bytecode Analysis for Reentrancy Vulnerability Detection in Ethereum Smart Contracts. In Proceedings of the 2023 IEEE International Conference Dependable, Autonomic and Secure Computing (DASC/PiCom/CBDCom/CyberSciTech), Abu Dhabi, United Arab Emirates, 14–17 November 2023; pp. 779–783. [Google Scholar]
Li, M.; Ren, X.; Fu, H.; Li, Z.; Sun, J. Enhancing Reentrancy Vulnerability Detection and Repair with a Hybrid Model Framework. In Proceedings of the 2024 31st Asia-Pacific Software Engineering Conference (APSEC), Chongqing, China, 3–6 December 2024; pp. 161–170. [Google Scholar]
Hu, S.; Huang, T.; İlhan, F.; Tekin, S.F.; Liu, L. Large Language Model-Powered Smart Contract Vulnerability Detection: New Perspectives. In Proceedings of the 2023 5th IEEE International Conference Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), Atlanta, GA, USA, 1–3 November 2023; pp. 297–306. [Google Scholar]
Cheong, Y.-Y.; Choi, L.Y.; Shin, J.; Kim, T.; Ahn, J.; Im, D.-H. GNN-based Ethereum Smart Contract Multi-Label Vulnerability Detection. In Proceedings of the 2024 International Conference on Information Networking (ICOIN), Ho Chi Minh City, Vietnam, 17–19 January 2024; pp. 57–61. [Google Scholar]
Huang, Q.; He, Y.; Xing, Z.; Yu, M.; Xu, X.; Lu, Q. Enhancing Fine-Grained Smart Contract Vulnerability Detection Through Domain Features and Transparent Interpretation. IEEE Trans. Reliab. 2025, 74, 4207–4221. [Google Scholar] [CrossRef]
Luu, L.; Chu, D.; Olickel, H.; Saxena, P.; Hobor, A. Making Smart Contracts Smarter. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016. [Google Scholar]
Mueller, B. A Framework for Bug Hunting on the Ethereum Blockchain. 2016. Available online: https://github.com/ConsenSys/mythril (accessed on 7 October 2025).
Zhuang, Y.; Liu, Z.; Qian, P.; Liu, Q.; Wang, X.; He, Q. Smart Contract Vulnerability Detection using Graph Neural Network. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, Yokohama, Japan, 11–17 July 2020; pp. 3283–3290. [Google Scholar]
Qian, P.; Liu, Z.; Yin, Y.; He, Q. Cross-Modality Mutual Learning for Enhancing Smart Contract Vulnerability Detection on Bytecode. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023. [Google Scholar]

Figure 1. System model.

Figure 2. Performance comparison of smart contract vulnerability detection (re-entrancy).

Figure 3. Performance comparison of smart contract vulnerability detection (timestamp).

Figure 4. Performance comparison of smart contract vulnerability detection (overflow/underflow).

Figure 5. Performance comparison of smart contract vulnerability detection (delegatecall).

Figure 6. Combined simulation results: (a) detection performance comparison on Smartbugs Curated and real-world Ethereum datasets; (b) ablation study across datasets showing consistent improvements of the full multi-modal fusion model.

Table 1. Systematic comparison of existing smart contract vulnerability detection methods.

Category	Methods	Input Features	Strengths	Limitations
Static Analysis	Oyente, Mythril	Symbolic traces	Interpretable	High false positives
Deep Learning	VulDeeSmart, TMP	Tokenized code, graphs	Automated learning	Requires large datasets
Graph-based	GGNN, VulHunter	AST/CFG/DFG	Captures structure	Ignores semantics
Multi-modal	DMT, DeepVul-Graph	Text + Graph fusion	Robust integration	Static fusion
Proposed (Ours)	MMVul-Detection	Transformer + GGNN + attention	Adaptive fusion	Higher training cost

Table 2. Experimental setup parameters.

Dataset Configuration
Dataset	Smartbugs Curated
Supplementary Dataset	1200 real-world Ethereum mainnet contracts
Vulnerability Type	Re-entrancy
	Timestamp
	Overflow/Underflow
	Delegatecall
Baseline Methods
Symbolic Execution	Oyente [19]
Industrial Tool	Mythril [20]
Industrial Tool	Slither [10]
Graph-based Model	TMP [21]
Neural network Model	DMT [22]
Evaluation Metrics
Performance Metrics	Accuracy, Precision, Recall, F1-score
Hardware/Software Environment
OS	Windows 10
CPU	Intel Core i5-14600
RAM	16 GB
GPU	NVIDIA GTX4070
Deep Learning Framework	PyTorch 1.12

Table 3. Ablation experiment results.

Method	Vulnerability	Recall	Precision	Accuracy	F1-Score
w/o Text feature	Re-entrancy	88.7	89.8	88.9	89.3
w/o Image feature		79.8	80.8	79.5	80.6
Full model		96.2	97.1	96.5	95.8
w/o Text feature	Timestamp	79.2	80.2	81.4	80.7
w/o Image feature		85.6	85.6	87.9	83.9
Full model		92.5	93.1	92.8	91.2
w/o Text feature	Overflow/Underflow	80.4	81.6	83.7	81.1
w/o Image feature		88.9	86.9	89.2	84.5
Full model		92.8	93.8	93.1	91.7
w/o Text feature	Delegatecall	81.6	85.6	83.9	82.9
w/o Image feature		87.5	84.7	88.8	84.3
Full model		93.4	94.4	93.7	92.1

Table 4. Average detection time for 200 contracts.

	Oyente	Mythril	Slither	Proposed
Time/s	45	30	10	2

Table 5. Representative qualitative analysis of detection outcomes.

Contract Name	Vulnerability Type	Detection Result
EtherBank.sol	Re-entrancy	Detected ✓
LotteryDAO.sol	Timestamp Dependency	Missed ✗
SafeToken.sol	Arithmetic Overflow	Detected ✓
MultiSend.sol	Delegatecall Misuse	Detected ✓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, X.; Wen, Q.; Qin, S. Multi-Modal Semantic Fusion for Smart Contract Vulnerability Detection in Cloud-Based Blockchain Analytics Platforms. Electronics 2025, 14, 4188. https://doi.org/10.3390/electronics14214188

AMA Style

Zeng X, Wen Q, Qin S. Multi-Modal Semantic Fusion for Smart Contract Vulnerability Detection in Cloud-Based Blockchain Analytics Platforms. Electronics. 2025; 14(21):4188. https://doi.org/10.3390/electronics14214188

Chicago/Turabian Style

Zeng, Xingyu, Qiaoyan Wen, and Sujuan Qin. 2025. "Multi-Modal Semantic Fusion for Smart Contract Vulnerability Detection in Cloud-Based Blockchain Analytics Platforms" Electronics 14, no. 21: 4188. https://doi.org/10.3390/electronics14214188

APA Style

Zeng, X., Wen, Q., & Qin, S. (2025). Multi-Modal Semantic Fusion for Smart Contract Vulnerability Detection in Cloud-Based Blockchain Analytics Platforms. Electronics, 14(21), 4188. https://doi.org/10.3390/electronics14214188

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Modal Semantic Fusion for Smart Contract Vulnerability Detection in Cloud-Based Blockchain Analytics Platforms

Abstract

1. Introduction

2. Related Work

2.1. Smart Contract Vulnerability Detection Methods

2.2. Machine-Learning-Driven Vulnerability Detection Technologies

2.3. Exploration of Multi-Modal and Graph Representation Learning Fusion

2.4. Difference from Traditional Software Vulnerability Detection

2.5. Comparative Summary and Innovation Highlights

3. System Model

3.1. Model Definition

3.2. Comparison with Existing Multi-Modal Fusion Methods

4. Smart Contract Code Vulnerability Detection Method Based on Multi-Modal Semantic Fusion

4.1. Preprocessing of Smart Contract Code

4.2. Feature Extraction of Key Vulnerability Code

4.2.1. Text Feature Extraction

4.2.2. Image Feature Embedding

4.2.3. Multi-Modal Fusion of Image–Text Features

4.3. Vulnerability Detection

5. Framework Deployment in Cloud-Based BaaS Environment

6. Experimental

6.1. Experimental Setup

6.2. Resource Consumption Analysis

6.3. Computational Cost and Energy-Efficiency Analysis

6.4. Detection Performance Comparison (Addressing RQ1)

6.5. Ablation Experiments (Addressing RQ2)

6.6. Efficiency Studies (Addressing RQ3)

6.7. Dataset Diversity Discussion

6.8. Cross-Language Adaptability and Portability Analysis

6.9. Qualitative Case Analysis and Reproducibility Statement

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Glossary of Symbols and Variables

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI