Enhancing Smart-Contract Security through Machine Learning: A Survey of Approaches and Techniques

.


Introduction
The introduction of blockchain technology as the foundation for Bitcoin by Satoshi Nakamoto in 2008 [1] marked the beginning of a new era in decentralized data management frameworks with the potential to revolutionize traditional industries [2].Over the past decade, blockchain technology has experienced rapid development not only experienced in the financial sector but also in various other fields, such as supply chain management [3], the Internet of Things (IoT) [4,5], transportation [6], retail [7], healthcare [8,9], gaming [10], and communication [11,12], among others.The impact of blockchain technology had progressively driven global economic growth by 2020, and it is projected to contribute 1.76 trillion dollars to the global economy by 2030 by increasing traceability and trust levels [13].
As blockchain technology progresses and permeates various domains, considering it merely a distributed ledger no longer captures its true essence.The tremendous potential of blockchain technology across diverse sectors has garnered significant attention and heightened expectations.Nonetheless, to realize these expectations, several challenges must be addressed.First, a sole consensus mechanism faces difficulties in ensuring high security and trustworthiness in a decentralized transaction verification network within a distributed environment where trust is not absolute.Second, the scalability challenge of blockchain technology has considerably hindered its integration and growth across various industries and fields.
In this context, Ethereum, as a groundbreaking blockchain platform [14], provides a viable solution to these challenges through the implementation of smart-contract technology.Smart contracts are self-executing, blockchain-based computer protocols that facilitate highly secure and trustworthy transactions in a decentralized environment.Consequently, smart contracts have been employed in a wide range of applications, including IoT-based smart contracts [15], trusted database systems using smart contracts [16], and smart-contract-based medical data consent models [9,17], among others.Given the significant sums of money involved in various smart-contract applications, ensuring their security is of paramount importance.As a result, numerous researchers and engineers have devoted their efforts to examining and enhancing the security of smart contracts, aiming to unlock their full potential across a diverse array of use cases.
An effective approach to smart-contract security detection is the application of machinelearning techniques.Over the past few decades, the field of machine learning has made significant progress and has become one of the key areas in modern computer science.The primary goal of machine learning is to develop algorithms capable of learning autonomously from data, enabling them to make predictions, classifications, or decisions when faced with new data.
Existing machine-learning-based smart-contract security detection has achieved a certain scale.However, in the current literature, comprehensive reviews on the application of machine learning in smart-contract security detection are relatively rare.To address this gap, we have decided to write a review paper, aiming to outline the current state, challenges, and future development trends of machine-learning techniques in smart-contract security detection, providing a comprehensive and systematic reference for both academics and practitioners.Our review is guided by the following core research questions: RQ1: What machine-learning methods are suitable for smart-contract security detection, and what are the advantages and limitations of these methods in vulnerability detection?
RQ2: What existing approaches apply machine learning to smart-contract security detection, and how do these methods perform?
RQ3: Future research directions and challenges: How can machine-learning methods be combined with other security analysis techniques to further improve the performance of smart-contract vulnerability detection?
In Section 2, we will introduce related review studies.In Section 3, we will introduce the research background of the article, including common types of smart-contract vulnerabilities and existing non-machine-learning tools for smart-contract vulnerability detection.In Section 4 will delve into machine-learning techniques applicable to smart-contract security detection.In Section 5 will provide an overview and analysis of related work that has been practically applied in smart-contract security detection.In Section 6, we will explore in-depth the three research questions posed by this study.Finally, in Section 7, we will summarize the main findings and conclusions of this paper.

Related Work
In this chapter, we have conducted an in-depth investigation of recent review literature in the field of smart-contract vulnerability detection, as well as reviews related to vulnerability detection based on machine learning.However, during this process, we noticed a concerning phenomenon: while machine learning has made significant progress in smart-contract vulnerability detection, there is still a noticeable gap in the review literature specifically addressing machine-learning-based smart-contract vulnerability detection.Therefore, the aim of this paper is to fill this void, providing researchers and developers with a more comprehensive and systematic theoretical framework and practical reference in this domain.
Atzei et al. [18] conducted a research investigation on Ethereum smart-contract attacks.Their study focused on various attack types and case studies, as well as ways to improve the security of smart contracts.Their work centered on analyzing different attack strategies, thus providing developers with practical advice and tools on how to defend against these attacks.Chen et al. [19] carried out a literature review on Ethereum system security, with a particular emphasis on vulnerabilities, attacks, and defenses.Overall, their research pro-vided an in-depth understanding of the security challenges faced by the Ethereum platform and proposed several countermeasures.Their work thoroughly assessed the security of the Ethereum ecosystem, suggested future research directions, and highlighted the importance of defensive measures.Liu et al. [20] surveyed and summarized the security verification of blockchain smart contracts.Their research work mainly focused on security verification methods and techniques for smart contracts.Their work was devoted to exploring various verification techniques rather than being limited to a single technology (such as formal verification), thus offering a comprehensive overview of various smart-contract security verification methods.Kabla et al. [21] conducted a comprehensive investigation on the applicability of Intrusion Detection Systems (IDS) for Ethereum attacks.This study examined the effectiveness of various IDS technologies in addressing security issues and detecting potential threats on the Ethereum blockchain.The investigation not only surveyed existing IDS methods but also provided new insights into the challenges and opportunities for enhancing intrusion-detection capabilities within the Ethereum ecosystem.Furthermore, the authors developed a multi-dimensional classification framework to assess and compare different IDS technologies, deepening the understanding of their strengths and weaknesses.Rameder et al. [22] performed a literature review on the automated vulnerability analysis of Ethereum smart contracts.This review mainly explored various automated vulnerability analysis techniques for smart contracts on the Ethereum platform.The review focused more on the application and potential of automated analysis methods in the smart-contract security domain and proposed a multi-dimensional classification framework to analyze and evaluate different vulnerability analysis techniques.Kushwaha et al. [23] conducted a systematic survey of security vulnerabilities in Ethereum blockchain smart contracts.The study delved into various security vulnerabilities plaguing Ethereum smart contracts and discussed techniques for identifying and mitigating these issues.The novelty of this work lies in the systematic approach taken to examine security vulnerabilities and their impacts while also identifying gaps in the current state of research and proposing future directions for enhancing the security of smart contracts on the Ethereum platform.Krichen et al. [24] innovatively investigate the application of formal methods in the specification and verification of smart contracts, aiming to reduce the risk of failures and minimize costs.Concurrently, it proposes future research directions, such as lowering verification costs, integrating various mathematical concepts, and enhancing the accessibility and reusability of formal methods.These insights provide novel perspectives and guidance for the smart-contract domain.Miller et al. [25] innovatively propose utilizing formal methods for systematic auditing and verification of smart contracts to ensure their security.Simultaneously, they examine the platforms, high-profile vulnerabilities, and existing analysis tools within the smart-contract ecosystem and identify the research challenges faced by formal methods and program analysis applied to smart contracts.
Additionally, we conducted a survey of review articles on vulnerability-detection models based on machine-learning techniques.Ahmed et al. [26] published a machinelearning review on software vulnerability detection at the 8th International Conference on Contemporary Information Technology and Mathematics (ICCITM) in 2022.This review primarily explored the application and outcomes of machine-learning methods in the field of software vulnerability detection.Unlike other reviews, this article delved into the practical application and effectiveness of various machine-learning techniques in software vulnerability detection and proposed a classification framework to illustrate the types of technologies employed by each method.Pan et al. [27] carried out a literature review on hardware vulnerability analysis using machine learning.This review mainly investigated the application and results of machine-learning methods in the field of hardware vulnerability analysis.They thoroughly examined the practical application and effectiveness of various machine-learning techniques in hardware vulnerability analysis and proposed a classification framework to describe the types of technologies adopted by each method.Zeng et al. [28] conducted a survey review on software vulnerability analysis using deeplearning techniques.This review primarily discussed the application and effectiveness of deep-learning methods in the field of software vulnerability analysis and discovery.This research did not focus on any specific technology but rather discussed the target technology as a dimension of the classification framework.Lin et al. [29] performed a literature review on software vulnerability detection using deep learning techniques.This review provided an in-depth discussion of the application and outcomes of deep neural network methods in software vulnerability detection, offering insightful perspectives and future research directions.Additionally, this review paid particular attention to the verification aspect, considering not only security-related issues but also functional problems.

Reentrancy Attack
Reentrancy attacks are a common vulnerability in smart contracts, where an attacker repeatedly calls functions during contract interaction, allowing them to steal assets or disrupt contract logic.To prevent such attacks, contract developers should follow the "check-effects-interactions" principle, ensuring state updates are completed before interacting with external contracts.Employing locking mechanisms can also help avoid reentrancy.Careful code review, testing, and adherence to programming best practices can help protect against reentrancy attacks and other security vulnerabilities.
Figure 1 illustrates a simplified bank smart contract with a reentrancy attack vulnerability and an attacker's contract designed to exploit this vulnerability.In this banking system, users can deposit funds into their bank accounts (using the deposit() function), check their balances (using the getBalance() function), and withdraw funds (using the withdraw() function).However, there is a flaw within the withdraw() function that attackers can exploit for a reentrancy attack.When a user calls the withdraw() function, the contract sends a specified amount of Ether to the user's address.If the user's address is a contract address and that contract implements a f allback() function, this f allback() function could trigger a new withdraw() call, leading to a reentrancy attack.In the example, we outline the attacker's process: the attacker creates a contract called Attacker, passing the target Bank contract's address and their own address in the constructor.In the attack() function, the attacker first deposits a certain amount of Ether into the target Bank contract's account, then calls the withdraw() function.In the f allback() function, if the attacker receives enough Ether, they transfer the Ether to the target contract and call the withdraw() function again.This causes the target contract to repeatedly execute the withdraw() function, resulting in multiple withdrawal operations and theft of the contract's assets.Therefore, by implementing a malicious contract, the attacker exploits the reentrancy attack to bypass the security mechanisms in the target contract and successfully steal its assets.
To prevent such attacks, developers should be cautious about reentrancy vulnerabilities when calling functions from other contracts and carefully consider security during contract design and implementation.Researchers and developers can use various static and dynamic analysis tools to detect and fix vulnerabilities during the vulnerability discovery and remediation process.Additionally, contract audits and security assessments are crucial steps to ensure contract safety.

Integer Overflow and Underflow
Smart-contract integer overflow and underflow vulnerabilities are common issues in smart contracts.An integer overflow or underflow occurs when an integer variable's value exceeds or falls below the maximum or minimum value that its data type can represent.In smart contracts, integer overflow or underflow can lead to calculation errors or unpredictable behavior, potentially compromising the contract's security.To address integer overflow and underflow problems, smart-contract developers typically use the SafeMath library, which offers a set of secure mathematical operations to prevent overflow and underflow.
In Figure 2, we present a simplified bank system smart contract and an example of an attack contract targeting this smart contract.In the bank contract on the left, users can deposit funds into the bank account (using the deposit() function), check their balance (using the getBalance() function), and withdraw funds (using the withdraw() function).However, the balances variable in the contract is of uint type, and if a user deposits more than 2 255 ether, it may cause an integer overflow in the balances variable, resulting in a significantly reduced user balance.Conversely, if a user attempts to withdraw an amount greater than balances[msg.sender],it may lead to integer underflow, turning the user's balance into a large value or even the maximum value in the contract.The attack contract shown on the right exploits this situation to attack the bank system.During the attack, the attacker first creates a contract called Attacker and passes the target contract Bank's address in the constructor.Then, in the attack() function, the attacker deposits 2 255 ether into the target contract Bank's account and subsequently calls the withdraw() function to withdraw 1 ether.Due to balances[msg.sender]becoming a large value, withdrawing 1 ether causes balances[msg.sender]to transform into a small value, leading to an integer underflow vulnerability.The attacker uses the getBalance() function to check the balance and stores the result in a public variable called overflow, allowing other users to view it.To prevent such attacks, developers should be cautious when using integer variables to avoid overflow or underflow, carefully considering security in contract design and implementation.In detecting and fixing vulnerabilities, researchers and developers can utilize various static and dynamic analysis tools to help identify and resolve issues.Additionally, contract auditing and security assessments are crucial steps in ensuring contract safety.

Uninitialized Storage Pointer
Uninitialized Storage Pointers in smart contracts represent a class of easily overlooked but potentially dangerous security vulnerabilities.These vulnerabilities typically occur when developers use storage pointers within smart contracts without initializing them, or when initialization is incomplete.Since storage pointers directly point to the storage space of a smart contract, attackers can exploit these pointers to access and modify data in the storage area, leading to incorrect reading, writing, or alteration of contract data.
In Figure 3, we present an example of a Voting System Smart Contract with an uninitialized storage pointer vulnerability and an attack contract targeting it.In the voting system, users can call the vote() function to cast their votes and the getVotes() function to query the voting results.However, the votes variable in the contract is a mapping type; if the attacker does not initialize it before calling the getVotes() function, an uninitialized storage pointer vulnerability may arise.Attackers can exploit this vulnerability to access and modify data in the storage area, thereby tampering with the voting results.In the attack contract shown on the right, the attacker creates a contract named Attacker and passes the address of the target Voting contract in the constructor.In the attack() function, the attacker first calls the getVotes() function, passing their own address as a parameter.
As the attacker's address has not been voted for, the value of msg.sender is 0, leading to the uninitialized storage pointer vulnerability.Next, the attacker calls the vote() function to cast their vote, causing the value of votes msg.sender to be set to 1, successfully manipulating the voting results.To prevent such attacks, developers should ensure proper initialization of storage pointers and carefully consider security during the design and implementation of smart contracts.In detecting and fixing vulnerabilities, researchers and developers can employ various static and dynamic analysis tools to help identify and remediate issues.Additionally, auditing and security evaluations of contracts are vital steps in ensuring contract safety.

Access Control Vulnerability
Access Control Vulnerabilities refer to situations where smart contracts fail to correctly implement permission control, allowing attackers to carry out unauthorized operations.In general, a smart contract should verify a user's identity before allowing certain actions.However, if the contract fails to properly implement authentication, attackers can bypass these measures by impersonating authorized users or, through other means, executing unauthorized operations.Such vulnerabilities can lead to severe consequences, such as asset theft, contract tampering, or contract shutdown.Implementing proper access control in smart contracts is critical.Typically, functions within a contract should be limited to the contract owner or specified users.To enforce access control, contracts often use the require() statement in Solidity to check the caller's permissions.When checking permissions, contracts should adhere to the principle of least privilege, granting callers the minimum necessary permissions to execute an operation.
In Figure 4, we present an example of a smart contract with an access control vulnerability, along with an attacker's contract that exploits the vulnerability and the attack sequence.In this example, only the contract creator (owner) can execute the doSomething() function, and others are not permitted.The contract sets the creator as the owner of its constructor.The doSomething() function uses the require() statement to check if the caller is the contract creator, and if not, the function execution fails and returns an error message.However, if someone obtains the private key of the contract creator, they can impersonate the creator to carry out an authorized user elevation attack and execute the doSomething() function.Therefore, even if the contract has implemented authentication, attackers can still bypass it if the private key is leaked or if other vulnerabilities are present.Hence, implementing proper access control in smart contracts is of utmost importance.Researchers should focus on concurrency control, role and permission design, and implementation, including aspects such as authentication and access control, to ensure the security of smart contracts.

Front-End Runtime Error Vulnerability
Front-end runtime error vulnerabilities often occur in the parts of smart contracts that interact with user interfaces.As smart contracts run within front-end applications, various errors may arise, including logic errors, input validation errors, and exceptional cases.These errors can lead to smart-contract execution failures or unpredictable behavior, ultimately compromising the contract's security.
Figure 5 is a voting system example; we create a voting system smart contract called "Voting," where users can call the vote() function to vote and the getVotes() function to query the voting results.However, the vote() function in this contract has a logic error: if a user votes multiple times, an exception is triggered, causing execution to fail.This may lead to the contract not executing correctly or exhibiting unpredictable behavior.Consequently, in the attack contract, the attacker exploits this function vulnerability by creating a contract called "Attacker" and passing the target contract Voting's address in the constructor.In the attack() function, the attacker calls the vote() function twice, causing the contract's execution to fail and potentially leading to unpredictable behavior.To prevent such attacks, developers should carefully consider the security of smartcontract front-end applications, including input validation, exception handling, and error handling.When designing and implementing contracts, developers should fully consider potential error scenarios and adopt appropriate security measures to ensure the contract's correctness and safety.Additionally, researchers and developers can use various static and dynamic analysis tools to detect and fix front-end runtime errors, helping to enhance the security of the contracts.
3.1.6.Time Dependency Time dependency attacks are a type of assault targeting smart contracts by exploiting time-related operations within the contract and the inherent characteristics of blockchain.In a blockchain, timestamps usually depend on the block generation time.However, mining nodes possess a certain degree of timestamp manipulation capability.Attackers may manipulate block timestamps to influence time-dependent smart-contract logic, such as timers, auction end times, or voting deadlines.This can result in contract behavior failure or outcomes that deviate from expectations, causing losses to contract participants.
In Figure 6, we name a deposit and withdrawal smart contract "Bank."Users can deposit funds by calling the deposit() function and withdraw funds using the withdraw() function.The getBalance() function can be utilized to query user balances.However, the contract is vulnerable to time dependency attacks.If an attacker calls the deposit() function before a user invokes the withdraw() function, they can disrupt the target contract's execution order and extract the deposit before the user's withdrawal.In the corresponding attack contract, the attacker creates a contract called "Attacker" and passes the target contract Bank's address in the constructor.Within the attack() function, the attacker first calls the deposit()function to deposit funds.Then, the attacker calls the withdraw()function to extract 1 ether of funds.Since the deposit has been committed and included in a new block but not yet added to the blockchain, the attacker successfully extracts the deposit before the withdrawal, causing losses to the deposit and withdrawal system users.To prevent such attacks, developers should carefully consider the contract's time dependencies and adopt appropriate security measures during design and implementation.For instance, contracts can perform necessary state checks and input validation before transaction execution to prevent unnecessary interference or attacks.Moreover, users should be cautious about protecting their funds and information security when using contracts.
In summary, time-dependency attacks can pose a significant risk to smart contracts that rely on time-based logic.Developers need to be vigilant in addressing these vulnerabilities and implementing robust security measures when designing and building smart contracts.By taking the necessary precautions, both developers and users can help ensure the safety and integrity of smart contracts on the blockchain.

Other Vulnerabilities
Building upon the previously discussed smart-contract vulnerabilities, there are other common vulnerability types in practice.Here are several representative smart-contract vulnerabilities: Delegatecall Vulnerability: Delegatecall operation serves as a mechanism for crosscontract invocation in smart contracts; however, inappropriate usage may lead to security vulnerabilities.Attackers, by crafting well-designed parameters for invocation, exploit the execution context of the target contract to perform malicious actions, resulting in asset theft or contract logic disruption.
Randomness Challenge: Due to the deterministic nature of blockchain, generating reliable random numbers within smart contracts poses a significant challenge.Attackers may predict or manipulate the random number generation process, thereby affecting the contract execution outcome and leading to asset loss or an unfair competitive environment.
Short Address Attack: A short address attack occurs when an attacker leverages the padding with zeros characteristic during data transmission, providing incomplete address information to mislead the contract.Consequently, the attacker bypasses the validation mechanism and incorrectly transfers assets to an address under their control.

Gas Limit and Optimization Issues:
The execution of smart contracts requires the consumption of Gas, with the Gas Limit serving as the budget ceiling for contract execution.When handling complex logic without optimizing the contract code, Gas resources may be depleted, preventing the contract from functioning correctly.Moreover, attackers can construct high Gas-consuming transactions to perform a denial of service, thereby compromising contract availability.
The aforementioned list only covers some of the potential vulnerability types in smart contracts.As blockchain technology evolves, new vulnerabilities may emerge.Therefore, maintaining a focus on smart-contract security research and best practices is crucial for ensuring contract safety.

Development of Smart-Contract Security Detection
With the rapid development and widespread application of blockchain technology, smart contracts, as a core component, play a crucial role in implementing essential functions.However, the security issues of smart contracts urgently need to be addressed, as vulnerabilities may lead to severe economic losses and a crisis of trust.Against this backdrop, smart-contract security detection has gradually become a critical area of research and practice.This section will review the development history of smart-contract security detection, from the initial methods and tools to the emerging innovative technologies, revealing their evolution trends and future challenges.
In 2016, Luu et al. [30] were the first to introduce the smart-contract vulnerability detection tool Oyente, a symbolic execution-based tool designed specifically to discover potential security vulnerabilities.Oyente can detect common vulnerabilities in smart contracts, such as reentrancy attacks, transaction order dependencies, and timestamp dependencies.This pioneering work laid the foundation for research in the Ethereum smart-contract security domain and had a profound impact on subsequent related studies.
Two years later, Tikhomirov et al. [31] proposed a static analysis-based smart-contract vulnerability detection tool called SmartCheck.This tool uses ANTLR (a powerful parser generator for building language tools) and custom Solidity syntax to generate XML parse trees as an intermediate representation.Then, vulnerability patterns are identified by running XPath queries on the Intermediate Representation (IR).That same year, numerous other research achievements made significant progress in the field of smart-contract security.Breidenbach et al. [32] introduced Securify, a symbolic execution-based security analysis tool aimed at assessing the safety of Ethereum smart contracts.Securify employed technologies such as Program Query Language (PQL) and Domain Specific Language (DSL) to achieve automated analysis of smart contracts.Additionally, the tool utilized dependency graph analysis, compliance checks, and violation pattern recognition meth-ods to identify potential security vulnerabilities within smart contracts.Brent et al. [33] presented an Ethereum smart-contract security analysis framework called Vandal.Vandal transformed EVM bytecode into semantic logic relations and used the Soufflé language for declarative security analysis.The paper also introduced a new decompilation technique for incremental control flow reconstruction.Kalra et al. [34] proposed the ZEUS framework, which combined abstract interpretation and symbolic model-checking techniques to verify the correctness and fairness of smart contracts.By developing a Solidity-to-LLVM bytecode converter and using LLVM pass separation for transformation and verification checks, the contract security verification achieved low false-positive rates and high analysis efficiency.In this year, besides the previously mentioned static detection methods, many dynamic detection algorithms emerged, such as fuzz testing techniques.These techniques generated random input data and detected potential vulnerabilities during program execution, thus supplementing the inadequacies of static analysis methods in smart-contract security vulnerability detection.Jiang et al. [35] introduced ContractFuzzer, a fuzz-testing framework specifically designed for detecting security vulnerabilities in Ethereum smart contracts.The paper provided a detailed description of ContractFuzzer's design, including input generation and test oracle analysis strategies, and demonstrated its effectiveness in the high-precision detection of seven types of Ethereum smart-contract vulnerabilities through experimental research.
One year later, Feist et al. [36] introduced a static analysis framework called Slither, designed to provide comprehensive information about Ethereum smart contracts.This framework achieves its goal by converting Solidity smart contracts into a dedicated intermediate representation called SlithIR.Employing a Static Single Assignment (SSA) form and a simplified instruction set, SlithIR streamlines the analysis process while preserving semantic information that might be lost during the conversion of Solidity to bytecode.Slither utilizes common program analysis techniques such as data flow analysis and taint tracking to uncover potential vulnerabilities and opportunities for code optimization.Chang et al. [37] proposed a method for automatically identifying critical paths in smart contracts and ranking them by importance called sCompile.This method uses symbolic execution to explore possible execution paths in smart contracts and identify those involving monetary transactions.By identifying critical paths through symbolic execution and ranking them by importance, paths that might violate safety or correctness are prioritized for analysis.In that year, the first machine-learning-based smart-contract vulnerability detection model, SmartEmbed [38], emerged.This tool consists of two phases: a model training phase and a prediction phase.The training phase comprises four main steps: tokenization, syntax parsing, code embedding, and similarity computation.Tokenization breaks down code into individual tokens; syntax parsing analyzes the code structure to identify syntactic components; code embedding maps each token and syntactic element to a high-dimensional vector space; and similarity computation compares the vectors of different code snippets to determine their similarity.In the prediction phase, SmartEmbed detects clones by identifying similar smart contracts based on embeddings and can also detect vulnerabilities by comparing contracts from the existing Ethereum blockchain or any contract provided by a developer to a vulnerability database.This tool can efficiently verify whether a given smart contract contains known vulnerabilities without the need to manually define vulnerability patterns.
Following this, the number of smart-contract vulnerability detection tools increased rapidly.For example, Huang et al. [39] introduced a vulnerability detection tool that combined graph embedding with bytecode.They normalized data and instructions, used simulated bytecode execution to track data flow and control flow, enforced contract slicing, and designed an unsupervised graph embedding algorithm to encode code graphs as comparable vectors, identifying potentially vulnerable smart contracts.Chen et al. [40] proposed DefectChecker, a method, and tool based on symbolic execution for detecting eight types of contract defects that could lead to undesirable behavior in Ethereum smart contracts, and validated its performance on open-source datasets.Chen et al. [41] auto-matically recovered function signatures by utilizing the way EVM processes functions, identified parameter types using Type-Aware Symbolic Execution (TASE), and developed a tool called SigRec to recover function signatures from contract bytecode.Hu et al. [42] introduced a static defect detection method based on the Solidity language knowledge graph, called SoliDetector, which constructed an ontology layer and an instance layer, introduced defect patterns, designed inference rules, and used SPARQL queries to locate defects.
In recent years, machine-learning-based smart-contract vulnerability detection methods have attracted widespread attention and research.These methods take full advantage of machine-learning techniques, offering more efficient and accurate solutions for smartcontract security analysis.However, as Chapter 5 of this paper will specifically explore machine-learning-based smart-contract vulnerability detection techniques in-depth, further discussion will not be provided in this background section.In subsequent chapters, we will elaborate on the specific implementation of these methods and their application in the security analysis of smart contracts.

Machine-Learning Techniques
Machine learning is committed to enabling computers to accumulate new experience and knowledge by mining potential patterns in data, thereby improving their intelligence and enabling them to make decisions like humans.The application of machine-learning algorithms has become increasingly crucial in the field of smart-contract vulnerability detection.As various industries experience sustained growth in data demand and an escalating need for efficient data processing and analysis, numerous tailored machinelearning algorithms have emerged.These algorithms primarily rely on mathematical and statistical approaches to address optimization problems.

The Development of Machine Learning
The development of machine learning is shown in Figure 7.In 1943, McCulloch et al. [43] introduced a mathematical model that depicted the fundamental structure of artificial neural networks.This research laid the groundwork for the development of the neural network field and had a profound impact on later machine-learning techniques.In 1950, Turing proposed the "Turing Test" [44], marking the beginning of artificial intelligence as an important research area.In 1957, Rosenblatt et al. [45] introduced the perceptron, which initiated the study of computer neural networks.Hubel et al. [46] discovered a neural network structure that provided deep insights into understanding biological visual systems, significantly influencing later computer vision and neural network models.In 2012, Graves et al. [49] first proposed the LSTM model, which addressed the vanishing and exploding gradient problems in recurrent neural networks (RNN) when processing long sequences, profoundly impacting sequence prediction and natural language processing fields.In 2014, Goodfellow et al. [50] introduced the concept of GANs, which included two competing neural networks: a generator network and a discriminator network.This framework brought innovation to generative model research.In 2015, Mnih et al. [51] proposed an algorithm called Deep Q-Network (DQN), which combined convolutional neural networks (CNN) with the Q-learning algorithm to process raw pixel inputs and action-value functions.This approach demonstrated the immense potential of combining deep learning with reinforcement learning for the first time.
In 2017, Vaswani et al. [52] introduced a novel neural network architecture based on the self-attention mechanism: Transformer.This had a profound impact on the Natural Language Processing (NLP) field.In 2018, the release of the BERT model [53] brought revolutionary changes to the natural language processing field.In the same year, OpenAI released GPT-1 and subsequently launched GPT-2 in 2019, GPT-3 in 2020, and GPT-4 in 2023, propelling natural language processing technology to unprecedented heights.

Machine-Learning Algorithms
We divide machine-learning techniques into four categories: Supervised Learning, Semi-Supervised Learning, Unsupervised Learning, and Reinforcement Learning.The general overview is shown in Figure 8.In what follows, we will discuss these four categories and their related methods in detail.

Supervised Learning
In the domain of smart-contract security analysis, supervised learning approaches have achieved significant success and have been extensively employed in practical scenarios.This can be primarily attributed to the powerful capabilities of supervised learning algorithms in pattern recognition and knowledge representation, as well as their effective utilization of large volumes of labeled data.By training models to identify potential security vulnerabilities, supervised learning offers robust support for the security auditing of smart contracts.
In the following sections, we will provide a comprehensive understanding and inspiration for readers by elaborating on some representative supervised learning approaches employed in smart-contract security analysis.
Linear regression: This algorithm is a linear method used for modeling the relationship between a dependent variable and one or more independent variables, typically employed for predicting numerical outcomes.The fundamental equation is y = kx + b, where k represents the slope and b denotes the intercept.Linear regression offers simplicity, interpretability, and computational efficiency [54], and prediction problems constitute a classic application scenario of the linear regression algorithm [55].
Logistic regression: This algorithm is a linear method introduced by David et al. [56] in 1958 for modeling discrete target variables.The basic form of the model is y = where the sigmoid function is defined as sigmoid(x) = 1 1+exp(−x) , making logistic regression an effective method for binary classification tasks.Logistic regression provides ease of interpretability and can be easily extended to handle multi-class classification problems, currently being extensively applied in areas such as credit assessment and tumor diagnosis [57].
Support Vector Machines: This algorithm constitute a set of classical supervised learning methods, initially proposed by Corinna et al. [58], aimed at identifying the optimal hyperplane that maximizes the margin between distinct classes.SVM is formalized as a convex optimization problem, which can be expressed as min w 2 subject to y i (w T x i + b) ≥ 1.These techniques are highly effective for classification and regression tasks in high-dimensional spaces and exhibit strong robustness against overfitting; thus, they are frequently applied in gene classification within the field of bioinformatics [59].
Random Forest: This algorithm is an ensemble learning method introduced by Breiman et al. [60] in 2001.It improves prediction accuracy by constructing multiple decision trees and combining their predictive outcomes using voting or averaging strategies.This approach has achieved success in numerous application domains, particularly being widely employed in classification and regression tasks [61].
K-Nearest Neighbors: This algorithm is an instance-based learning method proposed by Cover and Hart in 1967 [62].It operates by computing the distances between a test data point and known data points using distance metrics, such as the Euclidean distance, identifying the nearest K neighbors, and classifying the test point based on the labels of these neighbors.This method holds classical significance in the field of pattern recognition [63].
Convolutional Neural Networks: This algorithm is a deep learning architecture specifically designed for processing grid-like data, such as images, introduced by Le-Cun et al. in 1989 [64].CNNs employ convolutional layers to learn local features, pooling layers to reduce spatial dimensions, and fully connected layers for classification or regression [65].The convolution operation is defined as ( f * g)(t) = f (τ)g(t − τ)dτ, enabling the network to effectively capture local patterns and hierarchical features.A classic application of CNNs is LeNet-5, which achieved breakthrough results in handwritten digit recognition tasks [66].
Graph Neural Networks: This algorithm is a deep learning method for processing graph data, proposed by Scarselli et al. in 2009 [67].They learn node representations by performing information passing on nodes and edges.The core idea of GNNs is to multiply the adjacency matrix A of graph-structured data with the node feature matrix X, as in A * X.This approach finds broad applications in areas such as social network analysis, recommendation systems, and knowledge graphs [68].
Graph Convolutional Networks: This algorithm was introduced by Kipf et al. in 2016 [69] as a method of extending convolutional operations to graph-structured data.GCNs perform graph convolution operations using the adjacency matrix A and the node feature matrix X, as in Z = ReLU(A hat * X * W), where A hat is the normalized adjacency matrix, X represents the node feature matrix, W is the weight matrix, and ReLU is the activation function.This method holds classical significance in tasks such as node classification, graph embedding, and link prediction [70].
Recurrent Neural Networks: This algorithm is a neural network method for processing sequence data, proposed by Rumelhart et al. in 1986 [71], capable of capturing temporal dependencies.The core idea of RNNs is to introduce recurrent connections in the network's hidden layer, allowing information to be passed between time steps, as in h t = f (W * x t + U * h (t−1) ), where h t represents the hidden state at time t, x t represents the input at time t, W and U are weight matrices, and f is the activation function.RNNs find widespread applications in fields such as natural language processing, speech recognition, and time series prediction [72].
Long Short-Term Memory: This algorithm is a special type of Recurrent Neural Network (RNN) introduced by Hochreiter and Schmidhuber [73] in 1997 to address the vanishing gradient problem in long sequences.LSTMs control the storage and flow of information in cell states by introducing forget, input, and output gates, as in f t = σ(W f * [h (t−1) , x t ] + b f ).LSTMs are widely applied in tasks such as natural language processing, speech recognition, and time series prediction [74].
Gated Recurrent Units: This algorithm was proposed by Cho et al. in 2014 [75].GRUs are a variant of LSTMs that reduce computational complexity by decreasing the number of gates while maintaining similar performance.GRUs control the storage and flow of information in cell states by introducing update and reset gates, as in z t = σ(W z * [h (t−1) , x t ] + b z ).GRUs exhibit good performance in tasks such as natural language processing, speech recognition, and time series prediction.Compared to Long Short-Term Memory (LSTM) networks, GRUs have fewer gates, thus reducing computational requirements while maintaining similar performance levels in many applications.

Semi-Supervised Learning
Although supervised learning methods have achieved significant results in smartcontract security detection, semi-supervised learning methods have been relatively less explored in this field.Semi-supervised learning methods combine labeled and unlabeled data during the training process, aiming to overcome the dependency on large amounts of labeled data in supervised learning.However, the application of semi-supervised learning methods in smart-contract security detection is limited by several factors.First, the complexity and diversity of smart-contract vulnerabilities may lead to insufficient generalization capabilities in semi-supervised learning methods.Second, the data distribution in this field may exhibit significant imbalances, which may negatively impact the performance of semi-supervised learning algorithms.We believe that although the application of semi-supervised learning methods in smart-contract security detection is relatively limited, they have potential advantages in dealing with data scarcity and reducing annotation costs.Therefore, in the future, semi-supervised learning methods may still play a role in smart-contract vulnerability detection, providing new solutions for security audits.
Self-training: This algorithm was first introduced by Yarowsky et al. [76].Selftraining methods initially train a base model on a small labeled dataset, then use the model to predict labels for unlabeled data.The main application areas of self-training methods include image classification and natural language processing tasks [77].
Tri-training: This algorithm proposed by Zhou et al. in 2005 [78], trains three models on a labeled dataset and uses each model to label unlabeled data.If two models agree on the label for a data point, the label is added to the labeled dataset, and the third model is retrained on the extended dataset.This process is repeated, improving model performance.Classic application areas of tri-training include text classification and named entity recognition [79].
BERT: Bidirectional Encoder Representations from Transformers, this algorithm was introduced by Devlin et al. in 2018 [80].It is a pre-trained natural language processing model based on the Transformer architecture that undergoes pre-training on a large amount of unlabeled text using masked language modeling (MLM) and next sentence prediction (NSP) tasks.The key lies in its bidirectional context encoding, consisting of E = (e 1 , e 2 , . . ., e n ).It has been widely applied in areas such as text classification, named entity recognition, and question-answering systems [81].
GPT: Generative Pre-trained Transformer, this algorithm was proposed by Radford et al. in 2018 [82].This is another pre-trained natural language processing model based on the Transformer architecture that uses only unidirectional autoregressive language modeling tasks for pre-training.GPT models using the formula: P(x) = ∏ i P(x i | x 1 , . . ., x i−1 ).This formula denotes that, given the previous word sequence x 1 , . . ., x i−1 , the goal of GPT is to maximize the conditional probability of predicting the next word x i [83].The main application areas of GPT include text generation, text summarization, and machine translation [84].

Unsupervised Learning
Unsupervised learning is a type of machine-learning technique in which algorithms learn from and identify patterns in unlabelled data.In contrast to supervised learning, which relies on a dataset with labeled examples, unsupervised learning algorithms analyze the underlying structure or distribution of the data without any prior knowledge of the correct outputs.
K-means: This algorithm, proposed by MacQueen et al. [85], is a prototype-based iterative clustering method aimed at minimizing the sum of squared distances between data points within each cluster and their respective cluster centers.This method can be expressed by the following formula: argmin Here, K represents the number of clusters, C i denotes the i-th cluster, x signifies a data point, µ i stands for the cluster center of the i-th cluster, and || • || represents the Euclidean distance.The K-means method has extensive applications in the field of market segmentation [86,87].
Spectral Clustering: This algorithm is a graph-theoretic clustering method that captures the complex structure of data in low-dimensional space by performing dimensionality reduction on the eigenvectors of the data's Laplacian matrix [88].Spectral clustering has classical significance in the field of image segmentation [89].

Density-Based Spatial Clustering of Applications with Noise (DBSCAN):
This algorithm, proposed by Ester et al. [90], is a density-based clustering algorithm.This method discovers clusters of arbitrary shapes without the need to specify the number of clusters by connecting dense regions and distinguishing noise points.The implementation steps of this method are: 1. Calculate the number of points within the -neighborhood of each data point.2. Identify core points with an -neighborhood containing at least MinPts points.3. Classify density-reachable core points as clusters, non-core points as the nearest cluster members, or noise.
Principal Component Analysis (PCA): This algorithm, proposed by Pearson et al. [91], is a linear dimensionality reduction technique that projects the original data onto a new orthogonal coordinate system, maximizing the data variance, thereby reducing the data dimensionality while preserving as much information as possible.PCA has classical significance in the field of face recognition [92].
Maximum Entropy: This algorithm, proposed by Jaynes et al. [93], is an informationtheoretic statistical modeling technique that selects the most universal and least biased probability distribution by maximizing entropy under given constraints.The Maximum Entropy method has extensive applications in the part-of-speech tagging task [94].
Autoencoders: This algorithm, formally proposed by Rumelhart et al. [47], is an unsupervised neural network model that learns to compress and reconstruct data between the encoder and decoder, thereby achieving dimensionality reduction and feature extraction of data representation.The implementation process of autoencoders consists of three steps: encoding, decoding, and optimization.This method has extensive applications in the field of image denoising [95].
Generative Adversarial Network (GAN): This algorithm, proposed by Goodfellow et al. [50], consists of generative models based on adversarial training that learn to generate data similar to the true data distribution through a competitive process of simultaneously optimizing the generator and discriminator.A GAN has extensive applications in the field of image generation [96].

Reinforcement Learning
In the research of smart-contract security detection, reinforcement learning, as an important machine-learning method, has achieved significant results in multiple application fields in recent years.By interacting with the environment, reinforcement learning enables intelligent agents to autonomously explore the optimal strategy to maximize cumulative rewards in the long term.This section will briefly introduce some methods of reinforcement learning.
Q-Learning: This algorithm was introduced by Watkins et al. [97], as a model-free reinforcement learning algorithm that iteratively updates the action-value function Q(s, a) to estimate the expected total return of executing a particular action in a given state, enabling the agent to select the optimal action based on Q-values.The general formula for Q-Learning is: Here, α is the learning rate, and γ is the discount factor.Q-learning has classical significance in applications such as automatic control and network transmission [98,99].
Deep Q-Network: This algorithm was proposed by Mnih et al. [51].It is a reinforcement learning algorithm that combines deep neural networks with Q-Learning, using neural networks to approximate the action-value function Q(s, a) and handling highdimensional, complex input state spaces, such as raw images.The general update formula for DQN is: Here, θ and θ represent the parameters of the current and target neural networks, respectively.REINFORCE: This algorithm was introduced by Williams et al. [100].It is a reinforcement learning method that directly optimizes policy parameters by sampling trajectories to obtain unbiased estimates of the policy gradient and updating policy parameters using gradient ascent.Its core formula is: ∆θ = α∇ θ log π θ (a t | s t )G t .Here, θ represents policy parameters, α is the learning rate, π θ (a t | s t ) is the probability of selecting action a t in state s t , and G t is the cumulative reward starting from time step t.REINFORCE has classical significance in applications such as sequence generation and natural language processing [101].
MCTS: Monte Carlo Tree Search, this algorithm was proposed by Coulom et al. [102] as a search method based on Monte Carlo simulations that constructs a search tree and gradually finds an approximate optimal solution by balancing exploration and exploitation.MCTS does not have a general formula but primarily consists of four stages: Selection, Expansion, Simulation, and Backpropagation.AlphaGo is a prominent example of combining MCTS with deep learning [103].In 2016, it defeated the world Go champion, Lee Sedol.
GAIL: Generative Adversarial Imitation Learning, this algorithm was introduced by Ho et al. [104].It is a method that combines Generative Adversarial Networks (GANs) and reinforcement learning by training agents to acquire efficient policies through imitation learning of expert policies.The general formula for GAIL mainly includes generator (policy) loss and discriminator loss.

Comparing Different Kinds of Machine Learning
As shown in Table 1, we compared the advantages and disadvantages of the four types of supervised learning, semi-supervised learning, unsupervised learning, and reinforcement learning.See below for a detailed analysis.
Supervised Learning: This method is known for its high predictive accuracy and wide applicability across different domains.However, it requires a large labeled dataset to train effectively, which can be a limiting factor for some applications.
Semi-Supervised Learning: This approach benefits from high data utilization and reduced labeling costs compared to supervised learning, as it can work with partially labeled data.Nonetheless, it faces challenges such as increased algorithm complexity and dependence on underlying assumptions, which may affect its performance.Unsupervised Learning: The main advantages of this method are that it does not require labeled data and can discover latent structures within the data.However, its predictive performance is generally limited compared to supervised methods, and evaluating the quality of unsupervised models can be challenging.
Reinforcement Learning: This technique is particularly useful for decision-making and adaptive learning in dynamic environments.However, it comes with downsides, such as computational complexity and slow convergence, which may hinder its practicality in some situations.

Case Studies of ML-Based Smart-Contract Security Detection
In this section, we will explore the application of machine-learning techniques in the domain of smart-contract security detection.Machine-learning methods have demonstrated their powerful capabilities in various fields, consequently offering significant potential for smart-contract security detection.By reviewing relevant literature, we will delve into the application cases, advantages, and limitations of these methods in smart-contract security detection, ultimately providing valuable insights for future research and practice.

Document Retrieval
We conducted a comprehensive investigation into the application of machine learning in the domain of smart-contract security detection.In order to gain a thorough understanding of the latest advancements and trends in this field, we retrieved numerous relevant articles from authoritative databases such as Wiley, IEEE, Springer, ACM, and Elsevier.Given the limited number of such research works and their primary concentration within the last five years, we did not apply any time constraints during our search.While formulating the search strategy, we initially identified the main keywords and search terms, including "smart contract," "detection," "vulnerability," and "machine learning".To ensure the comprehensiveness of the search results, we expanded the keywords to encompass "bug," "fault," "security," "analysis," and more while also accounting for variations in tense and singular/plural forms by applying appropriate fuzziness.Based on these keywords and modifications, we constructed the following search formula: ("smart contract*") AND ("bug" OR "fault" OR "security" OR "vulnerability") AND ("detection" OR "analysis") AND ("machine learning" OR "deep learning" OR "reinforcement learning") Employing this search formula with some adjustments, we retrieved a total of 176 articles.The number of related papers in each database over the years is shown in Figure 9.To ensure the quality and relevance of the selected literature, we adopted the following screening criteria: 1.Only include English literature; 2. Focus on empirical research pertaining to the topic; 3.Only include papers published in reputable academic journals or conferences.After the screening process, a total of 32 articles were ultimately included in this review.9. Related papers for each database.

Machine-Learning-Based Tools for Smart-Contract Vulnerability Detection
In this chapter, we will systematically present the smart-contract vulnerability detection frameworks that have successfully employed machine-learning techniques to date.Our discussion will be organized chronologically.Up to now, there have been 32 frameworks successfully utilizing machine-learning technologies for smart-contract vulnerability detection.
In 2019, Gao et al. [38] first introduced SmartEmbed, marking the inaugural research achievement in employing machine-learning techniques for detecting vulnerabilities in smart contracts.Their paper presented a web service tool based on code embedding and similarity detection methods.This tool achieved vulnerability detection by comparing the similarity between existing Solidity code on the Ethereum blockchain and the code embeddings of known vulnerabilities.Two years later, the authors refined their tool [105].In their extended work, they further explored the application of machine-learning techniques in smart-contract security analysis and proposed several innovative improvements.
In 2020, Hao et al. introduced SCscan [106], a scanning system based on Support Vector Machines (SVM) for detecting vulnerabilities in blockchain smart contracts.The system aimed to identify potential security risks in smart contracts that could be exploited by attackers for illicit gains.In the same year, Lou et al. [107] proposed a Ponzi scheme detection method in smart contracts using an improved Convolutional Neural Network.They utilized a dataset of 3774 smart contracts for model training, including 132 Ponzi scheme contracts and 3642 legitimate smart contracts.Qian et al. [108] presented a deep learning approach based on bidirectional Long Short-Term Memory networks and attention mechanisms (BiLSTM-ATT) for the precise detection of reentrancy vulnerabilities.Furthermore, they introduced a contract fragment representation for smart contracts, which aids in capturing crucial semantic information and control flow dependencies.
In 2021, Hara et al. [109] employed machine-learning algorithms to detect Honeypots in Ethereum smart contracts.They proposed two feature extraction methods: one using TF-IDF to extract word features from the bytecode of the Ethereum blockchain and another using word2vec to extract distributed representations from the same data.These features could be used to detect Honeypots, and machine learning enhanced the detection performance.In the same year, Mi et al. [110] introduced a framework called VSCL for the automatic detection of vulnerabilities in smart contracts on blockchains.First, it leveraged a novel feature vector generation technique to extract information from the bytecode of smart contracts, as the source code of smart contracts is rarely available in public.Then, collected vectors were fed into their innovative deep neural network (DNN) based on metric learning to obtain detection results.Wang et al. [111] utilized deep learning techniques to automatically detect vulnerabilities in smart contracts.Their approach combined various code representations, such as code tokens, ASTs, and control flow graphs, and employed deep learning models for training and prediction.This method facilitated the learning of more comprehensive semantic information and enhanced the accuracy and completeness of vulnerability detection.Yu et al. [112] proposed a modular and systematic vulnerability detection framework based on deep learning named DeeSCVHunter.The framework focused on two types of smart-contract vulnerabilities: reentrancy and time dependence and introduced a novel concept called Vulnerability Candidate Slices (VCS) to help the model capture key points of vulnerabilities.Zhang et al. [113] presented a new classification model based on an improved CatBoost algorithm.The model employed a novel feature extraction pattern, delving deeper into the logic of smart-contract code.This approach could be used to detect Ponzi schemes during deployment and offer better performance, ultimately helping to prevent investor losses.
In 2022, Andrijasa et al. [114] employed deep reinforcement learning and multi-agent fuzz testing to develop improved techniques for detecting vulnerabilities in smart contracts.In the same year, Ashizawa et al. [115] introduced a machine-learning-based static analysis tool called Eth2Vec.This tool utilized neural networks to automatically learn features of vulnerable contracts and detect vulnerabilities in smart contracts by comparing the target contract code with the learned contract code.Gupta et al. [116] trained three different deep learning models, namely LSTM, ANN, and GRU, and applied them to predict the existence of vulnerabilities in smart contracts.These models were trained on known malicious and benign smart contracts, allowing for the automatic detection of vulnerabilities in new, unknown smart contracts.Hu et al. [117] proposed SCSGuard, a framework that employed machine-learning techniques to detect fraudulent behavior in smart contracts.SCSGuard leveraged the bytecode of smart contracts as a novel feature and utilized GRU networks and attention mechanisms to capture hidden information.Hwang et al. [118] introduced a new Convolutional Neural Network architecture, CodeNet, for smart-contract vulnerability detection.CodeNet addressed the issue of local information loss in existing CNN models by preserving the semantic and contextual information of smart contracts and demonstrated higher detection performance and faster detection times across various types of vulnerabilities.Li et al. [119] presented a new smart-contract vulnerability detection model called Link-DC.This model employed deep and cross networks to construct high-order nonlinear features and output these features to a fully connected layer to produce detection results.The model efficiently extracted features from raw data, thereby enhancing the performance and training efficiency of deep learning models.Liu et al. [120] proposed a heterogeneous graph transformation network for smart-contract anomaly detection (SHGTNs) to detect financial fraud on the Ethereum platform.They first extracted features to construct a Heterogeneous Information Network (HIN) of smart contracts, then fed the relation matrices obtained from learned meta-paths in the transformation network into a convolutional network, and finally utilized node embeddings for classification tasks.Nguyen et al. [121] proposed a novel heterogeneous graph representation approach called MANDO for learning the structure of heterogeneous contract graphs.MANDO developed a multiplex-path heterogeneous graph attention network to learn multi-layer embeddings of different types of nodes and their multiplex paths within the heterogeneous contract graph.This study extensively evaluated MANDO on a large-scale smart-contract dataset, showing that it improved the vulnerability detection results at the coarse-grained contract level compared to other techniques.Shakya et al. [122] introduced a vulnerability detection model named SmartMixModel, which employs machine-learning algorithms to detect vulnerabilities in Solidity smart contracts.This model extracts features at two levels-highlevel syntactic features and low-level bytecode features of the smart contracts-to achieve more precise vulnerability detection.Wang et al. [123] proposed a machine-learning model called GVD-net to detect security vulnerabilities in Ethereum smart contracts.This model the compiled smart-contract bytecode as input and uses a graph embedding-based machine-learning method for classification.Wu et al. [124] presented a deep learningbased framework for detecting vulnerabilities in Ethereum smart contracts, employing four deep learning models-CNN, LSTM, CNN-BiLSTM, and ResNets-to classify the source code of the smart contracts.Xu et al. [125] constructed a vulnerability detection model that utilized neural networks in machine learning, specifically bidirectional long short-term memory networks (BiLSTM), and introduced a hierarchical attention mechanism.This model takes code segments and account information of smart contracts as input, divides the input samples into three levels-word level, sentence level, and document level-and introduces attention mechanisms at different levels.By training the model to detect re-entry vulnerabilities in smart contracts, detection accuracy is improved and false positives are reduced.Zhang et al. [126] applied convolutional neural networks (CNN) to detect vulnerabilities in smart contracts.The authors transformed smart-contract vulnerabilities into image classification problems and converted bytecodes into numerical images based on predefined rules.They then trained and classified these numerical images using CNN to detect vulnerabilities in smart contracts.Zheng et al. [127] built a larger dataset and extracted numerous independent features from multiple perspectives, including bytecode, semantics, and developers, that were not related to transactions.They then constructed a multi-view cascading ensemble model (MulCas) using machine-learning methods, enabling their model to identify Ponzi schemes at the time of smart-contract creation.Zhou et al. [128] first investigated the classification of security issues related to smart contracts in BIoT scenarios.To address these security issues and overcome the limitations of existing methods, they proposed a tree-based machine-learning vulnerability detection (TMLVD) approach to perform vulnerability analysis of smart contracts.TMLVD inputs an intermediate representation derived from the abstract syntax tree (AST) of smart contracts into a tree-based training network to build a prediction model.This model captures multidimensional features to identify vulnerable smart contracts.The detection phase can be quickly implemented with limited computational resources while ensuring the accuracy of the detection results.The experimental evaluation demonstrated the effectiveness and efficiency of TMLVD on a dataset composed of Ethereum smart contracts.
In 2023, Cai et al. [129] proposed a graph neural network (GNN)-based method for smart-contract vulnerability detection.First, by combining abstract syntax trees (AST), control flow graphs (CFG), and program dependency graphs (PDG), they built a graph representation for smart-contract functions that included both syntactic and semantic features.To further enhance the representational power of their method, they performed program slicing to normalize the graph and eliminate redundancy unrelated to vulnerabilities.They then employed a bidirectional gated graph neural network model with mixed attention pooling to identify potential vulnerabilities in smart-contract functions.Jiang et al. [130] introduced VDDL, which utilized a multi-layer bidirectional Transformer structure as its model framework, involving multi-head attention and masking mechanisms.Multi-head attention was applied in the encoder and decoder layers.The masking mechanism enabled deep bidirectional training representations by randomly masking input tokens and predicting masked tokens using context.Furthermore, VDDL incorporated CodeBERT, a large-scale dual-modal pretraining model for natural language and programming language, to enhance training results.Jie et al. [131] used a multi-modal artificial intelligence framework to detect vulnerabilities in smart contracts.The framework combined various techniques such as natural language processing, image processing, and code analysis and employed machine-learning algorithms like support vector machines (SVM) and long short-term memory networks (LSTM) to improve vulnerability detection accuracy and efficiency.Liu et al. [132] explored the use of graph neural networks and expert knowledge for detecting smart-contract vulnerabilities.Specifically, they transformed the rich control and data flow semantics of the source code into contract graphs.To highlight key nodes in the graph, they further designed a node elimination phase to normalize the graph.They then proposed a novel temporal message propagation network to extract graph features from normalized graph, combining these features with designed expert patterns to produce the final detection system.Extensive experiments were performed on all smart contracts with source code on the Ethereum and VNT Chain platforms.Su et al. [133] proposed a reinforcement learning-based vulnerability-guided fuzz testing approach called RLF, used for generating vulnerability transaction sequences to detect complex vulnerabilities in smart contracts.Specifically, they first modeled the process of fuzz testing smart contracts as a Markov decision process, constructing a reinforcement learning framework.They then designed a reward that considered vulnerabilities and code coverage to effectively guide the fuzzer in generating specific transaction sequences to reveal vulnerabilities, especially those related to multiple functions.Sun et al. [134] introduced a novel smart-contract vulnerability detection framework called ASSBert, which combined active learning and semi-supervised learning to address the issue of insufficient labeled data.ASSBert employed bidirectional encoder representations from Transformers (BERT).Active learning was responsible for filtering highly uncertain code data from unlabeled sol files and manually annotating them, while semi-supervised learning continuously selected a certain number of high-confidence unlabeled code data from unlabeled sol files and included them in the training set after pseudo-labeling.Zhang et al. [135] proposed a deep learningbased two-stage smart-contract debugger, ReVulDL, for detecting and locating re-entry vulnerabilities.ReVulDL integrated vulnerability detection and location into a unified debugging process.The detection stage leveraged a graph-based pre-trained model to learn complex relationships in the propagation chain; the location stage applied interpretable machine-learning to pinpoint suspicious statements.

Comparative Analysis of Existing Machine-Learning-Based Smart-Contract Vulnerability Detection Tools
As shown in Table 2, this is a comparative analysis table about machine learning-based smart contract vulnerability detection tools.Within the table, we can observe detailed comparisons of various research studies or tools proposed by different authors and teams.The table encompasses several key attributes: first, references are provided for readers to easily locate relevant materials; second, the machine-learning methods employed are specified; subsequently, the size of the datasets is included, which aids in understanding the impact of data scale on model performance and reliability; finally, the machine-learning classification approaches utilized by the tools, such as supervised learning, semi-supervised learning, or reinforcement learning, are presented, assisting in comprehending the distinctions between different methods as well as their strengths and weaknesses in addressing smart-contract vulnerability detection issues.Through this table, we can gain a comprehensive understanding of various machine-learning-based smart-contract vulnerability detection tools and provide valuable references for further research and practice.

Discussion
In this section, we will comprehensively explore the three key research questions proposed in this paper, delving into existing methods and techniques for each question to provide valuable insights for researchers and practitioners.
RQ1: There are numerous methods that can be applied to the field of smart-contract security detection, including supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.Most of the currently implemented machine-learningbased smart-contract security detection methods primarily rely on supervised learning, which is the most common machine-learning technique in smart-contract vulnerability detection.The main advantage of supervised learning methods is that they are trained on a large amount of labeled data, allowing the patterns and features learned from the training data to be effectively applied to new data.However, the downside of these that they require substantial amount labeled data for training.Semi-supervised learning approaches used in smart-contract vulnerability detection but have great potential.The advantage of these methods is that the pre-training process does not require labeled training data, enabling the development of such methods based on pre-trained large models.The limitation of semi-supervised learning methods is that they may struggle to capture specific vulnerability features.Unsupervised learning methods are rarely applied in smart-contract vulnerability detection.The advantage of these methods is that they do not require labeled training data, while their limitation lies in their potential difficulty in capturing specific vulnerability features.Reinforcement learning is a machine-learning technique based on the interaction between an agent and its environment.
RQ2: Supervised learning, semi-supervised learning, and reinforcement learning methods have all been applied in the field of smart-contract security detection.However, the vast majority of applications are actually based on supervised learning methods.This paper posits that this is likely due to the current maturity of supervised learning methods and because labeled datasets enable achieving better results with smaller amounts of data.However, as the amount of data in smart-contract datasets continues to grow, semisupervised learning is also becoming a promising research direction.
RQ3: In the field of smart-contract security detection, machine learning can be combined with static analysis, dynamic analysis, and fuzz testing methods.To integrate with static analysis, features (such as code patterns and function calls) can be extracted from the smart contract's source code or bytecode and used to train machine-learning models to identify potential vulnerabilities.To combine with dynamic analysis, runtime data (such as state changes and transaction flows) can be collected during contract execution and used in conjunction with machine-learning models to detect anomalous behavior and potential vulnerabilities.To integrate with fuzz testing methods, random or semi-random input data can be generated, and the contract execution results can be observed.Machine learning can then be employed to analyze the execution process and outcomes, automatically discovering new vulnerabilities or abnormal behaviors.

Conclusions and Future Work
Although machine learning has made some progress in the detection of defects in smart contracts, there is still a noticeable gap in the literature regarding a comprehensive review of machine-learning-based smart-contract defect detection.To address this shortcoming, this paper innovatively delves into the application of machine learning in the field of smart-contract security detection, aiming to provide valuable references and inspiration for researchers.The paper conducts an in-depth analysis and classification of machine-learning techniques, exploring the effectiveness of different technologies in smart-contract defect detection.Furthermore, this paper investigates and compares existing machine-learningbased smart-contract defect detection models.In future research, we will explore the possibility of combining machine-learning techniques with formal methods.

Figure 7 .
Figure 7.A timeline of the evolution of machine learning.The first International Conference on Machine Learning in 1980 signified the global rise of the field.In 1986, Rumelhartd et al.[47] introduced a method to train multilayer neural networks using the backpropagation algorithm, resulting in a major breakthrough

Figure 8 .
Figure 8. Classification overview of machine-learning methods.

Table 1 .
Advantages and disadvantages of four machine-learning methods.

Table 2 .
Comparison of machine-learning-based smart-contract vulnerability detection tools.