Abstract
Smart contracts are deployed and represented as bytecodes in blockchain networks, and these bytecodes are machine-readable codes. Only a small number of deployed smart contracts have their verified human-readable code publicly accessible to blockchain users. To improve the understandability of deployed smart contracts, we explored rule-based classification of smart contracts using iterative integration of fingerprints of relevant function interfaces and keywords. Our classification system included categories for standard contracts such as ERC20, ERC721, and ERC1155, and non-standard contracts like FinDApps, cross-chain, governance, and proxy. To do this, we first identified the core function fingerprints for all ERC token contracts. We then used an adapted header extractor tool to verify that these fingerprints occurred in all of the implemented functions within the bytecode. For the non-standard contracts, we took an iterative approach, identifying contract interfaces and relevant fingerprints for each specific category. To classify these contracts, we created a rule that required at least two occurrences of a relevant fingerprint keyword or interface. This rule was stricter for standard contracts: the 100% occurrence requirement ensures that we only identify compliant token contracts. For non-standard contracts, we required a minimum of two relevant fingerprint occurrences to prevent hash collisions and the unintentional use of keywords. After developing the classifier, we evaluated its performance on sample datasets. The classifier performed very well, achieving an F1 score of over 99% for standard contracts and a solid 93% for non-standard contracts. We also conducted a risk analysis to identify potential vulnerabilities that could reduce the classifier’s performance, including hash collisions, an incomplete rule set, manual verification bottlenecks, outdated data, and semantic misdirection or obfuscation of smart contract functions. To address these risks, we proposed several solutions: continuous monitoring, continuous data crawling, and extended rule refinement. The classifier’s modular design allows for these manual updates to be easily integrated. While semantic-based risks cannot be completely eliminated, symbolic execution can be used to verify the expected behavior of ERC token contract functions with a given set of inputs to identify malicious contracts. Lastly, we applied the classifier on contracts deployed Ethereum main network.