Yul2Vec: Yul Code Embeddings
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe contribution presented in this manuscript is a technique for vectorizing (embedding) Yul programs so that these can be later used in ML or DL pipelines. The proposed technique is inspired by knowledge graph embeddings, and converts Yul programs (Yul is an intermediate byte code representation for Solidity programs, i.e., smart contracts in Ethereal) into a high-dimensional vector.
1) The conceptual explanation and working of the encoding technique is well described in the paper, but it does not suffice to present that the en dings are able to separate the different syntax constructs of Yul into disparate points in the embedding space. Additionally, the dimension of the vectors and the encoding length of a typical Yul program must be presented, for several test cases
2) Similarly, the running time or the computational complexity of the algorithm should be included in the paper.
3) What are the advantages of the proposed encoding techniques over other popular encodings for arbitrary texts, e.g., Word2vec?
4) The motivation of this work is to facilitate the use if Yul programs as input to subsequent ML or AI processing tasks. Can the author suggest some possible ML/AI applications that could benefit from Yul encoding as proposed here?
5) The relationship of this encoding with compiler optimization techniques is unclear to me. Maybe the author could elaborate a little more in depth this point in the manuscript.
Author Response
Comment 1:
The conceptual explanation and working of the encoding technique is well described in the paper, but it does not suffice to present that the en dings are able to separate the different syntax constructs of Yul into disparate points in the embedding space. Additionally, the dimension of the vectors and the encoding length of a typical Yul program must be presented, for several test cases
Response 1:
Thank you for this valuable feedback. Proving the separation of distinct Yul syntax constructs in the embedding space is challenging to demonstrate, and even to rigorously evaluate. To address this, I've added a new Test Case 1 in the Experiments section. This test utilizes a set of small Yul scripts sourced from the Solidity Compiler repository, specifically those used for Yul Optimizer unit tests. These scripts are intentionally small and designed to highlight specific syntactic elements. I've also examined and presented the results for various parameters within this new test case.
Comment 2:
Similarly, the running time or the computational complexity of the algorithm should be included in the paper.
Response 2:
Thank you for highlighting this important point. The computational complexity of the algorithm is primarily determined by the Depth-First Search (DFS) traversal used to generate the final vector from the Abstract Syntax Tree (AST). However, as you rightly point out, providing empirical time measurements is crucial to demonstrate the algorithm's applicability in real-world scenarios. I have re-executed the largest test case (now Test Case 2) and measured both the execution times and the corresponding lines of code. These results are now presented in a table within the Experiments section.
Comment 3:
What are the advantages of the proposed encoding techniques over other popular encodings for arbitrary texts, e.g., Word2vec?
Response 3:
It's important to clarify that comparing the proposed encoding technique directly with methods like Word2Vec in terms of "advantages" isn't appropriate, as they operate in entirely different contexts. My approach (and similar ones I cpontext of programming lanugage embeddings), in fact, draws inspiration from techniques developed for natural language processing, particularly in how it constructs a vocabulary for Yul tokens, conceptually similar to the principles behind Word2Vec.
As we know, Word2Vec is designed to embed human language vocabulary from large text corpora. In the context of programming languages, analogous approaches from natural language processing are adapted to create a programming language vocabulary. The primary distinction often lies in the preprocessing steps. For natural language, with the method I've applied (KGE) bases on tripplets, where a typical triplet might be object:word -> relation:next -> subject:word. However, for programming languages, defining such triplets requires more specific design. For stack-based languages like LLVM-IR, this can be relatively straightforward. For Yul, which is represented as an Abstract Syntax Tree (AST), it required a more creative approach to define these object-relation-subject relationships.
Comment 4:
The motivation of this work is to facilitate the use if Yul programs as input to subsequent ML or AI processing tasks. Can the author suggest some possible ML/AI applications that could benefit from Yul encoding as proposed here?
Comment 5:
The relationship of this encoding with compiler optimization techniques is unclear to me. Maybe the author could elaborate a little more in depth this point in the manuscript.
Response 4 and 5:
Thank you for raising these points; they are indeed central to the motivation behind this research. I'll address both comments together as they are closely related.
The primary ML/AI application I envision for Yul encoding is in compiler optimization techniques. Currently, these optimizations, including those in the Solidity compiler, largely rely on manually defined, hardcoded sequences of steps. This "phase-ordering problem" means that a globally fixed sequence of optimizations might improve some programs but worsen others.
My initial research began with exploring how Reinforcement Learning, specifically Deep Q-Networks (DQN), could potentially discover optimal, program-specific optimization sequences. However, a fundamental prerequisite for applying neural networks in this context is having program representations as fixed-size vectors. I quickly realized that a vectorized Yul code technique was missing, as was a suitable Yul dataset.
This led me to take a step back: first, to create a comprehensive Yul dataset (which I've published as a preprint and referenced in this paper). Concurrently, observing existing research on embeddings for various other languages (like Java and LLVM IR), I decided to dedicate this research to developing a similar embedding approach specifically for the Yul language.
While I touched upon this in the Introduction, it seems I didn't elaborate enough. I have now expanded on this connection and the potential applications more thoroughly in the Summary section of the manuscript.
-----
I believe the revisions made in response to your feedback have significantly strengthened the manuscript and hope they address all your concerns.
Reviewer 2 Report
Comments and Suggestions for Authors This paper introduces Yul2Vec, a novel approach for representing Yul programs as distributed embeddings in a continuous space, effectively addressing the research gap in the vectorization of the Yul intermediate language. The research follows a clear and logical framework, focusing on Yul code vectorization. It first generates embeddings for atomic components of Yul using knowledge graph embedding techniques, then aggregates these embeddings from elements within the Abstract Syntax Tree to derive vector representations for entire programs, thereby establishing a comprehensive technical architecture. The experimental section, conducted on a dataset comprising over 340,000 Yul files, employs methods such as PCA dimensionality reduction and visualization to analyze entity embeddings and program vector distributions, validating the method's effectiveness with reasonably persuasive results. Nevertheless, the paper could be enhanced in several aspects. Firstly, a comparative analysis with existing similar vectorization methods (e.g., those applied to LLVM IR) in terms of performance and efficacy is lacking, which hinders a straightforward demonstration of Yul2Vec's advantages. Secondly, the evaluation of vector representation quality is relatively basic; validating the method in the context of specific downstream tasks (such as the phase-ordering problem in compiler optimization) would significantly strengthen its persuasiveness. Thirdly, additional experimental details are needed, including the specific compositional features of the dataset and the influence of parameter settings on outcomes. In summary, this research offers a new perspective on Yul code vectorization, contributing positively to advancing Solidity compiler optimization and the efficient execution of Ethereum smart contracts.Author Response
Comment 1:
Firstly, a comparative analysis with existing similar vectorization methods (e.g., those applied to LLVM IR) in terms of performance and efficacy is lacking, which hinders a straightforward demonstration of Yul2Vec's advantages.
Response 1:
Thank you for your valuable time and detailed review of my manuscript. Regarding your first comment, I appreciate the suggestion for a comparative analysis with existing vectorization methods, such as those for LLVM IR. However, I believe a direct comparison might not be entirely appropriate or even feasible given the unique context of Yul.
The core motivation of this paper is to address a significant gap in the research landscape. While well-established vectorization methods, benchmarks, and applications (especially for compiler optimizations) exist for languages like LLVM IR, nothing comparable exists for Yul. Yul is a custom EVM intermediate language used by the Solidity compiler to generate EVM bytecode. My primary objective with this work is to pioneer research in Yul program vectorization, thereby opening doors for future advancements in Solidity compiler optimization through ML/AI techniques.
Given this foundational objective, directly comparing Yul embeddings with those designed for LLVM IR or Java, across their respective benchmarks or performance metrics, would not yield meaningful insights. For researchers focusing on Solidity development, the preprocessing of Yul (or Solidity directly) is a distinct requirement, unrelated to LLVM IR. Therefore, I contend that these represent different research branches, and my aim is not to find "advantages" over LLVM IR but rather to establish a much-needed foundation for Yul. I hope this clarification, also highlighted in the Introduction and Experiments sections of the paper, addresses your concern.
---
Comment 2:
Secondly, the evaluation of vector representation quality is relatively basic; validating the method in the context of specific downstream tasks (such as the phase-ordering problem in compiler optimization) would significantly strengthen its persuasiveness.
Response 2:
I completely agree with your insightful comment regarding the need for downstream task validation, especially in the context of the phase-ordering problem in Solidity compiler optimization. Indeed, the challenges posed by the current hardcoded optimization sequences, which can sometimes negatively impact bytecode efficiency, were the initial driving force behind my research. I observed that Deep Reinforcement Learning shows great promise for tackling the phase-ordering issue in LLVM IR.
However, I quickly identified a fundamental hurdle within the Solidity (and Yul) ecosystem: the complete absence of a fixed-size vector representation for Yul programs. Furthermore, there was no publicly available dataset of Yul scripts (though I was able to leverage existing Solidity datasets to construct one for Yul). This necessitated a crucial "step back" in my research pipeline: first, to develop a robust Yul dataset, and then to establish these foundational embeddings. Recognizing the significant scope of both topics, and observing other papers solely focused on language embeddings, I made the strategic decision to present this as two distinct research efforts. This paper, therefore, lays the groundwork by providing the necessary Yul vectorization, paving the way for future work on downstream tasks like the phase-ordering problem.
---
Comment 3:
Thirdly, additional experimental details are needed, including the specific compositional features of the dataset and the influence of parameter settings on outcomes.
Response 3:
Thank you for this valuable suggestion. To provide the additional experimental detail you requested, I have introduced a new test set, now designated as Test Case 1 within the Experiments section. This test leverages a collection of small Yul scripts directly from the Solidity Compiler repository, specifically those used for Yul Optimizer unit tests. These scripts are intentionally concise and designed to highlight particular syntactic elements, which I believe will facilitate a clearer human interpretation and assessment of the captured characteristics.
Furthermore, I have thoroughly examined and presented the results of this new test case across various parameter settings. This includes how different weights are applied to operations, types, and arguments, thereby influencing their respective importance in the final vector representation. Additionally, to demonstrate the practical viability of my approach, I have included time measurements for the largest test set, ensuring that computation time will not pose a barrier to real-world applications.
-----
I've also improved english in couple of places.
I believe the revisions made in response to your feedback have significantly strengthened the manuscript and hope they address all your concerns.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe revised version of the manuscript has addressed adequately all my previous concerns related to explaining better the vectorization procedure and adding a better experimental characterization.
Reviewer 2 Report
Comments and Suggestions for AuthorsI am satisfied with the revisions made by the authors in response to the requested changes. Based on the revised content, I agree to the acceptance of this paper.