Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

EVuLLM: Ethereum Smart Contract Vulnerability Detection Using Large Language Models

Electronics 2025, 14(16), 3226; https://doi.org/10.3390/electronics14163226

by Eleni Mandana

, George Vlahavas^*

and Athena Vakali

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2025, 14(16), 3226; https://doi.org/10.3390/electronics14163226

Submission received: 27 June 2025 / Revised: 29 July 2025 / Accepted: 12 August 2025 / Published: 14 August 2025

(This article belongs to the Special Issue Network Security and Cryptography Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper systematically summarizes three major categories of smart contract vulnerability detection methods (static analysis, machine learning, and LLM), and clearly presents the technical paths, performance indicators, and limitations of each method through a table comparison. It conducts a detailed comparison of the principles and accuracy rates of static analysis tools, introduces the EVuLLM dataset to address the issue of insufficient diversity in evaluation data.To enhance the rigor and readability of the paper, the following modifications are suggested:

1.The description of the innovative value of "lightweight model local deployment" in the abstract is relatively general. Specific advantages need to be supplemented to avoid vague statements, enabling readers to more clearly understand its core value.

2.The paper mentions that parameter-efficient fine-tuning technology has high research value in the field of vulnerability detection, but this point requires further substantiation. It is necessary to clarify whether this technology is applied in this field for the first time; if not, a systematic comparison should be made between this study and other related works to clarify the unique contributions of this research.

3.There are performance differences of the same model in Table 6 and Table 7, and this phenomenon needs appropriate explanation. Adding a reasonable explanation will help readers understand the consistency of the experimental results and enhance the credibility of the conclusions.

4.The current description of the relationship between the RAG framework and large language model fine-tuning is not clear enough. If RAG is used as a comparative scheme (for example, to compare with the effect of fine-tuning), it is necessary to clarify the performance differences between the two on the same dataset to make the comparative logic more complete.

5.The hyperparameters in the fine-tuning process (such as learning rate, batch size, number of training epochs, etc.) have not been specified, and there is a lack of explanation for the selection basis. It is recommended to supplement the specific values of these parameters and explain the reasons for their selection and their impact on model performance, so as to make the experimental process more reproducible.

6.The sentence structure of some technical descriptions can be further organized to improve clarity. For example, "parameter-efficient fine-tuning" and "PEFT" should be expressed consistently. At the same time, attention should be paid to the consistency of tenses and subject-verb agreement throughout the text to ensure the standardization of language expression.

Author Response

We would like to sincerely thank the Reviewer for the thoughtful and constructive feedback provided. Your detailed comments and insightful suggestions were very helpful in identifying areas of the manuscript that needed clarification and improvement. We have carefully addressed each of your points in the revised manuscript and believe that your input has substantially improved the overall quality and clarity of our work. We greatly appreciate the time and effort you dedicated to reviewing our submission.

General Comment: The paper systematically summarizes three major categories of smart contract vulnerability detection methods (static analysis, machine learning, and LLM), and clearly presents the technical paths, performance indicators, and limitations of each method through a table comparison. It conducts a detailed comparison of the principles and accuracy rates of static analysis tools, introduces the EVuLLM dataset to address the issue of insufficient diversity in evaluation data.To enhance the rigor and readability of the paper, the following modifications are suggested:

Comment 1: The description of the innovative value of "lightweight model local deployment" in the abstract is relatively general. Specific advantages need to be supplemented to avoid vague statements, enabling readers to more clearly understand its core value.

Response 1: We would like to thank the reviewer for their comment. We agree that relevant discussion was weak in our original manuscript. Accordingly, we have added the following text in our abstract to emphasize the value behind using local models.

“Moreover, we emphasize the advantages of lightweight models deployable on local hardware, such as enhanced data privacy, reduced reliance on internet connectivity, lower infrastructure costs, and improved control over model behavior, factors that are especially critical in security-sensitive blockchain applications. We also explore Retrieval-Augmented Generation (RAG) as a complementary strategy, achieving competitive results with minimal training. Our findings highlight the practicality of using locally hosted LLMs for secure, efficient, and reproducible smart contract analysis, paving the way for broader adoption of AI-driven security in blockchain ecosystems.”

Additionally, we have expanded the relevant text in the introduction section. It now reads:

“Locally deployable models offer distinct and practical benefits that are especially relevant in security-sensitive domains. By keeping smart contract data on-premises, they ensure a higher degree of privacy and reduce the risk of sensitive information exposure. Unlike proprietary APIs, which may impose usage restrictions, suffer from unpredictable latency, or conceal internal mechanisms, open local models grant full transparency and control over the inference process. This not only enhances trust but also facilitates precise fine-tuning and reproducibility of results. Additionally, the reduced computational footprint of lightweight models allows them to operate effectively on consumer-grade hardware, making them accessible in resource-constrained environments and eliminating the dependency on external cloud infrastructure. This self-sufficiency also supports use cases where offline capability, regulatory compliance, or fine-grained customization is essential.”

Finally, we believe we have strengthened the relevant section in the Conclusions section with the following text:

“Our findings reinforce their practical value of running lightweight LLMs locally. By using open-source, resource-efficient models, our approach supports scenarios where privacy, transparency, and

infrastructure independence are critical, such as in security-sensitive smart contract analysis. Unlike proprietary services, lightweight models can be fully controlled and customized, making them a strong fit for constrained or regulated environments.”

Comment 2: The paper mentions that parameter-efficient fine-tuning technology has high research value in the field of vulnerability detection, but this point requires further substantiation. It is necessary to clarify whether this technology is applied in this field for the first time; if not, a systematic comparison should be made between this study and other related works to clarify the unique contributions of this research.

Response 2: We thank the reviewer for the insightful comment. We agree that the research value of PEFT in this domain should be clearly substantiated in the context of related work. To address this, we have clarified in the revised manuscript that our study is not the first to apply PEFT to smart contract vulnerability detection. In particular, we already cite two relevant works, Boi et al. (2024) and Young et al. (2024), in Table 2, both of which utilize LoRA or QLoRA techniques. We have we have expanded our Related Work section with the following paragraph to more explicitly compare our approach with existing PEFT-based methods.

“As summarized in Table 2, prior work has already explored the use of PEFT in the context of smart contract vulnerability detection. Specifically, Boi et al, 2004, applied QLoRA to fine-tune a Llama-2 7B model, achieving an accuracy of 59.9% in identifying smart contract vulnerabilities and showcasing that LLMs can be used for this purpose with comparable results to more traditional vulnerability detection tooling. Additionally, Yang et al, 2024, fine-tuned Llama2b 13B as well as CodeLlamma 13B models on a dataset of labeled functions to achieve an accuracy of 34% in detecting vulnerabilities. These studies demonstrate that PEFT is a growing area of interest within this field. However, our work differs in scope and methodology: we introduce a new evaluation dataset (EVuLLM), systematically benchmark multiple PEFT-compatible models and combine fine-tuning with ensemble prompt engineering and Retrieval-Augmented Generation (RAG) strategies. This comprehensive evaluation highlights performance gains not just from fine-tuning, but also from robust prompt and inference techniques, underscoring our contribution beyond the use of PEFT alone.”

We believe that these additions clarify the unique contributions of our study, particularly in terms of dataset diversity (EVuLLM), combined evaluation strategies, and emphasis on high-performing open-source models suitable for local deployment.

Comment 3: There are performance differences of the same model in Table 6 and Table 7, and this phenomenon needs appropriate explanation. Adding a reasonable explanation will help readers understand the consistency of the experimental results and enhance the credibility of the conclusions.

Response 3:

We thank the reviewer for pointing out the performance differences observed in Tables 6 and 7 for the CodeGemma model. We appreciate the opportunity to clarify this point. Although both experiments used models referred to as "CodeGemma", they were in fact sourced from different providers, as already cited in Tables 4 and 5, and have different properties.

Specifically, the CodeGemma model used for our fine-tuning experiments is the CodeGemma variant with 7B parameters and sourced from the Unsloth library on Hugging Face. This model was quantized using bitsandbytes 4-bit dynamic quantization, and fine-tuned using QLoRA.

In contrast, the CodeGemma model used in our Retrieval-Augmented Generation (RAG) experiments was sourced from Ollama, which provides an 8B parameters variant with Q4 static quantization. This model was used in its off-the-shelf form, without fine-tuning, and integrated into the RAG pipeline.

We have updated the manuscript to make these differences clearer in the corresponding sections and tables 4 through 10. We also added a clarification note in the Methodology section that reads:

"It is important to note that the CodeGemma model used in our RAG experiments differs from the one used in our fine-tuning experiments. Specifically, the RAG experiments employ the CodeGemma-8B model sourced from the Ollama library, while the fine-tuning experiments use the smaller CodeGemma-7B variant provided by the Unsloth library via the Hugging Face repository. These models differ not only in parameter size but also in quantization method and intended usage context."

We hope that this clarifies this issue.

Comment 4: The current description of the relationship between the RAG framework and large language model fine-tuning is not clear enough. If RAG is used as a comparative scheme (for example, to compare with the effect of fine-tuning), it is necessary to clarify the performance differences between the two on the same dataset to make the comparative logic more complete.

Response 4:

We thank the reviewer for highlighting this important point. We agree that the relationship between fine-tuning and RAG methods needed clearer explanation and have revised the Background section (Sections 2.2 and 2.3) accordingly to better distinguish the methodological differences between the two approaches.

To clarify, in our study, RAG is not presented as a direct competitor to fine-tuning, but rather as a practical alternative for scenarios where fine-tuning is not feasible, such as resource-constrained environments or settings requiring runtime adaptability. Unlike fine-tuning, which internalizes domain knowledge by updating model weights, RAG augments pretrained models at inference time by retrieving relevant external context, without any additional training.

Both approaches are evaluated on the same datasets (e.g., EVuLLM) to ensure comparability. While the fine-tuned models outperform RAG in absolute terms, our RAG-based pipeline achieves higher accuracy and F1-scores than previously reported non-fine-tuned baselines in the literature, highlighting its effectiveness even without model updates. We have added clarification of this comparative framing in the revised manuscript text and improved the discussion in the Experimental Setup and Results sections to emphasize that RAG is explored as an alternative strategy rather than a substitute for fine-tuning.

Therefore, sections 2.2 and 2.3 now read:

“2.2. Fine-Tuning Large Language Models

Fine-tuning adapts pre-trained LLMs to specific domains or tasks by updating model parameters. Traditional full-parameter fine-tuning retrains the entire model but is computationally expensive. More efficient approaches, such as PEFT, modify only a subset of parameters, reducing costs while maintaining performance. Techniques like reinforcement learning with human feedback (RLHF) further refine models to align with user preferences and ethical guidelines.

PEFT optimises fine-tuning by adjusting only select model parameters, significantly reducing memory and compute requirements. One such approach, Low-Rank Adaptation (LoRA), introduces small trainable matrices to existing model weights, achieving efficient adaptation with fewer parameters. QLoRA extends this by incorporating 4-bit quantization and double quantisation techniques, further minimising memory use while preserving model performance.

Fine-tuning allows the model to internalize domain-specific knowledge by adapting its weights during training. As such, it is a training-time strategy that results in a modified model capable of improved inference in that domain.

2.3 Retrieval Augmented Generation

Retrieval-Augmented Generation (RAG) enhances LLM performance by dynamically incorporating external information into the inference process. Unlike fine-tuning, which permanently adjusts a model's internal parameters, RAG leaves the base model unchanged. Instead, it retrieves relevant context documents or examples at inference time, appending them to the model’s input prompt. This approach enables the model to produce more accurate and context-aware outputs, especially in domains that require up-to-date, task-specific, or factual information.

RAG is particularly valuable in scenarios where training resources are limited, or frequent model updates are impractical. By offloading part of the reasoning to an external retrieval system, RAG can achieve competitive performance without additional training, making it an attractive alternative to fine-tuning in constrained environments.

In this work, we evaluate RAG as an alternative strategy to fine-tuning for smart contract vulnerability detection. While RAG does not modify model weights, it benefits from integrating relevant vulnerability descriptions, code examples, or documentation at runtime.”

In addition to that, we have added the following paragraph in the methodology section:

“Fine-tuning and RAG represent two fundamentally different approaches: the former embeds task knowledge into the model parameters through training, while the latter injects task-relevant information at inference without retraining. We include RAG in our experiments not as a direct competitor to fine-tuning, but as a practical alternative for environments where training is infeasible or where flexibility and external knowledge integration are prioritized. Both approaches are evaluated on the same dataset to ensure a fair and transparent comparison of their effectiveness.”

And the respective paragraph in the Conclusions section has been revised to the following:

“In addition, we explored Retrieval-Augmented Generation (RAG) as a complementary technique. Although retrieval quality remains variable, the best-performing RAG model, Gemma-2, reached 79.1\% accuracy and a 78.8\% F1-score on EVuLLM, approaching the effectiveness of fine-tuned models with minimal training requirements. Our results show that, although RAG underperforms compared to fully fine-tuned models, it outperforms previous non-fine-tuned baselines and demonstrates the potential of retrieval-based augmentation in security-related tasks.”

We hope this resolves the ambiguity and strengthens the interpretability of our experimental comparisons.

Comment 5: The hyperparameters in the fine-tuning process (such as learning rate, batch size, number of training epochs, etc.) have not been specified, and there is a lack of explanation for the selection basis. It is recommended to supplement the specific values of these parameters and explain the reasons for their selection and their impact on model performance, so as to make the experimental process more reproducible.

Response 5:

We would like to thank the reviewer for this important observation. We have now included a tables in the manuscript (now Tables 5) that provides comprehensive details of the hyperparameters used during the fine-tuning process.

Additionally, we have included the following text in the Methodology section:

“The hyperparameters used for fine-tuning with the TrustLLM and EVuLLM datasets are shown in Table 5. While we did not perform an exhaustive hyperparameter search due to computational constraints, we based our choices on a combination of empirical testing, prior findings from similar PEFT-based LLM studies and recommendations from the Unsloth library and Hugging Face documentation for 4-bit fine-tuning with QloRA.

In particular, we experimented with a small number of configurations and selected the ones that offered the best trade-off between performance and training stability on a validation subset of the TrustLLM and EVuLLM datasets. While not exhaustive, our approach prioritizes practical reproducibility and reflects real-world constraints in deploying efficient fine-tuning pipelines for smart contract analysis.”

Comment 6: The sentence structure of some technical descriptions can be further organized to improve clarity. For example, "parameter-efficient fine-tuning" and "PEFT" should be expressed consistently. At the same time, attention should be paid to the consistency of tenses and subject-verb agreement throughout the text to ensure the standardization of language expression.

Response 6: We appreciate the reviewer’s careful reading and valuable feedback regarding the clarity and consistency of the manuscript’s language.

We have conducted a thorough review of the manuscript to ensure that terminology is used consistently. Additionally, we revised several sentences throughout the text to improve sentence structure, clarify technical descriptions, and ensure consistency in tense and subject-verb agreement.

We thank the reviewer for pointing this out, and we believe these edits have enhanced the clarity and readability of the manuscript.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This article is located in the field of cybersecurity applied to blockchain, with a focus on detecting vulnerabilities in Ethereum smart contracts, using advanced artificial intelligence techniques (Large Language Models).
The article proposes the use of open-source LLMs, fine-tuned by efficient methods (PEFT with QLoRA) and introduces EVuLLM, a new dataset of vulnerable/secure Solidity functions, demonstrating that comparable or even superior performances to large commercial models can be achieved, but with much reduced computational cost.
In order to increase the scientific value of the article and for a better understanding of the proposed methods, it would be advisable to address the following aspects:
1. It would be useful to highlight, beyond their limitations, in the section describing similar solutions, what problems the proposed solution solves or what improvements it brings.
2. For each section, in one sentence, to clearly explain what is done in that section, what role it plays in the research conducted, whether it is a review of something existing or an own contribution.
3. For tables 9 and 10, it is necessary to state the source from where those model implementations and those percentages were taken.
4. In order to integrate the proposed model into the state of the art, it is useful to extend the analysis to multi-class classification (e.g. SWC, OWASP).
5. Providing concrete examples of vulnerabilities detected correctly/incorrectly (error analysis).
6. Incorporating a cost study (runtime, memory, TCO vs commercial models).
7. Clarifying the preprocessing pipeline (e.g. regex extraction) through additional diagrams in the article.
8. Including simulations on real-world contracts, outside the dataset, for concrete validation.
9. A section highlighting the limitations of the proposed solution, possible ways to mitigate them, as well as the subdomains where this technique discussed in the article can possibly be applied would be useful.
10. The article needs to be restructured, with some sections rewritten to reduce its similarity to other specialized sources.

Author Response

Reviewer 2 Comments.

General comment: This article is located in the field of cybersecurity applied to blockchain, with a focus on detecting vulnerabilities in Ethereum smart contracts, using advanced artificial intelligence techniques (Large Language Models). The article proposes the use of open-source LLMs, fine-tuned by efficient methods (PEFT with QLoRA) and introduces EVuLLM, a new dataset of vulnerable/secure Solidity functions, demonstrating that comparable or even superior performances to large commercial models can be achieved, but with much reduced computational cost.

In order to increase the scientific value of the article and for a better understanding of the proposed methods, it would be advisable to address the following aspects:

Comment 1: It would be useful to highlight, beyond their limitations, in the section describing similar solutions, what problems the proposed solution solves or what improvements it brings.

Response 1:

We would like to thank the reviewer for their comment. To address this, we have added the following text at the end of the Related Works section.

“While several existing approaches to smart contract vulnerability detection have leveraged static analysis, dynamic analysis, and, more recently, large language models, limitations remain in terms of scalability, model transparency, and accessibility. Many prior LLM-based solutions rely on large proprietary models or resource-intensive full fine-tuning, which presents high financial and computational barriers, especially for researchers or developers operating in constrained environments. Our work addresses these challenges by demonstrating the effectiveness of lightweight, open-source LLMs fine-tuned using PEFT techniques for vulnerability detection. This reduces the resource burden while achieving state-of-the-art accuracy and even surpassing larger proprietary models.

In contrast to previous studies that often rely on narrow or private datasets, we introduce the EVuLLM dataset, a publicly available benchmark that combines and extends existing resources to support more robust evaluation. Additionally, our study explores RAG-based methods for smart contract analysis, showing that retrieval-enhanced inference can serve as a viable alternative to fine-tuning when computational budgets or data availability are limited. Together, these improvements offer a more accessible, reproducible, and privacy-preserving path toward the deployment of AI-driven security tools for blockchain ecosystems.”

Comment 2: For each section, in one sentence, to clearly explain what is done in that section, what role it plays in the research conducted, whether it is a review of something existing or an own contribution.

Response 2: We appreciate the reviewer’s suggestion and have revised the opening of each major section to include a clear, one-sentence summary that addresses what is done, the role of the section in the overall study, and whether the content reflects existing work or our own contribution. Below are the added or revised opening sentences for each section:

Background

“This section introduces key concepts related to Large Language Models (LLMs), fine-tuning strategies, and Parameter-Efficient Fine-Tuning (PEFT), providing foundational knowledge necessary to understand the methods and contributions of this work.”

Related Work

“This section reviews existing research on the use and security of smart contracts, contextualizing our work within ongoing efforts to improve reliability in blockchain-based financial systems.”

Methodology

“This section describes the methodology used in this study, focusing on the techniques and tools applied to address smart contract vulnerability detection. It begins with an overview of the datasets used, including their structure and preparation. First, TrustLLM, an existing dataset is described. Following that, our own EvuLLM extended dataset is introduced. Next, the fine-tuning process for LLMs is detailed, covering the tools, model selection, and parameter configurations. Finally, it outlines our RAG approach, including its architecture, components, and implementation to enhance the model's performance.”

Results and Discussion

“This section presents and analyzes the results of our proposed detection methods, evaluating their effectiveness through standard classification metrics to assess the performance impact of the RAG approach and fine-tuning strategies.”

Conclusions and Future Work

“This section summarizes the key contributions and findings of our study on LLM-based smart contract vulnerability detection and outlines potential directions for future research and improvement.”

We hope these additions meet the reviewer's expectations and clarify the structure and contributions of each section.

Comment 3: For tables 9 and 10, it is necessary to state the source from where those model implementations and those percentages were taken.

Response 3: We thank the reviewer for pointing this out. The source implementations of the base models used in our fine-tuning experiments, whose results are presented in Tables 10 and 11 (previously Tables 9 and 10), are listed in Table 4, along with appropriate references. To improve clarity and traceability, we have updated the beginning of Section 5.3.1 to explicitly link these tables by adding the following statement:

“Table 10 presents the evaluation results of our fine-tuned models. Each model is based on a corresponding base implementation detailed in Table 4, which includes the original source and reference for reproducibility.”

And respectively at the beginning of Section 5.3.2:

“Table 11 shows the performance of our fine-tuned models, which are once again based on the original modlels listed in Table 4”.

We believe it should be clear now, which models we used for fine-tuning and that the results are for our own fine-tuned models. Links to our own fine-tuned models on the Huggingface repository are present in the footnotes of the respective pages at the manuscript.

Comment 4: In order to integrate the proposed model into the state of the art, it is useful to extend the analysis to multi-class classification (e.g. SWC, OWASP).

Response 4: We appreciate the reviewer’s insightful comment. We agree that extending vulnerability detection to multi-class classification, such as categorizing vulnerabilities by SWC or OWASP types, would provide a more granular and practical analysis. However, our current work focuses on binary classification (i.e., vulnerable vs. safe), as both the TrustLLM and EVuLLM datasets used in our experiments are designed specifically for binary classification tasks. Supporting a multi-class framework would require the use or creation of additional datasets with fine-grained, type-specific vulnerability annotations, and would also necessitate substantial changes to model design, training, and evaluation pipelines.

We acknowledge this as a promising direction and have now included it in the Future Work section of the manuscript. Specifically, we note that future research should focus on building or leveraging datasets annotated with detailed vulnerability types and locations, and adapting models to support multi-class or even multi-label classification for enhanced precision in vulnerability detection. For that purpose the following text was added to the Conclusions and Future Work section:

“Future work should also explore multi-class classification of vulnerabilities based on taxonomies such as SWC and OWASP, which would require richer datasets with fine-grained annotations and new model architectures capable of handling more complex output spaces.”

Comment 5: Providing concrete examples of vulnerabilities detected correctly/incorrectly (error analysis).

Response 5:

We would like to thank the reviewer for this valuable suggestion. To address this comment, we have added a new section titled "Classification Examples" within the Results section of the manuscript. This section provides illustrative examples of both correctly and incorrectly classified smart contract functions, including detailed contextual information and analysis.

In particular, we highlight a case where the CodeGemma-RAG model misclassified a vulnerable preSign function as safe, and contrast it with the Granite Code model, which correctly identified it as vulnerable. We also present a correctly classified example (depositETH from the QBridge contract), demonstrating how the model benefited from relevant retrieval context and standard secure coding patterns.

These examples help clarify how model behavior is influenced by code structure, retrieval context, and potential reliance on surface-level patterns. We believe this addition improves the transparency and interpretability of our results.

Comment 6: Incorporating a cost study (runtime, memory, TCO vs commercial models).

Response 6: We thank the reviewer for this valuable suggestion. In response, we have expanded the manuscript to provide a more detailed comparison between the use of proprietary models (specifically ChatGPT-4o) and local fine-tuned models. We now include a cost estimate for ChatGPT-4o, showing that the average cost per function inference is approximately $0.06, based on token pricing and output size.

For local models, we have added detailed performance metrics, including training time, peak memory usage, and model loading memory on A100 GPUs for both the EVuLLM and TrustLLM datasets (see revised text and Figures X and Y). We also report inference speeds on both high-end GPUs and commodity CPUs to illustrate performance variability.

Additionally, we acknowledge in the revised manuscript that it is difficult to assign an exact monetary cost to local inference. We have included a new paragraph explaining that local costs depend on factors such as hardware depreciation, energy consumption, and maintenance, which are highly context-dependent and not easily generalizable. This addition aims to provide a balanced and realistic view of the trade-offs between cloud-based and locally hosted models.

We hope these revisions address the reviewer’s concern and strengthen the cost analysis component of our study.

Comment 7: Clarifying the preprocessing pipeline (e.g. regex extraction) through additional diagrams in the article.

Response 7: We would like to thank the reviewer for their comment. We have now added a figure that shows the preprocessing pipeline for creating our EVuLLM dataset (Figure 1). We believe this makes the process clearer to the reader.

Comment 8: Including simulations on real-world contracts, outside the dataset, for concrete validation.

Response 8: We appreciate the reviewer’s suggestion regarding real-world validation. The datasets used in our study, TrustLLM and EvuLLM, are composed of real-world Ethereum smart contracts and contract functions, including both known vulnerable and safe examples. These datasets were curated specifically to reflect practical, security-relevant scenarios, and represent realistic usage patterns and vulnerabilities observed in deployed contracts.

We agree that evaluating on entirely unseen contracts could offer additional validation. However, such an effort would effectively involve creating an expanded dataset, which requires careful labeling and verification to ensure reliability—particularly in a domain as sensitive as vulnerability detection. Since model conclusions and accuracy metrics depend on clearly annotated ground truth, any meaningful evaluation must rely on well-defined datasets.

To clarify this point, we have revised the manuscript to emphasize the real-world origin of our evaluation data and noted the importance of dataset curation for reproducibility and fairness.

Finally, we have added the following text in the Conclusions and Future Work section to emphasize the need for benchmarking on fresh, real-world examples:

“Additionally, we acknowledge the importance of validating models on a broader and continuously updated set of contracts not included in the training or evaluation data. As part of future work, we suggest to curate and annotate larger datasets from newly deployed contracts, enabling continuous benchmarking on fresh, real-world examples.”

Comment 9: A section highlighting the limitations of the proposed solution, possible ways to mitigate them, as well as the subdomains where this technique discussed in the article can possibly be applied would be useful.

Response 9:

We would like to thank the reviewer for their insightful comment. We have revised the Conclusions and Future work section to include further limitations of our proposed solution (also see our response to the previous question) as well as mitigation suggestions and possible domains where a similar solution might be deployed. For that purpose, we have included the following text in the Conclusions and Future work section:

“To mitigate these limitations, future work should prioritize building larger, better-labeled datasets that enable multi-class vulnerability detection and fine-grained localization. Incorporating full contract-level representations and static analysis features may further improve model comprehension. Additionally, enhancing RAG pipelines with better context filtering and semantic similarity retrieval could increase their reliability.

Beyond smart contract auditing, the techniques discussed in this paper, particularly lightweight fine-tuning and hybrid LLM-retrieval methods, can be applied to other security-sensitive code auditing tasks, such as detecting misconfigurations in infrastructure-as-code (e.g., Terraform) or vulnerabilities in API definitions. The methodology may also generalize to automated compliance checks or secure software development in other domains involving high-stakes code correctness.”

Comment 10: The article needs to be restructured, with some sections rewritten to reduce its similarity to other specialized sources.

Response 10: We thank the reviewer for this important observation. In response, we have carefully reviewed the manuscript and made substantial revisions to both the structure and the wording of several sections to improve originality and clarity. Specifically, we rephrased and reorganized portions of the background, methodology, and results sections to ensure the content more clearly reflects our unique contributions and avoids unnecessary overlap with prior literature or reference material. We have also made an effort to more clearly distinguish our phrasing, remove templated descriptions, and enhance the narrative flow. We believe these changes address the reviewer’s concern and improve the overall quality and originality of the manuscript.

Reviewer 3 Report

Comments and Suggestions for Authors

Nice work by authors for smart contract vulnerability detection using LLMs, some of my minor suggestions are:

I. Use of datasets TrustLLM and EVuLLM can be further explained for more clarity.

II. Image quality can still be enhanced (for Figure 1, Figure 2).

III. A sub-section as "Limitations of the Study" can be added.

Author Response

Reviewer 3 Comments:

General comment: Nice work by authors for smart contract vulnerability detection using LLMs, some of my minor suggestions are:

Comment 1: Use of datasets TrustLLM and EVuLLM can be further explained for more clarity.

Response 1: We would like to thank the reviewer for their comment. We understand that the reasoning behind using two distinct datasets might not have been clear in our original manuscript. In response we have added the following paragraph in the Methodology section, which we hope provides clarity:

“We used TrustLLM primarily because it is a well-established benchmark dataset introduced in prior literature, allowing for meaningful comparison with existing vulnerability detection models. Its structured format, real-world provenance from Code4rena audits, and support for both classification and justification tasks make it ideal for fine-tuning and evaluation. In contrast, EVuLLM was introduced in this work as a complementary dataset to enhance robustness and test generalization. It was constructed from real-world smart contract data (DeFiHacks and Top200) and adapted for function-level classification, which aligns better with the input granularity of our models. Using both datasets enables a broader and more reliable assessment: TrustLLM offers continuity with prior work, while EVuLLM adds diversity and context breadth that help validate our findings across multiple real-world settings.”

Comment 2: Image quality can still be enhanced (for Figure 1, Figure 2).

Response 2: We thank the reviewer for pointing this out. In response, all original figures, including Figure 1 and Figure 2, have been replaced with updated versions in vector format to ensure maximum clarity and resolution at any scale. Additionally, all new figures introduced during the revision process have also been included in vector format to maintain consistent and optimal image quality throughout the manuscript.

Comment 3: A sub-section as "Limitations of the Study" can be added.

Response 3: We would like to thank the reviewer for their insightful comment. We have revised the Conclusions and Future work section to include further limitations of our proposed solution as well as mitigation suggestions. For that purpose, we have included the following text in the Conclusions and Future work section:

“This work also surfaces several limitations that guide future directions. The binary classification framework (safe vs. vulnerable) limits the granularity of insights, and the modest dataset size with coarse annotations constrains generalization. Future work should focus on developing larger, more richly annotated datasets that include specific vulnerability types and locations. Moreover, our function-level analysis omits inter-function and contract-level context, underscoring the need for models that can capture these broader semantic relationships. Additionally, we acknowledge the importance of validating models on a broader and continuously updated set of contracts not included in the training or evaluation data. As part of future work, we suggest to curate and annotate larger datasets from newly deployed contracts, enabling continuous benchmarking on fresh, real-world examples.

To mitigate these limitations, future work should prioritize building larger, better-labeled datasets that enable multi-class vulnerability detection and fine-grained localization. Incorporating full contract-level representations and static analysis features may further improve model comprehension. Additionally, enhancing RAG pipelines with better context filtering and semantic similarity retrieval could increase their reliability.”

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The revisions have fully addressed the concerns raised during the review, and the quality of the paper has been significantly enhanced. However, it is recommended to read through the entire text once more before finalizing to check for possible redundant expressions or minor formatting issues caused by the added content.

Author Response

Reviewer Comment: The revisions have fully addressed the concerns raised during the review, and the quality of the paper has been significantly enhanced. However, it is recommended to read through the entire text once more before finalizing to check for possible redundant expressions or minor formatting issues caused by the added content.

Author Response: We appreciate the reviewer’s positive feedback and the recognition of the improvements made. In response to the suggestion, we carefully reviewed the entire manuscript to identify and eliminate redundant expressions and resolve minor formatting issues introduced during revision. We have revised several paragraphs for clarity, conciseness, and consistency, ensuring that the final version maintains a high standard of readability and presentation. We thank the reviewer again for their thoughtful input, which has helped further refine the manuscript.

Article Menu

EVuLLM: Ethereum Smart Contract Vulnerability Detection Using Large Language Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI